Joplin's OCR seems to perform well with clear texts again a simple background (such a screenshot of a Powerpoint slide or a table), but less so with more complex images. I did a small test on OCR with the attached image. Joplin (Win 3.2.8) either fails or only recognizes one line of the text. Apple, OneNote, Google Keep, Obsidian with the Text Extractor plugin can recognize all the English texts without a problem. ChatGPT can even recognize correctly the Chinese words.
Interesting. As an aside I'm thankful that Joplin doesn't incorporate AI and personally I hope it never does!
I've actually had a lot of OCR success with all kinds of documents including crumpled up receipts and badly photographed pages of books. I've very often been pleasantly surprised as I really thought the main use case was trawling clearish PDFs.
But the key thing here is that they are all documents, and your image of course isn't. I might be wrong but I think the issue is you're expecting something that I don't believe the OCR in Joplin was designed for: it's for text from documents and text over a photo is a fair way from that. But particularly when the text shares colours with the image - I'm not surprised it wasn't able to be read for that reason - it's a big ask! Though you do say other software can do it, fair enough.
Anyway just really wanted present a counter viewpoint - I've found Joplin's OCR to work really well with virtually everything I've thrown at it.
I would guess that Apple, OneNote and Google Keep send all your data to their server for OCR, which makes it a lot easier since they can have very large model running there. We do all this locally so that your data doesn't have to leave your computer.
I'm more surprised that the Text Extractor plugin works since it looks like they use Tesseract too. I'll give it a try and see how we can tweak our settings to improve it.
Ok I know what happened. In Obsidian I'm getting this text:
ETIrT ā " I ? r A ' ' ' AL r . 1 . G - 7 ' : v U1l ā ' | ' \ 1.7 eback Mountain (2005) ' ā . /?Iļ¬ f Havoc (2005) 1 4 3.Z1i5R%j Love & Other Drugs (2010) J 4.5 H9 I The Last Thing He Wanted (2020) āl '
Some of it is ok, but it's a lot of random characters so probably Tesseract gaves it a very low confidence value. In Joplin we'd discard such text since it's not very useful.
I'm not sure we should change this as adjusting the confidence threshold means a lot more poor quality text would get saved to the search index and pollutes it.
Thanks Laurent for the explanation. It makes sense. The benefit of the other apps' approach is that, for example, if you search for "Havoc," you will find it among your notes, and the OCR results are still useful after minimal cleanup if you want to copy out the list of movies. If we were voting for this feature, I personally would vote to include the imperfect OCR results with random characters. It's no different from having a text note with some codes that are not words or foreign words. But it's a judgment call. Keep up with the great work!
if you search for "Havoc," you will find it among your notes
It's true that it would help in this case. I will check if the threshold can be adjusted to better handle this case