Plugin: offline OCR (extract text from images, pdf, videos, etc)

Thanks.

Which operating systems are supported? Do I have to have tesseract installed?

All OS. No extra installation is required

2 Likes

Woow this is great. Was looking for something like this for some time.

Works great on images, less so on my handwriting...

+1 for the automatic addition to a note comment.

Awesome feature! I have tons of notes with cooking receipts as fotos. Always intended to migrate them to text. This plugin now triggers me to begin that and it opens great new opportunities for my usage of Joplin! Thx!

1 Like

Oh, I was enthusiastic yesterday, now it is I can't bring it to work :expressionless:
I installed the plugin, choose the language and marked the language checkbox before scanning. But then...it does not move up the 0%...any idea?

1

Joplin 2.5.4 (prod, win32)
Win 10 Version 21H1

I have the same issue with some images, it might be an image size problem.

You may try checking your image using one of online OCR tools.

This one for instance claims to be using the same lib (tesseract) as the plugin. If it can't extract text there's a good chance your image just isn't good enough.

Thx!

I gave it a try and it started. The result wasn‘t good, but in contrast to the plugin it startet and finished the process.

And btw in Joplin my tests with other images did not work as well - with my installation.

would you mind post the original image here so I can debug?

Voila!

Hi. Thanks a lot for the PDF support. It works great.

I am looking forward to scanning and recognizing all the content of the existing notes. How are you planning to implement it? What I mean is that is it going to, for example, scan everything and add the text as a comment under the image/pdf?

Yeah, like a lot of historical texts this is a tough case for OCR. You've got the variable contrast, multiple (and incomplete) columns of text, with some bonus wrinkles to boot. That said, I didn't have the plugin fail to work. It took a while for the needle to move off 0% and my 3900X CPU's usage briefly jumped up to 10% at one point (which almost never happens with a 12 core CPU), but it did yield results.

Alas a good part of the result seemed to be the obscure German dialect known a Gibberish. I tried cropping the image to just the main paragraph and both the time and OCR results improved. There were still issues such as Tesseract's guessing that "Schwulen-Aktivist Rosa von Praunheim" should be converted to "Schwulen- 7Er geagt von Praunheim" -- but that's hardly the fault of the plugin.

I had a somewhat similar result with a Japanese image using stylized characters including an advertising blurb paragraph tilted for emphasis. Tesseract rendered that emphasized paragraph as not existing. Isolating that portion of the image and rotating it in an external graphics editor, however, generated a perfect conversion via the plugin.

P.S. Many thanks to @ylc395 for this useful plugin.

2 Likes

Okay, sounds like I have to be more patient and use better quality of the pictures. Would not have expected that different problems come together when I randomly chose pictures for OCR.

Thanks so far!
I‘ll report here what I experience now. Maybe it can lead somewhere or help someone else.

Hi all,

I'm still trying to get this plugin running on my windows system ([in 10, version 21H1, build 19043.1348, Joplin 2.5.4 (prod, win32)]
Revision: 6eced7eversion After some unsuccessful attempts with v0.1.0 I now updated the OCR Plugin to v0.2.0 by deleting the old version and installing the current version.

When I try to scan an image now, it seems to start (yeah!) but then aborts the process and returns the following error instantly:

RuntimeError: abort(undefined). Build with -s ASSERTIONS=1 for more ifo.

I made another attempt with JoplinPortable on USB at another Win10 system and failed in the same way.

I have no experience in interpreting such error message and would be happy to get some help or info here :slight_smile:

@Wimvan Can you provide a screenshot of OCR modal(before recognizing)? In fact, I reproduced this error by using an incorrect language code.

1 Like

I hope you want me to show that:

And as addition the file I intended to use with OCR:

@Wimvan So here we got the answer: your language code is not correct. Language codes are on first column of this table:

1 Like

Yeah, the problem sat in front of the screen again! I didn't look at your screenshot above carefully enough...

Thank you @ylc395 for your quick and friendly help! :bouquet:
Now I will be able to test the plugin a lot :star_struck:

version 0.3.x is released! :firecracker: Auto recognition & text insertion is available now!

auto-ocr

5 Likes

I'm confused. There is a newer version at the plugin list already:
Screenshot(20)