Plugin: offline OCR (extract text from images, pdf, videos, etc)

kartoo · 8 November 2021 00:45

Thanks.

Which operating systems are supported? Do I have to have tesseract installed?

ylc395 · 8 November 2021 01:12

All OS. No extra installation is required

mabo · 10 November 2021 15:44

Woow this is great. Was looking for something like this for some time.

Works great on images, less so on my handwriting...

+1 for the automatic addition to a note comment.

Wimvan · 13 November 2021 07:49

Awesome feature! I have tons of notes with cooking receipts as fotos. Always intended to migrate them to text. This plugin now triggers me to begin that and it opens great new opportunities for my usage of Joplin! Thx!

Wimvan · 13 November 2021 20:07

Oh, I was enthusiastic yesterday, now it is I can't bring it to work
I installed the plugin, choose the language and marked the language checkbox before scanning. But then...it does not move up the 0%...any idea?

Joplin 2.5.4 (prod, win32)
Win 10 Version 21H1

kartoo · 13 November 2021 20:50

I have the same issue with some images, it might be an image size problem.

roman_r_m · 13 November 2021 20:56

You may try checking your image using one of online OCR tools.

This one for instance claims to be using the same lib (tesseract) as the plugin. If it can't extract text there's a good chance your image just isn't good enough.

Wimvan · 13 November 2021 21:33

Thx!

I gave it a try and it started. The result wasn‘t good, but in contrast to the plugin it startet and finished the process.

And btw in Joplin my tests with other images did not work as well - with my installation.

ylc395 · 14 November 2021 02:40

would you mind post the original image here so I can debug?

Wimvan · 14 November 2021 06:33

Voila!

Xindi · 14 November 2021 13:59

Hi. Thanks a lot for the PDF support. It works great.

I am looking forward to scanning and recognizing all the content of the existing notes. How are you planning to implement it? What I mean is that is it going to, for example, scan everything and add the text as a comment under the image/pdf?

ipsum · 14 November 2021 22:18

Yeah, like a lot of historical texts this is a tough case for OCR. You've got the variable contrast, multiple (and incomplete) columns of text, with some bonus wrinkles to boot. That said, I didn't have the plugin fail to work. It took a while for the needle to move off 0% and my 3900X CPU's usage briefly jumped up to 10% at one point (which almost never happens with a 12 core CPU), but it did yield results.

Alas a good part of the result seemed to be the obscure German dialect known a Gibberish. I tried cropping the image to just the main paragraph and both the time and OCR results improved. There were still issues such as Tesseract's guessing that "Schwulen-Aktivist Rosa von Praunheim" should be converted to "Schwulen- 7Er geagt von Praunheim" -- but that's hardly the fault of the plugin.

I had a somewhat similar result with a Japanese image using stylized characters including an advertising blurb paragraph tilted for emphasis. Tesseract rendered that emphasized paragraph as not existing. Isolating that portion of the image and rotating it in an external graphics editor, however, generated a perfect conversion via the plugin.

P.S. Many thanks to @ylc395 for this useful plugin.

Wimvan · 14 November 2021 22:56

Okay, sounds like I have to be more patient and use better quality of the pictures. Would not have expected that different problems come together when I randomly chose pictures for OCR.

Thanks so far!
I‘ll report here what I experience now. Maybe it can lead somewhere or help someone else.

Wimvan · 19 November 2021 14:11

Hi all,

I'm still trying to get this plugin running on my windows system ([in 10, version 21H1, build 19043.1348, Joplin 2.5.4 (prod, win32)]
Revision: 6eced7eversion After some unsuccessful attempts with v0.1.0 I now updated the OCR Plugin to v0.2.0 by deleting the old version and installing the current version.

When I try to scan an image now, it seems to start (yeah!) but then aborts the process and returns the following error instantly:

RuntimeError: abort(undefined). Build with -s ASSERTIONS=1 for more ifo.

I made another attempt with JoplinPortable on USB at another Win10 system and failed in the same way.

I have no experience in interpreting such error message and would be happy to get some help or info here

ylc395 · 19 November 2021 15:08

@Wimvan Can you provide a screenshot of OCR modal(before recognizing)? In fact, I reproduced this error by using an incorrect language code.

Wimvan · 19 November 2021 15:32

I hope you want me to show that:

And as addition the file I intended to use with OCR:

ylc395 · 19 November 2021 15:35

@Wimvan So here we got the answer: your language code is not correct. Language codes are on first column of this table:

Wimvan · 19 November 2021 16:53

Yeah, the problem sat in front of the screen again! I didn't look at your screenshot above carefully enough...

Thank you @ylc395 for your quick and friendly help!
Now I will be able to test the plugin a lot

ylc395 · 30 November 2021 11:48

version 0.3.x is released! Auto recognition & text insertion is available now!

auto-ocr

Wimvan · 30 November 2021 12:30

I'm confused. There is a newer version at the plugin list already:
Screenshot(20)

Topic		Replies	Views
What's new in Joplin 2.14 News	4	1430	10 March 2024
OCR for existing Joplin notes Apps	17	4540	12 April 2021
Search in resources plugin Plugins	17	3752	4 March 2023
OCR does it really work? Features	5	525	13 February 2025
Still some way to go for OCR Features	5	164	8 January 2025

Plugin: offline OCR (extract text from images, pdf, videos, etc)

Related topics