OCR in Joplin (How to)

myfta · 7 April 2021 16:40

Reviewing some of the discussions on this forum over the past 3 years or so there have been a couple of attempts to introduce OCR to Joplin. Now that I am seriously making use of the app I would like to have an OCR capability and would appreciate the current best way to implement this.

I have two scenarios.

New notes captured with Joplin's Webclipper that include an image that it would be good to extract any text from and post into the note.
Other notes either with an image or pdf attached that have been imported from Evernote that need to be processed and again the text added to the note.

Can the same solution cover both cases? Some of the discussions suggest that some previous ideas have been superseded.

A step by step installation for Windows would be helpful for the non-developer.

Thanks.

srm39 · 9 April 2021 10:02

+1 for this

myfta · 12 April 2021 17:10

Returning from a discussion here OCR for existing Joplin notes here are my "instructions" for how to achieve this in Windows.

These instructions are for Python 3.8 running in Anaconda 1.10.0 in Windows

At your Anaconda prompt install rest-uploader, ocr_joplin_notes and pytesseract

pip install rest-uploader

pip install ocr_joplin_notes

pip install pytesseract

Install Tesseract from here as a regular Windows installation Home · UB-Mannheim/tesseract Wiki · GitHub as per the recommendation I used tesseract-ocr-w64-setup-v5.0.0-alpha.20201127.exe

Add User Variables in Windows - Settings - Advanced System Settings

Variable name TESSDATA_PREFIX
Variable value C:\Program Files\Tesseract-OCR\tessdata

In the System variables edit PATH to ADD

C:\Users\graha\Anaconda3\python.exe

and

C:\Program Files\Tesseract-OCR

Whilst there add your Joplin Token:

Variable name JOPLIN_TOKEN
Variable value "your Joplin Token from Joplin - Tools - Options - Web Clipper"

Check or add PYTHONPATH in System Variables

Variable name PYTHONPATH

Variable value C:\Users<username>\Anaconda3\python.exe

In my case, I am using Python3.8 in Anaconda, but this needs to point to your Python executable.

Now despite having Anaconda being up to date with the latest version there were some issues with the installed packages. In one case having two versions installed. The solution is to uninstall and reinstall the relevant packages again using pip uninstall/install at the Anaconda prompt.

I spotted these when trying to run the final instruction for ocr_joplin_notes the various packages would be mentioned in the error messages, so I fixed them one by one.

mkl-service was missing and needed to be installed, so here is the complete list:

conda install -c conda-forge mkl-service

pip uninstall opencv-python

pip install opencv-python

pip uninstall numpy (first time to remove numpy-1.20.2)

pip uninstall numpy (second time to remove numpy-1.19.2

pip install numpy (installs a clean version of numpy-1.20.2)

pip uninstall pillow

pip install pillow

Now make sure Joplin is running. Backup all your notes to a JEX file Joplin - File - Export All - JEX

Then at a regular Windows Command prompt I can run:

python -m ocr_joplin_notes.cli --mode=TAG_NOTES

and it proceeds to tag all my notes with the scheme described in GitHub - plamola/ocr-joplin-notes: Add OCR data from PDF and image files as a comment in Joplin, to enable full-text search under Mode TAG_NOTES

myfta · 24 October 2021 18:37

harun27 has now released an update to the original Python package ocr-joplin-notes that enables the OCR of existing notes with both images and pdf attachments.

bradwww · 13 January 2024 04:43

please tell me how to turn on OCR function? i have installed today, checked the doc here: Optical Character Recognition (OCR) | Joplin
and still can't figure out where the switch is - help? thanks!

JackGruber · 13 January 2024 08:16

Go to the settings in a Joplin 2.14 version :

bradwww · 13 January 2024 22:25

i downloaded the latest version, it is 2.13 ?? is this a beta or something? it was not mentioned in the docs. thanks.

JackGruber · 13 January 2024 22:28

Yes a preview version

PackElend · 24 February 2024 19:38

@kish

read this topic and read the updated GSoC 2024 live blog

kish · 25 February 2024 04:38

In the proposed idea mentioned in the ideas-2024.md, was it proposed in Node.js or javascript? Here they have performed OCR using python. And that's complicated for an average person to achieve. This is achievable in JS.

kish · 25 February 2024 07:37

OCR for existing Joplin Notes: Feb '21

Shall I start with proposing this idea in the features section?

PackElend · 25 February 2024 08:59

why that? haven't you read yet:

kish · 25 February 2024 09:13

Oh yeah. Missed that, apologies. Thanks for taking your time

personalizedrefriger · 25 February 2024 14:39

Here are some parts of the Joplin codebase that might be helpful to look at:

joplin/packages/lib/services/ocr at dev · laurent22/joplin · GitHub
The lines in NoteEditor.tsx responsible for showing resource search results.

Isaac · 5 March 2024 13:23

Silly question... I've updated my joplin to 2.14.x and activated OCR.
Does OCR now still only work for new notes?

laurent · 5 March 2024 13:59

No, it's going to process all your existing notes too

Isaac · 5 March 2024 14:36

perfect - can I see the progress in the log or elsewhere?
(I have about 3k notes with attachements)

laurent · 5 March 2024 15:30

Yes, if you open the console and filter on "ocr" joplinapp.org/debugging

Isaac · 5 March 2024 16:07

I'm getting this error in the log

Topic		Replies	Views
OCR for existing Joplin notes Apps	17	4575	12 April 2021
Plugin: offline OCR (extract text from images, pdf, videos, etc) Plugins	48	8323	2 October 2023
OCR does it really work? Features	5	621	13 February 2025
File Uploader and OCR Apps	163	15564	24 July 2024
GSoC Idea - OCR Support Features gsoc-2020	18	2683	1 August 2024

OCR in Joplin (How to)

Related topics