OCR in Joplin (How to)

Reviewing some of the discussions on this forum over the past 3 years or so there have been a couple of attempts to introduce OCR to Joplin. Now that I am seriously making use of the app I would like to have an OCR capability and would appreciate the current best way to implement this.

I have two scenarios.

  1. New notes captured with Joplin's Webclipper that include an image that it would be good to extract any text from and post into the note.
  2. Other notes either with an image or pdf attached that have been imported from Evernote that need to be processed and again the text added to the note.

Can the same solution cover both cases? Some of the discussions suggest that some previous ideas have been superseded.

A step by step installation for Windows would be helpful for the non-developer.

Thanks.

7 Likes

+1 for this

Returning from a discussion here OCR for existing Joplin notes here are my "instructions" for how to achieve this in Windows.

These instructions are for Python 3.8 running in Anaconda 1.10.0 in Windows

At your Anaconda prompt install rest-uploader, ocr_joplin_notes and pytesseract

pip install rest-uploader

pip install ocr_joplin_notes

pip install pytesseract

Install Tesseract from here as a regular Windows installation Home · UB-Mannheim/tesseract Wiki · GitHub as per the recommendation I used tesseract-ocr-w64-setup-v5.0.0-alpha.20201127.exe

Add User Variables in Windows - Settings - Advanced System Settings

Variable name TESSDATA_PREFIX
Variable value C:\Program Files\Tesseract-OCR\tessdata

In the System variables edit PATH to ADD

C:\Users\graha\Anaconda3\python.exe

and

C:\Program Files\Tesseract-OCR

Whilst there add your Joplin Token:

Variable name JOPLIN_TOKEN
Variable value "your Joplin Token from Joplin - Tools - Options - Web Clipper"

Check or add PYTHONPATH in System Variables

Variable name PYTHONPATH

Variable value C:\Users<username>\Anaconda3\python.exe

In my case, I am using Python3.8 in Anaconda, but this needs to point to your Python executable.

Now despite having Anaconda being up to date with the latest version there were some issues with the installed packages. In one case having two versions installed. The solution is to uninstall and reinstall the relevant packages again using pip uninstall/install at the Anaconda prompt.

I spotted these when trying to run the final instruction for ocr_joplin_notes the various packages would be mentioned in the error messages, so I fixed them one by one.

mkl-service was missing and needed to be installed, so here is the complete list:

conda install -c conda-forge mkl-service

pip uninstall opencv-python

pip install opencv-python

pip uninstall numpy (first time to remove numpy-1.20.2)

pip uninstall numpy (second time to remove numpy-1.19.2

pip install numpy (installs a clean version of numpy-1.20.2)

pip uninstall pillow

pip install pillow

Now make sure Joplin is running. Backup all your notes to a JEX file Joplin - File - Export All - JEX

Then at a regular Windows Command prompt I can run:

python -m ocr_joplin_notes.cli --mode=TAG_NOTES

and it proceeds to tag all my notes with the scheme described in GitHub - plamola/ocr-joplin-notes: Add OCR data from PDF and image files as a comment in Joplin, to enable full-text search under Mode TAG_NOTES

2 Likes

harun27 has now released an update to the original Python package ocr-joplin-notes that enables the OCR of existing notes with both images and pdf attachments.

please tell me how to turn on OCR function? i have installed today, checked the doc here: Optical Character Recognition (OCR) | Joplin
and still can't figure out where the switch is - help? thanks!

Go to the settings in a Joplin 2.14 version :

i downloaded the latest version, it is 2.13 ?? is this a beta or something? it was not mentioned in the docs. thanks.

Yes a preview version

@kish

read this topic and read the updated GSoC 2024 live blog

In the proposed idea mentioned in the ideas-2024.md, was it proposed in Node.js or javascript? Here they have performed OCR using python. And that's complicated for an average person to achieve. This is achievable in JS.

OCR for existing Joplin Notes: Feb '21

Shall I start with proposing this idea in the features section?

why that? haven't you read yet:

Oh yeah. Missed that, apologies. Thanks for taking your time

Here are some parts of the Joplin codebase that might be helpful to look at:

Silly question... I've updated my joplin to 2.14.x and activated OCR.
Does OCR now still only work for new notes?

No, it's going to process all your existing notes too

perfect - can I see the progress in the log or elsewhere?
(I have about 3k notes with attachements)

Yes, if you open the console and filter on "ocr" joplinapp.org/debugging

I'm getting this error in the log
image