Homepage    |    GitHub    |    API    |    FAQ

OCR in Joplin (How to)

Reviewing some of the discussions on this forum over the past 3 years or so there have been a couple of attempts to introduce OCR to Joplin. Now that I am seriously making use of the app I would like to have an OCR capability and would appreciate the current best way to implement this.

I have two scenarios.

  1. New notes captured with Joplin's Webclipper that include an image that it would be good to extract any text from and post into the note.
  2. Other notes either with an image or pdf attached that have been imported from Evernote that need to be processed and again the text added to the note.

Can the same solution cover both cases? Some of the discussions suggest that some previous ideas have been superseded.

A step by step installation for Windows would be helpful for the non-developer.

Thanks.

7 Likes

+1 for this

Returning from a discussion here OCR for existing Joplin notes here are my "instructions" for how to achieve this in Windows.

These instructions are for Python 3.8 running in Anaconda 1.10.0 in Windows

At your Anaconda prompt install rest-uploader, ocr_joplin_notes and pytesseract

pip install rest-uploader

pip install ocr_joplin_notes

pip install pytesseract

Install Tesseract from here as a regular Windows installation Home · UB-Mannheim/tesseract Wiki · GitHub as per the recommendation I used tesseract-ocr-w64-setup-v5.0.0-alpha.20201127.exe

Add User Variables in Windows - Settings - Advanced System Settings

Variable name TESSDATA_PREFIX
Variable value C:\Program Files\Tesseract-OCR\tessdata

In the System variables edit PATH to ADD

C:\Users\graha\Anaconda3\python.exe

and

C:\Program Files\Tesseract-OCR

Whilst there add your Joplin Token:

Variable name JOPLIN_TOKEN
Variable value "your Joplin Token from Joplin - Tools - Options - Web Clipper"

Check or add PYTHONPATH in System Variables

Variable name PYTHONPATH

Variable value C:\Users<username>\Anaconda3\python.exe

In my case, I am using Python3.8 in Anaconda, but this needs to point to your Python executable.

Now despite having Anaconda being up to date with the latest version there were some issues with the installed packages. In one case having two versions installed. The solution is to uninstall and reinstall the relevant packages again using pip uninstall/install at the Anaconda prompt.

I spotted these when trying to run the final instruction for ocr_joplin_notes the various packages would be mentioned in the error messages, so I fixed them one by one.

mkl-service was missing and needed to be installed, so here is the complete list:

conda install -c conda-forge mkl-service

pip uninstall opencv-python

pip install opencv-python

pip uninstall numpy (first time to remove numpy-1.20.2)

pip uninstall numpy (second time to remove numpy-1.19.2

pip install numpy (installs a clean version of numpy-1.20.2)

pip uninstall pillow

pip install pillow

Now make sure Joplin is running. Backup all your notes to a JEX file Joplin - File - Export All - JEX

Then at a regular Windows Command prompt I can run:

python -m ocr_joplin_notes.cli --mode=TAG_NOTES

and it proceeds to tag all my notes with the scheme described in GitHub - plamola/ocr-joplin-notes: Rewrite the Evernote notes, to include OCR data under Mode TAG_NOTES

2 Likes