OCR in Joplin (How to)

Returning from a discussion here OCR for existing Joplin notes here are my "instructions" for how to achieve this in Windows.

These instructions are for Python 3.8 running in Anaconda 1.10.0 in Windows

At your Anaconda prompt install rest-uploader, ocr_joplin_notes and pytesseract

pip install rest-uploader

pip install ocr_joplin_notes

pip install pytesseract

Install Tesseract from here as a regular Windows installation Home · UB-Mannheim/tesseract Wiki · GitHub as per the recommendation I used tesseract-ocr-w64-setup-v5.0.0-alpha.20201127.exe

Add User Variables in Windows - Settings - Advanced System Settings

Variable name TESSDATA_PREFIX
Variable value C:\Program Files\Tesseract-OCR\tessdata

In the System variables edit PATH to ADD

C:\Users\graha\Anaconda3\python.exe

and

C:\Program Files\Tesseract-OCR

Whilst there add your Joplin Token:

Variable name JOPLIN_TOKEN
Variable value "your Joplin Token from Joplin - Tools - Options - Web Clipper"

Check or add PYTHONPATH in System Variables

Variable name PYTHONPATH

Variable value C:\Users<username>\Anaconda3\python.exe

In my case, I am using Python3.8 in Anaconda, but this needs to point to your Python executable.

Now despite having Anaconda being up to date with the latest version there were some issues with the installed packages. In one case having two versions installed. The solution is to uninstall and reinstall the relevant packages again using pip uninstall/install at the Anaconda prompt.

I spotted these when trying to run the final instruction for ocr_joplin_notes the various packages would be mentioned in the error messages, so I fixed them one by one.

mkl-service was missing and needed to be installed, so here is the complete list:

conda install -c conda-forge mkl-service

pip uninstall opencv-python

pip install opencv-python

pip uninstall numpy (first time to remove numpy-1.20.2)

pip uninstall numpy (second time to remove numpy-1.19.2

pip install numpy (installs a clean version of numpy-1.20.2)

pip uninstall pillow

pip install pillow

Now make sure Joplin is running. Backup all your notes to a JEX file Joplin - File - Export All - JEX

Then at a regular Windows Command prompt I can run:

python -m ocr_joplin_notes.cli --mode=TAG_NOTES

and it proceeds to tag all my notes with the scheme described in GitHub - plamola/ocr-joplin-notes: Add OCR data from PDF and image files as a comment in Joplin, to enable full-text search under Mode TAG_NOTES

2 Likes