OCR for existing Joplin notes

plamola · 14 February 2021 21:30

After I migrated my Evernote notes to Joplin, I was missing the functionality to do a full text search in attachments. I got inspired by a post on this forum:

After having a look on how rest-uploader does OCR, I decided to write a script which could add OCR text in existing Joplin notes.

The ocr-joplin-notes script can read notes from Joplin via the web clipper interface, OCR any image or PDF and insert the text as a comment block in the note. In case of a PDF document, it can also add a preview image.

The script has a simple detection algorithm to skip notes it suspects where created by rest-uploader and the notes it already processed. The current version of this script requires a tag to be supplied on the command line. It will only process the notes with that specific tag. Once all notes with that tag have been processed, the script will terminate. More details can be found in the readme.

The ocr-joplin-notes script is written in Python and has been tested on Ubuntu.

I just published a first version to both Github and PyPi.

Enjoy

jb261 · 15 February 2021 01:07

Would it be possible to implement this as a plugin?

plamola · 15 February 2021 06:38

I guess this could also be implemented as a plugin.

It would be a completely different beast, since a plugin needs to be written in JavaScript, where this is written in Python. It also might require the user to install additional libraries, which can do the actual OCR part.

It's not a project I'm going to take on in the foreseeable future.

myfta · 10 April 2021 17:18

Hi, in your Docker instructions you indicate a docker-env file is required with the Joplin token. Where should this docker-env file be placed in a Windows 10 installation?

Thanks

plamola · 10 April 2021 17:48

The --env-file parameter of the docker commands allows you to specify the full path to the a file. The file can also be named anything you want.

The example command in the documentation of ocr-joplin-notes looks for a docker-env file in the current directory.

myfta · 10 April 2021 19:57

Thanks, that's sorted and now starts to run. But I get this error:

Environment variable JOPLIN_SERVER not set, using default value: http://localhost:41184

What am I missing?

plamola · 11 April 2021 07:37

Typically, your Joplin client listens on http://localhost:41184, for web clipper communication. So this default should be fine.

If for some reason you need to overrule the default it, you could add a JOPLIN_SERVER=http://your.url:port to the environment file.

myfta · 11 April 2021 08:26

I checked the web clipper options page again in Joplin and it confirmed that it was already running on port 41184.
However, just to be sure I added the JOPLIN_SERVER line to the docker-env file. That has got rid of the error, but it still fails to run, here is the complete output. (I have checked the token value again, just to be sure)

docker run --env-file ./docker-env --network="host" plamola/ocr-joplin-notes:0.2.3 python -m ocr_joplin_notes.cli --mode=TAG_NOTES
Mode: TAG_NOTES
Language: eng
Add previews: yes
Autorotation: yes
Tagging notes. This might take a while. You can follow the progress by watching the tags in Joplin
Connection Error. URL: http://localhost:41184/notes?order_by=title&limit=10&page=1&token=xxxxxxxx

Thanks.

plamola · 11 April 2021 09:48

If the URL in the error message is actual the real URL you got in the error message (you did not replace your token with xxxxxxxx before posting it here), then you've not set up the token correctly in the environment file.

If you can paste the URL for the error message in a browser, you should get a response from Joplin, assuming you have the right token setup. If that doesn't work, something might be blocking access to the port. If that is the case, you probably also encounter a similar problem trying to use the Joplin web clipper plugin in your browser.

The --network="host" parameter should give the docker container to the network on your machine. I've never ran docker on Windows myself, so is might work differently on Windows.

myfta · 11 April 2021 11:01

(I had obfuscated the token.)

Now that's interesting. I pasted the URL into the browser I get a Welcome to Joplin message and it returns the first 9 notes (as text) and finishes with: ,"has_more":true}

So I guess that is good news - correct port and token. So why does Docker not work?

plamola · 11 April 2021 11:44

Hmmm, I found the following in the docker documentation:

The host networking driver only works on Linux hosts, and is not supported on Docker Desktop for Mac, Docker Desktop for Windows, or Docker EE for Windows Server.

Source: Networking using the host network | Docker Docs

I guess accessing the web clipper interface isn't that easy from a docker images on a non-Linux system. I believe the Joplin web clipper service only binds to localhost. Which is good thing, from a security perspective. But therefore it can't be accessed through any actual network interfaces.

There might be a work around, by introducing a locally installed proxy, like nginx, to expose the web clipper service on a (virtual) network interface, which then could be accesses from a docker image. But that makes the whole setup a lot more complex.

So you best bet might be to install the Python library directly on your Windows machine.

myfta · 11 April 2021 12:18

Thanks. So we can conclude that Windows and Docker don't play to any useful extent.

myfta · 11 April 2021 20:06

So I now have Python 3.7.9 installed in Anaconda and can get a python prompt. >>>

What do I need to run from here?

Where do the environment variables get set?

This was my first attempt:

python3 ocr_joplin_notes.cli --mode=TAG_NOTES
  File "<stdin>", line 1
     python3 ocr_joplin_notes.cli --mode=TAG_NOTES
                                           ^
SyntaxError: invalid syntax

Thanks.

darkcheftar · 12 April 2021 05:24

@myfta was the issue resolved?
or you need any help on it?

would love to help on this thing if you need any!

As I'm working on developing an OCR plugin. And I'm an intermediate developer in python. I think I may help solving this.

myfta · 12 April 2021 07:59

No, I'm still no further on. I need some guidance on how to run this in Phyton.

So, here is where I am. I have Python 3.7.9 running in Anaconda. I have done the

pip install ocr-joplin-notes

and I had also previously run

pip install rest_uploader

as that seems to be implied from the instructions.

I have a desktop installation of Joplin on Windows 10, web clipper is running so I can access http://localhost:41184/ with the token.

I think I just need to know what instruction to run in Python and setting the environment variables.

Thanks.

darkcheftar · 12 April 2021 09:14

ok! @myfta , I will try it to run by myself and let you know if I find any improvement

plamola · 12 April 2021 13:48

I currently don't have Python installed on my machine, so I can't verify, but I suspect the command line example might have been incorrect. The script is a module, so the -m option should be specified.
Try this:
python3 -m ocr_joplin_notes.cli --mode=TAG_NOTES

myfta · 12 April 2021 17:01

Thanks, yes, the -m option was required. I only have Python3.8 installed so this works for me:
python3 -m ocr_joplin_notes.cli --mode=TAG_NOTES

I am just about to post all my "instructions" for getting this running in Windows which you are welcome to borrow from for your Github README.md.

Here are the Windows installation notes OCR in Joplin (How to) - #3 by myfta

Topic		Replies	Views
File Uploader and OCR Apps	163	15529	24 July 2024
What's new in Joplin 2.14 News	4	1440	10 March 2024
OCR in Joplin (How to) Support	23	6193	23 March 2024
Ability to edit OCR for a note Features	5	77	5 March 2025
Import from Evernote...why doesn't existing OCR get included? Lounge	11	927	7 December 2023

OCR for existing Joplin notes

Related topics