OCR for existing Joplin notes

After I migrated my Evernote notes to Joplin, I was missing the functionality to do a full text search in attachments. I got inspired by a post on this forum:

After having a look on how rest-uploader does OCR, I decided to write a script which could add OCR text in existing Joplin notes.

The ocr-joplin-notes script can read notes from Joplin via the web clipper interface, OCR any image or PDF and insert the text as a comment block in the note. In case of a PDF document, it can also add a preview image.

The script has a simple detection algorithm to skip notes it suspects where created by rest-uploader and the notes it already processed. The current version of this script requires a tag to be supplied on the command line. It will only process the notes with that specific tag. Once all notes with that tag have been processed, the script will terminate. More details can be found in the readme.

The ocr-joplin-notes script is written in Python and has been tested on Ubuntu.

I just published a first version to both Github and PyPi.

Enjoy

6 Likes

Would it be possible to implement this as a plugin?

I guess this could also be implemented as a plugin.

It would be a completely different beast, since a plugin needs to be written in JavaScript, where this is written in Python. It also might require the user to install additional libraries, which can do the actual OCR part.

It's not a project I'm going to take on in the foreseeable future.

Hi, in your Docker instructions you indicate a docker-env file is required with the Joplin token. Where should this docker-env file be placed in a Windows 10 installation?

Thanks

The --env-file parameter of the docker commands allows you to specify the full path to the a file. The file can also be named anything you want.

The example command in the documentation of ocr-joplin-notes looks for a docker-env file in the current directory.

1 Like

Thanks, that's sorted and now starts to run. But I get this error:

Environment variable JOPLIN_SERVER not set, using default value: http://localhost:41184

What am I missing?

Typically, your Joplin client listens on http://localhost:41184, for web clipper communication. So this default should be fine.

If for some reason you need to overrule the default it, you could add a JOPLIN_SERVER=http://your.url:port to the environment file.

I checked the web clipper options page again in Joplin and it confirmed that it was already running on port 41184.
However, just to be sure I added the JOPLIN_SERVER line to the docker-env file. That has got rid of the error, but it still fails to run, here is the complete output. (I have checked the token value again, just to be sure)

docker run --env-file ./docker-env --network="host" plamola/ocr-joplin-notes:0.2.3 python -m ocr_joplin_notes.cli --mode=TAG_NOTES
Mode: TAG_NOTES
Language: eng
Add previews: yes
Autorotation: yes
Tagging notes. This might take a while. You can follow the progress by watching the tags in Joplin
Connection Error. URL: http://localhost:41184/notes?order_by=title&limit=10&page=1&token=xxxxxxxx

Thanks.

If the URL in the error message is actual the real URL you got in the error message (you did not replace your token with xxxxxxxx before posting it here), then you've not set up the token correctly in the environment file.

If you can paste the URL for the error message in a browser, you should get a response from Joplin, assuming you have the right token setup. If that doesn't work, something might be blocking access to the port. If that is the case, you probably also encounter a similar problem trying to use the Joplin web clipper plugin in your browser.

The --network="host" parameter should give the docker container to the network on your machine. I've never ran docker on Windows myself, so is might work differently on Windows.

(I had obfuscated the token.)

Now that's interesting. I pasted the URL into the browser I get a Welcome to Joplin message and it returns the first 9 notes (as text) and finishes with: ,"has_more":true}

So I guess that is good news - correct port and token. So why does Docker not work?

Hmmm, I found the following in the docker documentation:

The host networking driver only works on Linux hosts, and is not supported on Docker Desktop for Mac, Docker Desktop for Windows, or Docker EE for Windows Server.

Source: Networking using the host network | Docker Docs

I guess accessing the web clipper interface isn't that easy from a docker images on a non-Linux system. I believe the Joplin web clipper service only binds to localhost. Which is good thing, from a security perspective. But therefore it can't be accessed through any actual network interfaces.

There might be a work around, by introducing a locally installed proxy, like nginx, to expose the web clipper service on a (virtual) network interface, which then could be accesses from a docker image. But that makes the whole setup a lot more complex.

So you best bet might be to install the Python library directly on your Windows machine.

Thanks. So we can conclude that Windows and Docker don't play to any useful extent.

So I now have Python 3.7.9 installed in Anaconda and can get a python prompt. >>>

What do I need to run from here?

Where do the environment variables get set?

This was my first attempt:

python3 ocr_joplin_notes.cli --mode=TAG_NOTES
  File "<stdin>", line 1
     python3 ocr_joplin_notes.cli --mode=TAG_NOTES
                                           ^
SyntaxError: invalid syntax

Thanks.

@myfta was the issue resolved?
or you need any help on it?

would love to help on this thing if you need any!

As I'm working on developing an OCR plugin. And I'm an intermediate developer in python. I think I may help solving this.

1 Like

No, I'm still no further on. I need some guidance on how to run this in Phyton.

So, here is where I am. I have Python 3.7.9 running in Anaconda. I have done the

pip install ocr-joplin-notes

and I had also previously run

pip install rest_uploader

as that seems to be implied from the instructions.

I have a desktop installation of Joplin on Windows 10, web clipper is running so I can access http://localhost:41184/ with the token.

I think I just need to know what instruction to run in Python and setting the environment variables.

Thanks.

ok! @myfta , I will try it to run by myself and let you know if I find any improvement

1 Like

I currently don't have Python installed on my machine, so I can't verify, but I suspect the command line example might have been incorrect. The script is a module, so the -m option should be specified.
Try this:
python3 -m ocr_joplin_notes.cli --mode=TAG_NOTES

1 Like

Thanks, yes, the -m option was required. I only have Python3.8 installed so this works for me:
python3 -m ocr_joplin_notes.cli --mode=TAG_NOTES

I am just about to post all my "instructions" for getting this running in Windows which you are welcome to borrow from for your Github README.md.

Here are the Windows installation notes OCR in Joplin (How to) - #3 by myfta

1 Like