File Uploader and OCR

kellerjustin · 25 September 2018 16:56

I built a command line companion application to supplement Joplin. The app monitors a directory you specify; when new files are created in the directory, it automatically uploads the file as text or an attachment (depending on file type) to a new note in Joplin. The app performs OCR on images and PDFs, and the text is dropped into the note as a comment. An image preview is also created for PDFs. The app requires Python 3.7 and is available on Github and PyPI. Hope you like it!

Original post below.

I created a quick-and-dirty companion python app for Joplin, which will allow a monitored directory scrape into Joplin. This is a feature I use extensively with Evernote where it creates a new note when an file is dumped into a pre-defined directory.
Seems the API only supports image files, not PDFs.
This app will also create notes from plain text files. If anyone is interested, here's the repo:
REST Uploader

kellerjustin · 2 October 2018 14:57

Thanks to @laurent updating the API, I was able to enhance my companion app to handle all filetypes. It also does OCR for images and adds the OCR’d text as a Markdown comment. It will extract PDF text and create a one-page preview of the PDF. If anyone is interested…

PhantamaroK · 28 February 2019 22:38

Is your source code/app available?

kellerjustin · 1 March 2019 06:40

Hit that github link for REST Uploader above.

Let me know if you have any trouble with getting it working – It’s not the most elegant thing in the world but it does work.

foxmask · 1 March 2019 08:39

i move the post to “apps” category to join the little group of existing ones
maybe you could add setup.py to the project, and thus would publish it on pypi.org

kellerjustin · 1 March 2019 16:02

I should do that, yes. I’ve never published anything to pypi, would be a good exercise. I also considered hooking into your api library to streamline things.

Other things on the unofficial roadmap:
auto tagging based on ocr matches
auto cropping / image rotating based on filename
a better sweep system to handle missed files in case the service stops running

… I don’t get many opportunities to work on it, though, so it’s hard to say if I’ll ever get a chance to implement those things.

foxmask · 1 March 2019 21:39

main goal: having fun

kellerjustin · 26 June 2019 21:13

I published this to pypi today!

It’s packaged in a more sensible manner now and should be easier to get up and running. Just pip install, call rest_uploader with a directory to monitor.

bitbacchus · 12 November 2019 14:46

@kellerjustin thanks for publishing your uploader! I'm currently giving it a try, and so far, it is very cool.

I have some questions, mainly regarding OCR - which is not part of your code, but maybe you have some hints for me

OCR in PDF seems to work on the first page, only. Is this intentional or a bug?
OCR appears to be more reliable with English texts. I have installed tesseract-ocr-deu for German text recognition, but it seems not to improve OCR when used with the file uploader. Do you know whether tesseract needs to "know" the language before OCR?
OCR of handwritten text is not very good. You are not, by any chance, aware of a way to get better handwriting recognition?

I would appreciate it if you had any ideas on my OCR questions.

Cheers,
Sebastian

dasym · 12 November 2019 18:44

I agree, this package is so useful. I’ve also noticed that OCR only works on the first page.

kellerjustin · 12 November 2019 22:59

Thanks for the feedback! Wasn't sure if anyone else was using it or not.

The OCR on the first page was sort of a "known bug" -- I programmed it to extract embedded PDF text if available or just use the preview image and OCR if no embedded text. I modified the code to extract text from all pages. If you're dropping in large PDFs, be aware that this will definitely take longer to process.
I also added a command line option to specify language. Provided Tesseract supports it, it should work with my program. Just add the option --language ger when launching rest_uploader (I think that's what it is for German -- not sure, didn't install the package to test)
Improve your handwriting??? -- I'm relying on Tesseract for the OCR, so I can't really speak to this specific issue.

To enable the fixed multi-page OCR and language support, you'll need to update to version 0.4.0.

pip install rest_uploader --upgrade

Thanks!

bitbacchus · 13 November 2019 10:53

Awesome! Thanks a lot for the update

multipage pdf works like a charm
the language option improves OCR quite a lot, even for my lousy handwriting

I have just one more question: Is there a reason why rest_uploader converts the text to ASCII?
tesseract happily recognizes german umlauts (ü ö ä ...), but unidecode converts it cleverly to 127-character-ASCII compatible text (u, o, a ...). Is there a reason that you don't use Unicode?

Best,
Sebastian

kellerjustin · 13 November 2019 15:19

I wouldn’t have decoded it to ascii for no reason - I think I was getting some sort of encoding/decoding error or something early on. Might have been an issue specific to a test file, or that I was on an older version of python (3.5) – not sure. Now I’m using python 3.7 and it didn’t bomb with a few test files without the unidecode function, so maybe I don’t need it. I’ll test it out for a while and see if there are any issues.

dasym · 13 November 2019 18:24

Thanks for this. Works perfectly for me.

bitbacchus · 14 November 2019 10:27

@kellerjustin let me know if I can help with anything

kellerjustin · 14 November 2019 17:57

I removed the decode function and did a couple other bugfixes and released it to PyPI. Let me know if it does a better job with the OCR…
thanks!

bitbacchus · 18 November 2019 09:12

It works better than I’ve ever imagined. Many, many thanks!!

robertreems · 20 November 2019 08:25

Hi there,

This plugin seems to fit my needs to drop OneNote and switch over to Joplin.

I've run into an issue using rest_uploader.

Here is the error:

robert@robert-Latitude-6430U:~$ rest_uploader /home/robert/JoplerPDF
Traceback (most recent call last):
File "/home/robert/.local/bin/rest_uploader", line 6, in
from rest_uploader.cli import main
File "/home/rober/.local/lib/python2.7/site-packages/rest_uploader/cli.py", line 6, in
from .rest_uploader import watcher
File "/home/robert/.local/lib/python2.7/site-packages/rest_uploader/rest_uploader.py", line 58
print(f"Filesize = {filesize}. Too big for Joplin, skipping upload")
^
SyntaxError: invalid syntax

My setup is Ubuntu 19.10

Perheps I've missed a step or made an error. I've done the following:

sudo apt-get install tesseract-ocr
sudo apt-get install python-pip
pip install rest-uploader

And to start the rest_uploader

python -m rest_uploader.cli /home/robert/JoplerPDF/

bitbacchus · 20 November 2019 11:33

Hi @robertreems

I have the same problem on my laptop. On my desktop, however, it works fine (both Ubuntu 18.04). My plan is now to compare the installed packages on both machines. I’ll report back tomorrow.

Sebastian

kellerjustin · 20 November 2019 14:38

looks like it’s trying to run via python 2.7. rest_uploader requires python3. Try the following:
sudo apt install python3-pip
pip3 install rest_uploader
python3 -m rest_uploader.cli

Topic		Replies	Views
Plugin: offline OCR (extract text from images, pdf, videos, etc) Plugins	48	8328	2 October 2023
OCR in Joplin (How to) Support	23	6222	23 March 2024
OCR for existing Joplin notes Apps	17	4576	12 April 2021
OCR selectively Features	0	86	14 November 2024
GSoC Idea - OCR Support Features gsoc-2020	18	2683	1 August 2024

File Uploader and OCR

Related topics