Home / GitHub Page

File Uploader and OCR

I created a quick-and-dirty companion python app for Joplin, which will allow a monitored directory scrape into Joplin. This is a feature I use extensively with Evernote where it creates a new note when an file is dumped into a pre-defined directory.
Seems the API only supports image files, not PDFs.
This app will also create notes from plain text files. If anyone is interested, here’s the repo:
REST Uploader

3 Likes

Thanks to @laurent updating the API, I was able to enhance my companion app to handle all filetypes. It also does OCR for images and adds the OCR’d text as a Markdown comment. It will extract PDF text and create a one-page preview of the PDF. If anyone is interested…

1 Like

Is your source code/app available?

Hit that github link for REST Uploader above.

Let me know if you have any trouble with getting it working – It’s not the most elegant thing in the world but it does work.

i move the post to “apps” category to join the little group of existing ones
maybe you could add setup.py to the project, and thus would publish it on pypi.org

1 Like

I should do that, yes. I’ve never published anything to pypi, would be a good exercise. I also considered hooking into your api library to streamline things.

Other things on the unofficial roadmap:
auto tagging based on ocr matches
auto cropping / image rotating based on filename
a better sweep system to handle missed files in case the service stops running

… I don’t get many opportunities to work on it, though, so it’s hard to say if I’ll ever get a chance to implement those things.

main goal: having fun :wink:

1 Like

I published this to pypi today!


It’s packaged in a more sensible manner now and should be easier to get up and running. Just pip install, call rest_uploader with a directory to monitor.

1 Like

@kellerjustin thanks for publishing your uploader! I’m currently giving it a try, and so far, it is very cool.

I have some questions, mainly regarding OCR - which is not part of your code, but maybe you have some hints for me :wink:

  • OCR in PDF seems to work on the first page, only. Is this intentional or a bug?
  • OCR appears to be more reliable with English texts. I have installed tesseract-ocr-deu for German text recognition, but it seems not to improve OCR when used with the file uploader. Do you know whether tesseract needs to “know” the language before OCR?
  • OCR of handwritten text is not very good. You are not, by any chance, aware of a way to get better handwriting recognition?

I would appreciate it if you had any ideas :slight_smile: on my OCR questions.

Cheers,
Sebastian

1 Like

I agree, this package is so useful. I’ve also noticed that OCR only works on the first page.

1 Like

Thanks for the feedback! Wasn’t sure if anyone else was using it or not.

  • The OCR on the first page was sort of a “known bug” – I programmed it to extract embedded PDF text if available or just use the preview image and OCR if no embedded text. I modified the code to extract text from all pages. If you’re dropping in large PDFs, be aware that this will definitely take longer to process.
  • I also added a command line option to specify language. Provided Tesseract supports it, it should work with my program. Just add the option --language ger when launching rest_uploader (I think that’s what it is for German – not sure, didn’t install the package to test)
  • Improve your handwriting??? :smile: – I’m relying on Tesseract for the OCR, so I can’t really speak to this specific issue.

To enable the fixed multi-page OCR and language support, you’ll need to update to version 0.4.0.

pip install rest_uploader --upgrade

Thanks!

1 Like

Awesome! Thanks a lot for the update :+1:

  • multipage pdf works like a charm

  • the language option improves OCR quite a lot, even for my lousy handwriting :wink:

I have just one more question: Is there a reason why rest_uploader converts the text to ASCII?
tesseract happily recognizes german umlauts (ü ö ä …), but unidecode converts it cleverly to 127-character-ASCII compatible text (u, o, a …). Is there a reason that you don’t use Unicode?

Best,
Sebastian

1 Like

I wouldn’t have decoded it to ascii for no reason - I think I was getting some sort of encoding/decoding error or something early on. Might have been an issue specific to a test file, or that I was on an older version of python (3.5) – not sure. Now I’m using python 3.7 and it didn’t bomb with a few test files without the unidecode function, so maybe I don’t need it. I’ll test it out for a while and see if there are any issues.

1 Like

Thanks for this. Works perfectly for me.

1 Like

@kellerjustin let me know if I can help with anything :upside_down_face:

I removed the decode function and did a couple other bugfixes and released it to PyPI. Let me know if it does a better job with the OCR…
thanks!