I built a command line companion application to supplement Joplin. The app monitors a directory you specify; when new files are created in the directory, it automatically uploads the file as text or an attachment (depending on file type) to a new note in Joplin. The app performs OCR on images and PDFs, and the text is dropped into the note as a comment. An image preview is also created for PDFs. The app requires Python 3.7 and is available on Github and PyPI. Hope you like it!
Original post below.
I created a quick-and-dirty companion python app for Joplin, which will allow a monitored directory scrape into Joplin. This is a feature I use extensively with Evernote where it creates a new note when an file is dumped into a pre-defined directory.
Seems the API only supports image files, not PDFs.
This app will also create notes from plain text files. If anyone is interested, here's the repo: REST Uploader
Thanks to @laurent updating the API, I was able to enhance my companion app to handle all filetypes. It also does OCR for images and adds the OCR’d text as a Markdown comment. It will extract PDF text and create a one-page preview of the PDF. If anyone is interested…
i move the post to “apps” category to join the little group of existing ones
maybe you could add setup.py to the project, and thus would publish it on pypi.org
I should do that, yes. I’ve never published anything to pypi, would be a good exercise. I also considered hooking into your api library to streamline things.
Other things on the unofficial roadmap:
auto tagging based on ocr matches
auto cropping / image rotating based on filename
a better sweep system to handle missed files in case the service stops running
… I don’t get many opportunities to work on it, though, so it’s hard to say if I’ll ever get a chance to implement those things.
It’s packaged in a more sensible manner now and should be easier to get up and running. Just pip install, call rest_uploader with a directory to monitor.
@kellerjustin thanks for publishing your uploader! I'm currently giving it a try, and so far, it is very cool.
I have some questions, mainly regarding OCR - which is not part of your code, but maybe you have some hints for me
OCR in PDF seems to work on the first page, only. Is this intentional or a bug?
OCR appears to be more reliable with English texts. I have installed tesseract-ocr-deu for German text recognition, but it seems not to improve OCR when used with the file uploader. Do you know whether tesseract needs to "know" the language before OCR?
OCR of handwritten text is not very good. You are not, by any chance, aware of a way to get better handwriting recognition?
I would appreciate it if you had any ideas on my OCR questions.
Thanks for the feedback! Wasn't sure if anyone else was using it or not.
The OCR on the first page was sort of a "known bug" -- I programmed it to extract embedded PDF text if available or just use the preview image and OCR if no embedded text. I modified the code to extract text from all pages. If you're dropping in large PDFs, be aware that this will definitely take longer to process.
I also added a command line option to specify language. Provided Tesseract supports it, it should work with my program. Just add the option --language ger when launching rest_uploader (I think that's what it is for German -- not sure, didn't install the package to test)
Improve your handwriting??? -- I'm relying on Tesseract for the OCR, so I can't really speak to this specific issue.
To enable the fixed multi-page OCR and language support, you'll need to update to version 0.4.0.
the language option improves OCR quite a lot, even for my lousy handwriting
I have just one more question: Is there a reason why rest_uploader converts the text to ASCII?
tesseract happily recognizes german umlauts (ü ö ä ...), but unidecode converts it cleverly to 127-character-ASCII compatible text (u, o, a ...). Is there a reason that you don't use Unicode?
I wouldn’t have decoded it to ascii for no reason - I think I was getting some sort of encoding/decoding error or something early on. Might have been an issue specific to a test file, or that I was on an older version of python (3.5) – not sure. Now I’m using python 3.7 and it didn’t bomb with a few test files without the unidecode function, so maybe I don’t need it. I’ll test it out for a while and see if there are any issues.
This plugin seems to fit my needs to drop OneNote and switch over to Joplin.
I've run into an issue using rest_uploader.
Here is the error:
robert@robert-Latitude-6430U:~$ rest_uploader /home/robert/JoplerPDF
Traceback (most recent call last):
File "/home/robert/.local/bin/rest_uploader", line 6, in
from rest_uploader.cli import main
File "/home/rober/.local/lib/python2.7/site-packages/rest_uploader/cli.py", line 6, in
from .rest_uploader import watcher
File "/home/robert/.local/lib/python2.7/site-packages/rest_uploader/rest_uploader.py", line 58
print(f"Filesize = {filesize}. Too big for Joplin, skipping upload")
^
SyntaxError: invalid syntax
My setup is Ubuntu 19.10
Perheps I've missed a step or made an error. I've done the following:
I have the same problem on my laptop. On my desktop, however, it works fine (both Ubuntu 18.04). My plan is now to compare the installed packages on both machines. I’ll report back tomorrow.
looks like it’s trying to run via python 2.7. rest_uploader requires python3. Try the following: sudo apt install python3-pip pip3 install rest_uploader python3 -m rest_uploader.cli