File Uploader and OCR

Is your source code/app available?

Hit that github link for REST Uploader above.

Let me know if you have any trouble with getting it working – It’s not the most elegant thing in the world but it does work.

i move the post to “apps” category to join the little group of existing ones
maybe you could add setup.py to the project, and thus would publish it on pypi.org

1 Like

I should do that, yes. I’ve never published anything to pypi, would be a good exercise. I also considered hooking into your api library to streamline things.

Other things on the unofficial roadmap:
auto tagging based on ocr matches
auto cropping / image rotating based on filename
a better sweep system to handle missed files in case the service stops running

… I don’t get many opportunities to work on it, though, so it’s hard to say if I’ll ever get a chance to implement those things.

3 Likes

main goal: having fun :wink:

1 Like

I published this to pypi today!


It’s packaged in a more sensible manner now and should be easier to get up and running. Just pip install, call rest_uploader with a directory to monitor.

2 Likes

@kellerjustin thanks for publishing your uploader! I'm currently giving it a try, and so far, it is very cool.

I have some questions, mainly regarding OCR - which is not part of your code, but maybe you have some hints for me :wink:

  • OCR in PDF seems to work on the first page, only. Is this intentional or a bug?
  • OCR appears to be more reliable with English texts. I have installed tesseract-ocr-deu for German text recognition, but it seems not to improve OCR when used with the file uploader. Do you know whether tesseract needs to "know" the language before OCR?
  • OCR of handwritten text is not very good. You are not, by any chance, aware of a way to get better handwriting recognition?

I would appreciate it if you had any ideas :slight_smile: on my OCR questions.

Cheers,
Sebastian

1 Like

I agree, this package is so useful. I’ve also noticed that OCR only works on the first page.

1 Like

Thanks for the feedback! Wasn't sure if anyone else was using it or not.

  • The OCR on the first page was sort of a "known bug" -- I programmed it to extract embedded PDF text if available or just use the preview image and OCR if no embedded text. I modified the code to extract text from all pages. If you're dropping in large PDFs, be aware that this will definitely take longer to process.
  • I also added a command line option to specify language. Provided Tesseract supports it, it should work with my program. Just add the option --language ger when launching rest_uploader (I think that's what it is for German -- not sure, didn't install the package to test)
  • Improve your handwriting??? :smile: -- I'm relying on Tesseract for the OCR, so I can't really speak to this specific issue.

To enable the fixed multi-page OCR and language support, you'll need to update to version 0.4.0.

pip install rest_uploader --upgrade

Thanks!

2 Likes

Awesome! Thanks a lot for the update :+1:

  • multipage pdf works like a charm

  • the language option improves OCR quite a lot, even for my lousy handwriting :wink:

I have just one more question: Is there a reason why rest_uploader converts the text to ASCII?
tesseract happily recognizes german umlauts (ü ö ä ...), but unidecode converts it cleverly to 127-character-ASCII compatible text (u, o, a ...). Is there a reason that you don't use Unicode?

Best,
Sebastian

1 Like

I wouldn’t have decoded it to ascii for no reason - I think I was getting some sort of encoding/decoding error or something early on. Might have been an issue specific to a test file, or that I was on an older version of python (3.5) – not sure. Now I’m using python 3.7 and it didn’t bomb with a few test files without the unidecode function, so maybe I don’t need it. I’ll test it out for a while and see if there are any issues.

1 Like

Thanks for this. Works perfectly for me.

1 Like

@kellerjustin let me know if I can help with anything :upside_down_face:

I removed the decode function and did a couple other bugfixes and released it to PyPI. Let me know if it does a better job with the OCR…
thanks!

1 Like

It works better than I’ve ever imagined. Many, many thanks!!

1 Like

Hi there,

This plugin seems to fit my needs to drop OneNote and switch over to Joplin.

I've run into an issue using rest_uploader.

Here is the error:

robert@robert-Latitude-6430U:~$ rest_uploader /home/robert/JoplerPDF
Traceback (most recent call last):
File "/home/robert/.local/bin/rest_uploader", line 6, in
from rest_uploader.cli import main
File "/home/rober/.local/lib/python2.7/site-packages/rest_uploader/cli.py", line 6, in
from .rest_uploader import watcher
File "/home/robert/.local/lib/python2.7/site-packages/rest_uploader/rest_uploader.py", line 58
print(f"Filesize = {filesize}. Too big for Joplin, skipping upload")
^
SyntaxError: invalid syntax

My setup is Ubuntu 19.10

Perheps I've missed a step or made an error. I've done the following:

sudo apt-get install tesseract-ocr
sudo apt-get install python-pip
pip install rest-uploader

And to start the rest_uploader

python -m rest_uploader.cli /home/robert/JoplerPDF/

Hi @robertreems

I have the same problem on my laptop. On my desktop, however, it works fine (both Ubuntu 18.04). My plan is now to compare the installed packages on both machines. I’ll report back tomorrow.

Sebastian

looks like it’s trying to run via python 2.7. rest_uploader requires python3. Try the following:
sudo apt install python3-pip
pip3 install rest_uploader
python3 -m rest_uploader.cli

1 Like

Hi Justin,

That did the trick!

@bitbacchus Thanks for you help as well. I suppose your problem should be solved with python3.

Thanks for the quick responses and the OneNote deliberation :wink:

2 Likes

@kellerjustin Maybe you want to update the first post in this topic, because I think you’ve updated it quite a bit since then. e.g. PDFs should work and the following is not true anymore Seems the API only supports image files, not PDFs. or is it? Anyway, great script.

1 Like