File Uploader and OCR

I'm getting frequent cases of attachments that don't have any text recognized, even when the text is already there from running Tesseract on the file! I can send an example if anyone can debug.

THis plugin is not yet available in the official plugins list? I search for OCR and uploader in joplin and the plugin was not found.

This is no plugin for Joplin, this is a external script.

2 Likes

Thank you. I have a strange error when trying to configure token:

/usr/lib/python3.9/site-packages/rest_uploader/api_token.py in get_token()
      8     else:
      9         token = input("Paste your Joplin API Token:")
---> 10         with open(".api_token.txt", "w") as f:
     11             f.write(token.rstrip())
     12     return token

PermissionError: [Errno 13] Permission denied: '.api_token.txt'

I opened the script in an interactive way in ipython and gave me '/usr/lib/python3.9/site-packages/rest_uploader' as the current path. I don't if there is an expected directory which is or could be define so that rest_uploader would find the .api_token.txt. Should that be the directory were files are uploaded (this is what makes sense to me)?

I installed the script in archlinux with the following command:

pypi2pkgbuild.py git+https://github.com/kellerjustin/rest-uploader

which constructs automagically an archlinux package from github. Don't know if this could have modified something in the script.

The script was installed in /usr/lib/python3.9/site-packages/rest_uploader/

 $ ls /usr/lib/python3.9/site-packages/rest_uploader. The following files were installed:

api_token.py  cli.py  __init__.py  __pycache__  rest_uploader.py

I should refactor how that api token file is stored and put it in user space. You should be able to get around this by altering the permissions on the site-packages/rest_uploader folder or running the script as root the first time to create the .api_token.txt file.
Thanks for reporting!

1 Like

Is it possible to process all images and pdfs that are already in my notes?

I also always get this error:
Too few characters. Skipping this page Error during processing.
Am I doing something wrong or are the images just bad?

rest_uploader only handles the creation of new notes from files. You might want to check out OCR Joplin Notes for what you're trying to do.
Thanks!

1 Like

If the Tesseract lib reliable enough in general? I've read that it's difficult to use, but it seems you managed to make it work well?

1 Like

If it was terribly difficult to use, I wouldn't have been able to make it work :grinning_face_with_smiling_eyes:!
I'll put it this way - the creator(s) of tesseract did the heavy lifting and made a pretty solid opensource OCR framework, and subsequently pytesseract made it easy to access from Python. I just had to hook into it. I've found it's much easier to deal with tesseract in Linux because package management makes it easy - in Windows, unless someone has released a better installer in the last six months or so, it's sort of a headache to install and involves setting environment variables.
Thanks!

Thanks for the info, it seems indeed easier than I thought. Does it successfully OCR documents most of the time? Are there any issues when the documents are not in English?

1 Like

It's been pretty solid. My handwriting is garbage so I don't expect it will work with any of my handwritten notes, but it seems to do well when processing a scanned text document. You pass tesseract a language parameter and other than the occasional testing, I only ever pass in --eng so I can't adequately answer your question as to how well it works with other languages. Hope that helps!

1 Like

I'm getting the following error message:
Check Tesseract OCR Configuration
UZN file C:\Users[username]\AppData\Local\Temp\tess_tvjq976a loaded. Estimating resolution as 258 UZN file C:\Users[username]\AppData\Local\Temp\tess_tvjq976a loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.

Any ideas? From Warning. Invalid resolution 0 dpi. Using 70 instead. · Issue #1702 · tesseract-ocr/tesseract · GitHub it sounded like it might be an issue related to the metadata ("It means your image does not contain a resolution info in its metadata, so Tesseract warns you about this issue in the image and it tries to estimate the resolution by itself.") but I wasn't sure what to try to do to fix it.

Is this happening with all the images you try to upload, or just one particular file?

I get that errors with all files I scan, unfortunately.

Hmm, it looks like it's trying to OCR the file before it's fully written. Are you monitoring a temp directory?

Thank you @kellerjustin ! That likely is the issue. Unfortunately when I went to troubleshoot it, it started giving me the following error. Any thoughts on what I am doing wrong?

(base) C:\Users[username]> rest_uploader C:\Users[username].config\joplin-desktop\ImportingWithOCR
Launching Application rest_uploader.cli.main
Endpoint: http://127.0.0.1:41184
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\rest_uploader.exe_main
.py", line 7, in
File "C:\ProgramData\Anaconda3\lib\site-packages\click\core.py", line 1128, in call
return self.main(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\click\core.py", line 1053, in main
rv = self.invoke(ctx)
File "C:\ProgramData\Anaconda3\lib\site-packages\click\core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\ProgramData\Anaconda3\lib\site-packages\click\core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\rest_uploader\cli.py", line 126, in main
notebook_id = set_notebook_id(destination.strip())
File "C:\ProgramData\Anaconda3\lib\site-packages\rest_uploader\rest_uploader.py", line 151, in set_notebook_id
folders = res.json()["items"]
KeyError: 'items'

Check to make sure you have the latest version of Joplin and rest_uploader. There was a breaking change to the API several versions ago and this looks like what it might be.

OK, got it working. I think the issue was I had a user variable or system variable wrong. I followed these instructions OCR in Joplin (How to) - #3 by myfta and the errors resolved. Then to my earlier issue, per your suggestion, I tried having the scanner save the file somewhere else, and then only once the file was completely saved, moved it to the monitored folder. Sure enough, the OCR worked! Thanks so much for all your help. Truly, if you hadn't built and maintained this script I wouldn't have used Joplin at all, having OCR was a firm requirement for me as I transitioned away from Evernote.

1 Like

Great to hear!
I hadn't seen that thread, very cool that @myfta put that how-to together, but also reinforces that I should do better on documentation :crazy_face:...
Thanks for the update and the kind words!

Please do make use of my text and notes if it helps with the documentation. Getting the OCR working was critical for me too.

1 Like