File Uploader and OCR

mzguy · 9 March 2021 19:41

I got it figured out, so never mind the question. Thanks for the venv suggestion.

myfta · 13 April 2021 08:16

Not sure if this is the best place to post this. Whilst I was catching up with how to use Tesseract with Joplin the two posts illustrate the importance of optimising the image before trying to OCR. If not done so already, maybe these ideas need to be carried through to the plugin developments?

https://levelup.gitconnected.com/a-beginners-guide-to-tesseract-ocr-using-pytesseract-23036f5b2211

mzguy · 13 April 2021 20:27

I'm getting frequent cases of attachments that don't have any text recognized, even when the text is already there from running Tesseract on the file! I can send an example if anyone can debug.

manouchk · 11 June 2021 17:56

THis plugin is not yet available in the official plugins list? I search for OCR and uploader in joplin and the plugin was not found.

JackGruber · 11 June 2021 18:12

This is no plugin for Joplin, this is a external script.

manouchk · 12 June 2021 11:16

Thank you. I have a strange error when trying to configure token:

/usr/lib/python3.9/site-packages/rest_uploader/api_token.py in get_token()
      8     else:
      9         token = input("Paste your Joplin API Token:")
---> 10         with open(".api_token.txt", "w") as f:
     11             f.write(token.rstrip())
     12     return token

PermissionError: [Errno 13] Permission denied: '.api_token.txt'

I opened the script in an interactive way in ipython and gave me '/usr/lib/python3.9/site-packages/rest_uploader' as the current path. I don't if there is an expected directory which is or could be define so that rest_uploader would find the .api_token.txt. Should that be the directory were files are uploaded (this is what makes sense to me)?

I installed the script in archlinux with the following command:

pypi2pkgbuild.py git+https://github.com/kellerjustin/rest-uploader

which constructs automagically an archlinux package from github. Don't know if this could have modified something in the script.

The script was installed in /usr/lib/python3.9/site-packages/rest_uploader/

 $ ls /usr/lib/python3.9/site-packages/rest_uploader. The following files were installed:

api_token.py  cli.py  __init__.py  __pycache__  rest_uploader.py

kellerjustin · 14 June 2021 13:47

I should refactor how that api token file is stored and put it in user space. You should be able to get around this by altering the permissions on the site-packages/rest_uploader folder or running the script as root the first time to create the .api_token.txt file.
Thanks for reporting!

harun27 · 19 July 2021 13:35

Is it possible to process all images and pdfs that are already in my notes?

I also always get this error:
Too few characters. Skipping this page Error during processing.
Am I doing something wrong or are the images just bad?

kellerjustin · 19 July 2021 14:05

rest_uploader only handles the creation of new notes from files. You might want to check out OCR Joplin Notes for what you're trying to do.
Thanks!

laurent · 25 July 2021 20:05

If the Tesseract lib reliable enough in general? I've read that it's difficult to use, but it seems you managed to make it work well?

kellerjustin · 26 July 2021 14:49

If it was terribly difficult to use, I wouldn't have been able to make it work !
I'll put it this way - the creator(s) of tesseract did the heavy lifting and made a pretty solid opensource OCR framework, and subsequently pytesseract made it easy to access from Python. I just had to hook into it. I've found it's much easier to deal with tesseract in Linux because package management makes it easy - in Windows, unless someone has released a better installer in the last six months or so, it's sort of a headache to install and involves setting environment variables.
Thanks!

laurent · 26 July 2021 15:04

Thanks for the info, it seems indeed easier than I thought. Does it successfully OCR documents most of the time? Are there any issues when the documents are not in English?

kellerjustin · 26 July 2021 15:17

It's been pretty solid. My handwriting is garbage so I don't expect it will work with any of my handwritten notes, but it seems to do well when processing a scanned text document. You pass tesseract a language parameter and other than the occasional testing, I only ever pass in --eng so I can't adequately answer your question as to how well it works with other languages. Hope that helps!

Chamo · 5 December 2021 22:13

I'm getting the following error message:
Check Tesseract OCR Configuration
UZN file C:\Users[username]\AppData\Local\Temp\tess_tvjq976a loaded. Estimating resolution as 258 UZN file C:\Users[username]\AppData\Local\Temp\tess_tvjq976a loaded. Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.

Any ideas? From Warning. Invalid resolution 0 dpi. Using 70 instead. · Issue #1702 · tesseract-ocr/tesseract · GitHub it sounded like it might be an issue related to the metadata ("It means your image does not contain a resolution info in its metadata, so Tesseract warns you about this issue in the image and it tries to estimate the resolution by itself.") but I wasn't sure what to try to do to fix it.

kellerjustin · 6 December 2021 15:29

Is this happening with all the images you try to upload, or just one particular file?

Chamo · 6 December 2021 18:26

I get that errors with all files I scan, unfortunately.

kellerjustin · 9 December 2021 00:33

Hmm, it looks like it's trying to OCR the file before it's fully written. Are you monitoring a temp directory?

Chamo · 10 December 2021 23:42

Thank you @kellerjustin ! That likely is the issue. Unfortunately when I went to troubleshoot it, it started giving me the following error. Any thoughts on what I am doing wrong?

(base) C:\Users[username]> rest_uploader C:\Users[username].config\joplin-desktop\ImportingWithOCR
Launching Application rest_uploader.cli.main
Endpoint: http://127.0.0.1:41184
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\ProgramData\Anaconda3\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\rest_uploader.exe_main.py", line 7, in
File "C:\ProgramData\Anaconda3\lib\site-packages\click\core.py", line 1128, in call
return self.main(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\click\core.py", line 1053, in main
rv = self.invoke(ctx)
File "C:\ProgramData\Anaconda3\lib\site-packages\click\core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "C:\ProgramData\Anaconda3\lib\site-packages\click\core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "C:\ProgramData\Anaconda3\lib\site-packages\rest_uploader\cli.py", line 126, in main
notebook_id = set_notebook_id(destination.strip())
File "C:\ProgramData\Anaconda3\lib\site-packages\rest_uploader\rest_uploader.py", line 151, in set_notebook_id
folders = res.json()["items"]
KeyError: 'items'

kellerjustin · 12 December 2021 02:52

Check to make sure you have the latest version of Joplin and rest_uploader. There was a breaking change to the API several versions ago and this looks like what it might be.

Chamo · 14 December 2021 14:48

OK, got it working. I think the issue was I had a user variable or system variable wrong. I followed these instructions OCR in Joplin (How to) - #3 by myfta and the errors resolved. Then to my earlier issue, per your suggestion, I tried having the scanner save the file somewhere else, and then only once the file was completely saved, moved it to the monitored folder. Sure enough, the OCR worked! Thanks so much for all your help. Truly, if you hadn't built and maintained this script I wouldn't have used Joplin at all, having OCR was a firm requirement for me as I transitioned away from Evernote.

Topic		Replies	Views
OCR for existing Joplin notes Apps	17	4572	12 April 2021
What's new in Joplin 2.14 News	4	1440	10 March 2024
OCR in Joplin (How to) Support	23	6194	23 March 2024
Document Scanning Features	11	3178	26 December 2024
Import from Evernote...why doesn't existing OCR get included? Lounge	11	927	7 December 2023

File Uploader and OCR

Related topics