File Uploader and OCR

sttrebo · 15 May 2020 14:38

good to know, thanks. seems i have a bit more learning to do…

RogerL · 15 May 2020 18:04

I have just had a PDF emailed to me, which has caused this to happen:

Monitoring directory D:\Temp\Scans for files
created -- D:\Temp\Scans\Qigong Classes.pdf
Exception in thread Thread-1:
Traceback (most recent call last):
  File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\threading.py", line 926, in _bootstrap_inner
    self.run()
  File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\watchdog\observers\api.py", line 196, in run
    self.dispatch_events(self.event_queue, self.timeout)
  File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\watchdog\observers\api.py", line 369, in dispatch_events
    handler.dispatch(event)
  File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\watchdog\events.py", line 336, in dispatch
    }[event.event_type](event)
  File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\rest_uploader\rest_uploader.py", line 70, in on_created
    self._event_handler(event.src_path)
  File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\rest_uploader\rest_uploader.py", line 53, in _event_handler
    if filesize < 1 or (ext == ".pdf" and not pdf_valid(path)):
  File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\rest_uploader\img_process.py", line 34, in pdf_valid
    if open_pdf(filename) is None:
  File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\rest_uploader\img_process.py", line 25, in open_pdf
    pdfFileObject = open(filename, "rb")
PermissionError: [Errno 13] Permission denied: 'D:\\Temp\\Scans\\Qigong Classes.pdf'

This raises two questions.

If this happens it seems that rest_uploader is stuck and has to be restarted… I plopped another PDF known to be transferred previously into the scan folder and nothing happened. Is this right?
I don’t care what the permissions are for a PDF sent to me by email, I just want it transferred to Joplin. Will I have to worry about the permissions of every file that is emailed to me?

kellerjustin · 16 May 2020 04:20

Hm. I haven’t run into any permission issues in the testing I’ve done. Any chance the file was open in another application or something? Can you reproduce the error with same file after restarting the program?

RogerL · 16 May 2020 14:52

I can reproduce the error with the same file after restarting the program, but I don’t understand why - t seems it’s a quirck of MS’s NTFS ownership/permissions system.

If I detach the PDF from Thunderbird directly into the rest_uploader’s monitored folder it works.

If I detach it into a holding folder on the same drive and then move into the rest_uploader’s monitored folder it works

However if I detach into a holding folder on the mounted VeraCrypt volume and then move into the rest_uploader’s monitored folder it fails.

If I do a DIR /Q in a Command Prompt Window in this holding folder it says the file is owned by BUILTIN\Administrators but when I move it to rest_uploader’s monitored folder a DIR /Q says its owned by PC12-NOVATECH2\Roger, i.e. me, who obviously has read permission.

All users have read access so why your program can’t do an open(file, “rb”) is a mystery to me. Is it possible for your program to display the attributes open() is failing on?

kellerjustin · 21 May 2020 14:24

Next time I circle back to the project I can try to reproduce but it might be a few weeks – didn’t want to leave you hangin’, though. Is it something with VeraCrypt? rest_uploader does not take into account permissions at all - although watchdog or one of the PDF libraries might… ?
Thanks

RogerL · 21 May 2020 14:46

It isn’t a problem for me, I just won’t use the VeraCrypt volume as a holding folder. It’s more a question of understanding what’s going on. Rest_uploader’s error report states there is a PermissionError after an open(filename, "rb") but when I look in Explorer the PDF file’s permissions are OK. I would love to try an experiment and have rest_uploader wait a second after seeing the file appear in the folder before it does the open() because it appears the file is inheriting the properties of the folder and rest_uploader is getting it before this has taken effect.

kellerjustin · 23 September 2020 19:52

I released a new version of rest_uploader. Version 1.11 adds an option to move files after uploading them into Joplin (-o /path/to/directory, --moveto /path/to/directory). Also the newer version will attempt to autorotate image files based on the OCR unless you choose to turn off the autorotate option (-r no, --autorotate no). I implemented these features mostly because I really wanted to streamline my workflow, but I certainly hope others will see the benefit also!
To upgrade:
pip install --upgrade rest_uploader

Feedback welcome. Thanks!

johano · 14 October 2020 20:02

Just discovered the --moveto option today, makes things a lot easier to figure out what's going on in the watched folder. Very useful, thanks

kellerjustin · 15 October 2020 16:58

Glad you're finding it useful! Just FYI - I subsequently released version 1.12 and changed the argument from -m to -o to ensure I wouldn't conflict with python if calling as a python module. --moveto is unchanged.
Thanks!

kellerjustin · 26 October 2020 18:39

I'm assuming you're looking for a Windows solution?

This appears to be a good walk-through.

mzguy · 3 November 2020 16:09

I haven't used this solution yet, but I have a question....

What happens if I have a large database with already OCR'd attachments? Can I somehow get the text added to the note without recreating each note with an attachment? I don't need the OCR done again since it's already embedded in each PDF (and in images), but I'd like the notes to become globally searchable through Joplin.

Can I accomplish this, or would I need to extract all the PDF attachments and re-create them as new notes?

Thanks.

kellerjustin · 4 November 2020 01:57

Haven't implemented this but I have considered it. My thought would be to loop through notes and run the script (which will extract embedded PDF text) and then when it's done with the OCR, add an ocr tag to the note to indicate it's been done. I don't have the bandwidth to add something like this right now, but would certainly welcome a pull request! But yes, as a workaround you could extract the PDFs and let rest-uploader bring them in, create a preview image, and extract the existing OCR from the PDF.

Thanks!

mzguy · 7 November 2020 16:52

I figure doing the same for images would be ideal.

tnwn · 7 November 2020 18:40

It is interesting to discuss it without installing but I thought it can:
TESSERACT(1) Manual Page

Most image file formats (anything readable by Leptonica) are supported

I hope it would be a plugin soon with some GUI for ordinary users.

mzguy · 8 November 2020 00:37

My reply was not regarding general functionality. It was with regard to @kellerjustin specifically describing the new feature function that he agreed would be useful:

My thought would be to loop through notes and run the script (which will extract embedded PDF text) and then when it's done with the OCR, add an ocr tag to the note to indicate it's been done.

I simply suggested that this process would be useful to do for images, in addition to PDFs.

mzguy · 8 November 2020 00:39

Do you plan to convert this to a Joplin Plugin, using the new mechanism?

kellerjustin · 8 November 2020 14:04

rest_uploader does an OCR extraction on image files also via tesseract, if that wasn't clear. I just specifically mentioned embedded PDF text because that's what you asked about in your previous post.

as for a plugin - cool idea, but rest_uploader is written in python and it'd probably be less work for someone skilled at JS to start from scratch building a new plugin than trying to port my code.

mzguy · 12 November 2020 19:13

I finally tried this and got it working in Windows. However, I tried getting it to launch via pythonw.exe so that it ran in non-interactive mode as a background process, so I didn't have to keep the command window open. This resulted in all sorts of errors reported by the script after putting a PDF into the watched folder. These same errors did not show up if I launched with python or directly from the script, and supplied the same PDF.

Has anyone succeeded getting rest_uploader to run in the background without keeping a window open on the taskbar?

mzguy · 12 November 2020 19:20

When using the -o argument, the script is complaining about a file lock. The processed file gets copied to the correct folder, rather than moved (two copies exist). Permissions are lax on the folders in question.

Is this a windows-specific bug perhaps?

mzguy · 12 November 2020 19:42

Is it possible to add a text layer to the scanned PDFs, so that they can be searched when opening them with a reader?

Also, here are two sample PDF files, where the text extract did not get pulled into the note for an entire page:

Topic		Replies	Views
Plugin: offline OCR (extract text from images, pdf, videos, etc) Plugins	48	8426	2 October 2023
OCR for existing Joplin notes Apps	17	4614	12 April 2021
OCR in Joplin (How to) Support	23	6291	23 March 2024
OCR selectively Features	0	99	14 November 2024
Diario & Awesome Notes & WebDAVNav Apps	54	3214	10 April 2019

File Uploader and OCR

Related topics