good to know, thanks. seems i have a bit more learning to do…
I have just had a PDF emailed to me, which has caused this to happen:
Monitoring directory D:\Temp\Scans for files
created -- D:\Temp\Scans\Qigong Classes.pdf
Exception in thread Thread-1:
Traceback (most recent call last):
File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\watchdog\observers\api.py", line 196, in run
self.dispatch_events(self.event_queue, self.timeout)
File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\watchdog\observers\api.py", line 369, in dispatch_events
handler.dispatch(event)
File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\watchdog\events.py", line 336, in dispatch
}[event.event_type](event)
File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\rest_uploader\rest_uploader.py", line 70, in on_created
self._event_handler(event.src_path)
File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\rest_uploader\rest_uploader.py", line 53, in _event_handler
if filesize < 1 or (ext == ".pdf" and not pdf_valid(path)):
File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\rest_uploader\img_process.py", line 34, in pdf_valid
if open_pdf(filename) is None:
File "D:\Progs\WPy64-3770\python-3.7.7.amd64\lib\site-packages\rest_uploader\img_process.py", line 25, in open_pdf
pdfFileObject = open(filename, "rb")
PermissionError: [Errno 13] Permission denied: 'D:\\Temp\\Scans\\Qigong Classes.pdf'
This raises two questions.
- If this happens it seems that rest_uploader is stuck and has to be restarted… I plopped another PDF known to be transferred previously into the scan folder and nothing happened. Is this right?
- I don’t care what the permissions are for a PDF sent to me by email, I just want it transferred to Joplin. Will I have to worry about the permissions of every file that is emailed to me?
Hm. I haven’t run into any permission issues in the testing I’ve done. Any chance the file was open in another application or something? Can you reproduce the error with same file after restarting the program?
I can reproduce the error with the same file after restarting the program, but I don’t understand why - t seems it’s a quirck of MS’s NTFS ownership/permissions system.
If I detach the PDF from Thunderbird directly into the rest_uploader’s monitored folder it works.
If I detach it into a holding folder on the same drive and then move into the rest_uploader’s monitored folder it works
However if I detach into a holding folder on the mounted VeraCrypt volume and then move into the rest_uploader’s monitored folder it fails.
If I do a DIR /Q in a Command Prompt Window in this holding folder it says the file is owned by BUILTIN\Administrators but when I move it to rest_uploader’s monitored folder a DIR /Q says its owned by PC12-NOVATECH2\Roger, i.e. me, who obviously has read permission.
All users have read access so why your program can’t do an open(file, “rb”) is a mystery to me. Is it possible for your program to display the attributes open() is failing on?
Next time I circle back to the project I can try to reproduce but it might be a few weeks – didn’t want to leave you hangin’, though. Is it something with VeraCrypt? rest_uploader does not take into account permissions at all - although watchdog or one of the PDF libraries might… ?
Thanks
It isn’t a problem for me, I just won’t use the VeraCrypt volume as a holding folder. It’s more a question of understanding what’s going on. Rest_uploader’s error report states there is a PermissionError after an open(filename, "rb") but when I look in Explorer the PDF file’s permissions are OK. I would love to try an experiment and have rest_uploader wait a second after seeing the file appear in the folder before it does the open() because it appears the file is inheriting the properties of the folder and rest_uploader is getting it before this has taken effect.
I released a new version of rest_uploader. Version 1.11 adds an option to move files after uploading them into Joplin (-o /path/to/directory, --moveto /path/to/directory). Also the newer version will attempt to autorotate image files based on the OCR unless you choose to turn off the autorotate option (-r no, --autorotate no). I implemented these features mostly because I really wanted to streamline my workflow, but I certainly hope others will see the benefit also!
To upgrade:
pip install --upgrade rest_uploader
Feedback welcome. Thanks!
Just discovered the --moveto option today, makes things a lot easier to figure out what's going on in the watched folder. Very useful, thanks 
Glad you're finding it useful! Just FYI - I subsequently released version 1.12 and changed the argument from -m to -o to ensure I wouldn't conflict with python if calling as a python module. --moveto is unchanged.
Thanks!
I'm assuming you're looking for a Windows solution?
This appears to be a good walk-through.
I haven't used this solution yet, but I have a question....
What happens if I have a large database with already OCR'd attachments? Can I somehow get the text added to the note without recreating each note with an attachment? I don't need the OCR done again since it's already embedded in each PDF (and in images), but I'd like the notes to become globally searchable through Joplin.
Can I accomplish this, or would I need to extract all the PDF attachments and re-create them as new notes?
Thanks.
Haven't implemented this but I have considered it. My thought would be to loop through notes and run the script (which will extract embedded PDF text) and then when it's done with the OCR, add an ocr tag to the note to indicate it's been done. I don't have the bandwidth to add something like this right now, but would certainly welcome a pull request! But yes, as a workaround you could extract the PDFs and let rest-uploader bring them in, create a preview image, and extract the existing OCR from the PDF.
Thanks!
I figure doing the same for images would be ideal.
It is interesting to discuss it without installing
but I thought it can:
TESSERACT(1) Manual Page
Most image file formats (anything readable by Leptonica) are supported
I hope it would be a plugin soon with some GUI for ordinary users.
My reply was not regarding general functionality. It was with regard to @kellerjustin specifically describing the new feature function that he agreed would be useful:
My thought would be to loop through notes and run the script (which will extract embedded PDF text) and then when it's done with the OCR, add an ocr tag to the note to indicate it's been done.
I simply suggested that this process would be useful to do for images, in addition to PDFs.
Do you plan to convert this to a Joplin Plugin, using the new mechanism?
rest_uploader does an OCR extraction on image files also via tesseract, if that wasn't clear. I just specifically mentioned embedded PDF text because that's what you asked about in your previous post.
as for a plugin - cool idea, but rest_uploader is written in python and it'd probably be less work for someone skilled at JS to start from scratch building a new plugin than trying to port my code.
I finally tried this and got it working in Windows. However, I tried getting it to launch via pythonw.exe so that it ran in non-interactive mode as a background process, so I didn't have to keep the command window open. This resulted in all sorts of errors reported by the script after putting a PDF into the watched folder. These same errors did not show up if I launched with python or directly from the script, and supplied the same PDF.
Has anyone succeeded getting rest_uploader to run in the background without keeping a window open on the taskbar?
When using the -o argument, the script is complaining about a file lock. The processed file gets copied to the correct folder, rather than moved (two copies exist). Permissions are lax on the folders in question.
Is this a windows-specific bug perhaps?
Is it possible to add a text layer to the scanned PDFs, so that they can be searched when opening them with a reader?
Also, here are two sample PDF files, where the text extract did not get pulled into the note for an entire page: