File Uploader and OCR

Correct - this is a Windows-specific bug -- I've observed it as well. I've tried a few things to try to correct it, so far without success. My dev machine is Linux, so I would need to take some time to get a testing environment set up on Windows.

This issue isn't Windows specific though, is it? It seems to be a PDF / OCRparsing issue.

I have another issue with a PDF that's 17MB. When I drop it in the target folder, it gets apparently completely ignored by REST_Uploader. I don't get CPU activity, I don't get errors output by the script, and I don't get an output file. I do know that the process doesn't crash though, as I can send in another PDF and it gets processed just fine.

Is there a size limit or are some PDFs ignored for some reason? I'd be happy to share the PDF if that's helpful.

There is a notice in the output of the program...

File transferring...0
File transferring...14322847
File xfer complete. Size=14322847
Filesize = 14322847. Too big for Joplin, skipping upload

Just adhering to Joplin's 10mb size limit:

If this ever changes, I can remove the code which causes the uploader to ignore largeish files.

Thanks!

2 Likes

It looks like the 10mb limit is currently in Joplin as an error:

See the last few comments.

1 Like

@kellerjustin

Have you had a chance to look at this OCR issue, or the text layer PDF feature request?

As mentioned in the Github issue, broken on Joplin 1.4.19

There was a minor change in Joplin's API which broke rest_uploader. I fixed this in version 1.13.0. I haven't tested it against the older version of Joplin but I have every reason to believe that rest_uploader version 1.13.0 will not work with versions of Joplin prior to 1.4.x. Thanks!
To upgrade:
pip install -U rest_uploader

2 Likes

Appreciate the rapid fix!

This looks doable with Tesseract!

1 Like

I'm looking for comments about how well Tesseract works for search purposes compared to Evernote. I recognize that testing these types of things is very challenging but I'm wondering if anyone has thoughts about it.

What led me to Evernote in the first place was their proprietary OCR system. If I've understood it correctly, it has what I would call an x:1 or non-linear keyword system. What I mean is that it will create potential words for sections of image. By contrast, most OCR systems were developed to convert printed documents to electronic word processor type files. In the traditional OCR you have a 1:1 relationship between image on the original scan and the words that are output. So, if the OCR gets it wrong, a search system is completely blind. With Evernote, if the system considers a particular word or character choice to be questionable, it will look for other possible words or characters and also list them in the search keywords. This seems very smart to me for search purposes.

Of course, I may have this explanation of Evernote wrong. That said, I've found that it generally works quite well. For years now it has fulfilled my dream of touchless (mostly) archival.

I'm trying to get the rest-uploader set up and working soon and perhaps I will have my own opinions to post here.

1 Like

I used Evernote for archival also and I've been using Joplin now for a couple years, and haven't really ever needed to go back to Evernote to find anything. Tesseract OCR+Joplin search does pretty well. I did implement a rather rudimentary system in rest-uploader to match tags in the OCR text when it finds them, so that I think is helpful for finding things later. For me it usually matches too many tags, but that's more a matter of me needing to clean out some tags I don't really use. I'd suggest finding a tagging system that works for you to help optimize search. Thanks for checking it out, and I always like hearing feedback when you get things up & running -- good luck!

rest-uploader not responding to .jpg files.

I just installed this on Macos Mojave and have run into a few issues. The first one being that I didn't have the web clipper enabled. Because of that it threw an error when I added the first file to the monitored folder.
I'm pretty sure it was a jpg. BUT, since then, I haven't been able to get it to do anything in response to a jpg. Conversely, a pdf does make it do it's thing.

I've been experimenting with the open source "Open Note Scanner" app on Android to generate the initial images. It has an aggressive image processor. It seems to create unusually large files and I've had some odd issues with trying to convert them to pdf with Preview. But that aside, I think that rest-uploader is supposed to respond to jpg files, right?

Yes, it should respond to and upload jpg files. I have never tested on MacOS, and I have no access to a Mac so I just really have no idea. Do you have a screenshot?

Haven't implemented this but I have considered it. My thought would be to loop through notes and run the script (which will extract embedded PDF text) and then when it's done with the OCR, add an ocr tag to the note to indicate it's been done. I don't have the bandwidth to add something like this right now, but would certainly welcome a pull request! But yes, as a workaround you could extract the PDFs and let rest-uploader bring them in, create a preview image, and extract the existing OCR from the PDF.

For all the Evernote users out there that are jumping ship (including me!), this would be very welcome. Your script is working well for PDFs and images as they are added, but both I and many others have a large archive of notes that pre-date our use of Joplin.

If you or anyone can do this, I have a suggestion: provide a flag to go through all notes and process PDF files, or all notes with a particular tag. That way, a user can tag some notes with images that need OCR, and the script doesn't OCR every image, some of which won't need it.

Any chance this is on your near or medium-term radar?

Medium-term at best. I can certainly see the benefit, but I just don't see myself having time to implement this anytime soon. Sorry.

I just started with Joplin Linux, and installed via pip. I got an error:
ERROR: img-processor 0.11.0 has requirement Pillow>=8.1.0, but you'll have pillow 7.0.0 which is incompatible.

Rest-uploader seems to be working fine, calling with Python3 as a module. Should I be concerned something is suboptimal due to the error above?

Also, I have a couple of requests for enhancements or bug fixes:

I sent in a JPG that was approximately 4000x3000. Rest-uploader processed it fine, but it was downsampled to about 2000x1500 in the process. Is there some way to avoid or control the downsampling?

Also, Tesseract supports multithreaded processing. However, I'm not sure how to get that to be used within the rest-uploader workflow. Is it something simple I can activate?

I suppose I have the same question for pdf2image. Both that and tesseract benefit from multithreading, but I'm thinking the relevant variables / flags aren't being used since my CPU doesn't get much above 20% utilization.

I had made some changes to the img-processor package a few weeks ago and must have pinned newer versions of things. Best practice would be to create and use a virtual environment, but the older version of those libraries should suffice. Thanks.

Thanks. I'm a newbie here and learning. I'll try to get it set up through virtualenv, but to get everything with python back to Ubuntu stock, can I just pip3 uninstall rest-uploader, and all its dependencies one-by-one? I've not installed anything else with pip3 other than rest-uploader.