Import from Evernote...why doesn't existing OCR get included?

Stupid question from someone road-testing Joplin.

As we know, evernote does full OCR on...pretty much everything you stuff into it. It makes search very, very powerful.

I imported all 15k+ of my existing evernote notes, going back to 2008. For the most part, they look fine. But what I don't understand is why the existing OCR data that evernote creates with each note isn't available to Joplin? As I said, a stupid question.

This may be path-killer for me. I was very excited at how easily Joplin imported all my data, but - for example - I have about 3,500 receipts in a notebook in evernote, all of them pdf's scanned in from ScanSnap - and they are almost all named iterations similar to 'rcpt_scan_02876.pdf'.

Which means it's about 3,500 useless pdf's imported, since if I want to look for, say, the receipt for a car repair back in 2014, I am SOL.

Joplin being open-source is a big draw for me. Free is a side-benefit, but any application I use that gives me real value, I'll affirmatively pay for it, either by donations or upgrades (joplin cloud) etc., and frankly I was looking forward to doing one or the other if I can stay on this path.

I'm aware of the assorted plugins out there - official and non-official - that allegedly will do OCR, even on existing notes - but all of them are wonky in the extreme, requiring installing all sorts of glue on the backend that I'm just not inclined to do if the results are suboptimal (for example, when I tried ylc395's joplin-plugin-ocr....when it tried to do its thing, every single receipt was rendered upside down, and the OCR from it thus came out as gibberish).

I'm finally being pushed over the brink with Evernote because I learned that you cannot access your own local note repository unless you log in via Evernote's authentication servers. That won't work well when one day Evernote closes its doors and shuts them all down.

I have a bad tendency to write a thousand words where 150 would do the trick. Sorry.

1 Like

I'm probably not much (or any) help, but the question really is, how to get your stuff out of Evernote. The way OCR scanning works, is that you first scan in an image. This is the basic error. If you scan as searchable PDF in the first place you don't have an issue. Too late in your case. The OCR then adds a layer over the top of the image. Apparently the Joplin import isn't getting both layers of the pdf file. I would not be surprised if Evernote added something to their OCR to make it not work normally.

I have thousands of scanned documents. My scanner goes directly as searchable (OCR) documents and files them in Windows folders. I can use Windows Explorer to find anything. I use Joplin for other things since Windows handles files just fine. Is there any way to get your documents out of Evernote as a searchable PDF usable by any pdf reader?

1 Like

It won't even require EN shuttering for you to have problems with this requirement that the app 'phones home' every so often. If you don't have an internet connection when the app decides to phone home, you are locked out of your own notes until your internet connection is restored. That was one of many deal killers for me with EN.

1 Like

I'm almost certain all of the pdf's were scanned as 'searchable PDF' from the beginning, though it was a loooong time ago when I first started. But I know they've been searchable for many years.
I think it boils down to your speculation that Evernote probably embeds some sort of secret sauce into their OCR data so it's not exportable. I know the PDF's are all 'attachments' to the notes in that you can directly open or save a note's PDF at will - and I'd assume EN wouldn't go to the trouble of stripping the OCR on open or save, but who knows.
Actually extracting all of those PDF's is the big question. I do have backups of all of the original PDF scans generated by the ScanSnap, so maybe that's the secret, just fill a folder with all of them and iterate through them somehow.
My brain's a bit foggy right now so it's something I'll have to investigate tomorrow. Thanks for the reply!

Actually, it's less of a problem in the here and now than you've presented, and it's probably due to EN's general incompetency in recent years, that works to our benefit:
Once you authenticate in the app (at least for Mac and Windows), unless you affirmatively log out, you stay logged in. EN doesn't repeatedly phone home. That was a 'solution' one of the EN apologists over at the EN forums suggested. I can turn off my internet connection and still use Evernote (except for the bug that several thousand of my notes are not available locally for reasons unknown, EN support is clueless).
In the scenario I presented in the forums - how does one recover if you inadvertently get logged out - the EN apologist had the audacity to suggest 'don't do inadvertent things', as if there aren't numerous potential ways to get logged out (malicious teen in the house, a motor deficit causes an accidental click, etc).

But yeah. It's still an unacceptable decision to force a remote login in order to access one's own local data.

Laurent is currently working on implementing OCR in one of the upcoming versions of Joplin. I can't say how far along he is with this and when this function will be available.

Most excellent. I'll do the standard thing and ask "when will it be ready" but only for comic effect. I worked for a decade-plus with assorted devs, and the correct reply is always "when it's ready". :rofl:

Sorry for going off-topic, but what format have these searchable documents?

The reason for not keeping the data is documented here:

It's simply that even if we imported it, it wouldn't be used for anything such as search, etc.

From the next release OCR should hopefully be included and that OCR data will be regenerated in a way that is supported by Joplin. As far as I can see the OCR system is good for PDF documents and not too bad for images. The advantage is that it all happens locally so none of your data will be uploaded to the cloud.

3 Likes

I'm almost certain all of the pdf's were scanned as 'searchable PDF' from the beginning, though it was a loooong time ago when I first started.

I agree. That would be the scansnap default.

Most likely there is no Evernote OCR at all. Why ocr what is already text?

I know the PDF's are all 'attachments' to the notes

Are ANY attachments getting exported at all or only notes?

However, why put the stuff that is already in the scansnap home folders into Evernote or Joplin? In the scansnap home folder type in some words and boom, your receipt or proposal or whatever will immediately pop up. Microsoft's explorer searches and indexes pdf files.

To check if your pdf files are already searchable, eliminating evernote's OCR entirely, open one and search within the document (probably ctrl-f) for a clearly easy to read word. If it finds the word, then it is already searchable and there would be nothing for Evernote to do.

I very much look forward to it. I'm getting comfortable gradually with Joplin, but will probably run EN in parallel for the time being. Appreciate your work!

Are ANY attachments getting exported at all or only notes?

Well, it's all sort of one thing, but jammed together. If that makes the least bit of sense. I can open an evernote note with a pdf in it, and read the pdf within EN, or open the pdf in my standard reader app, or save it, whatever. With Joplin I can do the same. So the pdf is an 'attachment' but embedded within the note. Or is my brain going all fuzzy with your question...

As far as the OCR'd files - well, I just downloaded all 5,477 of my pdf scans from iDrive, and a random check shows no ability to search the raw pdf for text. Which kind of surprises me. But the same PDF in evernote, fully searchable. Checking the scansnap settings, for 'Scan to Evernote' it has the option of 'Convert to Searchable PDF'. So I'm unsure why the pdf file isn't searchable.

Too much for my puny brain right now!