GSoC Idea - OCR Support

PackElend · 27 February 2020 19:18

this is the topic in regard to the above-mentioned topic.
Anything on how to do this, how it shall be done, what features in shell include etc. is discussed here if an existing topic hasn’t been created yet, see idea description below.
Your interest in this idea shall be announced here, otherwise, it gets easily lost as we would need to remember each introduction.

This topic is used to update the specification of the idea as well, even if there is an existing topic, so interested students, watch it!
Anything that shell be discussed privately as e.g. if it involves your proposal will be discussed through a private channel what is currently in discussion.

As of the moment, I’m writing this, the idea’s description of https://joplinapp.org/gsoc2020/ideas.html#5-ocr-support is:

It is possible to add support for OCR content in Joplin via the Tesseract library. A first step would be to assess the feasibility of this project by integrating the lib in the desktop app and trying to OCR an image. OCR support should be implemented as a service of the desktop app. It would extract the text from the images, and append the content as plain text to the notes.

Expected Outcome: A service on the desktop app that extract text from images and attach it to the note.

Difficulty Level: High
Skills Required: JavaScript
Potential Mentor(s): CalebJohn, laurent22

rabeehrz · 27 February 2020 11:51

OCR is one of the projects listed on the ideas page for GSoC '20. I think OCR would be a great addition to Joplin. This would help users to add text from images very easily and would be a welcome feature by everyone.

As mentioned in the ideas page, The initial stage of implementing OCR requires checking the feasibility of the project. As suggested by Laurent, the first step was to integrate the library into the desktop app and test it.

I’ve done a basic implementation of the Tesseract Library in the desktop app and did some tests. I would like to discuss the results and get feedback for the same to get a really good idea for when I’m submitting my proposal.

The implementation is on the ocr-tess branch of a fork of the joplin repo.
It can be found here.

Before testing, install the tesseract-js package.

npm install tesseract-js

In the Note editor, an icon can be found in the toolbar. Click on the icon and select the image.

After selecting the image(s), the progress of the OCR is logged to the console, along with the time taken to process the image. (I haven’t added any progress bar. Monitor the console. After it says, initialized the api, the process is running and give it some time.)

On completion of the process, the text is appended to the markup editor. This is what I’ve implemented so far. I tested the functionality with some sample images and I would like to answer some queries Laurent has mentioned.

Note: This is a very basic implementation of the Tesseract Library and is no where near the specifications mentioned by Laurent in the issue. The final implementation can be much more user-friendly with dialogs and progress bars.

Is the image OCR’ed correctly?

Well, yes and no. Clean images (images with clear text/font and proper spacing) were perfectly-recognized.

Input image:

Output Text: (Time 18s)

Book scanning

Problem statement for the Online Qualification Round of Hash Code 2020
Introduction
Books allow us to discover fantasy worlds and better understand the world we live in.
They enable us to learn about everything from photography to compilers… and of
course a good book is a great way to relax!
Google Books is a project that embraces the value books bring to our daily lives. It
aspires to bring the world’s books online and make them accessible to everyone. In the
last 15 years, Google Books has collected digital copies of 40 million books in more
than 400 languages’, partly by scanning books from libraries and publishers all around
the world.
In this competition problem, we will explore the challenges of setting up a scanning
process for milions of books stored in libraries around the world and having them
scanned at a scanning facility.
Task
Given a description of libraries and books available, plan which books to scan from
which library to maximize the total score of all scanned books, taking into account that
each library needs to be signed up before it can ship books.

However, I tried scanning a few receipts and the text was not OCR’ed properly.

Input Image:

rec

Output Text: (Time 6s)

Musée du Louvre
105 W Riley 5t
Easton, KS 66020
(800) 867-5309
. Touvre. r
oct 28 2015 3:38pM
order: 10002292
Date: oct 28 2015 3:37Pw
card: i
Acet #: aosmsssnaeniny
expiration: Feb 2020
Authorization: ‘00000
Total: $13.20
signature:
st
BRI
Thank You!
Werchant copy

Does it work with non-English text?

Yep. Tesseract claims their library supports around 100 languages. A point to be noted is that the above-mentioned issues of images being OCR’ed correctly apply to all the languages and as the complexity of the language increases, the above-mentioned issues tend to follow the same trend.

How slow/fast is it?

For smaller notes and clearer images, the OCR is pretty quick. It can be seen from the sample inputs above. However, with larger files, it will take time.

This image took around 3.5 minutes to complete, though the output text was pretty accurate for the input file.

From the initial testing, I would say that OCR can be implemented in the app. This basic implementation can be improved by providing more options to the worker. Tesseract also allows us to custom train a dataset and if required, that is something that could be looked into. The final implementation would include the OCR having it’s own text block (as mentioned by Laurent) and with that use case in mind, I think this is very feasible.

Thank you for reading through this. I would love some feedback on this and actually look into implementing OCR as a part of GSoC '20.

PackElend · 27 February 2020 19:22

only reordered the post, so that are consistent with the others.

Great input

LunaticProgrammer · 29 February 2020 22:12

I’m also interested in solving this problem statement!

amitsin6h · 5 March 2020 00:47

Please check my recent workaround, how I plan to implement the OCR support in Joplin. Also please let me know your all valuable suggestions and feedback @CalebJohn @laurent

OCR Support workarround

PackElend · 5 March 2020 07:51

just stay in GSoC Idea - OCR Support that is all, so we see all at one glance

gp559 · 6 March 2020 21:27

@CalebJohn @laurent can we implement the ocr in an hybrid approach for desktop/web/mobile using ionic ?

PackElend · 6 March 2020 22:52

please read the live blog closely

PackElend · 7 March 2020 11:36

I have just read up Ionic.
It is compete different framework reading

As Laurent begun creating Joplin years ago, there was some reason, why chose React over Ionic. He may going to share it with us.
Regardless this, you don’t switch frameworks from today to tomorrow in particular as GSoC proposal are based on the current framework.
There are properly reason why Ionic would be a good choice and there are options to combine them, see https://reactionic.github.io/.
Out of question, we are open to serious discussion about this if you come up with a proper reasoning in a new topic in #development, wherein you explain why you could be wise to use Ionic and how break with the current code is avoided?

I don’t want to be harsh but just asking “can we do this” without explaining why, is a bit too time–intensive for us answering, keeping in mind that a lot of things are currently going around here and on GH , understandable isn’t it?

amitsin6h · 4 March 2020 18:48

Thanks for clearing @CalebJohn my doubts and based on that I have made a workaround on how the OCR Support should be implemented in the Joplin.

I will be discussing the cases keeping in mind User Experience in Joplin for OCR Integration. Also please share your views on it as it will help me to write a good proposal for the project.

About Tesseract

This is a pretty good library for OCR and supports more than 100+ languages.

The time taken in the OCR process depends on the image

=> Size of image

=> Quality of image

=> Font of image

=> No. of words in the image

So, basically the process may take from seconds to minutes depending on the image.

Now, based on our discussion I have assumed few things and accordingly, I have planned to test the implementation of the OCR support in Joplin keeping in mind it does now hamper the User Experience

Let us say if am writing a note and want to extract the text from the image and use the text.

If we want to get the OCRed text in the same text area in which we are writing our note then,

Case 1: If the image is of good quality, good font size, fewer words then the OCR process will be done in seconds and also the confidence level of the OCR will be good.

Case 2: If the image is of bad quality, small font size, large numbers of words then the OCR process will take time and also the confidence level of the OCR will not be good and which means the OCRed text will contain lots of errors.

Now in Case 1, we can do the way we discussed. Uploading the image and show “…”

And after processing “…” will be replaced with the OCRed text

But this will not be helpful in Case 2, as it can take minutes to OCR the image and get the text also we are not sure the OCRed text is correct or not.

Now if we go deeper and study the user behavior then we will come to know how the OCR support should be implemented maybe I could be wrong but this is what I came to know. Also, we want the OCR support should be available with multiple images. So what if we do it this way,

Check the video

In this way, we have pretty good User Experience and also it will improve the productivity of the user in writing the note.

We will simply have an icon that will open the OCR window and then the user can add all the images and simply copy-paste the OCRed text also we have the dropdown to select the different language. And till the time the image is being OCRed then the user can write some notes and use it the OCRed text accordingly.

Please do give me your suggestions

gp559 · 7 March 2020 21:41

@PackElend Thanks for the reply. I understand the constraints of GSoC proposal. So i will work with tesseract js along with electron js … though i will try to come up with a feasible approach in #development topic .

PackElend · 8 March 2020 09:30

sometime is not that easy to say what category is the appropriate one.
My point of few is, high level or strategic discussion #features and question about how to code and build in #development.
We are still at high level.

PackElend · 8 March 2020 09:38

https://openpaper.work/en/
could help for your research as well but in is build with PHP

bartatgithub · 28 August 2020 10:06

Nice to read that this topic is still active. Will look also to openpaper but this application places all your papers in one folder so that seams unhandy. I use and I am quite happy with Textfairy (renard314) for OCR on Android and copy, correct the text then in Joplin.

Download here without Playstore:

He is using also GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository).

Maybe they or he can help to get it working in Joplin?

I think that when using OCR on personal computers you have always to correct some text, never used one that's without errors. The OCR build-in in copy machines with print server are much better but not that personal.

leohoppergit · 6 September 2020 04:21

Thanks for the advice about “textfairy”.

What’s the status of OCR-implementation in Joplin? Just asking out of curiosity…
Thanks to all of you for your great work!!

PackElend · 6 September 2020 08:47

we have got some proposal how to add this feature but we got only two GSoC slots.
If anyone is want to work on it I would happily show off the options how it could be done.

manouchk · 10 January 2021 22:48

Is it a right place to make suggestion about OCR in Joplin? I will try. If not, I will just erase my message.

Few thoughts about it:
As joplin already manage integration of picture, I thin OCR could be added to contextual menu accessible through right click.
Advantage: simpler extension or plugin but
Disadvantage: image is imported in joplin.
It is though probably easy to delete the image if the text obtained through ocr is satisfactory.

I'm not at all a specialist in OCR but in order to improve results of OCR, dictionnaries are generally used. THis may be an option to consider.

Another idea would be to define prefered languages. Most people work with few languages. In my case, I work in English, Portugues and French. If prefered languages are not selected, for example, Portuguese, is at the end of menu which is not the most usable (fun) way to work.

PackElend · 12 January 2021 08:01

Yes it is

SMcCandlish · 1 August 2024 03:34

Is there a simple way to replace whatever very-local installation of Tesseract that Joplin is using? I already have my own system-wide installaction of Tesseract, with all the best (not fastest) language data files installed. This is in Windows 11. I've already got PDF24 using this version of Tesseract and would just as soon have Joplin using it, too, especially since I can script a daily check for updated language data files.

PS: I agree with the suggestion by @manouchk about being able to "define prefered languages". I regularly need English and a few other languages (including a couple of obscure ones).

Topic		Replies	Views
Plugin: offline OCR (extract text from images, pdf, videos, etc) Plugins	48	8340	2 October 2023
Temp Features	2	705	27 February 2020
Hello World! From Kishlay-notabot GSoC	7	345	23 February 2024
Introduction: Zaid Kesarani GSoC gsoc-2020 , ai	33	2123	11 April 2020
OCR Support workarround Development gsoc-2020	3	1415	7 March 2020

GSoC Idea - OCR Support

Note: This is a very basic implementation of the Tesseract Library and is no where near the specifications mentioned by Laurent in the issue. The final implementation can be much more user-friendly with dialogs and progress bars.

Is the image OCR’ed correctly?

Does it work with non-English text?

How slow/fast is it?

Related topics