Home / GitHub Page

GSoC Idea - OCR Support

this is the topic in regard to the above-mentioned topic.
Anything on how to do this, how it shall be done, what features in shell include etc. is discussed here if an existing topic hasn’t been created yet, see idea description below.
Your interest in this idea shall be announced here, otherwise, it gets easily lost as we would need to remember each introduction.

This topic is used to update the specification of the idea as well, even if there is an existing topic, so interested students, watch it!
Anything that shell be discussed privately as e.g. if it involves your proposal will be discussed through a private channel what is currently in discussion.

As of the moment, I’m writing this, the idea’s description of https://joplinapp.org/gsoc2020/ideas.html#5-ocr-support is:

It is possible to add support for OCR content in Joplin via the Tesseract library. A first step would be to assess the feasibility of this project by integrating the lib in the desktop app and trying to OCR an image. OCR support should be implemented as a service of the desktop app. It would extract the text from the images, and append the content as plain text to the notes.

Expected Outcome: A service on the desktop app that extract text from images and attach it to the note.

Difficulty Level: High
Skills Required: JavaScript
Potential Mentor(s): CalebJohn, laurent22

OCR is one of the projects listed on the ideas page for GSoC '20. I think OCR would be a great addition to Joplin. This would help users to add text from images very easily and would be a welcome feature by everyone.

As mentioned in the ideas page, The initial stage of implementing OCR requires checking the feasibility of the project. As suggested by Laurent, the first step was to integrate the library into the desktop app and test it.

I’ve done a basic implementation of the Tesseract Library in the desktop app and did some tests. I would like to discuss the results and get feedback for the same to get a really good idea for when I’m submitting my proposal.

The implementation is on the ocr-tess branch of a fork of the joplin repo.
It can be found here.

Before testing, install the tesseract-js package.

npm install tesseract-js

In the Note editor, an download icon can be found in the toolbar. Click on the icon and select the image.


After selecting the image(s), the progress of the OCR is logged to the console, along with the time taken to process the image. (I haven’t added any progress bar. Monitor the console. After it says, initialized the api, the process is running and give it some time.)

On completion of the process, the text is appended to the markup editor. This is what I’ve implemented so far. I tested the functionality with some sample images and I would like to answer some queries Laurent has mentioned.

Note: This is a very basic implementation of the Tesseract Library and is no where near the specifications mentioned by Laurent in the issue. The final implementation can be much more user-friendly with dialogs and progress bars.

Is the image OCR’ed correctly?

Well, yes and no. Clean images (images with clear text/font and proper spacing) were perfectly-recognized.

Input image:

Output Text: (Time 18s)

Book scanning

Problem statement for the Online Qualification Round of Hash Code 2020
Introduction
Books allow us to discover fantasy worlds and better understand the world we live in.
They enable us to learn about everything from photography to compilers… and of
course a good book is a great way to relax!
Google Books is a project that embraces the value books bring to our daily lives. It
aspires to bring the world’s books online and make them accessible to everyone. In the
last 15 years, Google Books has collected digital copies of 40 million books in more
than 400 languages’, partly by scanning books from libraries and publishers all around
the world.
In this competition problem, we will explore the challenges of setting up a scanning
process for milions of books stored in libraries around the world and having them
scanned at a scanning facility.
Task
Given a description of libraries and books available, plan which books to scan from
which library to maximize the total score of all scanned books, taking into account that
each library needs to be signed up before it can ship books.

However, I tried scanning a few receipts and the text was not OCR’ed properly.

Input Image:

rec

Output Text: (Time 6s)

Musée du Louvre
105 W Riley 5t
Easton, KS 66020
(800) 867-5309
. Touvre. r
oct 28 2015 3:38pM
order: 10002292
Date: oct 28 2015 3:37Pw
card: i
Acet #: aosmsssnaeniny
expiration: Feb 2020
Authorization: ‘00000
Total: $13.20
signature:
st
BRI
Thank You!
Werchant copy

Does it work with non-English text?

Yep. Tesseract claims their library supports around 100 languages. A point to be noted is that the above-mentioned issues of images being OCR’ed correctly apply to all the languages and as the complexity of the language increases, the above-mentioned issues tend to follow the same trend.

How slow/fast is it?

For smaller notes and clearer images, the OCR is pretty quick. It can be seen from the sample inputs above. However, with larger files, it will take time.

This image took around 3.5 minutes to complete, though the output text was pretty accurate for the input file.

From the initial testing, I would say that OCR can be implemented in the app. This basic implementation can be improved by providing more options to the worker. Tesseract also allows us to custom train a dataset and if required, that is something that could be looked into. The final implementation would include the OCR having it’s own text block (as mentioned by Laurent) and with that use case in mind, I think this is very feasible.

Thank you for reading through this. I would love some feedback on this and actually look into implementing OCR as a part of GSoC '20.

1 Like

only reordered the post, so that are consistent with the others.

Great input

I’m also interested in solving this problem statement!

Please check my recent workaround, how I plan to implement the OCR support in Joplin. Also please let me know your all valuable suggestions and feedback @CalebJohn @laurent

OCR Support workarround

just stay in GSoC Idea - OCR Support that is all, so we see all at one glance :slight_smile:

2 Likes

@CalebJohn @laurent can we implement the ocr in an hybrid approach for desktop/web/mobile using ionic ?

please read the live blog closely

I have just read up Ionic.
It is compete different framework reading

As Laurent begun creating Joplin years ago, there was some reason, why chose React over Ionic. He may going to share it with us.
Regardless this, you don’t switch frameworks from today to tomorrow in particular as GSoC proposal are based on the current framework.
There are properly reason why Ionic would be a good choice and there are options to combine them, see https://reactionic.github.io/.
Out of question, we are open to serious discussion about this if you come up with a proper reasoning in a new topic in #development, wherein you explain why you could be wise to use Ionic and how break with the current code is avoided?

I don’t want to be harsh but just asking “can we do this” without explaining why, is a bit too time–intensive for us answering, keeping in mind that a lot of things are currently going around here and on GH , understandable isn’t it?

1 Like

Thanks for clearing @CalebJohn my doubts and based on that I have made a workaround on how the OCR Support should be implemented in the Joplin.

I will be discussing the cases keeping in mind User Experience in Joplin for OCR Integration. Also please share your views on it as it will help me to write a good proposal for the project.

About Tesseract

This is a pretty good library for OCR and supports more than 100+ languages.

The time taken in the OCR process depends on the image

=> Size of image

=> Quality of image

=> Font of image

=> No. of words in the image

So, basically the process may take from seconds to minutes depending on the image.

Now, based on our discussion I have assumed few things and accordingly, I have planned to test the implementation of the OCR support in Joplin keeping in mind it does now hamper the User Experience

Let us say if am writing a note and want to extract the text from the image and use the text.

If we want to get the OCRed text in the same text area in which we are writing our note then,

Case 1: If the image is of good quality, good font size, fewer words then the OCR process will be done in seconds and also the confidence level of the OCR will be good.

Case 2: If the image is of bad quality, small font size, large numbers of words then the OCR process will take time and also the confidence level of the OCR will not be good and which means the OCRed text will contain lots of errors.

Now in Case 1, we can do the way we discussed. Uploading the image and show “…”

31c1c206bd2efc13059e498488f9560d.png

And after processing “…” will be replaced with the OCRed text

31c1c206bd2efc13059e498488f9560d.png

But this will not be helpful in Case 2, as it can take minutes to OCR the image and get the text also we are not sure the OCRed text is correct or not.

Now if we go deeper and study the user behavior then we will come to know how the OCR support should be implemented maybe I could be wrong but this is what I came to know. Also, we want the OCR support should be available with multiple images. So what if we do it this way,

Check the video

In this way, we have pretty good User Experience and also it will improve the productivity of the user in writing the note.

We will simply have an icon that will open the OCR window and then the user can add all the images and simply copy-paste the OCRed text also we have the dropdown to select the different language. And till the time the image is being OCRed then the user can write some notes and use it the OCRed text accordingly.

Please do give me your suggestions

1 Like

@PackElend Thanks for the reply. I understand the constraints of GSoC proposal. So i will work with tesseract js along with electron js … though i will try to come up with a feasible approach in #development topic .

sometime is not that easy to say what category is the appropriate one.
My point of few is, high level or strategic discussion #features and question about how to code and build in #development.
We are still at high level.

https://openpaper.work/en/
could help for your research as well but in is build with PHP