GSoC Idea - OCR Support old

OCR is one of the projects listed on the ideas page for GSoC '20. I think OCR would be a great addition to Joplin. This would help users to add text from images very easily and would be a welcome feature by everyone.

As mentioned in the ideas page, The initial stage of implementing OCR requires checking the feasibility of the project. As suggested by Laurent, the first step was to integrate the library into the desktop app and test it.

I’ve done a basic implementation of the Tesseract Library in the desktop app and did some tests. I would like to discuss the results and get feedback for the same to get a really good idea for when I’m submitting my proposal.

The implementation is on the ocr-tess branch of a fork of the joplin repo.
It can be found here.

Before testing, install the tesseract-js package.

npm install tesseract-js

In the Note editor, an download icon can be found in the toolbar. Click on the icon and select the image.


After selecting the image(s), the progress of the OCR is logged to the console, along with the time taken to process the image. (I haven’t added any progress bar. Monitor the console. After it says, initialized the api, the process is running and give it some time.)

On completion of the process, the text is appended to the markup editor. This is what I’ve implemented so far. I tested the functionality with some sample images and I would like to answer some queries Laurent has mentioned.

Note: This is a very basic implementation of the Tesseract Library and is no where near the specifications mentioned by Laurent in the issue. The final implementation can be much more user-friendly with dialogs and progress bars.

Is the image OCR’ed correctly?

Well, yes and no. Clean images (images with clear text/font and proper spacing) were perfectly-recognized.

Input image:

Output Text: (Time 18s)

Book scanning

Problem statement for the Online Qualification Round of Hash Code 2020
Introduction
Books allow us to discover fantasy worlds and better understand the world we live in.
They enable us to learn about everything from photography to compilers… and of
course a good book is a great way to relax!
Google Books is a project that embraces the value books bring to our daily lives. It
aspires to bring the world’s books online and make them accessible to everyone. In the
last 15 years, Google Books has collected digital copies of 40 million books in more
than 400 languages’, partly by scanning books from libraries and publishers all around
the world.
In this competition problem, we will explore the challenges of setting up a scanning
process for milions of books stored in libraries around the world and having them
scanned at a scanning facility.
Task
Given a description of libraries and books available, plan which books to scan from
which library to maximize the total score of all scanned books, taking into account that
each library needs to be signed up before it can ship books.

However, I tried scanning a few receipts and the text was not OCR’ed properly.

Input Image:

rec

Output Text: (Time 6s)

Musée du Louvre
105 W Riley 5t
Easton, KS 66020
(800) 867-5309
. Touvre. r
oct 28 2015 3:38pM
order: 10002292
Date: oct 28 2015 3:37Pw
card: i
Acet #: aosmsssnaeniny
expiration: Feb 2020
Authorization: ‘00000
Total: $13.20
signature:
st
BRI
Thank You!
Werchant copy

Does it work with non-English text?

Yep. Tesseract claims their library supports around 100 languages. A point to be noted is that the above-mentioned issues of images being OCR’ed correctly apply to all the languages and as the complexity of the language increases, the above-mentioned issues tend to follow the same trend.

How slow/fast is it?

For smaller notes and clearer images, the OCR is pretty quick. It can be seen from the sample inputs above. However, with larger files, it will take time.

This image took around 3.5 minutes to complete, though the output text was pretty accurate for the input file.

From the initial testing, I would say that OCR can be implemented in the app. This basic implementation can be improved by providing more options to the worker. Tesseract also allows us to custom train a dataset and if required, that is something that could be looked into. The final implementation would include the OCR having it’s own text block (as mentioned by Laurent) and with that use case in mind, I think this is very feasible.

Thank you for reading through this. I would love some feedback on this and actually look into implementing OCR as a part of GSoC '20.

1 Like

A post was split to a new topic: Temp

2 posts were split to a new topic: GSoC Idea - OCR Support