Week 11: Project Search Engine

naviji · 24 July 2020 15:30

Fuzzy search using spellfix

Here’s how I think fuzzy search should work.

For each word, we’re going to use the spellfix extension to find its (three?) closest matches.

So if the search query is
important invention

spellfix(“importat”) -> “important”, “importance”, “implement”
spellfix(“inventin”) -> “invention”, “investigate”, “investment”

Note the increasing “fuzziness” from left to right.
Spellfix finds the closest match from the words indexed by FTS (giving frequently occurring words higher priority).

So the fuzzy search query becomes

important(0) importance(1) implement(2) invention(0) investigate(1) investment(2)

Where (n) represents the fuzziness level. So n==0 would mean an exact match.

Sorting

Once we get the notes satisfying this query, we need to sort it.

There are many ways to go about this.

Here’s my current plan. We sort the notes by “min fuzziness.”
So if a note contains an exact match, its “min fuzziness” would be 0 (the best we can get),
and so it should go on top of notes with “min fuzziness” 1 or 2 (there are no exact matches).

For the notes having the same “min fuzziness,” we sort based on the relevance score from Okapi BM25.
Also, we won’t be fuzzifying phrase searches.

Thoughts?

CalebJohn · 25 July 2020 04:59

I like the proposal to do fuzzy searching its a great feature. Typically with fuzzy search you would use Levenshtein distance (and select cases where distance < 1 or 2) or some other edit distance. Have you considered these measures?

What you describe is definitely an interesting way to provide approximate search but isn’t strictly fuzzy search.

naviji · 25 July 2020 05:35

Spellfix does use Levenshtein distance as edit distance.

The reason some of the words are more than 2 edit distances apart is because there are no other closer matches in the sqlite database.

We are not finding the words with the smallest edit distance from the english dictionary. Instead we’re searching from the words in the sqlite database of Joplin (no need to search for a non existing word even if it does have a closer match).

https://www.sqlite.org/spellfix1.html

CalebJohn · 25 July 2020 23:48

Thanks for the clarification (and the link), I think this is a good proposal. What are your plans for measuring performance impact?

PackElend · 26 July 2020 13:44

I have read through this.
May I ask some humble question:

In 9. The editdist3 COST table the expression cost is used what based on

Multi-character insertions, deletions, and substitutions can be enumerated in the cost table.

The expression might be a bit irritating as it figures how fuzzy the search will be isn't it?
For sure it will cost more CPU power the more fuzziness is introduced.
Any chance that the user can adjust that value?
In the same chapter, there is languages mentioned. Will it be possible to search across multiple languages as I have notes in French, German and English.
It would be great if the included languages can be selected and a default selection is saved in user preferences.

naviji · 1 August 2020 03:15

For each word in the query, we’ll have to make a SELECT query to the spellfix table to get the fuzzy matches. How long this takes will be independent of the number of notes in the database but would depend on the number of terms in the query.

The actual query will be a single FTS query with OR as connective, which is fast.

I’ll do further testing on a large corpus of notes (>10000).

naviji · 1 August 2020 03:29

I don't plan on providing a custom edit distance function. Using the built in Wagner edit-distance function would be best. But the upper bound on the number of fuzzy matches to return could be made configurable. But as this is a SELECT query to an indexed table, it wouldn't make much difference in terms of performance.

I'm looking into this. I agree this would be useful.

PackElend · 1 August 2020 11:38

each user has different hardware, so it should be to the user to adjust.
IMHO, a reasonable performance on an up-do-date sub-notebook should be the default.

great

laurent · 2 August 2020 11:37

@naviji, thanks for the detailed spec. As I understand, fuzzy search is based on the existing words in the database, is that correct? In which case, why is the language relevant since it should be able to find matches based on string similarities rather than the meaning of the words, shouldn’t it?

Also does fuzzy search also handle spelling mistake? For example would it do this:

spellfix(“inventin”) -> “invention”, “investigate”, “investment”

PackElend · 2 August 2020 17:27

look at 6. Dealing With Unusual And Difficult Spellings, there it says:

To enhance the ability to correct the spelling of "salm" into "psalm", make an addition entry like this:

so it seems there is some kind spelling mistake correction but up to what degree

naviji · 3 August 2020 04:03

Yes. Using the words in the FTS index.

You're right. I mistakenly thought it did. But the language doesn't matter (at least for languages supported by FTS).

Or if I type ”我是法人“ it would return “我是法国人” too

It won't since this would perform a basic search (FTS only works with alphabetical languages).

Yes, provided the word "invention" does exist in the FTS index. I'll edit the post to clarify.

PackElend · 3 August 2020 06:29

would it allow to fill it by external dictionaries?

naviji · 3 August 2020 06:55

No, because it won’t be useful.

Searching for a word in an external dictionary that isn’t already in the FTS index (Joplin database) will always return an empty result.

PackElend · 3 August 2020 07:13

that is true but you may need it for words similar to word being sought. For example, I have only the word “flat” in my notes but enter “apartment” in as word being sought. It would be nice if it shows all notes with “flat”.

Thesaurus support is given in SQLite FTS5 Extension - 7.1.1. Synonym Support. I found that in

naviji · 4 August 2020 14:04

That seems useful. But unfortunately, we won’t be using FTS5 because of restrictions in the mobile app.
Let’s leave this for future work.

PackElend · 4 August 2020 14:22

what restriction?

naviji · 4 August 2020 14:49

FTS5 support is kind of limited on mobile. It needs

iOS ≥ 10.0 Android ≥ API 24 (Android 7.0)

whereas FTS4 can be used from API >= 11

PackElend · 4 August 2020 15:29

so it could be optional
min Android 7.0 fraction is rising

looks better for iOS

naviji · 5 August 2020 02:57

I’ve already discussed this with @laurent, and he thinks, as I do, that we should only support one version. Making FTS5 optional would complicate maintenance.

Besides, synonym search isn’t in the scope of this project. Fuzzy search deals with the problem of approximate string matching.

What do you think @laurent , @CalebJohn ?

CalebJohn · 5 August 2020 06:56

I could see it maybe being added as a search filter e.g tag:moving synonym:apartment but I don’t really see that as being in scope for this project, but maybe a future plugin?

Topic		Replies	Views
Week 16: Project Search Engine GSoC - Search Engine Project	10	1679	28 October 2020
Week 1: Project Search Engine GSoC - Search Engine Project	3	1289	20 May 2020
Week 15: Project Search Engine GSoC - Search Engine Project	2	971	29 August 2020
Week 12: Project Search Engine GSoC - Search Engine Project	2	1027	2 August 2020
Week 13: Project Search Engine GSoC - Search Engine Project	3	732	20 August 2020

Week 11: Project Search Engine

Fuzzy search using spellfix

Sorting

Related topics