Week 11: Project Search Engine

Fuzzy search using spellfix

Here’s how I think fuzzy search should work.

For each word, we’re going to use the spellfix extension to find its (three?) closest matches.

So if the search query is
important invention

spellfix(“importat”) -> “important”, “importance”, “implement”
spellfix(“inventin”) -> “invention”, “investigate”, “investment”

Note the increasing “fuzziness” from left to right.
Spellfix finds the closest match from the words indexed by FTS (giving frequently occurring words higher priority).

So the fuzzy search query becomes

important(0) importance(1) implement(2) invention(0) investigate(1) investment(2)

Where (n) represents the fuzziness level. So n==0 would mean an exact match.

Sorting

Once we get the notes satisfying this query, we need to sort it.

There are many ways to go about this.

Here’s my current plan. We sort the notes by “min fuzziness.”
So if a note contains an exact match, its “min fuzziness” would be 0 (the best we can get),
and so it should go on top of notes with “min fuzziness” 1 or 2 (there are no exact matches).

For the notes having the same “min fuzziness,” we sort based on the relevance score from Okapi BM25.
Also, we won’t be fuzzifying phrase searches.

Thoughts?

1 Like

I like the proposal to do fuzzy searching its a great feature. Typically with fuzzy search you would use Levenshtein distance (and select cases where distance < 1 or 2) or some other edit distance. Have you considered these measures?

What you describe is definitely an interesting way to provide approximate search but isn’t strictly fuzzy search.

Spellfix does use Levenshtein distance as edit distance.

The reason some of the words are more than 2 edit distances apart is because there are no other closer matches in the sqlite database.

We are not finding the words with the smallest edit distance from the english dictionary. Instead we’re searching from the words in the sqlite database of Joplin (no need to search for a non existing word even if it does have a closer match).

https://www.sqlite.org/spellfix1.html

Thanks for the clarification (and the link), I think this is a good proposal. What are your plans for measuring performance impact?

I have read through this.
May I ask some humble question:

  1. In 9. The editdist3 COST table the expression cost is used what based on

    Multi-character insertions, deletions, and substitutions can be enumerated in the cost table.

    The expression might be a bit irritating as it figures how fuzzy the search will be isn't it?
    For sure it will cost more CPU power the more fuzziness is introduced.
    Any chance that the user can adjust that value?

  2. In the same chapter, there is languages mentioned. Will it be possible to search across multiple languages as I have notes in French, German and English.
    It would be great if the included languages can be selected and a default selection is saved in user preferences.

For each word in the query, we’ll have to make a SELECT query to the spellfix table to get the fuzzy matches. How long this takes will be independent of the number of notes in the database but would depend on the number of terms in the query.

The actual query will be a single FTS query with OR as connective, which is fast.

I’ll do further testing on a large corpus of notes (>10000).

I don't plan on providing a custom edit distance function. Using the built in Wagner edit-distance function would be best. But the upper bound on the number of fuzzy matches to return could be made configurable. But as this is a SELECT query to an indexed table, it wouldn't make much difference in terms of performance.

I'm looking into this. I agree this would be useful.

each user has different hardware, so it should be to the user to adjust.
IMHO, a reasonable performance on an up-do-date sub-notebook should be the default.

great

@naviji, thanks for the detailed spec. As I understand, fuzzy search is based on the existing words in the database, is that correct? In which case, why is the language relevant since it should be able to find matches based on string similarities rather than the meaning of the words, shouldn’t it?

Also does fuzzy search also handle spelling mistake? For example would it do this:

spellfix(“inventin”) -> “invention”, “investigate”, “investment”

look at 6. Dealing With Unusual And Difficult Spellings, there it says:

To enhance the ability to correct the spelling of "salm" into "psalm", make an addition entry like this:

so it seems there is some kind spelling mistake correction but up to what degree :thinking:

Yes. Using the words in the FTS index.

You're right. I mistakenly thought it did. But the language doesn't matter (at least for languages supported by FTS).

Or if I type ”我是法人“ it would return “我是法国人” too

It won't since this would perform a basic search (FTS only works with alphabetical languages).

Yes, provided the word "invention" does exist in the FTS index. I'll edit the post to clarify.

would it allow to fill it by external dictionaries?

No, because it won’t be useful.

Searching for a word in an external dictionary that isn’t already in the FTS index (Joplin database) will always return an empty result.

that is true but you may need it for words similar to word being sought. For example, I have only the word “flat” in my notes but enter “apartment” in as word being sought. It would be nice if it shows all notes with “flat”.

Thesaurus support is given in SQLite FTS5 Extension - 7.1.1. Synonym Support. I found that in

That seems useful. But unfortunately, we won’t be using FTS5 because of restrictions in the mobile app.
Let’s leave this for future work.

what restriction?

FTS5 support is kind of limited on mobile. It needs

iOS ≥ 10.0 Android ≥ API 24 (Android 7.0)

whereas FTS4 can be used from API >= 11

so it could be optional
min Android 7.0 fraction is rising


looks better for iOS

I’ve already discussed this with @laurent, and he thinks, as I do, that we should only support one version. Making FTS5 optional would complicate maintenance.

Besides, synonym search isn’t in the scope of this project. Fuzzy search deals with the problem of approximate string matching.

What do you think @laurent , @CalebJohn ?

I could see it maybe being added as a search filter e.g tag:moving synonym:apartment but I don’t really see that as being in scope for this project, but maybe a future plugin?