Home / GitHub Page

Week 11: Project Search Engine

Spellfix does use Levenshtein distance as edit distance.

The reason some of the words are more than 2 edit distances apart is because there are no other closer matches in the sqlite database.

We are not finding the words with the smallest edit distance from the english dictionary. Instead we’re searching from the words in the sqlite database of Joplin (no need to search for a non existing word even if it does have a closer match).

https://www.sqlite.org/spellfix1.html

Thanks for the clarification (and the link), I think this is a good proposal. What are your plans for measuring performance impact?

I have read through this.
May I ask some humble question:

  1. In 9. The editdist3 COST table the expression cost is used what based on

    Multi-character insertions, deletions, and substitutions can be enumerated in the cost table.

    The expression might be a bit irritating as it figures how fuzzy the search will be isn’t it?
    For sure it will cost more CPU power the more fuzziness is introduced.
    Any chance that the user can adjust that value?

  2. In the same chapter, there is languages mentioned. Will it be possible to search across multiple languages as I have notes in French, German and English.
    It would be great if the included languages can be selected and a default selection is saved in user preferences.

For each word in the query, we’ll have to make a SELECT query to the spellfix table to get the fuzzy matches. How long this takes will be independent of the number of notes in the database but would depend on the number of terms in the query.

The actual query will be a single FTS query with OR as connective, which is fast.

I’ll do further testing on a large corpus of notes (>10000).

I don’t plan on providing a custom edit distance function. Using the built in Wagner edit-distance function would be best. But the upper bound on the number of fuzzy matches to return could be made configurable. But as this is a SELECT query to an indexed table, it wouldn’t make much difference in terms of performance.

I’m looking into this. I agree this would be useful.

each user has different hardware, so it should be to the user to adjust.
IMHO, a reasonable performance on an up-do-date sub-notebook should be the default.

great

@naviji, thanks for the detailed spec. As I understand, fuzzy search is based on the existing words in the database, is that correct? In which case, why is the language relevant since it should be able to find matches based on string similarities rather than the meaning of the words, shouldn’t it?

Also does fuzzy search also handle spelling mistake? For example would it do this:

spellfix(“inventin”) -> “invention”, “investigate”, “investment”

look at 6. Dealing With Unusual And Difficult Spellings, there it says:

To enhance the ability to correct the spelling of “salm” into “psalm”, make an addition entry like this:

so it seems there is some kind spelling mistake correction but up to what degree :thinking:

Yes. Using the words in the FTS index.

You’re right. I mistakenly thought it did. But the language doesn’t matter (at least for languages supported by FTS).

Or if I type ”我是法人“ it would return “我是法国人” too

It won’t since this would perform a basic search (FTS only works with alphabetical languages).

Yes, provided the word “invention” does exist in the FTS index. I’ll edit the post to clarify.

would it allow to fill it by external dictionaries?

No, because it won’t be useful.

Searching for a word in an external dictionary that isn’t already in the FTS index (Joplin database) will always return an empty result.

that is true but you may need it for words similar to word being sought. For example, I have only the word “flat” in my notes but enter “apartment” in as word being sought. It would be nice if it shows all notes with “flat”.

Thesaurus support is given in SQLite FTS5 Extension - 7.1.1. Synonym Support. I found that in

That seems useful. But unfortunately, we won’t be using FTS5 because of restrictions in the mobile app.
Let’s leave this for future work.

what restriction?

FTS5 support is kind of limited on mobile. It needs

iOS ≥ 10.0 Android ≥ API 24 (Android 7.0)

whereas FTS4 can be used from API >= 11

so it could be optional
min Android 7.0 fraction is rising


looks better for iOS

I’ve already discussed this with @laurent, and he thinks, as I do, that we should only support one version. Making FTS5 optional would complicate maintenance.

Besides, synonym search isn’t in the scope of this project. Fuzzy search deals with the problem of approximate string matching.

What do you think @laurent , @CalebJohn ?

I could see it maybe being added as a search filter e.g tag:moving synonym:apartment but I don’t really see that as being in scope for this project, but maybe a future plugin?

personally, it drives me insane to think about the proper synonym.
At certain amount of notes you cannot simply think of all synonyms, so certain notes won’t be shown in the search result.

It does not have to be part of this project but showing a route how it could be implemented by its follow-up project would be appreciated.

Support for synonyms might be a useful feature but indeed it would be out of scope for Naveen’s project.

In fact it would be an entirely new project as we’d need to specific exactly how it should work, what it should do as I think this is quite complex. It also means we need to bundle a thesaurus for each supported language so perhaps it should be done after the spell checking feature, which will provide methods to look up words in dictionaries.

1 Like