Improving search with Asian scripts

Currently when a search term contains Asian characters, the search engine automatically selects basic mode, since FTS doesn't work with these scripts. However this is not documented anywhere and users have no reason to expect search in these languages to work any differently. An issue about this has been raised on github: The combined search returns no results for Chinese · Issue #4613 · laurent22/joplin · GitHub

I was thinking that maybe this problem could at least be partially addressed, without adding a custom tokenizer, by reimplementing the most important features of the default search in a new search mode (named SEARCH_TYPE_ASIAN_SCRIPT for eg.) based on regular SQL queries, without FTS. Of course, the performance would be much worse, but in my opinion it could be an acceptable tradeoff.

I tried simple multiple word search for Chinese by constructing an SQL query like ...WHERE body LIKE %term1% AND body LIKE %term2% .... I've tested it with up to 5 terms, on 200 notes, each with 5000 randomly generated Chinese characters and there was no noticeable slowdown on my desktop. By recording the screen and counting the frames (let me know if there's a better way, I'm really not experienced in profiling :sweat_smile:), I measured 750ms for this hacky multi-term search, compared to 700ms for the default basic search.

Besides multi-term search, any and - filters could also be relatively easily implemented using SQL queries.

What do you guys think?

EDIT: Also let me know if my test sample was too small. With how much data is Joplin usually tested?

1 Like

I suppose this may be a situation where some search however slow is better than nothing. Just make sure it does not affect the fast path.

Once you have it working, maybe someone who actually writes notes in Chinese o(r any other language that is affected by this issue - I think there's more than one), would volunteer to give it a try on their notes.

That seems reasonable, although we should try not to create too much duplicate code just for this feature. In fact, is it not possible to somehow improve the basic search to make it work as you want?

Also it would be useful to search the forum for previous threads and find out what Chinese users search for in general and what results they expect. Just to be sure the work you might do is actually what's needed!

The feature (I would say it's more like a bugfix) is indeed needed. In fact, it's me that raised the issue #4613. Hope it will be resolved.

Hey sorry for not answering here. Just wanted to let you know that I'm working on this though it's a bit more complicated than I've imagined, but I'll try to get this done before I start my summer of code project.

@laurent if I have a few question about the search internals who's it best to ask?

To be honest, if possible, I hope Joplin can solve the search problem of Chinese users in one go, but this may be far away. However, the way of concatenating search keywords is indeed a way of thinking. I will try to segment words by myself and concatenate search keywords.


At present, without going deep into the sqlite level, can I use joplin http api to search for multiple keywords like "Windows" OR "工具"? @laurent

Unfortunately that wouldn't work right now, since as soon as the Chinese characters are detected the search mode is switched to basic, which would match your query literally, looking for all three of these words with quotes and everything.

Okay, then I’ll continue to wait. Although I have only 400 notes now, it may not be enough to use the directory structure after more than 1,000.

@naviji knows best about this but I might be able to help too.

1 Like

Hey @rxliuli @novelx! I think I've managed to get all filters to work. Would you mind testing the fix for me to see if you can spot any issues? You can find the patch on my fork: GitHub - mablin7/joplin at nonlatin-seach, or I can probably build a binary for you if necessary, just let me know for which platform.

@laurent Thanks for the tip, I messaged with him and it was really helpful! The only thing problem now is that there are no unit tests. There's this nice big test suite for filters:

Do you think there's a smart way of reusing that for testing with Chinese/Japanese/etc. text, without rewriting much of the tests?

That's great, thanks for looking into this @mablin7.

For the tests, I don't know how you've implemented this, but if it's based on the existing filters then whatever works for English should work for Asian languages too, shouldn't it? Then you could have additional tests for checking specifically what's different for Asian search?

I finished the tests, so I opened a PR: All: Resolves #4613: Improve search with Asian scripts by mablin7 · Pull Request #5018 · laurent22/joplin · GitHub
In the end, I copied the existing filter tests, changed the strings and removed a few test cases which weren't relevant to the new search mode.

Even though it passes all the same tests as the FTS search, it would still be great if someone could give it a try!

Great to hear that! I have two notebooks (Windows 10 and macOS respectively) and one android phone, all of them have Joplin client installed. There're roughly 2000 notes in the repo.

Hey sorry for not replying, I've been trying to get a windows build to work, but it keeps failing and complaining about vs build tools version during npm install. I'll let you know as soon as I got it working.

Hey sorry again for the delay, but I finally managed to get a windows build to work. You can get it here:
https://www.transfernow.net/dl/20210530hdessZCs (available until 6th of June)

Also, while I don't think there's anything you really have to worry about, I'd still say it's better to make a backup of your notes before trying this version. (Help > Open profile directory, then copy the folder somewhere safe).

Great! The actual test can normally support Chinese and English mixed search

Although it seems that the sorting of search results may still be a bit problematic, but it has been very good progress, thanks!

2 Likes

It seems that the OneDrive infinite synchronization bug has been introduced again. . .

Strange.. Are you sure it's related to this fix? Can you try it on the dev branch? I didn't touch anything sync related, just the search.

It may be that I did not directly overwrite but imported JEX to run as a brand new program. No matter what, it was temporarily solved by overwriting the old program. (It should have nothing to do with your modification)

It works well for pure Chinese charaters so far, but has side effects on ASIC characters. Create a note as below:

note title:

会议:Meeting

note body:

Apple (Asian) Ltd.
上面的括号是全角English words.

Try search these keywords:

  1. "words" - hit
  2. "meeting words" - not hit
  3. ":Meeting words" - hit
  4. "asian" - not hit
  5. "(asian“ - hit
  6. "English words" - not hit
  7. "全角English words" - hit
    BR