I was thinking that maybe this problem could at least be partially addressed, without adding a custom tokenizer, by reimplementing the most important features of the default search in a new search mode (named SEARCH_TYPE_ASIAN_SCRIPT for eg.) based on regular SQL queries, without FTS. Of course, the performance would be much worse, but in my opinion it could be an acceptable tradeoff.
I tried simple multiple word search for Chinese by constructing an SQL query like ...WHERE body LIKE %term1% AND body LIKE %term2% .... I've tested it with up to 5 terms, on 200 notes, each with 5000 randomly generated Chinese characters and there was no noticeable slowdown on my desktop. By recording the screen and counting the frames (let me know if there's a better way, I'm really not experienced in profiling ), I measured 750ms for this hacky multi-term search, compared to 700ms for the default basic search.
Besides multi-term search, any and - filters could also be relatively easily implemented using SQL queries.
What do you guys think?
EDIT: Also let me know if my test sample was too small. With how much data is Joplin usually tested?
I suppose this may be a situation where some search however slow is better than nothing. Just make sure it does not affect the fast path.
Once you have it working, maybe someone who actually writes notes in Chinese o(r any other language that is affected by this issue - I think there's more than one), would volunteer to give it a try on their notes.
That seems reasonable, although we should try not to create too much duplicate code just for this feature. In fact, is it not possible to somehow improve the basic search to make it work as you want?
Also it would be useful to search the forum for previous threads and find out what Chinese users search for in general and what results they expect. Just to be sure the work you might do is actually what's needed!
Hey sorry for not answering here. Just wanted to let you know that I'm working on this though it's a bit more complicated than I've imagined, but I'll try to get this done before I start my summer of code project.
@laurent if I have a few question about the search internals who's it best to ask?
To be honest, if possible, I hope Joplin can solve the search problem of Chinese users in one go, but this may be far away. However, the way of concatenating search keywords is indeed a way of thinking. I will try to segment words by myself and concatenate search keywords.
At present, without going deep into the sqlite level, can I use joplin http api to search for multiple keywords like "Windows" OR "工具"? @laurent
Unfortunately that wouldn't work right now, since as soon as the Chinese characters are detected the search mode is switched to basic, which would match your query literally, looking for all three of these words with quotes and everything.
Hey @rxliuli@novelx! I think I've managed to get all filters to work. Would you mind testing the fix for me to see if you can spot any issues? You can find the patch on my fork: GitHub - mablin7/joplin at nonlatin-seach, or I can probably build a binary for you if necessary, just let me know for which platform.
@laurent Thanks for the tip, I messaged with him and it was really helpful! The only thing problem now is that there are no unit tests. There's this nice big test suite for filters:
Do you think there's a smart way of reusing that for testing with Chinese/Japanese/etc. text, without rewriting much of the tests?
That's great, thanks for looking into this @mablin7.
For the tests, I don't know how you've implemented this, but if it's based on the existing filters then whatever works for English should work for Asian languages too, shouldn't it? Then you could have additional tests for checking specifically what's different for Asian search?
Hey sorry for not replying, I've been trying to get a windows build to work, but it keeps failing and complaining about vs build tools version during npm install. I'll let you know as soon as I got it working.
Also, while I don't think there's anything you really have to worry about, I'd still say it's better to make a backup of your notes before trying this version. (Help > Open profile directory, then copy the folder somewhere safe).
It may be that I did not directly overwrite but imported JEX to run as a brand new program. No matter what, it was temporarily solved by overwriting the old program. (It should have nothing to do with your modification)