Improving search with Asian scripts

mablin7 · 19 May 2021 10:41

Currently when a search term contains Asian characters, the search engine automatically selects basic mode, since FTS doesn't work with these scripts. However this is not documented anywhere and users have no reason to expect search in these languages to work any differently. An issue about this has been raised on github: The combined search returns no results for Chinese · Issue #4613 · laurent22/joplin · GitHub

I was thinking that maybe this problem could at least be partially addressed, without adding a custom tokenizer, by reimplementing the most important features of the default search in a new search mode (named SEARCH_TYPE_ASIAN_SCRIPT for eg.) based on regular SQL queries, without FTS. Of course, the performance would be much worse, but in my opinion it could be an acceptable tradeoff.

I tried simple multiple word search for Chinese by constructing an SQL query like ...WHERE body LIKE %term1% AND body LIKE %term2% .... I've tested it with up to 5 terms, on 200 notes, each with 5000 randomly generated Chinese characters and there was no noticeable slowdown on my desktop. By recording the screen and counting the frames (let me know if there's a better way, I'm really not experienced in profiling ), I measured 750ms for this hacky multi-term search, compared to 700ms for the default basic search.

Besides multi-term search, any and - filters could also be relatively easily implemented using SQL queries.

What do you guys think?

EDIT: Also let me know if my test sample was too small. With how much data is Joplin usually tested?

roman_r_m · 19 May 2021 19:27

I suppose this may be a situation where some search however slow is better than nothing. Just make sure it does not affect the fast path.

Once you have it working, maybe someone who actually writes notes in Chinese o(r any other language that is affected by this issue - I think there's more than one), would volunteer to give it a try on their notes.

laurent · 19 May 2021 20:30

That seems reasonable, although we should try not to create too much duplicate code just for this feature. In fact, is it not possible to somehow improve the basic search to make it work as you want?

Also it would be useful to search the forum for previous threads and find out what Chinese users search for in general and what results they expect. Just to be sure the work you might do is actually what's needed!

novelx · 23 May 2021 08:14

The feature (I would say it's more like a bugfix) is indeed needed. In fact, it's me that raised the issue #4613. Hope it will be resolved.

mablin7 · 24 May 2021 16:29

Hey sorry for not answering here. Just wanted to let you know that I'm working on this though it's a bit more complicated than I've imagined, but I'll try to get this done before I start my summer of code project.

@laurent if I have a few question about the search internals who's it best to ask?

rxliuli · 25 May 2021 00:20

To be honest, if possible, I hope Joplin can solve the search problem of Chinese users in one go, but this may be far away. However, the way of concatenating search keywords is indeed a way of thinking. I will try to segment words by myself and concatenate search keywords.

At present, without going deep into the sqlite level, can I use joplin http api to search for multiple keywords like "Windows" OR "工具"? @laurent

mablin7 · 25 May 2021 06:30

Unfortunately that wouldn't work right now, since as soon as the Chinese characters are detected the search mode is switched to basic, which would match your query literally, looking for all three of these words with quotes and everything.

rxliuli · 25 May 2021 06:44

Okay, then I’ll continue to wait. Although I have only 400 notes now, it may not be enough to use the directory structure after more than 1,000.

laurent · 25 May 2021 07:53

@naviji knows best about this but I might be able to help too.

mablin7 · 27 May 2021 17:37

Hey @rxliuli @novelx! I think I've managed to get all filters to work. Would you mind testing the fix for me to see if you can spot any issues? You can find the patch on my fork: GitHub - mablin7/joplin at nonlatin-seach, or I can probably build a binary for you if necessary, just let me know for which platform.

@laurent Thanks for the tip, I messaged with him and it was really helpful! The only thing problem now is that there are no unit tests. There's this nice big test suite for filters:

github.com

laurent22/joplin/blob/dev/packages/lib/services/searchengine/SearchFilter.test.js

/* eslint-disable no-unused-vars, @typescript-eslint/no-unused-vars, prefer-const */

const time = require('../../time').default;
const { setupDatabaseAndSynchronizer, supportDir, db, createNTestNotes, switchClient } = require('../../testing/test-utils.js');
const SearchEngine = require('../../services/searchengine/SearchEngine').default;
const Note = require('../../models/Note').default;
const Folder = require('../../models/Folder').default;
const Tag = require('../../models/Tag').default;
const shim = require('../../shim').default;
const ResourceService = require('../../services/ResourceService').default;


let engine = null;

const ids = (array) => array.map(a => a.id);

describe('services_SearchFilter', function() {
	beforeEach(async (done) => {
		await setupDatabaseAndSynchronizer(1);
		await switchClient(1);

This file has been truncated. show original

Do you think there's a smart way of reusing that for testing with Chinese/Japanese/etc. text, without rewriting much of the tests?

laurent · 27 May 2021 17:56

That's great, thanks for looking into this @mablin7.

For the tests, I don't know how you've implemented this, but if it's based on the existing filters then whatever works for English should work for Asian languages too, shouldn't it? Then you could have additional tests for checking specifically what's different for Asian search?

mablin7 · 27 May 2021 21:49

I finished the tests, so I opened a PR: All: Resolves #4613: Improve search with Asian scripts by mablin7 · Pull Request #5018 · laurent22/joplin · GitHub
In the end, I copied the existing filter tests, changed the strings and removed a few test cases which weren't relevant to the new search mode.

Even though it passes all the same tests as the FTS search, it would still be great if someone could give it a try!

novelx · 28 May 2021 08:59

Great to hear that! I have two notebooks (Windows 10 and macOS respectively) and one android phone, all of them have Joplin client installed. There're roughly 2000 notes in the repo.

mablin7 · 29 May 2021 14:36

Hey sorry for not replying, I've been trying to get a windows build to work, but it keeps failing and complaining about vs build tools version during npm install. I'll let you know as soon as I got it working.

mablin7 · 30 May 2021 23:18

Hey sorry again for the delay, but I finally managed to get a windows build to work. You can get it here:
https://www.transfernow.net/dl/20210530hdessZCs (available until 6th of June)

Also, while I don't think there's anything you really have to worry about, I'd still say it's better to make a backup of your notes before trying this version. (Help > Open profile directory, then copy the folder somewhere safe).

rxliuli · 31 May 2021 04:17

Great! The actual test can normally support Chinese and English mixed search

Although it seems that the sorting of search results may still be a bit problematic, but it has been very good progress, thanks!

rxliuli · 31 May 2021 07:12

It seems that the OneDrive infinite synchronization bug has been introduced again. . .

mablin7 · 31 May 2021 08:18

Strange.. Are you sure it's related to this fix? Can you try it on the dev branch? I didn't touch anything sync related, just the search.

rxliuli · 31 May 2021 08:23

It may be that I did not directly overwrite but imported JEX to run as a brand new program. No matter what, it was temporarily solved by overwriting the old program. (It should have nothing to do with your modification)

novelx · 31 May 2021 08:33

It works well for pure Chinese charaters so far, but has side effects on ASIC characters. Create a note as below:

note title:

会议：Meeting

note body:

Apple （Asian） Ltd.
上面的括号是全角English words.

Try search these keywords:

"words" - hit
"meeting words" - not hit
"：Meeting words" - hit
"asian" - not hit
"（asian“ - hit
"English words" - not hit
"全角English words" - hit
BR

Topic		Replies	Views
Search problem feedback Support	6	457	8 August 2020
Search optimization suggestions Features	8	1524	24 March 2021
Implement intelligent search Features	4	507	17 April 2022
GSoC Idea - Search Features gsoc-2020	15	2479	11 August 2020
Search results missing Support	16	466	17 March 2024

Improving search with Asian scripts

Related topics