Improving search with Asian scripts

novelx · 31 May 2021 08:40

Sorry, I just tested the original app v1.8.5 and the 'side effects' also exist. So it is not introduced by your modification.

mablin7 · 31 May 2021 09:09

Well yes, these queries would use the default FTS mode, because they only contain English letters, but the notes contain Chinese characters so I guess that's causing problems. I'm not really sure how that could be fixed though.

laurent · 31 May 2021 09:29

Does it mean notes that contain Chinese characters are not indexed by FTS?

mablin7 · 31 May 2021 12:05

Hm well now that I think about it, probably that's not the case. According to the sqlite docs, Unicode characters are simply skipped, but the rest of the note should still be indexed. But then I don't know what's causing the issues mentioned by @novelx. I'll look into it later today.

laurent · 31 May 2021 12:10

Maybe it's because "English words" is just next to Chinese characters, without spaces? 上面的括号是全角English words. It could be that when normalising the note content we should strip off all non Latin scripts, so that it can be indexed properly by FTS.

mablin7 · 31 May 2021 13:15

I think I've figured out why this is happening: in the example provided by novelx ：, （ and ） were not ASCII characters, but their special fullwidth forms, designed for Asian scripts. The FTS4 docs says:

A term is a contiguous sequence of eligible characters, where eligible characters are all alphanumeric characters and all characters with Unicode codepoint values greater than or equal to 128. All other characters are discarded when splitting a document into terms.

Because these special characters have a value greater than 128, they don't break the word and are not discarded like regular parentheses. So 会议：Meeting will be a single word, therefore it does not match meeting. Same for （Asian） and 上面的括号是全角English. Although it seems FTS can sometimes match a part of a word (it matches 全角English words), but not in this case, so I don't really get the rules for that.

Yes, I think that would solve this. But is that possible? FTS tables are automatically generated from the notes table, no?

laurent · 31 May 2021 14:11

No they are generated from a notes_normalized table, and this table is populated by the app. There's a normalizeNote_() function that's used to normalize the title and body so I think you'd just need to change this to filter out the non-supported characters.

mablin7 · 31 May 2021 14:22

Ah I see, thanks! I'll give that a try then.

Topic		Replies	Views
V1.5.4 bug: When searching in Chinese, it does not work as expected Support	8	1028	10 March 2021
Chinese search for multiple keywords does not work as expected Support	8	676	29 November 2020
Option for search to filter by characters instead of words Support	3	450	29 May 2022
Search problem feedback Support	6	454	8 August 2020
Search optimization suggestions Features	8	1518	24 March 2021

Improving search with Asian scripts

Related topics