V1.5.4 bug: When searching in Chinese, it does not work as expected

Using "地球 火星" can not search for "地球 月球 火星" notes.

For Chinese language, only exact matches are supported, so "地球 月球" would work, "月球 火星" too, but not "地球 火星". This is a limitation of the search engine we use (FTS), which is pretty much English language only, although we've improved it to support languages with accentuated characters too. But beyond that, it's not really possible.

1 Like

But to be honest, Chinese search is still very bad, especially when it contains both Chinese and English. For example, when I enter Windows 工具清单 but I can’t find the Windows 上的工具清单, but it feels like it should be found Yes, this is not right. . .@laurent

I think to properly support Chinese language we would need a special search engine that understands Chinese text. Alphabetical languages have words separated by delimiters like spaces, commas, etc. but Chinese doesn't so we'd need to split 我是法国人 into 我 是 法国人 for instance, but for that the app needs a Chinese dictionary and a special tokenizer.

If someone can create a pull request that solves some of these issues, I'd be willing to help get it done, but it's probably a tricky problem to solve.

Someone has created a related Chinese word segmentation database, but I am not sure how to add it to joplin

So is there any beginning for adding a Chinese word segmentation engine? Or how should I simply add it, right here? @laurent

I added a simple example of using stammering to separate Chinese and English words. Of course, this example is very simple, but as long as the Chinese and English are separated and different word segmentation engines are used, it should be able to handle it. @laurent

I'm not very familiar with the query parsing code since that was developed by Naveen. How easy is it to integrate your word segmentation code?

Simply put, it is to install an npm package, and then pass in the Chinese paragraph, and the Chinese array after the word segmentation will be returned.
But it should be noted that since jieba does not support English word segmentation, you have to deal with Chinese and English separately. The example code above does just that.