V1.5.4 bug: When searching in Chinese, it does not work as expected

s4j4 · 5 December 2020 16:02

Using "地球火星" can not search for "地球月球火星" notes.

laurent · 5 December 2020 16:15

For Chinese language, only exact matches are supported, so "地球月球" would work, "月球火星" too, but not "地球火星". This is a limitation of the search engine we use (FTS), which is pretty much English language only, although we've improved it to support languages with accentuated characters too. But beyond that, it's not really possible.

rxliuli · 7 March 2021 03:04

But to be honest, Chinese search is still very bad, especially when it contains both Chinese and English. For example, when I enter Windows 工具清单 but I can’t find the Windows 上的工具清单, but it feels like it should be found Yes, this is not right. . .@laurent

laurent · 7 March 2021 17:30

I think to properly support Chinese language we would need a special search engine that understands Chinese text. Alphabetical languages have words separated by delimiters like spaces, commas, etc. but Chinese doesn't so we'd need to split 我是法国人 into 我是法国人 for instance, but for that the app needs a Chinese dictionary and a special tokenizer.

If someone can create a pull request that solves some of these issues, I'd be willing to help get it done, but it's probably a tricky problem to solve.

rxliuli · 7 March 2021 23:57

Someone has created a related Chinese word segmentation database, but I am not sure how to add it to joplin

rxliuli · 10 March 2021 13:45

So is there any beginning for adding a Chinese word segmentation engine? Or how should I simply add it, right here? @laurent

github.com

laurent22/joplin/blob/63559ac8b9f6e928f64b1b06e784120929e6a746/packages/lib/services/searchengine/filterParser.ts#L68


			inTerm = true; // to ignore any other ':' before a space eg.'sourceurl:https://www.google.com'
			continue;
		}

		currentTerm += c;
	}
	if (currentTerm) terms.push(makeTerm(currentCol, currentTerm));
	return terms;
};

const parseQuery = (query: string): Term[] => {
	const validFilters = new Set(['any', 'title', 'body', 'tag',
		'notebook', 'created', 'updated', 'type',
		'iscompleted', 'latitude', 'longitude',
		'altitude', 'resource', 'sourceurl']);

	const terms = getTerms(query);

	const result: Term[] = [];
	for (let i = 0; i < terms.length; i++) {
		const { name, value, negated } = terms[i];

rxliuli · 10 March 2021 14:14

I added a simple example of using stammering to separate Chinese and English words. Of course, this example is very simple, but as long as the Chinese and English are separated and different word segmentation engines are used, it should be able to handle it. @laurent

github.com

rxliuli/example/blob/d26b6c5a781642394dba841b841b6a95ed203303/jieba-example/src/index.test.ts

import nodejieba from 'nodejieba'

enum Lang {
  En,
  ZhCN
}

type Snippet = { type: Lang, str: string }

function getLang(s: string): Lang {
  if (/[\u4e00-\u9fa5]/.test(s)) {
    return Lang.ZhCN
  }
  return Lang.En
}

function group(str: string): Snippet[] {
  const arr = str.split('')
  const res: Snippet[] = []
  let last = Lang.En

This file has been truncated. show original

laurent · 10 March 2021 14:41

I'm not very familiar with the query parsing code since that was developed by Naveen. How easy is it to integrate your word segmentation code?

rxliuli · 10 March 2021 15:11

Simply put, it is to install an npm package, and then pass in the Chinese paragraph, and the Chinese array after the word segmentation will be returned.
But it should be noted that since jieba does not support English word segmentation, you have to deal with Chinese and English separately. The example code above does just that.

Topic		Replies	Views
When searching in Chinese, the search filter did not work Support	3	957	21 December 2020
Chinese search for multiple keywords does not work as expected Support	8	677	29 November 2020
Improving search with Asian scripts Development	27	1235	31 May 2021
Can not search chinese in titile Support	2	377	14 January 2019
Tag search function does not support Chinese Support	5	474	2 November 2020

V1.5.4 bug: When searching in Chinese, it does not work as expected

Related topics