Segmenters for Chinese, Thai and Japanese languages

Unlike in the Western languages, texts in the East Asian languages Chinese, Thai and Japanese may not have spaces between words in a phrase. Thus, when indexing documents in these languages, a search engine needs to know how to split phrases into separate words, and also needs to know the word boundaries when running a search query. mnoGoSearch can find Asian word boundaries with help of so called segmenters.

Japanese phrase segmenter

mnoGoSearch can use ChaSen and MeCab Japanese morphological systems to break phrases into words.

To build mnoGoSearch with Japanese phrase segmenting, use either --with-chasen or --with-mecab command line switches when running configure.

Chinese phrase segmenter

mnoGoSearch uses frequency dictionaries for Chinese phrase segmenting. Segmenting is implemented using the dynamic programming method to maximize the cumulative frequency of the separate words produced from a phrase.

mnoGoSearch distribution includes two Chinese dictionaries: mandarin.freq - a Simplified Chinese dictionary and TraditionalChinese.freq - a Traditional Chinese dictionary.

Note: When building mnoGoSearch from sources for use with Chinese language, don't forget to add --with-extra-charsets=big5,gb2312 when running configure.

Use the LoadChineseList command to enable Chinese phrase segmenting, with this format:

LoadChineseList [charset filename]
You can optionally specify the character set name and the file name of a dictionary.

Note: LoadChineseList will load the dictionary for Simplified Chinese by default, that is using the GB2312 character set set and the file mandarin.freq. Anyway, you may find it convenient to specify the default values explicitly:

LoadChineseList gb2312 mandarin.freq

To enable Traditional Chinese segmenting, use this command:

LoadChineseList big5 TraditionalChinese.freq

Thai phrase segmenter

Thai segmenting uses the same method with segmenting for Chinese, with help of a Thai frequency dictionary thai.freq, which is included into mnoGoSearch distribution.

Use the LoadThaiList to enable Thai phrase segmenting, with this format:

LoadThaiList [charset dictionaryfilename]

Note: The TIS-620 character set and the file thai.freq are used by default. That is if you use LoadThaiList without any arguments, it will be effectively the same to this command:

LoadThaiList tis-620 thai.freq

The CJK phrase segmenter

Starting from the version 3.3.8, mnoGoSearch also supports a special universal segmenter which is suitable for Japanese, Tradtitional Chinese and Simplied Chinese. The universal CJK segmenter does not use dictionaries and does not require external libraries.

You can enable the CJK segmenter by adding this command into both indexer.conf and search.htm:

Segmenter cjk

The CJK segmenter considers all ideogram characters from the Unicode blocks CJK Ideographs Extension A (U+3400 - U+4DB5) and CJK Ideographs (U+4E00 - U+9FA5) as separate words. When indexing a document using the CJK segmenter, mnoGoSearch stores information about every ideogram character separately.

At search time, the search query you type is preprocessed by the CJK sementer and some delimiters are inserted between the ideograms.

If you pass the m=phrase query string parameter to search.cgi (which means exact phrase search), the CJK segmenter uses the dash character as a delimiter, and the space character otherwise (that is in case of all words and any of the words search modes).

Imagine you type the query ``ABCD'', where A, B, C, D are some ideographic characters. In case when the exact phrase search mode is not active, your query will be preprocessed by the CJK segmenter to ``A B C D'' and the four individual "words" will be searched. Note, that mnoGoSearch ranks the documents will smaller distance between the query words higher than the documents having the same words in different parts of the document, so if you have some documents the exact phrase ABCD, it is very likely that they will be displayed in the top 10 results.

Note: You can try different values for the WordDistanceWeight command to see how distances between the query words in the found documents affect their final score.

Now imagine you type the same query ``ABCD'' with the exact phrase search mode enabled. The query will be preprocessed by the CJK segmenter to ``A-B-C-D''. The dash character forces automatic phrase search (see the Section called Phrase search in Chapter 10 for details on automatic phrase search), so as a result only those documents with exact phrase match will be found.

Note: You can also use the ordinary mnoGoSearch query syntax with quotes to enable phrase searches without having to pass the m=all query string variable (exact phrase search mode) . For example, if you type ``"AB" "CD"'', then the documents having the ideogram A immediately followed by the ideogram B, and at the same time, the ideogram C immediately followed by the ideogram D will be found. The mutual positions of the phrases AB and CD will not affect the result set, and will affect only the result ordering.

Although, the CJK phrase segmenter is not aware of the real word boundaries, tests made by the native speakers indicated that in many cases it works even better and more predictable than the Mecab-based, Chasen-based, and the frequency-based segmenters.