We publish now the last release of our segmentation rules in all formats: SRX, CSCX and CSEX
This new release adds rules for russian and for CJK (Chinese, Japanese, Korean) languages. CJK-specific punctuations are in the common part since they may appear in a latin text but should not imply a different behavior than in real CJK texts.
Main change concerns behaviour of ; in greek. Normally, texts in greek should use the dedicated symbol 037E (Íž) as a question mark. But lot of texts in greek use the ASCII semicolon (;) due to the similar aspect and maybe because it is easier to find in the keyboard. But question mark is a full stop (implying split in all cases) while a semicolon is an half stop (only before an uppercase). In the new release, there is a clear exception for greek language only, where a semicolon will be considered as a full stop. In all other languages, including asian languages, this remains an half stop.
Add new comment