Segmentation rules updated

We publish now the last release of our segmentation rules in all formats: SRX, CSCX and CSEX

This new release adds rules for russian and for CJK (Chinese, Japanese, Korean) languages. CJK-specific punctuations are in the common part since they may appear in a latin text but should not imply a different behavior than in real CJK texts.

Main change concerns behaviour of ; in greek. Normally, texts in greek should use the dedicated symbol 037E (Íž) as a question mark. But lot of texts in greek use the ASCII semicolon (;) due to the similar aspect and maybe because it is easier to find in the keyboard. But question mark is a full stop (implying split in all cases) while a semicolon is an half stop (only before an uppercase). In the new release, there is a clear exception for greek language only, where a semicolon will be considered as a full stop. In all other languages, including asian languages, this remains an half stop.

 

Add new comment

Limited HTML

  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.