DGT-OmegaT 3.3 update 3 published

Score improvement

The new score implemented in previous version is actually tested. For very small strings (1 word) it seems significantly better. But for medium strings, let's say 2 or 3 words, the previous implementation gave often an high score when only one word changes.

This is due to the fact that the score took in consideration spaces as single items, with same importance as words.

Example: let's compare "One string" with "The string".

  • Without space, both strings are 2-item long, so the score is 50%
  • With space, both strings are 3-item long, so the score is 66% (1 item from 3 differ)

The problem is that space is like a default separator, it is frequently present in most strings, so once the string has been cut, spaces are not anymore an information. For that reason, we now decided to still consider punctuations (dots, commas and quotation marks do really add an information) but not spaces.

Diff improvement

When you display a diff string in the matches pane, actually in all cases when 2 words differ, it displays one word as a deletion and other as an insertion.

To make the string shorter, now we experiment the following changes:

  • If only the case of the first letter changes (i.e. from "create" to "Create") then we display "c" in red and "C" in blue, but the rest of the word is not considered as a difference;
  • If the only difference is addition or deletion of & sign (which appears in localization strings) then only this sign is displayed in red or blue

Note that these two optimizations are mutually exclusive: if the difference contains both case and & change, we keep original algorithm.

This is consistent with the way the new score works: case changes or & changes have less weight in the score than a full change in the word.

 

Let's remember that all of this is experimental, we are intersted in receiving comments in order to build the best possible scoring alrorithm.

Theme: 
OmegaT

Add new comment

Limited HTML

  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.