DGT-OmegaT 3.3 update 2 published

Until now, OmegaT, including the version inside DGT, used 3 different algorithms for calculating matching score. This can be source of confusion.

When we started internally DGT-OmegaT, users rapidly wanted to remove the default "score", in other terms the one which considers stemming. And more recently they pointed out some strange behaviours with non-stemmed and adjusted score.

Personally I always had another problem with these 3 scores. As they are based on tokens only, they all give bad results on very small strings (if the string contains only one token, the score can be only 0 or 100, without intermediate state) which causes problems when you want to use OmegaT for software localization (L10N), where lot of strings are 1 or two words long.

The ideal distance, the one which gives the best results, is the Levenshtein distance at character level. But this one has a very big cost in time calculation, so it is difficult to use it as a default.

Now, the algorithm we propose.

First, we consider all tokens, exactly as in the adjusted score. The only difference during tokenization is that we consider that the space which usually follows a punctuation (and eventually the non-break space which preceedes, in french for example) is not a token to be considered separately, it is a part of the punctiation itself (which becomes 2-character long). That reduces the number of used tokens. But generally, the new score is close to adjusted score, which is the most precise one, but the next step will make it more precise.

In other scores, two tokens can be only identical or different, meaning that the cost of a token change can be only 0 or 1. Now here is the main difference of our improved score: token comparison can give a cost between 0 and 1, meaning that we can add improvements after tokenization, rather than only inside.

Actually what is implemented is the following:

  • When tokens are not words (numbers, punctuations, spaces, etc.), we still answer only 0 or 1;
  • If both tokens are words:
    1. if they differ only by the case (upper- and lowercase), but are identical for everything else, then the cost is only 0.1 (10%) ;
    2. if they differ only by the presence of a & sign (which occurs in software localization), then the cost is only 0.2 (20%) ;
    3. if after stemming they are identical, meaning that they have the same root (for example, runs and running), then the cost is only 0.4 (40%);
    4. in all other cases, we will use a score of 0.7 + the levenshtein distance at character level (this time it does not have big cost because we compare only 2 words) reduced to value between 0 and 0.3

Some of you may ask themselves why we consider as a best result (with cost 40%) when the two tokens have the same root, because they would make less effort to change "cut" to "cat" than "cut" to "cutting". But you must not forget one important point: the score is always calculated on the source segment, while the effort of the translator is always on the target. And once translated, you see that if the first string contains "cut", "cutting" and "cat", then you have less effort to change "couper" to "coupant" than to change "couper" to "chat".

For the moment, if you use DGT-OmegaT 3.3 update 2, you will see all 4 scores, and of course, this should be significantly slower than previously with 3 scores. But this is necessary if you want to compare: definitively, this version is not to be used in production but to test and compare the scores, to see if the improvement is useful or not. In the future I will study the possibility to calculate each kind of score only if it is used, when it appears in the matches pane configuration.

Don't hesitate to give us any feedback about the new score: this is still experimental.

 

Theme: 
OmegaT

Add new comment

Limited HTML

  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.