Expression and word mode

In the public version of OmegaT, except for regular expressions, the engine always searches for strings: even if you do not use joker * or ?, a segment containing «contest» will also match the string «test», just because what is considered is the string, not the word.

In our screens, on the contrary, exact and word searches must also be completed by a word mode type :

In exact and keywod expression modes, word modes are defined as follows:

  • Strings means sequence of characters, independantly from what is before or after (but for exact search, characters inside are calculated). Jokers are replaceable by any character except spaces: for example, in french, "l'activité" will match "l*ité" while "une activité" will not match "u*ité" (apostrophe matches *, space does not). This is what the public version of OmegaT (3.6 or 4.1) does;
  • Whole words means sequence of letters, where a letter is defined by the Unicode class "Letter" (in Java \p{L}), meaning "any letter in any language" (including ideograms): generally, you use it to find a word exaclty, without inflection unless you explicitely declare it using jokers. In this mode, jokers are replaceable by any letter, not by a punctuation: for example "l'activité" will not match "l*té" (in whole words mode, apostrophe is excluded from jokers) while "lâcheté" will match correctly (letters with diacritics are correctly included).
    We are perfectly conscient that this may be totally unusable for languages which do not use spaces. It does not seem to be a good reason why to totally forbid this notion to other people. You may also note that there is a requirement for that in OmegaT ( see RFE 849)
  • Lemmas means that both the search criteria and the target string are tokenized, using the tokenizer for the correct language (either source or target language of the project, depending on the field you search in). This enables to recognize grammatical inflections without recognizing words which have the same beginning or end. But you must be sure that you sucessfullly loaded the correct tokenizer for source or target language before using this criteria.
    In this mode jokers are not available.

Let's see the difference between them in a table. If you typed "test", without joker characters, then:

If the text contains…

Strings

Whole words

Lemmas

test

Yes

Yes

Yes

tested

Yes (it "contains" the string t + e + s + t)

No (only the word "test" is accepted)

Yes
(if used english tokenizer)

protest

Yes (it "contains" the string t + e + s + t)

No (only the word "test" is accepted)

No (protest is not a grammatical variant of "test")

Now let's do the same with joker. If you typed "t*st":

If the text contains…

Strings

Whole words

Lemmas

test

Yes

Yes

No
(Lemmas mode does not support jokers)

the_file.test

Yes (* means 'any character except space')

No (* does not accept '.')

to test

No (* will reject the space)

No (* does not accept space)

The difference between exact and keyword search does not change, even with lemmas: keyword search means that the lemmas can appear anywhere in the segment, eventually with a different order. In exact search + lemmas search mode, the segment is tokenized according to the MATCHING mode, meaning that each term is lemmatized but stop words are not removed (note: except for glossaries configuration, where GLOSSARY mode is used). Let's say that you typed «test element», in exact search mode, then:

If the text contains…

Strings

Whole words

Lemmas

test element

Yes

Yes

Yes

test elements

Yes (it contains the string t + e + s  +  t  + space + e + l + e + m + e + n + t)

No (only the word "element" is accepted) (*)

Yes
(if used english tokenizer)

tested elements

No (it does not contain the string t + e + s  +  t  + space + e + l + e + m + e + n + t) (*)

No (only the words "test" and "element" are accepted)

Yes
(if used english tokenizer)

contest elements

Yes (it contains the string t + e + s  +  t  + space + e + l + e + m + e + n + t)

No (only the words "test" and "element" are accepted)

No (contest is not a grammatical variant of "test") (*)

contested elements

No (it does not contain the string t + e + s  +  t  + space + e + l + e + m + e + n + t) (*)

No (only the words "test" and "element" are accepted)

No (contested is not a grammatical variant of "test") (*)

To be complete, note that for the entries marked with (*), almost one of the words is correct, meaning that in keyword search, this word will be marked. Of course, since keywords search is an AND search, the string will be rejected.

When you select «Regular expressions», then the «Word mode» is replaced by «Regular expression mode» :


 

The options include :

  • Partial segment: the segment must contain something matching the regular expression (this is what Omega-T actually does)

  • Full segment: the entire segment must match the regular expression. This is equivalent to adding \A at the beginning and \z at the end.

  • Whole words: equivalent to adding \b in the beginning and the end of the searched text.

Add new comment