Expression and word mode

In the public version of OmegaT, except for regular expressions, the engine always searches for strings: even if you do not use joker * or ?, a segment containing «contest» will also match the string «test», just because what is considered is the string, not the word.

In our screens, on the contrary, exact and word searches must also be completed by a word mode type :

In exact and keywod expression modes, word modes are defined as follows:

Strings means sequence of characters, independantly from what is before or after (but for exact search, characters inside are calculated). Jokers are replaceable by any character except spaces: for example, in french, "l'activité" will match "l*ité" while "une activité" will not match "u*ité" (apostrophe matches *, space does not). This is what the public version of OmegaT (3.6 or 4.1) does;
Whole words means sequence of letters, where a letter is defined by the Unicode class "Letter" (in Java \p{L}), meaning "any letter in any language" (including ideograms): generally, you use it to find a word exaclty, without inflection unless you explicitely declare it using jokers. In this mode, jokers are replaceable by any letter, not by a punctuation: for example "l'activité" will not match "l*té" (in whole words mode, apostrophe is excluded from jokers) while "lâcheté" will match correctly (letters with diacritics are correctly included).
We are perfectly conscient that this may be totally unusable for languages which do not use spaces. It does not seem to be a good reason why to totally forbid this notion to other people. You may also note that there is a requirement for that in OmegaT ( see RFE 849)
Lemmas means that both the search criteria and the target string are tokenized, using the tokenizer for the correct language (either source or target language of the project, depending on the field you search in). This enables to recognize grammatical inflections without recognizing words which have the same beginning or end. But you must be sure that you sucessfullly loaded the correct tokenizer for source or target language before using this criteria.
In this mode jokers are not available.

Let's see the difference between them in a table. If you typed "test", without joker characters, then:

If the text contains…	Strings	Whole words	Lemmas
test	Yes	Yes	Yes
tested	Yes (it "contains" the string t + e + s + t)	No (only the word "test" is accepted)	Yes (if used english tokenizer)
protest	Yes (it "contains" the string t + e + s + t)	No (only the word "test" is accepted)	No (protest is not a grammatical variant of "test")

Now let's do the same with joker. If you typed "t*st":

If the text contains…	Strings	Whole words	Lemmas
test	Yes	Yes	No (Lemmas mode does not support jokers)
the_file.test	Yes (* means 'any character except space')	No (* does not accept '.')
to test	No (* will reject the space)	No (* does not accept space)

The difference between exact and keyword search does not change, even with lemmas: keyword search means that the lemmas can appear anywhere in the segment, eventually with a different order. In exact search + lemmas search mode, the segment is tokenized according to the MATCHING mode, meaning that each term is lemmatized but stop words are not removed (note: except for glossaries configuration, where GLOSSARY mode is used). Let's say that you typed «test element», in exact search mode, then:

If the text contains…	Strings	Whole words	Lemmas
test element	Yes	Yes	Yes
test elements	Yes (it contains the string t + e + s + t + space + e + l + e + m + e + n + t)	No (only the word "element" is accepted) (*)	Yes (if used english tokenizer)
tested elements	No (it does not contain the string t + e + s + t + space + e + l + e + m + e + n + t) (*)	No (only the words "test" and "element" are accepted)	Yes (if used english tokenizer)
contest elements	Yes (it contains the string t + e + s + t + space + e + l + e + m + e + n + t)	No (only the words "test" and "element" are accepted)	No (contest is not a grammatical variant of "test") (*)
contested elements	No (it does not contain the string t + e + s + t + space + e + l + e + m + e + n + t) (*)	No (only the words "test" and "element" are accepted)	No (contested is not a grammatical variant of "test") (*)

To be complete, note that for the entries marked with (*), almost one of the words is correct, meaning that in keyword search, this word will be marked. Of course, since keywords search is an AND search, the string will be rejected.

When you select «Regular expressions», then the «Word mode» is replaced by «Regular expression mode» :

The options include :

Partial segment: the segment must contain something matching the regular expression (this is what Omega-T actually does)
Full segment: the entire segment must match the regular expression. This is equivalent to adding \A at the beginning and \z at the end.
Whole words: equivalent to adding \b in the beginning and the end of the searched text.

Main menu

Expression and word mode

Add new comment

Limited HTML

Plain text

Main menu

You are here

User login

Expression and word mode

Add new comment

Limited HTML

Plain text