New XML / XLIFF filters

This feature is integrated in the core of DGT-OmegaT, but also available as a plugin which is compatible with standard OmegaT, almost tested with 3.6 (but in Java 8), 4.3 or 5.8. The plugin has no aditional dependancies, except the StaX API which is normally already present in Java 8 (and actually also in later versions, almost until Java 17, but we cannot ensure it for the future).

In both cases, once installed you will see the menu Options => File Filters, but you may have to activate them manually:

 

This introduces totally new filters for XLIFF 1.2, 2.0 and SDLXLIFF.  These filters are neither based on Okapi's filters, nor on the actual OmegaT XLIFF filter.

OmegaT's filter is based on their high-level API named filters3/XMLFilter, with which it is easy to write a new filter but difficult to add complex features: that is probably the reason why it is still monolingual. Instead, we based our work on filters2/AbstractFilter. But since it is possible to create other XML filters based on it, we introduced a new intermediate class named filters2/AbstractXMLFilter which implements XML parsing based on Java StAX. This is not so high-level as filters3, for that reason the source code of such filters are probably harder to read, but we keep all possibilities of the original AbstractFilter, such as bilinguism, and we can also parse complex XML formats like Microsoft's OpenXML (actually in test version, alost working but with some known bugs) in a less generic way.

Note: some additions have been done in the filters2/filters3 API in order to implement, for example, the conversion of OmegaT's notes (for which no filter until now did write anything to output files) into SDLXLIFF comments. That is the reason why the plugin does not offer all features when used in standard OmegaT. If the core team is interested in these features, they should be easy to port to OmegaT 4.

More details, with schema, in the technical document.

Specific features (i.e. not possible with renumerotation scripts)

Choice of tag identification character

One thing the core filter(filters3/XliffFilter) had while Okapi did not, is the possibility to decide about tag id (i.e. to use <b> for bold or <i> for italics) when the file follows XLIFF conventions. This was not possible to do it via  script, but the StaX filter re-used the same algorithm in XLIFF 1 and an equivalent in XLIFF 2.

But as usual, SDLXLIFF uses its own way to specify the role of a tag. Fortunately,  we succeeded to implement a specific algorithm for SDLXLIFF, almost if the original file is DOCX. Again compare (in bold, what is better compared to previous option):

Okapi filter alone Okapi filter + Perl renumerotation StaX filter
Sample <g18>Text</g18>
<segment 0010>
<g18>Texte</g18> d'exemple
<end segment>

Sample <g24>Text</g24>
<segment 0015>
<g24>Texte</g24> d'exemple
<end segment>
Sample <g0>Text</g0>
<segment 0010 (+ 1 more)>
<g0>Texte</g0> d'exemple
<end segment>

Sample <g0>Text</g0>
<segment 0015 (+ 1 more)>
<g0>Texte</g0> d'exemple
<end segment>
Sample <b0>Text</b0>
<segment 0010 (+ 1 more)>
<b0>Texte</b0> d'exemple
<end segment>

Sample <b0>Text</b0>
<segment 0015 (+ 1 more)>
<b0>Texte</b0> d'exemple
<end segment>
Reads <source> or <seg-source> as source segment
Reads <target> as auto-populated translation
Tag is unique at document-level
Segments are not recognized as identical
Tag type is g (as in xliff)
Reads <source> or <seg-source> as source segment
Reads <target> as auto-populated translation
Tag is reset to 0 at each paragraph
Segments are recognized as identical

Tag type is g (as in xliff)
Reads <source> or <seg-source> as source segment
Reads <target> as auto-populated translation
Tag is reset to 0 at each paragraph
Segments are recognized as identical

Tag type is detected (here bold), in some cases

This feature works perfectly in the plugin for standard OmegaT (4 or 5). However, warning, it makes the plugin incompatible with Okapi filter: if you open with StaX filter a project which has been translated with Okapi filter, you will loose translation for segments containing tags, because they are not anymore 100% matches. This is not a bug, but an incompatibility between filters: doing the contrary (translate with StaX and open with Okapi) would give same result and until now nobody told them that there was a bug in their side.

Keep date and author of current translation

When you use bilingual formats in OmegaT, and when a segment has a translation in the source file, OmegaT is already capable of retreiving it, but it always sets author = unknown and no date.

DGT-OmegaT 3.2 has capacity to get author and date if it is registred in the source file: this is the case for SDLXLIFF files only (standard XLIFF does not have an author nor date attribute), not in PO files. Example:

Before After
Last modified by unknown
Sample <g0>Text</g0>
<segment 0010>
<g0>Texte</g0> d'exemple
<end segment>
Last modified by cordoth on 26-janv.-2016 at 17:06:46
Sample <g0>Text</g0>
<segment 0010>
<g0>Texte</g0> d'exemple
<end segment>

Note that, due to the fact that it implies some changes in the core OmegaT classes, this feature is not available via the plugin for OmegaT 4.

SDLXLIFF note/comments conversion :

  • When you open an SDLXLIFF file containing "comments" visible in Studio, these comments are readable in the "Comments" pane (or Segment Properties in OmegaT 4 or later)
  • When you produce the final SDLXLIFF file, the notes which are present in the project memory (project_save.tmx) are converted into comments readable by Trados Studio (only in DGT-OmegaT, not in the plugin)

Add new comment