StaX filters 2.0.4 and 2.1.1 published

Download:
    Test release (2.0.4): Source / Binary
    Development release (2.1.1): Source / Binary

Bugs have been found in the releases 2.x of StaX filters. They also affect DGT-OmegaT 3.4 and 3.5, which will be corrected later. DGT-OmegaT 3.3, which is still using (variant of) StaX filters 1.1 (and does not include OpenXML at all), is not affected.

OpenXML filter: wrong "default" attributes

OpenXML files contain lot of repeated tags, in particular each change such as bold or italic must be repeated for each "run" in a paragraph, even if the full paragraph has common attributes.

Markup <w:p> also contains <w:pPr> (paragraph properties) which I interpreted by mistake as a factorization of common properties for the whole paragraph, while they refer, on the contrary, only to the paragraph markup.

The consequence was that sometimes, a paragraph lost its attributes when OmegaT generated the target output file.

It seems then impossible to avoid repetitions in the generated XML file, but now I try a new method to factorize almost in OmegaT: if all runs of the same paragraph have some attributes in common, then I keep them once only in memory. For example if an entire paragraph is in italics but contains multiple runs with distinct attributes, then the memory will contain italics attribute as common, but I will repeat it when I create the target output file.

If all of this seems totally abscons for you, this is perfectly normal: now you understand why Microsoft's DOCX format is so hard to parse and why OmegaT renders it with so long list of tags. I hope it will be better with this release, but there are probably lot of tags remaining for which you may ask what they do here, but which seem impossible to reduce.

In any case, please don't forget that OpenXML filter has actually not be widely tested, so it can still contain bugs. The only filters which have been widely used and can be considered as stable are Xliff 1 and SDLXLIFF.

Filter for XLIFF 2.0 now compliant with StaX filters 2.0 API

Release 2.0 of StaX filters introduce a new algorithm to read files, alternating cursor-based and event-based parsing. Cursor-based is harder but takes less memory, so it is used for parts of the document which are not modified (in particular the big encapsulated base64 file in SDLXLIFF), and events are used where there are real segments, in order to keep in RAM the association between OmegaT tags and internal tags.

As it is not actually widely used, filter for XLIFF 2 had not been tested and nobody had seen that it remained in cursor mode everywhere, meaning that it could find absolutely no segment. Problem is solved now.

Theme: 
OmegaT

Add new comment

Limited HTML

  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.
  • Web page addresses and e-mail addresses turn into links automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.