Thursday, April 17, 2008
« XML Profile: A Rough Proposal for a New ... | Main | Up here where the air is clear »

I was excited to receive from Murata Makoto a set of the RELAX NG schemas for the (post-BRM) revision of OOXML, and thought it would be interesting to validate some real-world content against them, to get a rough idea of how non-conformant the standardisation of 29500 had made MS Office 2007.

Not having Office 2007 installed at work (our clients aren't using it – yet), the first problem is actually getting a reasonable sample for testing. Fortunately, the Ecma 376 specification itself is available for download from Ecma as a .docx file, and this hefty document is a reasonable basis for a smoke test ...

The main document ("document.xml") content for Part 4 of Ecma 376 weighs in at approx. 60MB of XML. Looking at it ... I'm sorry, but I'm not working on that size of document when it's spread across only two lines. Pretty-printing the thing makes it rather more usable, but pushes the file size up to around 100MB.

So we have a document and a RELAX NG schema. All that's necessary now it to use jing (or similar) and we can validate ...

Validating against the STRICT model

The STRICT conformance model is quite a bit different from Ecma 376, essentially because most of that format's most notorious features (non ISO dates, compatibility settings like autospacewotnot, VML, etc.) have been removed. Thus the expectation is that existing Office 2007 documents might be some distance away from being valid according to the strict schemas.

Sure enough, jing emitted 17MB (around 122,000) of invalidity messages when validating in this scenario. Most of them seem to involve unrecognised attributes or attribute values: I would expect a document which exercised a wider range of features to generate a more diverse set of error message.

Validating against the TRANSITIONAL model

The TRANSITIONAL conformance model is quite a bit closer to the original Ecma 376. Countries at the BRM (rather more than Ecma, as it happened) were very keen to keep compatibilty with Ecma 376 and to preserve XML structures at which legacy Office features could be targetted. The expectation is therefore that an MS Office 2007 document should be pretty close to valid according to the TRANSITIONAL schema.

Sure enough (again) the result is as expected: relatively few messages (84) are emitted and they are all of the same type complaining e.g. of the element:

<m:degHide m:val="on"/>
since the allowed attribute values for val are now "true", "false", etc. — this was one of the many tidying-up exercices performed at the BRM.

Conclusions?

Such a test is only indicative, of course, but a few tentative conclusions can be drawn:

  • Word documents generated by today's version of MS Office 2007 do not conform to ISO/IEC 29500
  • Making them conform to the STRICT schema is going to require some surgery to the (de)serialisation code of the application
  • Making them conform to the TRANSITIONAL will require less of the same sort of surgery (since they're quite close to conformant as-is)

Given Microsoft's proven ability to tinker with the Office XML file format between service packs, I am hoping that MS Office will shortly be brought into line with the 29500 specification, and will stay that way. Indeed, a strong motivation for approving 29500 as an ISO/IEC standard was to discourage Microsoft from this kind of file format rug-pulling stunt in future.

What's next?

To repeat the exercise with ISO/IEC 26300:2006 (ODF 1.0) and a popular implementation of OpenDocument. Will anybody be brave enough to predict what kind of result that exercise will have?


- Alex.
Thursday, April 17, 2008 12:20:22 PM UTC  #    Disclaimer  |  Comments [9]  | 
Monday, April 21, 2008 8:27:55 PM UTC
It would be strange for a document from the past to vailidate against totally new XML schema's that have not been published yet by ISO.

Groklaw is amused.
Propably their tiny brains cannot grasp the idea that documents created in the past which were correctly validating against schema's do not nescesairly validate against totally new strict XML schema's that have a lot of transitional items removed from them.

MS Office files off course validates against the existing and available Ecma standard version schema's but it will require an update/upgrade in MS Office to comply better with the new schema's.

What I am more interested in is whether the schema's will have new versioning in them. So we can easily reconize if a file has strict or transitional schema's, or the current Ecma schema's.

For Groklaw readers a litte extra newflash, OpenOffice years after standardization still does not produce conforming ISO ODF files.
Tuesday, April 22, 2008 4:25:54 AM UTC
This always append when you modify the data storage format after the application delivery. You can not change the data storage into production just like correcting a processing bug.

I do not even understand why you tried to validate it. Your results were obvious. I do not understand why you needed to publish an entry about this.

Btw, I am not trying to defend Microsoft against anything. I am just saying your 2 cents are really oriented because your test protocol is wrong.

May be you are not the right guy at the right position.
sebastien madelenat
Tuesday, April 22, 2008 4:18:03 PM UTC
To repeat the exercise with ISO/IEC 26300:2006 (ODF 1.0) and a popular implementation of OpenDocument. Will anybody be brave enough to predict what kind of result that exercise will have?

funny you say popular implementation - that implies that you have a choice of applications to test
with ooxml you only have one and it doesn't even pass. shouldn't of this been done before approving it as a standard.

maybe you should ask if openoffice can implement ooxml - isn't that the real point of standards. the public has a choice on what applications to use.

iso has totally failed in my book as far as approving ooxml. it is nowhere near complete and not one product can implement correctly.
suezz
Tuesday, April 22, 2008 4:22:49 PM UTC
I think the groklaw-readers are quite rightly amused!

Bearing in mind the ISO fast track is for "established industry standards", this shows that there is no single working implementation of the standard in the world today. Now we are left "hoping" that the monopolist vendor of this "neutral" standard will provide us with a functioning implementation by way of a service pack.

This should be a source of great embarrassment to anyone involved in this process. I wouldn't even employ you folks to make the coffee at my office.
max stirner
Tuesday, April 22, 2008 4:54:20 PM UTC
Actually, the Part 4 document may not be a good choice for this testing. We know that there was a lot of automation involved in assembling these specs from fragments in a database, so it may be that Office 2007 wasn't actually used to create the final document - you may be testing the compliance of an in-house publishing tool!

"a popular implementation of OpenDocument" - of course, you'll be testing for conformance against the ISO standard ISO/IEC 26300:2006, rather than against any other non ISO standards such as ODF 1.1... Probably your best approach to getting a fair test would be to save the Part 4 OOXML document as Word 97-2003 binary, then open it in your popular OpenDocument implementation of choice, and save it out again. The CleverAge OOXML->ODF converter isn't good at handling very large files.

Inigo
Tuesday, April 22, 2008 5:57:15 PM UTC
@hAl

I can't remember if the schemas are versioned or not. They should be -- otherwise it'll be necessary to perform some kind of heuristic to determine whether a document is strict or transitional.

- Alex.
Alex Brown
Tuesday, April 22, 2008 6:06:01 PM UTC
@sebastien

The purpose of the test was to see how far conformance had drifted. This is of interest, I think. I was also hoping to move the "debate" a little more towards being evidence-based.

You miss the fact that MS can change their software post deployment, as it is dynamically configurable through service packs. They can (and do) alter file formats this way, as I wrote in the blog entry.

- Alex.
Alex Brown
Tuesday, April 22, 2008 7:22:19 PM UTC
"You miss the fact that MS can change their software post deployment, as it is dynamically configurable through service packs."

MS may indeed TRY to implement a service pack, just as soon as they're done trying to get their Vista service pack out!
max stirner
Wednesday, April 23, 2008 6:59:47 AM UTC
@Inigo

You're right - this document does indeed only allow a "smoke test" - what we really need of course is a suite of test documents and validation that goes beyond grammar. But luckily, SC 34 has spent most of the last years developing just just sophisticated validation technologies!

And yes, I will be ignoring the OASIS versions of the ODF schema for testing (just as I am ignoring Ecma versions of OOXML). I'm interested in conformance to the International Standards, not the consortium ones.

- Alex.
Alex Brown
Comments are closed.