Friday, June 13, 2008


>>>  UPDATE: XML UK has cancelled this event due (remarkably) to a lack of interest :-(  <<<

Here is the full programme for XML UK's upcoming day conference, “XML in the Office”. Interested? The registration form is here.

XML in the Office

Thursday 26th June 2008
Victoria Hall, Reading Town Hall
Blagrave Street, Reading, Berkshire RG1 1QH

A growing number of organisations are taking advantage of the fact that the mainstream office application suites, are now using XML as the normal format for office document storage. Users are looking to make the best of whatever formats they currently use and this event will give users an opportunity to share their experiences as to how to make best use of the enthusiasm for XML among office system developers. This one-day conference brings together a number of speakers with experience in a range of office applications where XML tools can now be used to manipulate office documents for a variety of purposes.


10.00 – 10.40 : Alex Brown, Griffin Brown

How we got where we are; where we are going …

The last two years have seen the OpenDocument Format (ODF) and then Office Open XML (OOXML) progressed through various standards bodies. This presentation will give a description of what happened during this period, and make some predictions about what it likely to happen next

Alex first became interested in structured markup when analysing literary texts for his doctorate (on early Shakespeare editions) in the late 1980s. Following this he worked as a developer on heavily object-oriented C++ application framework for cross-platform multimedia publishing, at the height of the CD-ROM boom. In 1997 Alex was one of the founding directors of Griffin Brown Digital Publishing Ltd, a UK-based company providing XML-based services and products. He is responsible for leading the company’s XML consulting and implementation, and his work includes advising clients on XML/IT strategy and practice, mentoring clients’ staff, writing DTDs and Schemas, and designing and developing XML software systems in C++, Java and other languages. In 2002, Alex was invited to join the British Standards Institute (BSI) Technical Committee IST/41, where he contributes to ISO/IEC JTC1/SC34 in its formation of the DSDL ISO standard, among other things. Alex writes and speaks regularly on structured markup technologies and their application to information management.


10.40-11.00 : Coffee/networking


11.00-11.40 : Inigo Surguy

Semantic capabilities in office document metadata

Both ODF and OOXML have powerful mechanisms for storing custom metadata about a document. How can these be used to make documents more useful? How can connections be made between documents? And can the semantics of ODF and OOXML be reconciled? This talk will attempt to answer these questions, by examining the semantic capabilities available in each office document format, what they enable, and how they can be used

Inigo Surguy has 13 years experience working in software, in a variety of areas (including publishing, healthcare, logistics, and energy research). His technical skills include software architecture design and modelling; Java, J2EE and .NET development; web development; and a wide variety of XML-based technologies including XSLT, XQuery, HL7, RDF and OWL. Inigo has written chapters for the books "Practical XML for the Web", "Content Management Systems" and "Practical Intranet Development" published by Glasshaus, and presented a paper on ontologies at the WWW2006 conference. Inigo is co-founder of 67bricks a consultancy specialising in bespoke solutions to help companies maximise the value of their information.


11.40-12.20 : Robin La Fontaine & Nigel Whitaker

Fun and frustration processing Open Document Format

We will present some experiences in extracting the XML content of an ODT document, processing it and then re-synthesising an ODT document. The objective was to produce an improved comparison engine for large legal documents and to enable intelligent merging of two different edits of the same document.

Robin La Fontaine is CEO of DeltaXML, a company providing change management for XML including a method for representing changes to XML documents and data in XML. Robin has contributed to various ISO standards and has been project manager of several European research projects including the XML/EDI European Pilot Project. His background is in CAD data exchange and Lisp programming. Robin has a degree in Engineering Science from the University of Oxford and a Masters degree in Computer Science.


12.20-13.00 : John Collins, Francis Cave , Helena Bayler

Design, implementation and maintenance of a digital workflow for production of Parliamentary business papers

The House of Commons Vote Bundle is an assembly of Parliamentary business papers, many published on a daily basis, that are the essential working papers of Members of Parliament. The Vote Bundle includes a wide variety of material spread over a number of documents that vary from a simple notice on a single page to scores of pages of amendments to a Bill. A large number of offices within the House are involved in its production, and complex editorial procedures have evolved to ensure the accuracy and timeliness of the content of each document. By adopting XML-based workflows, the House of Commons has been able to standardise its data formats while retaining flexibility in the choice of editorial and production systems. One component of the Vote Bundle is the Order Paper, which provides Members with a detailed agenda for today's business and a timetable for future business. The Order Paper is an assembly of many distinct components of varying complexity and length, including legal text and tables. This presentation will focus upon the systems used for production of the Order Paper: how they were designed and implemented, and how they are being actively maintained to meet the changing needs of Parliament.

Francis Cave is an XML consultant based in the UK. He was a founder member of the International SGML/XML Users Group and is currently Chairman of XML UK, the UK XML user group. He is also Chairman of BSI Technical Committee IST/41, which represents the UK in the work of ISO/IEC JTC 1/SC 34, and is editor of DSDL Part 9 Namespace- and datatype-aware DTDs. For more than seven years Francis has been working with EDItEUR on the development and maintenance of a wide range of XML-based communication standards for the book and serials industries worldwide, including the ONIX and EDItX format families. As well as supporting the continuing development of the ONIX for Books product metadata standard, significant work is being done to develop XML-based standards for communication of license terms, with current and potential applications across the whole publishing and media sector. Partly as a result of this work, in 2006 Francis was appointed by the World Association of Newspapers to be Technical Project Manager for the ACAP Project.


13.00-14.00 : Lunch/networking


14.00-14.40 : Peter Cox

Structured Authoring at the Open University

The Open University uses Word styles in Microsoft Word to identify the structural elements in the course texts it publishes. Word 2003 goes a step further by allowing an XML schema to be attached to a document. It provides a more controlled structure for the content within an environment our users are already familiar with. This presentation will describe the Structured Authoring project which uses a combination of WordprocessingML in Word 2003 and our own XML schema to produce XML files. These can then be rendered in a variety of ways such as to PDF using 3B2 pagination software, for input into Moodle which is used by OpenLearn and our Virtual Learning Environment, and for output to a DAISY Digital Talking Book that automatically generates an MP3 file using synthetic speech.


14.40-15.20 : Nick Perry, Pendragon

Word-SGML workflow for regulatory documents - review of ten years of experience

As aggregators of an increasingly broad source of documents and original publishers, Pendragon's SGML-based publishing process attempts to capture a wide variety of document types from highly-structured primary legislation to one-off newsletters and pamphlets in a small handful of DTDs. In his brief review, Nick describes how the main Word-based capture and conversion process has changed in the last decade and his aspirations for simplification and rationalization.

Following an early career tour through Government engineering, educational fundraising, print publishing and nascent Web development, Nick settled into his role as Technical Manager of Pendragon just over ten years ago. Pendragon has grown from a small business of the Thomson Corporation, via a period of independence, to ownership under the Waterlow and Wilmington organizations. Its Perspective product, has held the dominant position in the niche Pensions sector of legal publishing for most of that time - largely through its comprehensive coverage and its ability to reconstitute legislation at any given date in the past or future.


15.20-15.50 : Coffee/networking


15.50-16.30 : Matt Deacon, Microsoft

Microsoft’s perspective on office formats: Interoperability by design?

In today's connected world, interoperability is as important as security and reliability for IT professionals . This is due to an increase in technical heterogeneity which drives more complexity within, and on the edge of, their IT infrastructures. This leads to a greater demand for data and information integration as organizations seek to optimize process performance. In this talk we will look at Microsoft's vision to address interoperability holistically in order to better connect people, data, and diverse systems with particular focus on XML Office formats.

Matt Deacon is the Chief Architectural Advisor for the Developer and Platform Group at Microsoft Ltd, in the UK. His primary role is to serve as an advisor to Microsoft's customers, and the public, on all matters relating to the field and profession of IT Architecture.

He chairs the Microsoft UK Architect Council, a body of 30 senior industry Architects who provide feedback and advice to Microsoft on matters of product direction and strategy and is the owner of the Microsoft Architect Forums and Architect Insight, Microsoft UK's premier 2-day architectural conference. Matt brings over 16 years experience in the IT industry delivering many mission critical enterprise solutions on both Microsoft and Java platforms.

As founder UK region of the International Association of Software Architects (IASA) and now the IASA European Regional Chair, he is successfully building an active and informed community of IT architect professionals within the IT industry in the UK and across Europe.

Friday, June 13, 2008 8:36:14 AM UTC  #    Disclaimer  |  Comments [3]  | 
 Monday, June 09, 2008
Further to my earlier post about the affect of SP2 on XML saved from Word, there are two ways to resolve this.

Either bite the bullet and write one of the more (the most?) mind-bending transformation scripts of your life, or use your status as a high-profile user of Microsoft wares and ask them just what all that extra stuff means and why it was put there.

A client of ours plumped for the latter and was promptly informed that the interpolated elements in the wsp-prefix namespace are there to help Word compare and merge documents - you get more of them, apparently, the more sessions of editing a document has been through. They can be turned off via Tools-Options-Security-Store Random Number, which should be unchecked. As far as we can see, all seems well after doing so.

- Andrew

Monday, June 09, 2008 8:27:46 AM UTC  #    Disclaimer  |  Comments [0]  | 
 Thursday, May 22, 2008

So the rumours were true – Microsoft has announced that the next version of their Office suite will have native support for the OpenDocument Format (ODF). The recently standardised OOXML format will, it seems, be likely to play second fiddle representing the “legacy” of MS Office document that the world has accumulated to date.

Conformance

Microsoft have over the years gathered a deserved reputation as being one of the worst software companies in the world at respecting standards, particularly for their mainstream desktop software. Historically, Microsoft’s standards-spurning Internet Explorer web browser has been viewed as the boat anchor that held back the world wide web, and their Office suite has been notorious for its closed file formats that shifted endlessly according to the whim of Redmond’s developers.

However, the world has changed and increasingly such practices are seen as bad behaviour which need not be tolerated. High-quality open source alternatives with better standards support (such as the Firefox web browser) have gathered a significant user base, and governments and big business have begun to care about whether their office systems are based on International (ISO/IEC) IT Standards. Indeed it is clearly concerns about future revenue streams which have been the primary motivation behind Microsoft’s new-found enthusiasm for standardisation. They are, after all, a business.

Whatever Microsoft’s motivations, users are set to benefit from a world in which MS Office, easily the most used office software, has aligned itself with open, documented standards. But while announcements are all well and good the true test of Microsoft’s commitment will be found in the byte-by-byte details of the files that Office reads and writes. ODF lays down some strict rules for how these XML documents must be in order to be conformant, and software exists for testing them – I look forward on this blog to holding the magnifying glass to Microsoft’s efforts to see if what is claimed to be Standard really is so. Success will deserve praise; failure will deserve correction.

Whither Microsoft’s opponents?

“What customers need is direct, internal support for ODF in MS Office”.

This is not a quote from Microsoft’s press release, but, a few days ago, from IBM’s Rob Weir – one of the most prominent and eloquent opponents of OOXML standardisation. His request is being answered. It becomes difficult to oppose Microsoft when they are systematically removing the stated reasons for opposition.

While the standards world contained many who argued against OOXML for entirely respectable reasons, the standardisation of OOXML also saw a concerted campaign against Microsoft from a variety of other quarters, including commercial competitors and some more extreme “activists” who are against any form of non-free software.

Microsoft’s move to support ODF now leaves very little reasonable ground for such opponents: those who are determined will surely be forced further into extremity. Some residual shrieks that Microsoft is trying to “extend and embrace” may linger, or maybe there will be mutterings that Microsoft are “poisoning the well” – but in the end these will be tired mantras that count for little – whether Microsoft is playing fair with their formats will become a testable fact. Religious arguments will not survive in that arena.

A parallel announcement is coming from a new group calling itself “Digistan”, which appears to have been founded in part from the rump of the dubious nooxml.org campaign (many of the same names crop up and the Digitstan site lives on the same server as noooxml.org). Not lacking in grandiloquence, the members are signing a document entitled “The Hague Convention” which calls on governments to “procure only information technology that implements free and open standards”. An acid test of whether they are in earnest, or whether they are a proxy anti-Microsoft effort, can be gleaned by watching their reaction to MS Office’s plan to use ODF. By Digistan’s own logic, MS Office is in line for government procurement alongside open source alternatives. Will they say so?

A Return to Normality

Software companies should compete on the basis of the quality of their products. The last 18 months have seen them try to compete by influencing the natural course of the International standards process. This has not been edifying, and brought little or no benefit to anyone. Let us hope now that the argy-bargy goes away from the standards committees and software vendors return to writing software – preferably software that users want.

- Alex.

Thursday, May 22, 2008 4:18:45 AM UTC  #    Disclaimer  |  Comments [15]  | 
 Sunday, May 04, 2008

Just when it seemed like nobody was interested in the ODF conformance smoke test posted a few days ago, IBM's Rob Weir weighs in with a lengthy piece in response.

Rob replicates the test I ran and runs a few of his own, finding ODF validation problems along the way and ending with an eyebrow-raising take on this which, I think, sells ODF seriously short.

But before getting to that, a few technical things need to be put straight.

Is the ODF schema broken?

One of the unexpected things I found in my test was that the ODF schema itself was broken, leading me to conclude that there could be no valid ODF 1.0 documents in existence as the schema simply could not be validated against.

Rob doesn't believe there's a problem here (though he allows "Alex's proposed changes to the schema are reasonable and should be considered" – too right!), and when he finds a validator reporting the error I mention, he blithely disables the reporting of that error so he can continue on to get a bunch of "error free" validation reports when validating the ODF 1.0 spec.

Why did Rob disable this error reporting? Well, he claims the standard allows him to – he writes that "there is no claim whatsoever [in the ODF spec] that a conformant ODF 1.0 document will conform to the ID/IDREF constraints defined in Relax NG DTD Compatibility". Crucially, this claim is misguided.

The ODF 1.0 spec makes explicit use of datatypes it names "ID" and "IDREF" – it states that these are the W3C types as defined in XML Schema Part 2. If we look in turn at this document, it defines both of these types, and states that they represent the same types from XML 1.0 (Second Edition). And if we look back to that document we see that both these types have a bunch of validity constraints which need to be tested, such as the need for every IDREF to correspond to some matching ID, or that ID values must be unique per document. To be valid according to these definitions a validator must respect the semantic constraints associated with these datatype definitions. (To return to the "dummies" level, we might read the helpful description from the XML Schema Primer which states: "XML 1.0 provides a mechanism for ensuring uniqueness using the ID attribute and its associated attributes IDREF and IDREFS. This mechanism is also provided in XML Schema through the ID, IDREF, and IDREFS simple types which can be used for declaring XML 1.0-style attributes"). By switching this functionality OFF Rob may be generating good spin for his blog, but he is not validating ODF correctly, as he is ignoring the very type correctness checking that the ODF spec mandates through its datatyping! (And worryingly, this gaffe has now been perpetuated in an (official?) OASIS TC Wiki, on an immutable page!.)

Coming at this from another direction, we could also take into account the fact that the RELAX NG used by ODF is not "pure" ISO/IEC 19757-2, but uses mechanisms from the OASIS past of RELAX NG. In particular, it declares:

datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"

and in so doing brings into play RELAX NG's schema XSD datatype emulation. The OASIS spec describing this feature is Guidelines for using W3C XML Schema Datatypes with RELAX NG and this refers to the very RELAX NG compatibility features Rob claims we can safely ignore:

[DTD Compatibility] defines the concept of an ID-type, which is an additional semantic for datatypes that allows datatypes to have [XML 1.0] cross-reference semantics. An implementation of [DTD Compatibility] that supports these guidelines should associate the ID, IDREF and IDREFS datatypes of [W3C XML Schema Datatypes] with the ID-types ID, IDREF, and IDREFS respectively.

The jing validator does support these guidelines, and accordingly performs just such an association. As the co-author of the spec, James Clark (the author of jing) can be relied on - rather more than Rob - to know what functionality applies for a particular validation scenario.

So, both formally and informally we should not be disabling ID/IDREF awareness – and there is also a third, less dry technical reason why we should not: common sense. The ID/IDREF testing performs a useful first-line of defense testing on our document, and prevents such nonsenses as duplicate IDs or broken links. Without it, we could take the ODF spec as XML, make all the IDs in it identical, and then watch as Rob's validation method passed the resulting rubbish all as "a-okay". So I'm sorry Rob, but on all three counts the "it's error free if we disable error testing" approach does not cut the mustard, and is simply not something the ODF spec entitles you to do.

Where I do agree is that we need to put this in perspective. Although these findings are interesting in the context of the OOXML furor, they do not signal anything particularly momentous about ODF. Defects get found; defects get fixed – the standard improves and everybody is happy. Right?

Negativity

Amid the general downer that is Rob's blog entry, is an assumption that I share such negative thoughts. I find myself described as "someone who would be well served if he could show that all consortia standards are junk, and that only SC34 (and he himself) could make them good". Hmmmmm - where did that come from?

For the record, I am an enthusiastic supporter of consortia and consortium standards and know from experience that consortia contain great people who are producing some of the best standards work in the planet: XML 1.0, ODF, XSLT, UBL, OOXML (ha!) – the list goes on. Most recently I was very pleased to see a new working draft of the important new W3C XProc specification – something that SC 34 is specifically deferring to rather than attempt something similar itself. I thoroughly disapprove of the kind of oppositional mindset that sees things in a polarised "ISO vs OASIS" or "ISO vs W3C" way. In my view that mode of thinking already did enough damage during the DIS 29500 project.

Tools that produce valid ODF?

Rob continues, re-running the tests I performed and finding the same result. Rob quibbles with many aspects of the test (which is fine, this was just a "smoke test") but, after all the huffing and puffing is done, we are left with the cold, hard fact that OpenOffice.org 2.4 (and, as Rob demonstrates, the CleverAge converter) are not emitting valid ODF documents.

It's at this point that things get a bit odd. Faced with the invalid documents before him Rob writes:

Conformance requires that [an application] is capable of writing out a valid document. And of course, success for an ODF implementation requires that its conformance to the standard is sufficient to deliver on the promises of the standard, for interoperability.

No. A conformant application needs to be more than "capable of" writing valid documents. If it claims to be emitting ODF 1.0 then valid ODF 1.0 is what it has to emit – the ODF schema is normative, not an optional extra. If the application fails to do this, it is non-conformant and consequently has a bug which need fixing. This is what I would expect to be the message to OpenOffice: it has some (mild-looking) ODF conformance bugs which need fixing. Let's fix the application, not try and re-define what conformance means and pretend all is well!

Rob then moves on to compare the corpus of ODF documents to HTML on the Web:

So I suggest that ODF has a far better validation record than HTML and the web have, and that is an encouraging statement.

"encouraging"!? err, sorry but again: no. To compare any document type collection to the validity rubbish-heap that is the Web's corpus of HTML is saying practically nothing and, I think, sells ODF seriously short of where it's at. What is "encouraging" to me is that the schema problems in the ODF schema, and the validity errors we find in ODF emitted by a major application (OpenOffice), are so comparatively minor. The prize is in sight - with some schema fixing and bug fixing we (the users) could be using an office application which worked reliably with a truly international standard (ODF 1.0 in this case). That is surely what we should all be aiming for. Inevitably, progress in this will be slower if defects, when found, meet with denial and obfuscation rather than a willingness to move forwards.

Homework

Now that interest seems to have been awakened in performing ODF (and OOXML) validation, perhaps it is worth investigating the 25 warning messages that msv emits when parsing the ODF 1.0 schema with warnings enabled? The last two are related to the ID/IDREF problem mentioned above and are fixed by applying my proposed resolution. But are the remaining 23 all spurious? – nothing seems wrong with the schema from a quick look (this is a genuine, not a rhetorical, question BTW).

And I again renew my call: I am very interested in hearing about any application that consistently emits valid ODF (or valid OOXML for that matter). Are there really none?

Moving forward

As I wrote many times (and as was repeatedly ignored) the smoke tests for OOXML and ODF validation were, by design, crude – they just give a rough idea whether all is well. Based on the results, it is apparent that a more thorough investigation of both formats (and their applications) would be of interest. Accordingly the next step is to start constructing a validation testing framework that:

  • Uses a varied suite of documents originated natively using office applications (MS Office, OpenOffice.org and others)
  • Goes beyond schema validation to apply semantic constraints described by the standards' text (using e.g. Schematron)
  • Corellates and presents the results in full

Watch this space ...

- Alex.

Sunday, May 04, 2008 12:40:14 PM UTC  #    Disclaimer  |  Comments [23]  | 
 Wednesday, April 30, 2008

Following on from the recent smoke test of Office 2007 conformance to ISO/IEC 29500 here, as promised, is a repeat of the exercise using ISO/IEC 26300 (ODF 1.0).

Like OOXML, ODF has (sensibly) a schema defined using RELAX NG (ISO/IEC 19757-2). This schema is published in the standard itself and is available for download from OASIS.

ODF Schema Woes

The first problem encountered was in trying to use this schema. Both James Clark’s jing and Sun’s Multi-schema validator emitted error messages when processing it. Further investigation reveals that the schema has a critical flaw in the way its open models conflict with its typed attribute values. At the end of this blog entry is a detailed defect report with a proposal how to fix the schema. By filing this I nail my colours to the mast as a staunch ODF supporter!

The consequence of this schema flaw is that the formal definition of document validity in ODF 1.0 is broken. I suspect tools which claim to use the schema with success are based on Libxml, whose RELAX NG validator is incomplete. Don’t trust them.

Imagine the outrage there'd have been if OOXML had passed with this kind of defect!

Getting an ODF Document

For parity with the OOXML test, I used the same document (Ecma 376 Part 4) for testing. This requires several steps of conversion, from Ecma 376 format to Word binary, and then (using OpenOffice.org 2.4.0) from Word binary to ODF. The process took several hours, but in the end it results in a .odt file of approx 59MB.

Validation Result

Validating the ODF document against the (patched) schema yielded 7,525 validation errors – mostly of the same type (use of an undeclared soft-page-break element).

Conclusion

Again, only tentative conclusions can be drawn from a smoke test (readers unfamiliar with this term as applied to software testing are recommended to read the Wikipedia article on it before grumbling about the depth of the test, please).

  • For ISO/IEC 26300:2006 (ODF) in general, we can say that the standard itself has a defect which prevents any document claiming validity from being actually valid. Consequently, there are no XML documents in existence which are valid to ISO ODF.
  • Even if the schema is fixed, we can see that OpenOffice.org 2.4.0 does not produce valid XML documents. This is to be expected and is a mirror-case of what was found for MS Office 2007: while MS Office has not caught up with the ISO standard, OpenOffice has rather bypassed it (it aims at its consortium standard, just as MS Office does).

I’d be very interested to find an office application that does work with valid ISO/IEC 26300 content. Do any readers know of one?

Looking Forward

A smoke test only scratches the surface – a fuller document conformance test suite would give a much better idea of the semantic (as well as the syntactic) validity of documents that claim conformance to either 29500 or 26300.

Fortunately SC 34 has spent the past years working on exactly the kinds of technologies (ISO/IEC 19757, DSDL) that will allow a more complete validation of XML documents. I am hopeful that we will see some more meaningful testing in time, and note with interest that the Italian National Standards Body have invited participation in such activities.

The unfortunate reality for concerned users is that there are no office application suites on the planet that create XML valid to International Standards, although both MS Office and OpenOffice.org get you within sniffing distance. The remedies for this shortfall are for Microsoft (on the one hand) to update its Office product, and for ODF developers (on the other hand) to pay more attention to XML validity – especially when targeting the upcoming ISO standard version of ODF 1.2. The world is moving on, and users do not want to spend time battling with incorrect outputs of their office applications: they want a reliable format they can use to build further applications on. Let us hope the coming months and years will see marked improvements in document conformance levels!

N.B. As this blog entry “goes to press”, Jesper Lund Stocholm has posted a blog entry on ODF conformance which is also well worth reading.

I suspect neither his blog entry, nor this one, will receive as much attention as the one reporting findings on MS Office's XML! Let's see.





Defect Report ISO/IEC 26300:2006

Clause 16.2 defines an “open model” for custom content using two patterns, as follows:

<define name="anyAttListOrElements">
<zeroOrMore>
<attribute>
<anyName>
<text> </text>
</anyName>
<ref name="anyElements"> </ref>
<define name="anyElements">
<zeroOrMore>
<element>
<anyName>
<mixed>
<ref name="anyAttListOrElements"> </ref>
</mixed>
</anyName>
</element>
</zeroOrMore>
</define>
</attribute>
</zeroOrMore>
</define>

Similar definitions are also used (clause 15.2) for the modelling of mathematical markup:

 <!-- To avoid inclusion of the complete MathML schema, anything -->
<!-- is allowed within a math:math top-level element -->

<define name="mathMarkup">
<zeroOrMore>
<choice>
<attribute>
<anyName/>
</attribute>
<text/>
<element>
<anyName/>
<ref name="mathMarkup"/>
</element>
</choice>
</zeroOrMore>
</define>

However, the declaration of attributes here with any name and any value of any type, conflicts with the declaration elsewhere in the schema of attributes that have an ID or IDREF type. Consequently the schema cannot be processed by validating processors which respect type consistency (e.g. jing [1] or msv [2] used with warning enabled).

Proposed Solution

The schema must be corrected. This can be done by excluding the typed attributes from the custom model as follows:

  <define name="anyAttListOrElements">
<zeroOrMore>
<attribute>
<anyName>
<except>
<name>smil:targetElement</name>
<name>text:id</name>
<name>text:change-id</name>
<name>form:id</name>
<name>presentation:master-element</name>
<name>draw:id</name>
<name>anim:id</name>
<name>draw:shape-id</name>
<name>draw:end-shape</name>
<name>draw:start-shape</name>
<name>draw:control</name>
</except>
</anyName>
<text/>
</attribute>
</zeroOrMore>
<ref name="anyElements"/>
</define>
<define name="anyElements">
<zeroOrMore>
<element>
<anyName/>
<mixed>
<ref name="anyAttListOrElements"/>
</mixed>
</element>
</zeroOrMore>
</define>

If it is intended these attributes should be allowed in custom data, they should be re-included (correctly typed) as necessary.

In general, the custom data model should be revisited – is it really the intention that it should be so open?

Similarly, the math markup model would be better made more restrictive either by incorporating a MathML schema, or at least by restricting the allowed elements to certain Namespaces. For the time being it should at least re-use the custom model to avoid unnecessary replication of patterns.

References

[1] Jing - A RELAX NG validator in Java http://www.thaiopensource.com/relaxng/jing.html

[2] Sun Multi-Schema Validator https://msv.dev.java.net/

Wednesday, April 30, 2008 10:50:15 AM UTC  #    Disclaimer  |  Comments [21]  | 
 Wednesday, April 23, 2008


Empire State Building

To New York for three days of client meetings. With an afternoon free, and very pleasant weather what better way to spend time than taking a trip up the Empire State Building (the sign in the lobby said "visibility: 10 miles").


Pigeons on the 86th floor

How nice to have three days of purely commercial work stretching ahead, with no OOXML or standards politics in sight. There is a certain clarity to doing technical work in an environment when the requirements are clearly on the table; and technically and conceptually the schema I'm working on here is miles ahead of OOXML/ODF — but maybe in saying that I'm influenced by the fact that I am the chief designer ;-)

ODF Conformance catch-up

When I get back to the UK I hope to post a blog entry on ODF conformance. I'm surprised nobody has risen to the challenge I issued in my last blog entry to predict the result. So, I renew the call! I'd be particularly interested in hearing about any ODF implementations that people think should be conformant …

One immediate problem came up in that the published RELAX NG schemas in the ISO standard (ISO/IEC 26300) appear to have a technical fault which makes them unusable. I wonder, am I the first person ever to make a serious attempt to validate an ODF document against its International Standard specification?

- Alex.
Wednesday, April 23, 2008 7:26:54 AM UTC  #    Disclaimer  |  Comments [8]  | 
 Thursday, April 17, 2008

I was excited to receive from Murata Makoto a set of the RELAX NG schemas for the (post-BRM) revision of OOXML, and thought it would be interesting to validate some real-world content against them, to get a rough idea of how non-conformant the standardisation of 29500 had made MS Office 2007.

Not having Office 2007 installed at work (our clients aren't using it – yet), the first problem is actually getting a reasonable sample for testing. Fortunately, the Ecma 376 specification itself is available for download from Ecma as a .docx file, and this hefty document is a reasonable basis for a smoke test ...

The main document ("document.xml") content for Part 4 of Ecma 376 weighs in at approx. 60MB of XML. Looking at it ... I'm sorry, but I'm not working on that size of document when it's spread across only two lines. Pretty-printing the thing makes it rather more usable, but pushes the file size up to around 100MB.

So we have a document and a RELAX NG schema. All that's necessary now it to use jing (or similar) and we can validate ...

Validating against the STRICT model

The STRICT conformance model is quite a bit different from Ecma 376, essentially because most of that format's most notorious features (non ISO dates, compatibility settings like autospacewotnot, VML, etc.) have been removed. Thus the expectation is that existing Office 2007 documents might be some distance away from being valid according to the strict schemas.

Sure enough, jing emitted 17MB (around 122,000) of invalidity messages when validating in this scenario. Most of them seem to involve unrecognised attributes or attribute values: I would expect a document which exercised a wider range of features to generate a more diverse set of error message.

Validating against the TRANSITIONAL model

The TRANSITIONAL conformance model is quite a bit closer to the original Ecma 376. Countries at the BRM (rather more than Ecma, as it happened) were very keen to keep compatibilty with Ecma 376 and to preserve XML structures at which legacy Office features could be targetted. The expectation is therefore that an MS Office 2007 document should be pretty close to valid according to the TRANSITIONAL schema.

Sure enough (again) the result is as expected: relatively few messages (84) are emitted and they are all of the same type complaining e.g. of the element:

<m:degHide m:val="on"/>
since the allowed attribute values for val are now "true", "false", etc. — this was one of the many tidying-up exercices performed at the BRM.

Conclusions?

Such a test is only indicative, of course, but a few tentative conclusions can be drawn:

  • Word documents generated by today's version of MS Office 2007 do not conform to ISO/IEC 29500
  • Making them conform to the STRICT schema is going to require some surgery to the (de)serialisation code of the application
  • Making them conform to the TRANSITIONAL will require less of the same sort of surgery (since they're quite close to conformant as-is)

Given Microsoft's proven ability to tinker with the Office XML file format between service packs, I am hoping that MS Office will shortly be brought into line with the 29500 specification, and will stay that way. Indeed, a strong motivation for approving 29500 as an ISO/IEC standard was to discourage Microsoft from this kind of file format rug-pulling stunt in future.

What's next?

To repeat the exercise with ISO/IEC 26300:2006 (ODF 1.0) and a popular implementation of OpenDocument. Will anybody be brave enough to predict what kind of result that exercise will have?


- Alex.
Thursday, April 17, 2008 12:20:22 PM UTC  #    Disclaimer  |  Comments [9]  | 
 Thursday, March 13, 2008

There has been some interesting discussion on xml-dev recently about the future of XML, and in particular whether the XML specification itself needs to be fundamentally revisited. One idea that particularly interested me was that DTDs could/should be removed from XML specification as they place a heavy burden on implementors and implementations in what is the Age of the Schema (apparently).

I think we can go a lot further than that, and that there is a general need to be able to communicate to a processor what features of the XML family a document uses. I think a good way to do this is with a PI that follows the XML declaration, so:

<?xml version="1.0"?>
<?profile dtd="no"?>

would do the trick in conveying that a document made no uses of DTD constructs.

We could go further:

<?xml version="1.0"?>
<?profile dtd="no" namespaces="no"?>

et voila we convey to our processor that there will be no use of XML Namespaces in a document. Conversely, specifying namespaces="yes" would tell a processor that support for that spec is required. Currently this sort of thing has to be done using ad hoc processor-specific features.

We could use this kind of mechanism to tell a processor whether it should recognize xml:id, XML Inclusions, etc. etc.

Getting more controversial

We can go further still. What about this?

<?xml version="1.0"?>
<?profile edition="4"?>

In an attempt to stop the slippety-slide of XML 1.0 fifth edition into our document space.

And what about this?

<?xml version="1.0"?>
<?profile attributes="no"?>

i.e., turning off a "core" feature of XML – the use of attributes. SML by the back door? Hmmmmmmm, I like.

And of course, such profiled XML documents would always be 100% conformant XML too ...

What's not to like? If I can just type it up we can have it fast-tracked through ISO in a jiffy ;-)

- Alex.
Thursday, March 13, 2008 12:42:30 PM UTC  #    Disclaimer  |  Comments [2]  | 
 Thursday, July 26, 2007

I have recently recommended to a large publishing client that they adopt RELAX NG as the basis of the formal definitions of their content, in preference to W3C XML Schema Definition Language (WXS).

There are lots of individual bits of information on why RELAX NG should be preferred all over the web. Here is an attempt to condense some of the key information into ten points …

1. A better spec means better interoperability

We, in common with many people working with WXS schemas, have been tripped up by interoperability problems caused by different tools having a different take on how WXS should be implemented. Even Microsoft, a developer who in generally sympathetic to WXS, has reported a number of interoperabilty problems, and that for its customers WXS had “stuffed up the ready interoperability they thought they were buying into with XML”. [1]

The root of such interoperability problems is that the WXS specification is notoriously hard to interpret. James Clark has called it “without doubt the hardest to understand specification that I have ever read”. [2] Little wonder then that mere mortal developers have difficulty interpreting it!

RELAX NG has, by contrast, a clear formal description of the semantics of a RELAX NG schema – and for those who want to skip the formal text of the standard, the technology can be clearly explained even in a short tutorial.

2. Availability of a compact syntax

Unlike WXS, RELAX NG has a compact syntax (as explained in this tutorial. Using it a DTD like:

<!DOCTYPE addressBook [
<!ELEMENT addressBook (card*)>
<!ELEMENT card (name, email)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT email (#PCDATA)>

can be expressed with this syntax:

element addressBook {
element card {
element name { text },
element email { text }
}*
}

Much nicer!

3. The specification is a stable ISO standard

RELAX NG first became an OASIS standard in 2001 and then went through a full ISO standardisation process to become an ISO Standard (ISO 19757-2:2003 [free ZIP download]) in 2003. It has proved stable and complete from the start and no revisions to it are planned.

WXS emerged from a vendor-dominated consortium (the W3C), and is currently anticipated to be revised and released in a 'mostly compatible' version 1.1 and, later, revised to a 2.0 release. It is unclear what level of vendor support these new releases will enjoy.

4. No PSVI

The PSVI, or Post-Schema-Validation Infoset, is the result of validating a document against a WSX schema. It consists of the normal XML infoset, plus extra information that might have be gleaned from the schema, such as type information about content.

This is a bad thing.

The main reason why it's a bad thing is that it introduces into the processing model, information that cannot be expressed as XML. If a processing pipeline needs to make use of the kind of information embodied by the PSVI, then every step in that pipeline has to become PSVI-aware and the result is a tightly-coupled system that is no longer XML-based, but based on something other than the XML Infoset, the PSVI.

Both James Clark [3] and Elliotte Rusty Harold [4] say all that needs to be said about the perils of the PSVI.

5. No content defaulting

RELAX NG, at least in its ISO form, provides no mechanisms for content default. For reaons why this is good, see this other blog entry.

6. Better datatyping support

WXS provides a set of datatypes that may be used to constrain and bind values in content. This is a good idea.

Unfortunately, there are a number of serious problems[5] with the way this has been done (and the fact that type information is communicated using the PSVI).

RELAX NG, in contrast has the option for pluggable type libraries which may be implemented through an API. Most validators ship with WXS-mirroring type libraries (if you must) too.

(In future, when we're all using pipeline processing for validation, a nice datatype language like DTLL could more properly perform the task of datatype validation.)

7. More sophisticated modelling

WXS gives us barely more sophistication in grammar modelling than DTDs did. RELAX NG introduces useful new feature for modelling interdependent attribute and element content.

8. More sophisticated grammatical validation

WXS grammars have to be deterministic. RELAX NG grammars can be ambiguous.

Score one for WXS, you might think. But wait - WXS's means of preventing ambiguity is through a constraint called Unique Particle Attribution (UPA). The problem with this, as the Microsoft report notes, is that “it breaks idiomatic uses of XML”. So if you want to express a grammar like (title?,para+)|(title,subtitle?,para+) (i.e. subtitle is only permitted when there is a title) the UPA rule will prevent you, as a validator cannot know which 'branch' of the model it is following during validation. The problem becomes more acute if one starts adopting some of the wildcarding features permitted in WXS.

RELAX NG, on the other hand, will happily accommodate non-deterministic content models.

In most applications (and probably all publishing applications) the question of whether a governing schema's content model is deterministic or not, is a dry technicality, of absolutely no consequence to the work in hand.

9. Instances have no dependency

WXS schemas (like DTDs) provide a mechanism for associating an instance with a schema: the xsi:schemaLocation attribute. This is problematic in two ways: first, the W3C recommendation makes it optional for processors to use this mechanism - and so behaviour is unpredictable; secondly, this is a potential security problem: it is possible to specify an unwanted schema here knowing that an aplication may not be free to ignore it.

RELAX NG schemas, on the other hand, have no formal association with instances. The validation model is one in which the validation process has separate inputs for the data being tested, and the tests themselves - users do not to have to validate a document each and every time it is processed.

10. Growing consensus

A growing number of key XML languages are being normatively defined using RELAX NG, such as XHTML 2.0, the Atom Syndication Format, OpenDocument Format and DocBook 5. It's clear (if there is a shift) which direction that shift is in, particularly for document-like modelling. And when Tim Bray, one of the original editors of XML 1.0 comes out against WXS it really is time to listen:

Everybody who actually touches the technology has known the truth for years, and it’s time to stop sweeping it under the rug. W3C XML Schemas (XSD) suck. They are hard to read, hard to write, hard to understand, have interoperability problems, and are unable to describe lots of things you want to do all the time in XML. Schemas based on Relax NG, also known as ISO Standard 19757, are easy to write, easy to read, are backed by a rigorous formalism for interoperability, and can describe immensely more different XML constructs. [5]

- Alex.

References

[1] Microsoft Corp., XML Schema Language Experience Report, http://www.w3.org/2005/05/25-schema/microsoft.html

[2] James Clark, RELAX NG and W3C XML Schema, http://www.imc.org/ietf-xml-use/mail-archive/msg00217.html

[3] James Clark, PSVI considered harmful, href='http://osdir.com/ml/org.w3c.tag/2002-06/msg00118.html

[4] Elliotte Rusty Harold, Pretend There's No Such Thing as the PSVI, http://safari.awprofessional.com/0321150406/ch25 [pay-for content]

[5] Comments on XML Schema Datatype made by ISO/IEC JTC 1/SC 34/WG1, http://www.jtc1sc34.org/repository/0392.htm

[6] Tim Bray, Choose RELAX Now, http://www.tbray.org/ongoing/When/200x/2006/11/27/Choose-Relax

Digg!
Thursday, July 26, 2007 8:05:16 AM UTC  #    Disclaimer  |  Comments [1]  | 
 Monday, July 23, 2007

Francis Cave and I are running training days on XML in Publishing this summer at The Publishing Training Centre at Book House.

This course is for those who have to manage the production of electronic content for a range of applications. It requires no prior knowledge. During it, participants will find out:

  • the basic principles of mark-up languages
  • the roles XML can play in publishing
  • what it is like to work with XML data.

The next session is scheduled for 25th September, and is already filling up. To book a place, please contact The Publishing Training Centre directly ...

- Alex.
Monday, July 23, 2007 2:22:37 PM UTC  #    Disclaimer  |  Comments [0]  | 
 Thursday, April 26, 2007
XSLT transformations are the stock-in-trade of XML developers. In general, you shouldn't have to worry too much about how different engines work, but in some rare edge cases, it's a consideration.

Now, it would be foolish to rely on the order of attributes in your physical instance, since none is imposed on them by an XML processor. (OK, you could in theory rely on the ordering if you canonicalize the document first: "An element's attribute nodes are sorted lexicographically with namespace URI as the primary key and local name as the secondary key" -- W3C Canonical XML Version 1.0).

The ordering of attribute nodes in XPath is similarly undefined, so trusty XPath engines do not necessarily produce the same results when an element's attributes are processed together. Given the XML instance:

    <foo c='1' b='2' a='3'/>

and the XPath expression:

    name(//@*)


the result might be 'c', 'b' or 'a', and could conceivably differ between runs of the same engine.

That XPath expression is artificial and unlikely to be used seriously, but it serves to make the point: different XPath engines producing different results can all still be considered to have produced "correct" output.

- Andrew

Thursday, April 26, 2007 9:50:36 AM UTC  #    Disclaimer  |  Comments [0]  | 
 Wednesday, February 14, 2007

XML UK are holding a day conference entitled entitled “Publishing 2.0” at Bletchley Park on Wednesday 25 April 2007.

Beyond being an eye-catching title what we (as organisers) intend “Publishing 2.0” to mean, is that the conference will be examining some of the more cutting-edge applications of XML(ish) technology to publishing. We're putting together a cracking program which already includes:

And how cool a venue is Bletchley Park? To be in the presence of the ghost of Alan Turing adds an extra geeky frisson to the occasion.

A full programme will be announced shortly, but I confidently predict this event will sell out (the venue is limited to 100 people), so to reserve an early space contact XML UK with your credit card in hand.

- Alex.
Wednesday, February 14, 2007 9:50:35 AM UTC  #    Disclaimer  |  Comments [0]  | 
 Tuesday, February 13, 2007

… is the title of a paper I will be giving at The XTech 2007 Conference, which is to be held in Paris from 15 - 18 May.

The focus of the presentation is the new ONIX-PL XML language for expressing licences electronically, so that they are machine-processable. In the first instance the licenses being modelled are between publishers and libraries.

This will be a two-hander, with Francis Cave handling an explanation of the wider business issues, and me concentrating on how we've used Orbeon Forms as the basis of a web application for authoring and managing these complex electronic documents (see a sample license for an example). Francis has been hard at work on an innovative way of annotating XML schemas to affect how instances governed by them are rendered in XForms engines.

Now to write that paper ... Here's the abstract ...


Abstract

As more and more content is published electronically so the need for controlling access to it has risen. Early efforts in this field focused on copy-protection technologies (DRM), but a more enlightened approach emerges if instead content licenses can be agreed between parties and content then used according to that agreement.

This presentation focuses on the design and system implementations around the new ONIX-PL industry standard (developed by EDItEUR ), for representing license agreements between content producers and content recipients. Early adopters of the standard are publishers, libraries and academic institutions wishing to agree licensing terms for the use of high-value scholarly content. ONIX-PL is a high-profile initiative enjoying support from JISC, DLF/ERMI and a number of US and European universities, commercial publishers and library systems vendors.

ONIX-PL license expressions are XML document which are machine actionable. For that reason they need to capture with precise semantics the implications of the legal clauses they embody.

This presentation will examine the challenges of representing machine-actionable legal agreements using XML, and in particular look at the semantic web technologies considered and used (or rejected) in the XML model designs.

Standards and models are of no use if they have no implementation or take up. The presentation will therefore consider how EDItEUR chose to develop a free Open Source software application for authoring and managing these complex XML documents, and how ultimately a full range of Web 2.0 technologies including XForms, pipelining, and AJAX were necessary in consort with more established technologies such as XSLT, XHTML and J2EE, in order to have a web application that dealt properly with the problem space while meeting tight development deadlines.

The presentation will this conclude with some real-world tales of software development and deployment (together with a demonstration) of licenses being created and used using EDItEUR’s chosen infrastructure technology, Orbeon Forms (whose developers the presenters have no affiliation with)

In summary, attendees can expect to learn:

  • why there is a need for electronic expressions of licenses
  • how XML and semantic technologies can be used for this purpose
  • what an XML electronic license expressions looks like ‘for real’
  • why XML licenses need to be created by non-technical users
  • how to rapidly develop a web application for them, and the ‘real world’ software development challenges faced in doing so.
- Alex.
Tuesday, February 13, 2007 1:55:48 PM UTC  #    Disclaimer  |  Comments [0]  | 

… is a presentation that I won't be giving at The XTech 2007 Conference, as the proposal was not accepted (I will however be speaking on another topic). Based on my experience of speaking at, and reviewing for, XML conferences over several years the rejection of this paper surprises me. Maybe XTech really is losing the XML-focus of its XML Europe past.

I do hope somebody is covering DSDL, as the technologies it contains are important ones that deserve public airing.

Anyway, here's the abstract of the paper that didn't make it:


Description

ISO is expected shortly to standardise three new schema languages as part of DSDL. Learn about them, and the DSDL project as a whole, in this update.

Abstract

It has recently been proclaimed that “among the XML cognoscenti, the debate is effectively over. Everyone is choosing RELAX NG”. And indeed the early indicators are that RELAX NG is getting increasing traction (if still only being the grammar modelling language of the “cognoscenti”). So for example:

  • The W3C have defined XHTML 2.0 normatively using RELAX NG
  • Microsoft have agreed to have the schemas for Office re-expressed in RELAX NG as part of their standardisation effort
  • DocBook 5 is being primarily developed using RELAX NG.

But RELAX NG is only one part of a 10 part ISO standard: DSDL (or Document Schema Definition Languages, ISO 19757) aims to offer a complete family of XML validation languages, in which RELAX NG covers just the specialised area of regular-grammar-based validation.

The other two fully-standardised parts of DSDL (Schematron and NVDL) are also gaining wider adoption in public XML models and in implementations.

But DSDL is about to include, in their final forms, three new standards which are currently less well known, even among “the cognoscenti”: DTLL (Datatype Library Language), DSRL (Document Schema Renaming Language) and Datatype- and Namespace-aware DTDs.

Drawn from real world experience in the ISO working group, and in editing and implementing part of DSDL, this presentation will include a description of DSDL, and in particular will set out the function of the three lesser-known parts which are soon to be standardised. It will explain why DSDL as a whole offers an elegant and complete solution to the problems of XML validation, and why users should care.

  • DTLL will introduce data-typing into the validation mix in a way which overcomes the limitations of W3C Schema’s fixed typing scheme. It will allow users to define their own type libraries in elegant declarative XML.
  • Influenced by architectural forms, DSRL acts as a schema adapter, allowing users to validate XML as though it were valid to a schema, by modifying it ‘on the fly’. As such it powerfully supports internationalisation and content defaulting.
  • Part 9 of DSDL will retro-fit some of its major features into DTDs, allowing users with heavy investment in DTD technology to get more life of them.

In summary, attendees will hear:

  • A conceptual overview of the need for DSDL and an appreciation of the problem space it addresses
  • What the 10 parts of DSDL are
  • A more detailed description of the three upcoming parts of DSDL
  • Examples and/or demonstrations of these in action
  • A report of progress made in working-group meetings running alongside XTech 2007
  • A roadmap for the completion of the project and details on how to get involved.
Tuesday, February 13, 2007 1:36:27 PM UTC  #    Disclaimer  |  Comments [0]  | 
 Tuesday, January 09, 2007

XML DTDs and W3C Schemas both have mechanisms that allow default information to be supplied at validation time. Here's an example of this from the W3C's own XHTML Schema Module Implementations for attributes of the input element:

<xs:attribute name="type" default="submit">

When XHTML instances are validated against a schema with this construct, the validation process will default the value "submit" for the type attribute, if none is present in the instance being validated.

In the early days of the company we were enthusiastic content defaulters, priding ourselves on designing DTDs that could 'take the strain' by providing default information that might otherwise have to be tediously keyed in. But now we think this is bad practice. Here's why:

1. Conceptual confusion. There's validation, and there's transformation. A schema (or DTD) should be used for validation, and transformation languages (like XSLT) for transformation. Trying to both jobs in one language confuses these concerns.

2. Defaulting models can't be expressed with RELAX NG. RELAX NG as standardised by ISO contains no mechanisms for defaulting content (unlike its OASIS-standardised predecessor). So content models which expect the schema language to provide default content won't be expressible in RELAX NG.

3. WYSINWYG. What you see is not what you get. One of the great strengths of XML (despite the W3C's wrong-headed pronouncement that XML is not meant to be read) is that you can open up an XML document in a text editor and actually see, at the text 'n' tags level, what is going on. But if a schema or DTD will be providing default content, you might not be seeing everything 'in' the instance.

4. Having to take the Schema everywhere. Relatedly, if the DTD or schema provides default information you need to make sure that every time its governed instances are parsed, that DTD or schema needs to be present, as such document are not standalone. This is an overhead.

5. The Namespace problem. See here for reasons why providing Namespace support through content defaulting can be tricky.

6. Saving typing is sooooo over. Content defaulting, like tag minimisation features, emerges from an earlier era where saving keystrokes and precious VDU screen real estate were important concerns. These things are generally less important now, and - if they are - there a better tools for the job than content-defaulting schemas or DTDs.

Okay, so that's six reasons. Under-promising and over-delivering again ;-)

Tuesday, January 09, 2007 8:17:00 PM UTC  #    Disclaimer  |  Comments [0]  | 

XML users who choose DTDs for modelling are faced with practical problems caused by the lack of information in the XML specifications about how XML DTDs and XML Namespaces should co-exist. Strictly speaking there is no mechanism by which DTD users can specify that element definitions are declared in a particular namespace — however a frequently-seen approach to trying to achieve this (as in the W3C's own XHTML 1.0 DTDs) is to declare an attribute called 'xmlns' and fix a Namespace URI to it like this.

<?xml version='1.0'?>
<!DOCTYPE a [
<!ELEMENT a EMPTY>
<!ATTLIST a xmlns CDATA #FIXED "http://example.com/ns">
]>
<a/>

There are, however, a number of problems with this, not least the fact that 'xmlns' is not (and cannot be - the sequence x m l is prohibited from starting XML names) an attribute. However, we might close our eyes, cross our fingers and hope the hack works.

And we'd be okay in most cases. The Xerces and rxp parsers, for example, happily process this XML and behave as if the element a is associated with the Namespace URI http://example.com/ns.

Microsoft parsers, however, will not parse this content. They halt with the message "Use of default namespace declaration attribute in DTD not supported." Any Windows user launching such XML for viewing in Internet Explorer (as customers do) will get this message. Microsoft's developers are arguably quite correct in doing this, but as in the well-known joke perhaps also unhelpful.

What to do?

What we do is this:

<?xml version='1.0'?>
<!DOCTYPE a [
<!ELEMENT a EMPTY>
<!ATTLIST a xmlns CDATA #REQUIRED >
]>
<a/>

… and in the documentation accompanying the DTD specify that instances must specify the Namespace properly using the usual xmlns mechanism (ideally this is enforced in another validation layer, with Schematron e.g.). This way the XML is 'correct' and all conformant parsers - including Microsoft's - are happy.

The only downside (one might think) is that users have to go to the trouble of putting that Namespace URI in the instance, rather than letting the DTD supply it for them. But we think this is a good thing — why we think that is the subject of another blog entry.

Tuesday, January 09, 2007 7:59:52 PM UTC  #    Disclaimer  |  Comments [0]  | 
 Thursday, January 04, 2007

The company is 10 years old in February! We've been discussing what suitably eye-catching initiative it would be good to celebrate with.

One idea was a "back to 1997 prices" promotion for our consulting rates … but on second thoughts maybe the market couldn't stand such a steep rise ;-)

A Happy New Year to everyone!

- Alex.
Thursday, January 04, 2007 3:56:59 PM UTC  #    Disclaimer  |  Comments [0]  | 
 Tuesday, December 05, 2006

There are many areas of computing and development where what might be termed questions of "taste" apply. A function should fit on a screen; a module should contain no more than a few dozen functions; a database table should have no more than a few dozen fields, etc.

The same questions applies to XML document instances, and in general they shouldn't be more than a few megabytes in size. If you find you're working with XML and your documents are often bigger than this, more often than not it's a symptom of a deeper architectural malaise.

The reasons for having small functions, etc, are not just capricious – things that are smaller are easier to debug, view and maintain — in short easier to comprehend, given that our puny human brains can only function effectively when not overloaded with content.

XML documents should be human-consumable too, despite the W3C's wrong-headed strictures that XML isn't meant to be read. This means they should be small enough to be worked with by human beings, and shouldn't be too big.

What is this "too big"? Well, a clue lies in the fact that XML describes "documents", not "reference-libraries" or "databases". XML documents are well-suited to representing things like book chapters, journal articles, employee records, or tax returns; they are not good at modelling (as single standalone documents) book collections, a journal series, a large company's employee details, or a country's tax return collection. While (the old joke runs) software engineers are only interested three numbers: zero, one and infinity, this is damaging when it shades into a temptation to say that a system must model "one" XML document. We must learn to be comfortable with "some" documents, of a certain size.

When dealing with complex systems, human minds seem happiest when they have clear divisions between the mental contexts in which they apprehend parts of that system. So in the "complex system" we call life, we might of a morning leave the house, open the garage door and get into our car. It is of no help to us at all to learn that the house, the garage and the car are all types of container (true though that is) — for our human brains, knowing the type similarity between things is usually just unwelcome noise1. In the same way, creating usable software can require being canny about concealing the underlying similarity between things. We see a folder on our desktop containing files: it is of no help to us at all to know that the files, the folder, and the desktop itself, are all "file system entries" of one kind or another.

In the same way, usable XML systems have three levels of mental context. At the core is text content, with its own rules and appeal to our mind; above this is the XML structure (elements and attributes) of the document; above this is the storage system for the documents — whether they are files in a file system or entries in a database. We should embrace this storage layer as a useful abstraction, instead of trying to expand the remit of the document to supplant it.

The ultimate exemplar of this model is, of course, the Web itself. It is not a single document, and cannot be represented as such (it has no root, or starting point). In smaller systems this multi-document model is exemplified by such things as "collections" of XML documents. XML storage applications such as eXist have such collections as a central feature of their storage philosophy.

Fragments can be combined, of course, to make composites of arbitrary size — but the correct way to do this is to use linking and embedding technologies so that the overall collection is a loosely-coupled assembly of reasonably-sized documents, not to munge them into some XML mega-document which you rely on a fragmentation tool to extract them from before they can be used.

If the above paragraphs set out abstract reasons for preferring small documents, there are a number of practical considerations too:

  • Small documents are friendly to other desktop applications, so can be opened in a text editor, emailed around and visualised with web-browsers + stylesheets, easily
  • Even some dedicated commercial XML editors can bog down horribly when they are fed big XML documents to edit
  • Day to day XML processing tasks (such as XSLT transformations) rely on in-memory representations of the XML being built. A Xerces-J DOM takes (as an example) 7 times as much memory as the original document to represent; meaning on typical desktop machines processing documents more than a few hundred MBs is difficult or impossible.

So, when designing a model or a system look for the human-sized things (there always are some) that can be modelled to represent optimally-sized XML document, and build around them. In our experience, users and developers both with thank you in the end.

- Alex.

1 Except for poets, but that is a different story.
Tuesday, December 05, 2006 8:26:37 PM UTC  #    Disclaimer  |  Comments [0]  | 
 Tuesday, November 28, 2006

… so blogs Tim Bray, in comment to Elliotte Rusty Harold's piece, RELAX Wins.

There, people are finally coming out and saying it — and as long term Schema skeptics1 we're pleased to see the view expressed, if only to make clear that this topic is still open.

However, the commercial reality is not that simple, as Micheal Champion outlines in an xml-dev posting. The issue is tooling. Many