Thursday, July 26, 2007
« XML Training | Main | XML Profile: A Rough Proposal for a New ... »

I have recently recommended to a large publishing client that they adopt RELAX NG as the basis of the formal definitions of their content, in preference to W3C XML Schema Definition Language (WXS).

There are lots of individual bits of information on why RELAX NG should be preferred all over the web. Here is an attempt to condense some of the key information into ten points …

1. A better spec means better interoperability

We, in common with many people working with WXS schemas, have been tripped up by interoperability problems caused by different tools having a different take on how WXS should be implemented. Even Microsoft, a developer who in generally sympathetic to WXS, has reported a number of interoperabilty problems, and that for its customers WXS had “stuffed up the ready interoperability they thought they were buying into with XML”. [1]

The root of such interoperability problems is that the WXS specification is notoriously hard to interpret. James Clark has called it “without doubt the hardest to understand specification that I have ever read”. [2] Little wonder then that mere mortal developers have difficulty interpreting it!

RELAX NG has, by contrast, a clear formal description of the semantics of a RELAX NG schema – and for those who want to skip the formal text of the standard, the technology can be clearly explained even in a short tutorial.

2. Availability of a compact syntax

Unlike WXS, RELAX NG has a compact syntax (as explained in this tutorial. Using it a DTD like:

<!DOCTYPE addressBook [
<!ELEMENT addressBook (card*)>
<!ELEMENT card (name, email)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT email (#PCDATA)>

can be expressed with this syntax:

element addressBook {
element card {
element name { text },
element email { text }
}*
}

Much nicer!

3. The specification is a stable ISO standard

RELAX NG first became an OASIS standard in 2001 and then went through a full ISO standardisation process to become an ISO Standard (ISO 19757-2:2003 [free ZIP download]) in 2003. It has proved stable and complete from the start and no revisions to it are planned.

WXS emerged from a vendor-dominated consortium (the W3C), and is currently anticipated to be revised and released in a 'mostly compatible' version 1.1 and, later, revised to a 2.0 release. It is unclear what level of vendor support these new releases will enjoy.

4. No PSVI

The PSVI, or Post-Schema-Validation Infoset, is the result of validating a document against a WSX schema. It consists of the normal XML infoset, plus extra information that might have be gleaned from the schema, such as type information about content.

This is a bad thing.

The main reason why it's a bad thing is that it introduces into the processing model, information that cannot be expressed as XML. If a processing pipeline needs to make use of the kind of information embodied by the PSVI, then every step in that pipeline has to become PSVI-aware and the result is a tightly-coupled system that is no longer XML-based, but based on something other than the XML Infoset, the PSVI.

Both James Clark [3] and Elliotte Rusty Harold [4] say all that needs to be said about the perils of the PSVI.

5. No content defaulting

RELAX NG, at least in its ISO form, provides no mechanisms for content default. For reaons why this is good, see this other blog entry.

6. Better datatyping support

WXS provides a set of datatypes that may be used to constrain and bind values in content. This is a good idea.

Unfortunately, there are a number of serious problems[5] with the way this has been done (and the fact that type information is communicated using the PSVI).

RELAX NG, in contrast has the option for pluggable type libraries which may be implemented through an API. Most validators ship with WXS-mirroring type libraries (if you must) too.

(In future, when we're all using pipeline processing for validation, a nice datatype language like DTLL could more properly perform the task of datatype validation.)

7. More sophisticated modelling

WXS gives us barely more sophistication in grammar modelling than DTDs did. RELAX NG introduces useful new feature for modelling interdependent attribute and element content.

8. More sophisticated grammatical validation

WXS grammars have to be deterministic. RELAX NG grammars can be ambiguous.

Score one for WXS, you might think. But wait - WXS's means of preventing ambiguity is through a constraint called Unique Particle Attribution (UPA). The problem with this, as the Microsoft report notes, is that “it breaks idiomatic uses of XML”. So if you want to express a grammar like (title?,para+)|(title,subtitle?,para+) (i.e. subtitle is only permitted when there is a title) the UPA rule will prevent you, as a validator cannot know which 'branch' of the model it is following during validation. The problem becomes more acute if one starts adopting some of the wildcarding features permitted in WXS.

RELAX NG, on the other hand, will happily accommodate non-deterministic content models.

In most applications (and probably all publishing applications) the question of whether a governing schema's content model is deterministic or not, is a dry technicality, of absolutely no consequence to the work in hand.

9. Instances have no dependency

WXS schemas (like DTDs) provide a mechanism for associating an instance with a schema: the xsi:schemaLocation attribute. This is problematic in two ways: first, the W3C recommendation makes it optional for processors to use this mechanism - and so behaviour is unpredictable; secondly, this is a potential security problem: it is possible to specify an unwanted schema here knowing that an aplication may not be free to ignore it.

RELAX NG schemas, on the other hand, have no formal association with instances. The validation model is one in which the validation process has separate inputs for the data being tested, and the tests themselves - users do not to have to validate a document each and every time it is processed.

10. Growing consensus

A growing number of key XML languages are being normatively defined using RELAX NG, such as XHTML 2.0, the Atom Syndication Format, OpenDocument Format and DocBook 5. It's clear (if there is a shift) which direction that shift is in, particularly for document-like modelling. And when Tim Bray, one of the original editors of XML 1.0 comes out against WXS it really is time to listen:

Everybody who actually touches the technology has known the truth for years, and it’s time to stop sweeping it under the rug. W3C XML Schemas (XSD) suck. They are hard to read, hard to write, hard to understand, have interoperability problems, and are unable to describe lots of things you want to do all the time in XML. Schemas based on Relax NG, also known as ISO Standard 19757, are easy to write, easy to read, are backed by a rigorous formalism for interoperability, and can describe immensely more different XML constructs. [5]

- Alex.

References

[1] Microsoft Corp., XML Schema Language Experience Report, http://www.w3.org/2005/05/25-schema/microsoft.html

[2] James Clark, RELAX NG and W3C XML Schema, http://www.imc.org/ietf-xml-use/mail-archive/msg00217.html

[3] James Clark, PSVI considered harmful, href='http://osdir.com/ml/org.w3c.tag/2002-06/msg00118.html

[4] Elliotte Rusty Harold, Pretend There's No Such Thing as the PSVI, http://safari.awprofessional.com/0321150406/ch25 [pay-for content]

[5] Comments on XML Schema Datatype made by ISO/IEC JTC 1/SC 34/WG1, http://www.jtc1sc34.org/repository/0392.htm

[6] Tim Bray, Choose RELAX Now, http://www.tbray.org/ongoing/When/200x/2006/11/27/Choose-Relax

Digg!
Thursday, July 26, 2007 8:05:16 AM UTC  #    Disclaimer  |  Comments [1]  |