Sunday, May 04, 2008

Just when it seemed like nobody was interested in the ODF conformance smoke test posted a few days ago, IBM's Rob Weir weighs in with a lengthy piece in response.

Rob replicates the test I ran and runs a few of his own, finding ODF validation problems along the way and ending with an eyebrow-raising take on this which, I think, sells ODF seriously short.

But before getting to that, a few technical things need to be put straight.

Is the ODF schema broken?

One of the unexpected things I found in my test was that the ODF schema itself was broken, leading me to conclude that there could be no valid ODF 1.0 documents in existence as the schema simply could not be validated against.

Rob doesn't believe there's a problem here (though he allows "Alex's proposed changes to the schema are reasonable and should be considered" – too right!), and when he finds a validator reporting the error I mention, he blithely disables the reporting of that error so he can continue on to get a bunch of "error free" validation reports when validating the ODF 1.0 spec.

Why did Rob disable this error reporting? Well, he claims the standard allows him to – he writes that "there is no claim whatsoever [in the ODF spec] that a conformant ODF 1.0 document will conform to the ID/IDREF constraints defined in Relax NG DTD Compatibility". Crucially, this claim is misguided.

The ODF 1.0 spec makes explicit use of datatypes it names "ID" and "IDREF" – it states that these are the W3C types as defined in XML Schema Part 2. If we look in turn at this document, it defines both of these types, and states that they represent the same types from XML 1.0 (Second Edition). And if we look back to that document we see that both these types have a bunch of validity constraints which need to be tested, such as the need for every IDREF to correspond to some matching ID, or that ID values must be unique per document. To be valid according to these definitions a validator must respect the semantic constraints associated with these datatype definitions. (To return to the "dummies" level, we might read the helpful description from the XML Schema Primer which states: "XML 1.0 provides a mechanism for ensuring uniqueness using the ID attribute and its associated attributes IDREF and IDREFS. This mechanism is also provided in XML Schema through the ID, IDREF, and IDREFS simple types which can be used for declaring XML 1.0-style attributes"). By switching this functionality OFF Rob may be generating good spin for his blog, but he is not validating ODF correctly, as he is ignoring the very type correctness checking that the ODF spec mandates through its datatyping! (And worryingly, this gaffe has now been perpetuated in an (official?) OASIS TC Wiki, on an immutable page!.)

Coming at this from another direction, we could also take into account the fact that the RELAX NG used by ODF is not "pure" ISO/IEC 19757-2, but uses mechanisms from the OASIS past of RELAX NG. In particular, it declares:

datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes"

and in so doing brings into play RELAX NG's schema XSD datatype emulation. The OASIS spec describing this feature is Guidelines for using W3C XML Schema Datatypes with RELAX NG and this refers to the very RELAX NG compatibility features Rob claims we can safely ignore:

[DTD Compatibility] defines the concept of an ID-type, which is an additional semantic for datatypes that allows datatypes to have [XML 1.0] cross-reference semantics. An implementation of [DTD Compatibility] that supports these guidelines should associate the ID, IDREF and IDREFS datatypes of [W3C XML Schema Datatypes] with the ID-types ID, IDREF, and IDREFS respectively.

The jing validator does support these guidelines, and accordingly performs just such an association. As the co-author of the spec, James Clark (the author of jing) can be relied on - rather more than Rob - to know what functionality applies for a particular validation scenario.

So, both formally and informally we should not be disabling ID/IDREF awareness – and there is also a third, less dry technical reason why we should not: common sense. The ID/IDREF testing performs a useful first-line of defense testing on our document, and prevents such nonsenses as duplicate IDs or broken links. Without it, we could take the ODF spec as XML, make all the IDs in it identical, and then watch as Rob's validation method passed the resulting rubbish all as "a-okay". So I'm sorry Rob, but on all three counts the "it's error free if we disable error testing" approach does not cut the mustard, and is simply not something the ODF spec entitles you to do.

Where I do agree is that we need to put this in perspective. Although these findings are interesting in the context of the OOXML furor, they do not signal anything particularly momentous about ODF. Defects get found; defects get fixed – the standard improves and everybody is happy. Right?

Negativity

Amid the general downer that is Rob's blog entry, is an assumption that I share such negative thoughts. I find myself described as "someone who would be well served if he could show that all consortia standards are junk, and that only SC34 (and he himself) could make them good". Hmmmmm - where did that come from?

For the record, I am an enthusiastic supporter of consortia and consortium standards and know from experience that consortia contain great people who are producing some of the best standards work in the planet: XML 1.0, ODF, XSLT, UBL, OOXML (ha!) – the list goes on. Most recently I was very pleased to see a new working draft of the important new W3C XProc specification – something that SC 34 is specifically deferring to rather than attempt something similar itself. I thoroughly disapprove of the kind of oppositional mindset that sees things in a polarised "ISO vs OASIS" or "ISO vs W3C" way. In my view that mode of thinking already did enough damage during the DIS 29500 project.

Tools that produce valid ODF?

Rob continues, re-running the tests I performed and finding the same result. Rob quibbles with many aspects of the test (which is fine, this was just a "smoke test") but, after all the huffing and puffing is done, we are left with the cold, hard fact that OpenOffice.org 2.4 (and, as Rob demonstrates, the CleverAge converter) are not emitting valid ODF documents.

It's at this point that things get a bit odd. Faced with the invalid documents before him Rob writes:

Conformance requires that [an application] is capable of writing out a valid document. And of course, success for an ODF implementation requires that its conformance to the standard is sufficient to deliver on the promises of the standard, for interoperability.

No. A conformant application needs to be more than "capable of" writing valid documents. If it claims to be emitting ODF 1.0 then valid ODF 1.0 is what it has to emit – the ODF schema is normative, not an optional extra. If the application fails to do this, it is non-conformant and consequently has a bug which need fixing. This is what I would expect to be the message to OpenOffice: it has some (mild-looking) ODF conformance bugs which need fixing. Let's fix the application, not try and re-define what conformance means and pretend all is well!

Rob then moves on to compare the corpus of ODF documents to HTML on the Web:

So I suggest that ODF has a far better validation record than HTML and the web have, and that is an encouraging statement.

"encouraging"!? err, sorry but again: no. To compare any document type collection to the validity rubbish-heap that is the Web's corpus of HTML is saying practically nothing and, I think, sells ODF seriously short of where it's at. What is "encouraging" to me is that the schema problems in the ODF schema, and the validity errors we find in ODF emitted by a major application (OpenOffice), are so comparatively minor. The prize is in sight - with some schema fixing and bug fixing we (the users) could be using an office application which worked reliably with a truly international standard (ODF 1.0 in this case). That is surely what we should all be aiming for. Inevitably, progress in this will be slower if defects, when found, meet with denial and obfuscation rather than a willingness to move forwards.

Homework

Now that interest seems to have been awakened in performing ODF (and OOXML) validation, perhaps it is worth investigating the 25 warning messages that msv emits when parsing the ODF 1.0 schema with warnings enabled? The last two are related to the ID/IDREF problem mentioned above and are fixed by applying my proposed resolution. But are the remaining 23 all spurious? – nothing seems wrong with the schema from a quick look (this is a genuine, not a rhetorical, question BTW).

And I again renew my call: I am very interested in hearing about any application that consistently emits valid ODF (or valid OOXML for that matter). Are there really none?

Moving forward

As I wrote many times (and as was repeatedly ignored) the smoke tests for OOXML and ODF validation were, by design, crude – they just give a rough idea whether all is well. Based on the results, it is apparent that a more thorough investigation of both formats (and their applications) would be of interest. Accordingly the next step is to start constructing a validation testing framework that:

  • Uses a varied suite of documents originated natively using office applications (MS Office, OpenOffice.org and others)
  • Goes beyond schema validation to apply semantic constraints described by the standards' text (using e.g. Schematron)
  • Corellates and presents the results in full

Watch this space ...

- Alex.

Sunday, May 04, 2008 12:40:14 PM UTC  #    Disclaimer  |  Comments [23]  | 
 Wednesday, April 30, 2008

Following on from the recent smoke test of Office 2007 conformance to ISO/IEC 29500 here, as promised, is a repeat of the exercise using ISO/IEC 26300 (ODF 1.0).

Like OOXML, ODF has (sensibly) a schema defined using RELAX NG (ISO/IEC 19757-2). This schema is published in the standard itself and is available for download from OASIS.

ODF Schema Woes

The first problem encountered was in trying to use this schema. Both James Clark’s jing and Sun’s Multi-schema validator emitted error messages when processing it. Further investigation reveals that the schema has a critical flaw in the way its open models conflict with its typed attribute values. At the end of this blog entry is a detailed defect report with a proposal how to fix the schema. By filing this I nail my colours to the mast as a staunch ODF supporter!

The consequence of this schema flaw is that the formal definition of document validity in ODF 1.0 is broken. I suspect tools which claim to use the schema with success are based on Libxml, whose RELAX NG validator is incomplete. Don’t trust them.

Imagine the outrage there'd have been if OOXML had passed with this kind of defect!

Getting an ODF Document

For parity with the OOXML test, I used the same document (Ecma 376 Part 4) for testing. This requires several steps of conversion, from Ecma 376 format to Word binary, and then (using OpenOffice.org 2.4.0) from Word binary to ODF. The process took several hours, but in the end it results in a .odt file of approx 59MB.

Validation Result

Validating the ODF document against the (patched) schema yielded 7,525 validation errors – mostly of the same type (use of an undeclared soft-page-break element).

Conclusion

Again, only tentative conclusions can be drawn from a smoke test (readers unfamiliar with this term as applied to software testing are recommended to read the Wikipedia article on it before grumbling about the depth of the test, please).

  • For ISO/IEC 26300:2006 (ODF) in general, we can say that the standard itself has a defect which prevents any document claiming validity from being actually valid. Consequently, there are no XML documents in existence which are valid to ISO ODF.
  • Even if the schema is fixed, we can see that OpenOffice.org 2.4.0 does not produce valid XML documents. This is to be expected and is a mirror-case of what was found for MS Office 2007: while MS Office has not caught up with the ISO standard, OpenOffice has rather bypassed it (it aims at its consortium standard, just as MS Office does).

I’d be very interested to find an office application that does work with valid ISO/IEC 26300 content. Do any readers know of one?

Looking Forward

A smoke test only scratches the surface – a fuller document conformance test suite would give a much better idea of the semantic (as well as the syntactic) validity of documents that claim conformance to either 29500 or 26300.

Fortunately SC 34 has spent the past years working on exactly the kinds of technologies (ISO/IEC 19757, DSDL) that will allow a more complete validation of XML documents. I am hopeful that we will see some more meaningful testing in time, and note with interest that the Italian National Standards Body have invited participation in such activities.

The unfortunate reality for concerned users is that there are no office application suites on the planet that create XML valid to International Standards, although both MS Office and OpenOffice.org get you within sniffing distance. The remedies for this shortfall are for Microsoft (on the one hand) to update its Office product, and for ODF developers (on the other hand) to pay more attention to XML validity – especially when targeting the upcoming ISO standard version of ODF 1.2. The world is moving on, and users do not want to spend time battling with incorrect outputs of their office applications: they want a reliable format they can use to build further applications on. Let us hope the coming months and years will see marked improvements in document conformance levels!

N.B. As this blog entry “goes to press”, Jesper Lund Stocholm has posted a blog entry on ODF conformance which is also well worth reading.

I suspect neither his blog entry, nor this one, will receive as much attention as the one reporting findings on MS Office's XML! Let's see.





Defect Report ISO/IEC 26300:2006

Clause 16.2 defines an “open model” for custom content using two patterns, as follows:

<define name="anyAttListOrElements">
<zeroOrMore>
<attribute>
<anyName>
<text> </text>
</anyName>
<ref name="anyElements"> </ref>
<define name="anyElements">
<zeroOrMore>
<element>
<anyName>
<mixed>
<ref name="anyAttListOrElements"> </ref>
</mixed>
</anyName>
</element>
</zeroOrMore>
</define>
</attribute>
</zeroOrMore>
</define>

Similar definitions are also used (clause 15.2) for the modelling of mathematical markup:

 <!-- To avoid inclusion of the complete MathML schema, anything -->
<!-- is allowed within a math:math top-level element -->

<define name="mathMarkup">
<zeroOrMore>
<choice>
<attribute>
<anyName/>
</attribute>
<text/>
<element>
<anyName/>
<ref name="mathMarkup"/>
</element>
</choice>
</zeroOrMore>
</define>

However, the declaration of attributes here with any name and any value of any type, conflicts with the declaration elsewhere in the schema of attributes that have an ID or IDREF type. Consequently the schema cannot be processed by validating processors which respect type consistency (e.g. jing [1] or msv [2] used with warning enabled).

Proposed Solution

The schema must be corrected. This can be done by excluding the typed attributes from the custom model as follows:

  <define name="anyAttListOrElements">
<zeroOrMore>
<attribute>
<anyName>
<except>
<name>smil:targetElement</name>
<name>text:id</name>
<name>text:change-id</name>
<name>form:id</name>
<name>presentation:master-element</name>
<name>draw:id</name>
<name>anim:id</name>
<name>draw:shape-id</name>
<name>draw:end-shape</name>
<name>draw:start-shape</name>
<name>draw:control</name>
</except>
</anyName>
<text/>
</attribute>
</zeroOrMore>
<ref name="anyElements"/>
</define>
<define name="anyElements">
<zeroOrMore>
<element>
<anyName/>
<mixed>
<ref name="anyAttListOrElements"/>
</mixed>
</element>
</zeroOrMore>
</define>

If it is intended these attributes should be allowed in custom data, they should be re-included (correctly typed) as necessary.

In general, the custom data model should be revisited – is it really the intention that it should be so open?

Similarly, the math markup model would be better made more restrictive either by incorporating a MathML schema, or at least by restricting the allowed elements to certain Namespaces. For the time being it should at least re-use the custom model to avoid unnecessary replication of patterns.

References

[1] Jing - A RELAX NG validator in Java http://www.thaiopensource.com/relaxng/jing.html

[2] Sun Multi-Schema Validator https://msv.dev.java.net/

Wednesday, April 30, 2008 10:50:15 AM UTC  #    Disclaimer  |  Comments [21]  | 
 Wednesday, April 23, 2008


Empire State Building

To New York for three days of client meetings. With an afternoon free, and very pleasant weather what better way to spend time than taking a trip up the Empire State Building (the sign in the lobby said "visibility: 10 miles").


Pigeons on the 86th floor

How nice to have three days of purely commercial work stretching ahead, with no OOXML or standards politics in sight. There is a certain clarity to doing technical work in an environment when the requirements are clearly on the table; and technically and conceptually the schema I'm working on here is miles ahead of OOXML/ODF — but maybe in saying that I'm influenced by the fact that I am the chief designer ;-)

ODF Conformance catch-up

When I get back to the UK I hope to post a blog entry on ODF conformance. I'm surprised nobody has risen to the challenge I issued in my last blog entry to predict the result. So, I renew the call! I'd be particularly interested in hearing about any ODF implementations that people think should be conformant …

One immediate problem came up in that the published RELAX NG schemas in the ISO standard (ISO/IEC 26300) appear to have a technical fault which makes them unusable. I wonder, am I the first person ever to make a serious attempt to validate an ODF document against its International Standard specification?

- Alex.
Wednesday, April 23, 2008 7:26:54 AM UTC  #    Disclaimer  |  Comments [8]  | 
 Thursday, April 17, 2008

I was excited to receive from Murata Makoto a set of the RELAX NG schemas for the (post-BRM) revision of OOXML, and thought it would be interesting to validate some real-world content against them, to get a rough idea of how non-conformant the standardisation of 29500 had made MS Office 2007.

Not having Office 2007 installed at work (our clients aren't using it – yet), the first problem is actually getting a reasonable sample for testing. Fortunately, the Ecma 376 specification itself is available for download from Ecma as a .docx file, and this hefty document is a reasonable basis for a smoke test ...

The main document ("document.xml") content for Part 4 of Ecma 376 weighs in at approx. 60MB of XML. Looking at it ... I'm sorry, but I'm not working on that size of document when it's spread across only two lines. Pretty-printing the thing makes it rather more usable, but pushes the file size up to around 100MB.

So we have a document and a RELAX NG schema. All that's necessary now it to use jing (or similar) and we can validate ...

Validating against the STRICT model

The STRICT conformance model is quite a bit different from Ecma 376, essentially because most of that format's most notorious features (non ISO dates, compatibility settings like autospacewotnot, VML, etc.) have been removed. Thus the expectation is that existing Office 2007 documents might be some distance away from being valid according to the strict schemas.

Sure enough, jing emitted 17MB (around 122,000) of invalidity messages when validating in this scenario. Most of them seem to involve unrecognised attributes or attribute values: I would expect a document which exercised a wider range of features to generate a more diverse set of error message.

Validating against the TRANSITIONAL model

The TRANSITIONAL conformance model is quite a bit closer to the original Ecma 376. Countries at the BRM (rather more than Ecma, as it happened) were very keen to keep compatibilty with Ecma 376 and to preserve XML structures at which legacy Office features could be targetted. The expectation is therefore that an MS Office 2007 document should be pretty close to valid according to the TRANSITIONAL schema.

Sure enough (again) the result is as expected: relatively few messages (84) are emitted and they are all of the same type complaining e.g. of the element:

<m:degHide m:val="on"/>
since the allowed attribute values for val are now "true", "false", etc. — this was one of the many tidying-up exercices performed at the BRM.

Conclusions?

Such a test is only indicative, of course, but a few tentative conclusions can be drawn:

  • Word documents generated by today's version of MS Office 2007 do not conform to ISO/IEC 29500
  • Making them conform to the STRICT schema is going to require some surgery to the (de)serialisation code of the application
  • Making them conform to the TRANSITIONAL will require less of the same sort of surgery (since they're quite close to conformant as-is)

Given Microsoft's proven ability to tinker with the Office XML file format between service packs, I am hoping that MS Office will shortly be brought into line with the 29500 specification, and will stay that way. Indeed, a strong motivation for approving 29500 as an ISO/IEC standard was to discourage Microsoft from this kind of file format rug-pulling stunt in future.

What's next?

To repeat the exercise with ISO/IEC 26300:2006 (ODF 1.0) and a popular implementation of OpenDocument. Will anybody be brave enough to predict what kind of result that exercise will have?


- Alex.
Thursday, April 17, 2008 12:20:22 PM UTC  #    Disclaimer  |  Comments [9]  | 
 Thursday, March 13, 2008

There has been some interesting discussion on xml-dev recently about the future of XML, and in particular whether the XML specification itself needs to be fundamentally revisited. One idea that particularly interested me was that DTDs could/should be removed from XML specification as they place a heavy burden on implementors and implementations in what is the Age of the Schema (apparently).

I think we can go a lot further than that, and that there is a general need to be able to communicate to a processor what features of the XML family a document uses. I think a good way to do this is with a PI that follows the XML declaration, so:

<?xml version="1.0"?>
<?profile dtd="no"?>

would do the trick in conveying that a document made no uses of DTD constructs.

We could go further:

<?xml version="1.0"?>
<?profile dtd="no" namespaces="no"?>

et voila we convey to our processor that there will be no use of XML Namespaces in a document. Conversely, specifying namespaces="yes" would tell a processor that support for that spec is required. Currently this sort of thing has to be done using ad hoc processor-specific features.

We could use this kind of mechanism to tell a processor whether it should recognize xml:id, XML Inclusions, etc. etc.

Getting more controversial

We can go further still. What about this?

<?xml version="1.0"?>
<?profile edition="4"?>

In an attempt to stop the slippety-slide of XML 1.0 fifth edition into our document space.

And what about this?

<?xml version="1.0"?>
<?profile attributes="no"?>

i.e., turning off a "core" feature of XML – the use of attributes. SML by the back door? Hmmmmmmm, I like.

And of course, such profiled XML documents would always be 100% conformant XML too ...

What's not to like? If I can just type it up we can have it fast-tracked through ISO in a jiffy ;-)

- Alex.
Thursday, March 13, 2008 12:42:30 PM UTC  #    Disclaimer  |  Comments [2]  | 
 Thursday, July 26, 2007

I have recently recommended to a large publishing client that they adopt RELAX NG as the basis of the formal definitions of their content, in preference to W3C XML Schema Definition Language (WXS).

There are lots of individual bits of information on why RELAX NG should be preferred all over the web. Here is an attempt to condense some of the key information into ten points …

1. A better spec means better interoperability

We, in common with many people working with WXS schemas, have been tripped up by interoperability problems caused by different tools having a different take on how WXS should be implemented. Even Microsoft, a developer who in generally sympathetic to WXS, has reported a number of interoperabilty problems, and that for its customers WXS had “stuffed up the ready interoperability they thought they were buying into with XML”. [1]

The root of such interoperability problems is that the WXS specification is notoriously hard to interpret. James Clark has called it “without doubt the hardest to understand specification that I have ever read”. [2] Little wonder then that mere mortal developers have difficulty interpreting it!

RELAX NG has, by contrast, a clear formal description of the semantics of a RELAX NG schema – and for those who want to skip the formal text of the standard, the technology can be clearly explained even in a short tutorial.

2. Availability of a compact syntax

Unlike WXS, RELAX NG has a compact syntax (as explained in this tutorial. Using it a DTD like:

<!DOCTYPE addressBook [
<!ELEMENT addressBook (card*)>
<!ELEMENT card (name, email)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT email (#PCDATA)>

can be expressed with this syntax:

element addressBook {
element card {
element name { text },
element email { text }
}*
}

Much nicer!

3. The specification is a stable ISO standard

RELAX NG first became an OASIS standard in 2001 and then went through a full ISO standardisation process to become an ISO Standard (ISO 19757-2:2003 [free ZIP download]) in 2003. It has proved stable and complete from the start and no revisions to it are planned.

WXS emerged from a vendor-dominated consortium (the W3C), and is currently anticipated to be revised and released in a 'mostly compatible' version 1.1 and, later, revised to a 2.0 release. It is unclear what level of vendor support these new releases will enjoy.

4. No PSVI

The PSVI, or Post-Schema-Validation Infoset, is the result of validating a document against a WSX schema. It consists of the normal XML infoset, plus extra information that might have be gleaned from the schema, such as type information about content.

This is a bad thing.

The main reason why it's a bad thing is that it introduces into the processing model, information that cannot be expressed as XML. If a processing pipeline needs to make use of the kind of information embodied by the PSVI, then every step in that pipeline has to become PSVI-aware and the result is a tightly-coupled system that is no longer XML-based, but based on something other than the XML Infoset, the PSVI.

Both James Clark [3] and Elliotte Rusty Harold [4] say all that needs to be said about the perils of the PSVI.

5. No content defaulting

RELAX NG, at least in its ISO form, provides no mechanisms for content default. For reaons why this is good, see this other blog entry.

6. Better datatyping support

WXS provides a set of datatypes that may be used to constrain and bind values in content. This is a good idea.

Unfortunately, there are a number of serious problems[5] with the way this has been done (and the fact that type information is communicated using the PSVI).

RELAX NG, in contrast has the option for pluggable type libraries which may be implemented through an API. Most validators ship with WXS-mirroring type libraries (if you must) too.

(In future, when we're all using pipeline processing for validation, a nice datatype language like DTLL could more properly perform the task of datatype validation.)

7. More sophisticated modelling

WXS gives us barely more sophistication in grammar modelling than DTDs did. RELAX NG introduces useful new feature for modelling interdependent attribute and element content.

8. More sophisticated grammatical validation

WXS grammars have to be deterministic. RELAX NG grammars can be ambiguous.

Score one for WXS, you might think. But wait - WXS's means of preventing ambiguity is through a constraint called Unique Particle Attribution (UPA). The problem with this, as the Microsoft report notes, is that “it breaks idiomatic uses of XML”. So if you want to express a grammar like (title?,para+)|(title,subtitle?,para+) (i.e. subtitle is only permitted when there is a title) the UPA rule will prevent you, as a validator cannot know which 'branch' of the model it is following during validation. The problem becomes more acute if one starts adopting some of the wildcarding features permitted in WXS.

RELAX NG, on the other hand, will happily accommodate non-deterministic content models.

In most applications (and probably all publishing applications) the question of whether a governing schema's content model is deterministic or not, is a dry technicality, of absolutely no consequence to the work in hand.

9. Instances have no dependency

WXS schemas (like DTDs) provide a mechanism for associating an instance with a schema: the xsi:schemaLocation attribute. This is problematic in two ways: first, the W3C recommendation makes it optional for processors to use this mechanism - and so behaviour is unpredictable; secondly, this is a potential security problem: it is possible to specify an unwanted schema here knowing that an aplication may not be free to ignore it.

RELAX NG schemas, on the other hand, have no formal association with instances. The validation model is one in which the validation process has separate inputs for the data being tested, and the tests themselves - users do not to have to validate a document each and every time it is processed.

10. Growing consensus

A growing number of key XML languages are being normatively defined using RELAX NG, such as XHTML 2.0, the Atom Syndication Format, OpenDocument Format and DocBook 5. It's clear (if there is a shift) which direction that shift is in, particularly for document-like modelling. And when Tim Bray, one of the original editors of XML 1.0 comes out against WXS it really is time to listen:

Everybody who actually touches the technology has known the truth for years, and it’s time to stop sweeping it under the rug. W3C XML Schemas (XSD) suck. They are hard to read, hard to write, hard to understand, have interoperability problems, and are unable to describe lots of things you want to do all the time in XML. Schemas based on Relax NG, also known as ISO Standard 19757, are easy to write, easy to read, are backed by a rigorous formalism for interoperability, and can describe immensely more different XML constructs. [5]

- Alex.

References

[1] Microsoft Corp., XML Schema Language Experience Report, http://www.w3.org/2005/05/25-schema/microsoft.html

[2] James Clark, RELAX NG and W3C XML Schema, http://www.imc.org/ietf-xml-use/mail-archive/msg00217.html

[3] James Clark, PSVI considered harmful, href='http://osdir.com/ml/org.w3c.tag/2002-06/msg00118.html

[4] Elliotte Rusty Harold, Pretend There's No Such Thing as the PSVI, http://safari.awprofessional.com/0321150406/ch25 [pay-for content]

[5] Comments on XML Schema Datatype made by ISO/IEC JTC 1/SC 34/WG1, http://www.jtc1sc34.org/repository/0392.htm

[6] Tim Bray, Choose RELAX Now, http://www.tbray.org/ongoing/When/200x/2006/11/27/Choose-Relax

Digg!
Thursday, July 26, 2007 8:05:16 AM UTC  #    Disclaimer  |  Comments [1]  | 
 Monday, July 23, 2007

Francis Cave and I are running training days on XML in Publishing this summer at The Publishing Training Centre at Book House.

This course is for those who have to manage the production of electronic content for a range of applications. It requires no prior knowledge. During it, participants will find out:

  • the basic principles of mark-up languages
  • the roles XML can play in publishing
  • what it is like to work with XML data.

The next session is scheduled for 25th September, and is already filling up. To book a place, please contact The Publishing Training Centre directly ...

- Alex.
Monday, July 23, 2007 2:22:37 PM UTC  #    Disclaimer  |  Comments [0]  | 
 Thursday, April 26, 2007
XSLT transformations are the stock-in-trade of XML developers. In general, you shouldn't have to worry too much about how different engines work, but in some rare edge cases, it's a consideration.

Now, it would be foolish to rely on the order of attributes in your physical instance, since none is imposed on them by an XML processor. (OK, you could in theory rely on the ordering if you canonicalize the document first: "An element's attribute nodes are sorted lexicographically with namespace URI as the primary key and local name as the secondary key" -- W3C Canonical XML Version 1.0).

The ordering of attribute nodes in XPath is similarly undefined, so trusty XPath engines do not necessarily produce the same results when an element's attributes are processed together. Given the XML instance:

    <foo c='1' b='2' a='3'/>

and the XPath expression:

    name(//@*)


the result might be 'c', 'b' or 'a', and could conceivably differ between runs of the same engine.

That XPath expression is artificial and unlikely to be used seriously, but it serves to make the point: different XPath engines producing different results can all still be considered to have produced "correct" output.

- Andrew

Thursday, April 26, 2007 9:50:36 AM UTC  #    Disclaimer  |  Comments [0]  | 
 Wednesday, February 14, 2007

XML UK are holding a day conference entitled entitled “Publishing 2.0” at Bletchley Park on Wednesday 25 April 2007.

Beyond being an eye-catching title what we (as organisers) intend “Publishing 2.0” to mean, is that the conference will be examining some of the more cutting-edge applications of XML(ish) technology to publishing. We're putting together a cracking program which already includes:

And how cool a venue is Bletchley Park? To be in the presence of the ghost of Alan Turing adds an extra geeky frisson to the occasion.

A full programme will be announced shortly, but I confidently predict this event will sell out (the venue is limited to 100 people), so to reserve an early space contact XML UK with your credit card in hand.

- Alex.
Wednesday, February 14, 2007 9:50:35 AM UTC  #    Disclaimer  |  Comments [0]  | 
 Tuesday, February 13, 2007

… is the title of a paper I will be giving at The XTech 2007 Conference, which is to be held in Paris from 15 - 18 May.

The focus of the presentation is the new ONIX-PL XML language for expressing licences electronically, so that they are machine-processable. In the first instance the licenses being modelled are between publishers and libraries.

This will be a two-hander, with Francis Cave handling an explanation of the wider business issues, and me concentrating on how we've used Orbeon Forms as the basis of a web application for authoring and managing these complex electronic documents (see a sample license for an example). Francis has been hard at work on an innovative way of annotating XML schemas to affect how instances governed by them are rendered in XForms engines.

Now to write that paper ... Here's the abstract ...


Abstract

As more and more content is published electronically so the need for controlling access to it has risen. Early efforts in this field focused on copy-protection technologies (DRM), but a more enlightened approach emerges if instead content licenses can be agreed between parties and content then used according to that agreement.

This presentation focuses on the design and system implementations around the new ONIX-PL industry standard (developed by EDItEUR ), for representing license agreements between content producers and content recipients. Early adopters of the standard are publishers, libraries and academic institutions wishing to agree licensing terms for the use of high-value scholarly content. ONIX-PL is a high-profile initiative enjoying support from JISC, DLF/ERMI and a number of US and European universities, commercial publishers and library systems vendors.

ONIX-PL license expressions are XML document which are machine actionable. For that reason they need to capture with precise semantics the implications of the legal clauses they embody.

This presentation will examine the challenges of representing machine-actionable legal agreements using XML, and in particular look at the semantic web technologies considered and used (or rejected) in the XML model designs.

Standards and models are of no use if they have no implementation or take up. The presentation will therefore consider how EDItEUR chose to develop a free Open Source software application for authoring and managing these complex XML documents, and how ultimately a full range of Web 2.0 technologies including XForms, pipelining, and AJAX were necessary in consort with more established technologies such as XSLT, XHTML and J2EE, in order to have a web application that dealt properly with the problem space while meeting tight development deadlines.

The presentation will this conclude with some real-world tales of software development and deployment (together with a demonstration) of licenses being created and used using EDItEUR’s chosen infrastructure technology, Orbeon Forms (whose developers the presenters have no affiliation with)

In summary, attendees can expect to learn:

  • why there is a need for electronic expressions of licenses
  • how XML and semantic technologies can be used for this purpose
  • what an XML electronic license expressions looks like ‘for real’
  • why XML licenses need to be created by non-technical users
  • how to rapidly develop a web application for them, and the ‘real world’ software development challenges faced in doing so.
- Alex.
Tuesday, February 13, 2007 1:55:48 PM UTC  #    Disclaimer  |  Comments [0]  | 

… is a presentation that I won't be giving at The XTech 2007 Conference, as the proposal was not accepted (I will however be speaking on another topic). Based on my experience of speaking at, and reviewing for, XML conferences over several years the rejection of this paper surprises me. Maybe XTech really is losing the XML-focus of its XML Europe past.

I do hope somebody is covering DSDL, as the technologies it contains are important ones that deserve public airing.

Anyway, here's the abstract of the paper that didn't make it:


Description

ISO is expected shortly to standardise three new schema languages as part of DSDL. Learn about them, and the DSDL project as a whole, in this update.

Abstract

It has recently been proclaimed that “among the XML cognoscenti, the debate is effectively over. Everyone is choosing RELAX NG”. And indeed the early indicators are that RELAX NG is getting increasing traction (if still only being the grammar modelling language of the “cognoscenti”). So for example:

  • The W3C have defined XHTML 2.0 normatively using RELAX NG
  • Microsoft have agreed to have the schemas for Office re-expressed in RELAX NG as part of their standardisation effort
  • DocBook 5 is being primarily developed using RELAX NG.

But RELAX NG is only one part of a 10 part ISO standard: DSDL (or Document Schema Definition Languages, ISO 19757) aims to offer a complete family of XML validation languages, in which RELAX NG covers just the specialised area of regular-grammar-based validation.

The other two fully-standardised parts of DSDL (Schematron and NVDL) are also gaining wider adoption in public XML models and in implementations.

But DSDL is about to include, in their final forms, three new standards which are currently less well known, even among “the cognoscenti”: DTLL (Datatype Library Language), DSRL (Document Schema Renaming Language) and Datatype- and Namespace-aware DTDs.

Drawn from real world experience in the ISO working group, and in editing and implementing part of DSDL, this presentation will include a description of DSDL, and in particular will set out the function of the three lesser-known parts which are soon to be standardised. It will explain why DSDL as a whole offers an elegant and complete solution to the problems of XML validation, and why users should care.

  • DTLL will introduce data-typing into the validation mix in a way which overcomes the limitations of W3C Schema’s fixed typing scheme. It will allow users to define their own type libraries in elegant declarative XML.
  • Influenced by architectural forms, DSRL acts as a schema adapter, allowing users to validate XML as though it were valid to a schema, by modifying it ‘on the fly’. As such it powerfully supports internationalisation and content defaulting.
  • Part 9 of DSDL will retro-fit some of its major features into DTDs, allowing users with heavy investment in DTD technology to get more life of them.

In summary, attendees will hear:

  • A conceptual overview of the need for DSDL and an appreciation of the problem space it addresses
  • What the 10 parts of DSDL are
  • A more detailed description of the three upcoming parts of DSDL
  • Examples and/or demonstrations of these in action
  • A report of progress made in working-group meetings running alongside XTech 2007
  • A roadmap for the completion of the project and details on how to get involved.
Tuesday, February 13, 2007 1:36:27 PM UTC  #    Disclaimer  |  Comments [0]  | 
 Tuesday, January 09, 2007

XML DTDs and W3C Schemas both have mechanisms that allow default information to be supplied at validation time. Here's an example of this from the W3C's own XHTML Schema Module Implementations for attributes of the input element:

<xs:attribute name="type" default="submit">

When XHTML instances are validated against a schema with this construct, the validation process will default the value "submit" for the type attribute, if none is present in the instance being validated.

In the early days of the company we were enthusiastic content defaulters, priding ourselves on designing DTDs that could 'take the strain' by providing default information that might otherwise have to be tediously keyed in. But now we think this is bad practice. Here's why:

1. Conceptual confusion. There's validation, and there's transformation. A schema (or DTD) should be used for validation, and transformation languages (like XSLT) for transformation. Trying to both jobs in one language confuses these concerns.

2. Defaulting models can't be expressed with RELAX NG. RELAX NG as standardised by ISO contains no mechanisms for defaulting content (unlike its OASIS-standardised predecessor). So content models which expect the schema language to provide default content won't be expressible in RELAX NG.

3. WYSINWYG. What you see is not what you get. One of the great strengths of XML (despite the W3C's wrong-headed pronouncement that XML is not meant to be read) is that you can open up an XML document in a text editor and actually see, at the text 'n' tags level, what is going on. But if a schema or DTD will be providing default content, you might not be seeing everything 'in' the instance.

4. Having to take the Schema everywhere. Relatedly, if the DTD or schema provides default information you need to make sure that every time its governed instances are parsed, that DTD or schema needs to be present, as such document are not standalone. This is an overhead.

5. The Namespace problem. See here for reasons why providing Namespace support through content defaulting can be tricky.

6. Saving typing is sooooo over. Content defaulting, like tag minimisation features, emerges from an earlier era where saving keystrokes and precious VDU screen real estate were important concerns. These things are generally less important now, and - if they are - there a better tools for the job than content-defaulting schemas or DTDs.

Okay, so that's six reasons. Under-promising and over-delivering again ;-)

Tuesday, January 09, 2007 8:17:00 PM UTC  #    Disclaimer  |  Comments [0]  | 

XML users who choose DTDs for modelling are faced with practical problems caused by the lack of information in the XML specifications about how XML DTDs and XML Namespaces should co-exist. Strictly speaking there is no mechanism by which DTD users can specify that element definitions are declared in a particular namespace — however a frequently-seen approach to trying to achieve this (as in the W3C's own XHTML 1.0 DTDs) is to declare an attribute called 'xmlns' and fix a Namespace URI to it like this.

<?xml version='1.0'?>
<!DOCTYPE a [
<!ELEMENT a EMPTY>
<!ATTLIST a xmlns CDATA #FIXED "http://example.com/ns">
]>
<a/>

There are, however, a number of problems with this, not least the fact that 'xmlns' is not (and cannot be - the sequence x m l is prohibited from starting XML names) an attribute. However, we might close our eyes, cross our fingers and hope the hack works.

And we'd be okay in most cases. The Xerces and rxp parsers, for example, happily process this XML and behave as if the element a is associated with the Namespace URI http://example.com/ns.

Microsoft parsers, however, will not parse this content. They halt with the message "Use of default namespace declaration attribute in DTD not supported." Any Windows user launching such XML for viewing in Internet Explorer (as customers do) will get this message. Microsoft's developers are arguably quite correct in doing this, but as in the well-known joke perhaps also unhelpful.

What to do?

What we do is this:

<?xml version='1.0'?>
<!DOCTYPE a [
<!ELEMENT a EMPTY>
<!ATTLIST a xmlns CDATA #REQUIRED >
]>
<a/>

… and in the documentation accompanying the DTD specify that instances must specify the Namespace properly using the usual xmlns mechanism (ideally this is enforced in another validation layer, with Schematron e.g.). This way the XML is 'correct' and all conformant parsers - including Microsoft's - are happy.

The only downside (one might think) is that users have to go to the trouble of putting that Namespace URI in the instance, rather than letting the DTD supply it for them. But we think this is a good thing — why we think that is the subject of another blog entry.

Tuesday, January 09, 2007 7:59:52 PM UTC  #    Disclaimer  |  Comments [0]  | 
 Thursday, January 04, 2007

The company is 10 years old in February! We've been discussing what suitably eye-catching initiative it would be good to celebrate with.

One idea was a "back to 1997 prices" promotion for our consulting rates … but on second thoughts maybe the market couldn't stand such a steep rise ;-)

A Happy New Year to everyone!

- Alex.
Thursday, January 04, 2007 3:56:59 PM UTC  #    Disclaimer  |  Comments [0]  | 
 Tuesday, December 05, 2006

There are many areas of computing and development where what might be termed questions of "taste" apply. A function should fit on a screen; a module should contain no more than a few dozen functions; a database table should have no more than a few dozen fields, etc.

The same questions applies to XML document instances, and in general they shouldn't be more than a few megabytes in size. If you find you're working with XML and your documents are often bigger than this, more often than not it's a symptom of a deeper architectural malaise.

The reasons for having small functions, etc, are not just capricious – things that are smaller are easier to debug, view and maintain — in short easier to comprehend, given that our puny human brains can only function effectively when not overloaded with content.

XML documents should be human-consumable too, despite the W3C's wrong-headed strictures that XML isn't meant to be read. This means they should be small enough to be worked with by human beings, and shouldn't be too big.

What is this "too big"? Well, a clue lies in the fact that XML describes "documents", not "reference-libraries" or "databases". XML documents are well-suited to representing things like book chapters, journal articles, employee records, or tax returns; they are not good at modelling (as single standalone documents) book collections, a journal series, a large company's employee details, or a country's tax return collection. While (the old joke runs) software engineers are only interested three numbers: zero, one and infinity, this is damaging when it shades into a temptation to say that a system must model "one" XML document. We must learn to be comfortable with "some" documents, of a certain size.

When dealing with complex systems, human minds seem happiest when they have clear divisions between the mental contexts in which they apprehend parts of that system. So in the "complex system" we call life, we might of a morning leave the house, open the garage door and get into our car. It is of no help to us at all to learn that the house, the garage and the car are all types of container (true though that is) — for our human brains, knowing the type similarity between things is usually just unwelcome noise1. In the same way, creating usable software can require being canny about concealing the underlying similarity between things. We see a folder on our desktop containing files: it is of no help to us at all to know that the files, the folder, and the desktop itself, are all "file system entries" of one kind or another.

In the same way, usable XML systems have three levels of mental context. At the core is text content, with its own rules and appeal to our mind; above this is the XML structure (elements and attributes) of the document; above this is the storage system for the documents — whether they are files in a file system or entries in a database. We should embrace this storage layer as a useful abstraction, instead of trying to expand the remit of the document to supplant it.

The ultimate exemplar of this model is, of course, the Web itself. It is not a single document, and cannot be represented as such (it has no root, or starting point). In smaller systems this multi-document model is exemplified by such things as "collections" of XML documents. XML storage applications such as eXist have such collections as a central feature of their storage philosophy.

Fragments can be combined, of course, to make composites of arbitrary size — but the correct way to do this is to use linking and embedding technologies so that the overall collection is a loosely-coupled assembly of reasonably-sized documents, not to munge them into some XML mega-document which you rely on a fragmentation tool to extract them from before they can be used.

If the above paragraphs set out abstract reasons for preferring small documents, there are a number of practical considerations too:

  • Small documents are friendly to other desktop applications, so can be opened in a text editor, emailed around and visualised with web-browsers + stylesheets, easily
  • Even some dedicated commercial XML editors can bog down horribly when they are fed big XML documents to edit
  • Day to day XML processing tasks (such as XSLT transformations) rely on in-memory representations of the XML being built. A Xerces-J DOM takes (as an example) 7 times as much memory as the original document to represent; meaning on typical desktop machines processing documents more than a few hundred MBs is difficult or impossible.

So, when designing a model or a system look for the human-sized things (there always are some) that can be modelled to represent optimally-sized XML document, and build around them. In our experience, users and developers both with thank you in the end.

- Alex.

1 Except for poets, but that is a different story.
Tuesday, December 05, 2006 8:26:37 PM UTC  #    Disclaimer  |  Comments [0]  | 
 Tuesday, November 28, 2006

… so blogs Tim Bray, in comment to Elliotte Rusty Harold's piece, RELAX Wins.

There, people are finally coming out and saying it — and as long term Schema skeptics1 we're pleased to see the view expressed, if only to make clear that this topic is still open.

However, the commercial reality is not that simple, as Micheal Champion outlines in an xml-dev posting. The issue is tooling. Many vendors have bet the farm on XSD becoming the schema language of choice, resulting in lots of business-critical software like editors and databases, being firmly bound-in to XSD.

Since XSD will, in practice, be here to stay for a while (if only as a legacy technology), right now our recommendation (as blogged elsewhere) is to use RELAX NG (preferably the compact syntax) to author grammars, and then auto-generate DTDs or XSD from it using trang. As well as letting you create the models with the civilised RELAX NG compact syntax, this has the very nice side effect of making your XSD schemas 'defensive' — they won't be full of those edge features where the difficulty of interpreting the spec makes tools incompatible.

- Alex.


1 See for example this 2002 article on XSD Schemas in publishing.

Tuesday, November 28, 2006 9:40:06 AM UTC  #    Disclaimer  |  Comments [0]  | 
 Friday, November 17, 2006

In a break with the past, ISO are beginning to make some key IT standards documents publicly and freely available. See http://www.iso.org/PubliclyAvailableStandards.

Of particular interest to XML users are the three parts of DSDL that have achieved standards status so far:

- Alex.

Friday, November 17, 2006 7:49:55 AM UTC  #    Disclaimer  |  Comments [0]  | 
 Tuesday, October 17, 2006

While in a meeting discussing updating a Schema the other day, the question arose whether to change the Namespace of the declared elements in line with the version of the Schema itself, so version 1.0 might have a Namespace of http://example.com/myschema/1.0, version 1.1 a Namespace of http://example.com/myschema/1.1, and so on.

Some fairly well-known schemas adopt this approach (for example the CrossRef Schema), but in our experience this is a really bad idea, because when Namespaces change, developers cry. Any (properly-written) XSLT stylesheets and XML-aware programs need to be altered to be made aware of the new Namespace. And if there is a desire to support legacy version of the schema then the existence of multiple Namespaces makes things even messier.

It's much better practice to have something in the schema itself to denote its version (a version attribute on the root element seems a popular choice), while keeping the Namespace the same. A bit like XSLT, where the Namespace remains the same as the version changes.

The only reason for changing the Namespace for a Schema version is if the desire is to make sure that the existing body of processors for it break when they encounter the new version. But this is often very likely to inhibit adoption of the new version — to take a parallel example, just look what's (not) happened to XML 1.1, where the 'new version breaks existing processors' approach has severely crimped up-take.

- Alex.

Tuesday, October 17, 2006 7:47:36 AM UTC  #    Disclaimer  |  Comments [0]  | 
 Friday, October 13, 2006
Transforming the XML that Word 2003 produces into something elegant was never going to be a doddle, but recent front-line experience of a client's Word-based XML workflow suggests it just got that bit harder.

We were taken aback to learn that a styled run of text, which was being saved as this WordML

        <w:r>
          <w:rPr>
            <w:rStyle w:val='Refjnltitle'/>
          </w:rPr>
          <w:t>NASMHPD Medical Directors Best Practice Symposium</w:t>
        </w:r>

        
was being lengthened to this monstrous markup:

        <w:r wsp:rsidRPr='00987ACD'>
          <w:rPr>
            <w:rStyle w:val='Refjnltitle'/>
          </w:rPr>
          <w:t>NASMHPD</w:t>
        </w:r>
        <w:r>
          <w:rPr>
            <w:rStyle w:val='Refjnltitle'/>
          </w:rPr>
          <w:t> </w:t>
        </w:r>
        <w:r wsp:rsidRPr='00987ACD'>
          <w:rPr>
            <w:rStyle w:val='Refjnltitle'/>
          </w:rPr>
          <w:t>Medical</w:t>
        </w:r>
        <w:r>
          <w:rPr>
            <w:rStyle w:val='Refjnltitle'/>
          </w:rPr>
          <w:t> </w:t>
        </w:r>
        <w:r wsp:rsidRPr='00987ACD'>
          <w:rPr>
            <w:rStyle w:val='Refjnltitle'/>
          </w:rPr>
          <w:t>Directors</w:t>
        </w:r>
        <!-- etc. -->

        
This has obvious consequences for the transformation process, complicating it considerably. The culprit? Well, note the wsp-prefixed attributes. They belong to a namespace with the URI http://schemas.microsoft.com/office/word/2003/wordml/sp2, introduced in SP2. Not only are these attributes undocumented by Microsoft (the only clue to their purpose is that they reference IDs in the document properties used for tracking style revisions), they are not even declared in the Word 2003 schema, making documents containing them invalid against that schema.

It's all right, though, because such documents also have this in them:

    <w:ignoreElements w:val="http://schemas.microsoft.com/office/word/2003/wordml/sp2"/>
    
to indicate that elements (what about the attributes!?) in that namespace should be ignored by Word. What a relief - as long as Word can read and render it, who cares about other users of the XML...?

The only fix we've found so far is to save down using a pre-SP2 version of Word 2003 - not ideal for organizations which have applied SP2 en masse. We've begun warning others with similar workflows off applying SP2, and continue recommending ODF instead, as a mature and truly transparent, open standard.

- Andrew.

Friday, October 13, 2006 10:40:25 AM UTC  #    Disclaimer  |  Comments [0]  | 
 Thursday, October 12, 2006

We have released a new technical white paper setting out some of the thinking behind XMLProbe and describing its technical features.

We're also pleased to announce that a new release of XMLProbe, version 1.4, has now entered beta testing, and is scheduled for release in time for the Online Exhibition in London (Nov 28-30). Support pack subscribers will get this version 1.4 automatically when it is released.

New features in this release include:

  • reports can now be emitted as HTML directly using an in-built stylesheet
  • new <probe:for-each> element allows simpler iteration over node-sets
  • GUI with drag-and-drop validation (desktop version only)
  • is-valid-issn() extension function
  • graphics sniffing functions (check your files with GIF extensions really are GIFs).
  • preconfigured validators available for CrossRef metadata and the NLM Journal Publishing DTD 2.2
  • performance improvements
  • The CrossRef and NLM rules can be previewed using our online validation service at http://www.xmlprobe.com/

    And, as ever, we want to hear what features you want to see in upcoming releases. Please mail us with your requests.

    Thursday, October 12, 2006 7:13:48 AM UTC  #    Disclaimer  |  Comments [0]  | 
     Tuesday, October 03, 2006

    Norm Walsh has released a new beta of DocBook version 5.

    Version 5 is a significant rewrite of DocBook 4.x that "is true to the spirit of DocBook while simultaneously removing inconsistencies that have arisen as a natural consequence of DocBook's long, slow evolution".

    One interesting feature is that DocBook 5.x is normatively defined (so, written in) RELAX NG, and the DTD and W3C Schema are auto-generated from that. What a sensible idea :-)

    - Alex.

    Tuesday, October 03, 2006 2:46:13 PM UTC  #    Disclaimer  |  Comments [0]  | 
     Thursday, September 28, 2006

    For your diary …

    Thursday 30 November: STM E-Production Seminar (London, UK)

    Hear the great and the good of the STM publishing world (and me) talk about the burning issues affecting digital production of book and serial publications. One day event. €450 (STM members), €675 (non-members).

    December 5 - 7 XML 2006 (Boston, USA)

    This annual fixture in the XML calendar has a dedicated publishing track (see the programme) running over the 3 days of the conference. Boston in December? Take warm clothing! Three day conference (+ ancillary events). US$ 795 (IDEAlliance members), US$ 1,250 (non-members), US$ 275 (students).

    Saturday 20 January 2007 PLAN-X 2007 (Nice, France)

    Still at the CFP stage, this looks to be a hard-core XML programming conference which may be of interest to those pushing the envelope in publishing applications, especially if they do get submissions on topics like 'Languages and systems that can cope with XML fragments (messages) or very large XML instances (beyond main-memory size)'. Hmmm. wonder if I should be submitting an update on our frozen streams work. Pricing TBA.

    - Alex.
    Thursday, September 28, 2006 12:48:53 PM UTC  #    Disclaimer  |  Comments [0]  | 

    Well, it is if you're using it to describe a programming language, says Allen Holub in his Article, Just Say No to XML.

    Following a number of questionable assertions (the ‘original design’ or XML was ‘as a data-description language' — really?), Holub gets the bit between his teeth, and lays into ‘dilletante’ ‘so-called’ XML programmers who don't have the knowledge to justify calling themselves ‘professional’. This ‘need-to-know’ stuff includes:

    • ‘a deep understanding of data structures and key algorithms’
    • ‘[knowing] how the hardware works’
    • ‘knowing how to build a compiler’.