What’s wrong with XML?

I’ve uploaded my Dutch article ‘Wat is er mis met XML?‘ on the many well-known shortcomings of XML – namespaces, DOM, canonicalization, XML Schema, in general: complexity and counter-intuitivity. Entertaining, and necessary since I’m more and more often being accused of being an XML evangelizer – at best it sometimes is the least evil of possible alternatives.

This entry was posted in xml. Bookmark the permalink.

6 Responses to What’s wrong with XML?

  1. Some comments on your XML rant:

    “Na 10 jaar is de XML gemeenschap er nog steeds niet uit of de URI die een Namespace definieert naar een resource moet wijzen (b.v. een Schema of RDF bestand, of een HTML pagina) die iets zinnigs over die Namespace zegt of niet. Verwarring alom bij de (nieuwe) gebruiker, iedereen associeert een URI immers met een webpagina.”

    The I in URI means ‘identifier’ in the broad sense of the word. That people usually dereference URIs does not mean you can not use them as identifiers.

    “Na deze hobbel is er de verkorte notatie – prefix – van een Namespace. Er is nooit een poging gedaan deze te registreren, zoals bijvoorbeeld Media Types. Gevolg: de prefix van XML Schema kan (en deze komen allemaal regelmatig voor) xs: zijn, of xsd:, of wxs: of zelfs iets zelfbedachts als kul:, en erger, we kunnen prima xs: gebruiken als prefix voor XSLT elementen.”

    This is a feature; a requirement for registration of prefixes places an undesirable hurdle on the author for creating his own namespace. Also, convenient short prefixes would very quickly run out.

    “Iedere programmeur kan dus niets zinnigs zeggen over een element, zonder eerst de prefix op te zoeken bij de namespace declaratie, daar de hele betreffende URI te lezen (wie weet de URI van XML Schema uit het hoofd?) en te kijken of het nu écht XML Schema is waar het over gaat.”

    You’re exaggerating greatly here; the namespace URI for XML Schema has XML Schema in it, and not XSLT. There is no need to memorise anything. Also, from the context (element names, etc.) it should immediately be clear whether you are dealing with an XML Schema or XSLT document, even without looking up the prefix mapping.

    “Alsof deze mess nog niet erg genoeg is, is het mogelijk namespaces overal in een document te declareren. Zeker als een default Namespace gebruikt wordt (dus een Namespace zonder prefix) is het zoeken naar een Namespace-declaratie in een hooiberg.”

    Did you want to XML-in-XML wrapping or not? This facilitates that (partially). And even though this is possible, most of the time namespaces are all on the root element, so you will not have to consider this. This is true at least for most of our documents at work, even though we have no internal guideline that requires this.

    “En zeg eens eerlijk – wie weet nu op het eerste gezicht of een attribuut in een Namespace valt of niet?”

    The fact that attributes are in the null namespace is definitely at first unintuitive and an often-made mistake. I remember well when I found this out, and have had to explain this to others often. But I do not see how this could be different; otherwise you would need to prefix every attribute like so which would probably be even harder for authors to grasp. At the moment, for most document authors and consumers the namespace of attributes is a detail that they do not need to know about.

    “Het wordt echter nog erger. […] Tegen alle conventies in is wat er tussen aanhalingstekens staat, significant.”

    QNames in content is an age-old debate that has both pros and cons. Also, I would say this is a choice made by individual languages based on XML (XSLT, etc.), not of XML and XML Namespaces itself.

    Wrt XML Schema, it is definitely pretty convoluted. I guess the reason you can not express certain common things that are seemingly simple (e.g. ‘this attribute (X)OR this child element’) is because of some design decisions wrt. computability and performance that I don’t fully understand. That is the only explanation I can think of. XML Schema part 2 Datatypes is pretty decent though.

    Oh, and your JSON example doesn’t facilitate any kind of distributed extensibility. And it needs to be normalised as well to be properly compared (e.g. “a” and “\u0061” are equivalent but have different serialisation).

    f I look at the parts of XML you criticise, most of them JSON does not provide either, except for an easy object model. So er, JSON fail, too? :)

  2. * otherwise you would need to prefix every attribute like so <xhtml:div xhtml:class=”asdf”>

  3. Oh actually, the fact that JSON allows for different variations of white space alone causes it to need canonicalization. I think it will be hard to find a (readable, so Mork doesn’t count :)) text-based format that doesn’t require it, really.

    Also, the new DOM Element Traversal spec (http://www.w3.org/TR/ElementTraversal/) will make looping over elements while ignoring non-element nodes easier. Unfortunately there does not seem to be a getChildElements(‘nn’) method yet.

    It would be nice if E4X was supported by more browsers… Because that is really the biggest advantage JSON has over XML in JavaScript; it is a language primitive.

  4. @laurens:

    “a requirement for registration of …namespace, MdG… prefixes places an undesirable hurdle on the author for creating his own namespace. Also, convenient short prefixes would very quickly run out”

    This is easily solved: if a namespace prefix is declared locally in a doc, it overrides registered namespaces; if not, the prefix must be registered… alternatively, w3c could have claimed all x* namespace prefixes.

    “You’re exaggerating greatly here…”

    That was the fun part :-)

    “QNames in content is an age-old debate that has both pros and cons…”

    It’s an age-old mess… XML NS should have put rules for where namespaces are allowed and where not – it’s simply couterintuitive the way it is now.

    True, XSD part 2 is better and much easier to grasp than part 1

    And yes, JSON has canonicalization issues too, and you’re right, probably any serious text-based format will. But a c14n proposal for JSON can say: “Whitespace is not permitted between tokens. Leading and trailing whitespace is likewise disallowed.” (see: http://wiki.laptop.org/go/Canonical_JSON). A data-oriented format can say that. A document-rooted format as XML cannot, due to mixed content.

    “If I look at the parts of XML you criticise, most of them JSON does not provide either, except for an easy object model. So er, JSON fail, too? :)”

    JSON isn’t as extensible as XML – it doesn’t have namespaces. There’s no schema language. There’s not as much tool support. I think JSON has it’s sweet spot in simple data exchange models, where one party controls the mode of data exchanged. In that area though, JSON is much better – no namespaces, simple format, no mixed content, easy object model. For the harder jobs, in spite of my article, I’d still choose XML anytime. In lieu of an even better alternative…

  5. “This is easily solved: if a namespace prefix is declared locally in a doc, it overrides registered namespaces; if not, the prefix must be registered… alternatively, w3c could have claimed all x* namespace prefixes.”

    But that makes non-registered namespaces second-class citizens… I’m not sure that is a desirable situation.

    Anyway, yeah I guess I agree with the basic point; XML is not perfect. Then again, for most of the things you sum up, I can see the rationale for why they were done the way they are.

    Such as the basic syntax of XML; it does force the language and its extensions into a form that is probably not optimal or most user-friendly (e.g. namespace syntax, end tags, the need for shorthands, DTDs). Yet the decision to base XML on SGML has given the language an early boost in popularity, contributing to its success.

    Or the DOM syntax; dropping text nodes is not possible because markup languages such as XHTML depend on it (e.g. <em>x</em> <strong>y</strong> would show “xy” instead of “x y” without it). And not-exactly-short property and method names were chosen to minimise conflicts with existing object properties (such as those of ‘DOM 0’).

    Or why attributes are in the null namespace as I mentioned earlier.

    Etc. etc.

  6. However, the fact that JSON is seeing so much traction does show that there is some hope. I mean, even in the sorry state JSON is in, just 1 advantage (easy DOM access) has made it so popular in such a short amount of time. So it is possible to overthrow XML with something better.

    Something designed with distributed extensibility in mind from the ground up (unlike XML and JSON). Something with a simple object model, yet powerful enough to express complex documents.

    It needs to be introduced cleverly though, probably as something that can piggy-back on the success of a often-used web language :).

Comments are closed.