Validate for Machines, not Humans

Mark Baker misses an important distinction in “Validation Considered Harmful” when he writes:

“Today’s sacred cow is document validation, such as is performed by technologies such as DTDs, and more recently XML Schema and RelaxNG.

Surprisingly though, we’re not picking on any one particular validation technology. XML Schema has been getting its fair share of bad press, and rightly so, but for different reasons than we’re going to talk about here. We believe that virtually all forms of validation, as commonly practiced, are harmful; an anathema to use at Web scale.”

Dare Obasanjo replied in “Versioning does not make validation irrelevant“:

“Let’s say we have a purchase order format which in v1 has a element which can have a value of "U.S. dollars" or "Canadian dollars" then in v2 we now support any valid currency. What happens if a v2 document is sent to a v1 client? Is it a good idea for such a client to muddle along even though it can't handle the specified currency format?"

to which Mark replied:

“No, of course not. As I say later in the post; ‘rule of thumb for software is to defer checking extension fields or values until you can’t any longer'”

With software the most important point is whether the data sent ends up with a human, or ends up in software – either to be stored in a database for possible later retrieval, or is used to generate a reply message without human intervention. Humans can make sense of unexpected data: when they see “Euros” where “EUR” was expected, they’ll understand. Validating as little as possible makes sense there. When software does all the processing, stricter validation is necessary – trying to make software ‘intelligent’ by enabling it to process (not just store, but process) as-yet-unknown format deviations is a road to sure disaster. So in the latter case stricter validation makes a lot of sense – we accept “EUR” and “USD”, not “Euros”. And if we do that, the best thing for two parties who exchange anything is to make those agreements explicit in a schema. If we “defer checking extension fields or values until you can’t any longer” we end up with some application’s error message. You don’t want to return that to the partner who sent you a message – you’ll want to return “Your message does not validate against our agreed-upon schema”, so they know what to fix (though sometimes you’ll want your own people to look at it first, depending on the business case).

Of course one should not include unnecessary constraints in schema’s – but whether humans or machines will process the message is central in deciding what to validate and what not.
Another point is what to validate – values in content or structure, and Uche Ogbuji realistically adds:

“Most forms of XML validation do us disservice by making us nit-pick every detail of what we can live with, rather than letting us make brief declarations of what we cannot live without.”

Yes, XML Schema and others make structural requirements which impose unnecessary constraints. Unexpected elements often can be ignored, and this enhances flexibility.

4 Replies to “Validate for Machines, not Humans”

  1. That’s a perfect example of a constraint that should not be there – and I agree with you one should not include validations which make no business sense, and there are way too many of those.

    However, in this example, if we accept only ordered quantities > 0, like we probably should, it makes a lot of sense to define quantity as xsd:positiveInteger, not xsd:integer, and it makes sense to lay this rule down in a schema and reject incoming messages which violate it. If my customer orders minus one pencils, his software must be broken. The schema describes what I’m willing to accept, and I should be generous, but not that generous. What my apps down the line can or cannot accept (not -1 in this case) is no information for my customers, therefore I need a schema for this – or prose, but just prose seems a step back.

  2. Agreed, a positive integer would be a far more reasonable constraint because it’s less apt to change over time. But schema users haven’t, IMO, gotten this point, and most schemas I’ve seen have the most brittle of assumptions built in (heck, that example is from the XSD primer where you’d expect to see best practice practiced!!). Besides, the schema specs largely limit extensibility by default, meaning that it takes a guru to develop a decent schema. So I say we just junk the practice of only processing “valid” documents (as my latest post explains in more detail), and let the determination of obvious constraints (such as positiveInteger) be done by the software responsible for processing that value.

Comments are closed.