Validation Considered Essential

I just ran into a disaster scenario which Mark Baker recently described as the way things should be: a new message exchange without schema validation. He writes: “If the message can be understood, then it should be processed” and in a comment “I say we just junk the practice of only processing ‘valid’ documents … and let the determination of obvious constraints … be done by the software responsible for processing that value.” I’ll show this is unworkable, undesirable and impossible (in that order).

I’ve got an application out there which reads XML sent to my customer. The XML format is terrible, and old – it predates XML Schema. So there is no schema, just an Excel file with “AN…10” style descriptions and value lists. It is built into my software, and works pretty well – my code does the validation, and the incoming files are always processed fine.

Now a second party is going to send XML in the same format. Since there is no schema, we started testing in the obvious way – entering data on their website, exporting to XML, send it over, import in my app, see what goes wrong, fix, start over. We have had an awful lot of those cycles so far, and no error-proof XML yet. Given a common schema, we could have a decent start the first cycle. Check unworkable.

So I wrote a RelaxNG schema for the XML. It turned out there where hidden errors which my software did not notice. For instance, there is a code field, and if it has some specific value, such as ‘D7’, my customer must be alerted immediately. My code checks for ‘D7’ and alerts the users if it comes in. The new sender sent ‘d7’ instead. My software did not see the ‘D7’ code and gave no signal. I wouldn’t have caught this so easily without the schema – I would have caught it in the final test rounds, but it is so much easier to catch those errors early, which schema’s can do. Check undesirable.

Next look at an element with value ‘01022007’. According to Mark, if it can be understood, it should be processed. And indeed I can enter ‘Feb 1, 2007’ in the database. Or did the programmer serialize the American MM/DD/YYYY format as MMDDYYYY and is it ‘Jan 2, 2007’? Look at the value ‘100,001’ – perfectly understandable, one hundred thousand and one hundred – or is this one hundred and one thousandth, with a decimal comma? Questions like that may not be common in an American context, but in Europe they arise all the time – on the continent we use DD-MM-YYYY dates and decimal comma’s, but given the amount of American software MM/DD/YYYY dates and decimal points occur everywhere. The point is the values can apparently be understood, but aren’t in fact. One cannot catch those errors with processing logic because the values are perfectly acceptable to the software. Check impossible.

In exchanges, making agreements and checking if those agreements are met is essential. Schema’s wouldn’t always catch the third kind of errors either, but they provide a way to avoid the misinterpretations. The schema is the common agreement – unless one prefers to fall back to prose – and once we have it, not using it for validation seems pointless. Mark Baker makes some good points on over-restrictive validation rules, but throws out the baby with the bathwater.

5 Replies to “Validation Considered Essential”

  1. Thanks for the response. To address each of your 3 examples …

    Unworkable – you need a data model to address that problem, not a schema. While a schema can help in specifying one, it is neither necessary nor sufficient.

    Undesirable – I don’t understand. Why is it undesirable to have you software do a case-insensitive check? I hope you’re doing this anyway, even if you’re also using a schema, because software should generally validate inputs (hence my point about encapsulation).

    Impossible – I agree that explicit data typing is often a good idea in examples like that, but again, a schema is neither necessary nor sufficient to do that. e.g. rdf:datatype.

  2. Actually I’m quite sympathetic to the idea of using some other form of datatyping for exchanges. I once got terribly fed up with handcrafting schema’s for data which – usually – come from a relational db and end up in one. So I got the idea it should be possible to define any rdb-to-rdb exchange with SQL DDL (CREATE TABLE etc.) and a solid set of deserialization rules from XML into relational tables. Parties then simply exchange the DDL, create the exchange db and send XML. Further processing into the target db can be done with the favorite local tools. Basically this is delegating the datatyping to SQL DDL. The project got stuck in the details, which can be muddy, but maybe I’ll work out the details sometime and see if I can get it published.

    But now there is no very commonly accepted way to specify datamodels and -types for XML other than schema’s.

  3. You can even use a schema initially if you want, if that is really your lingua franca, but your real problem is getting the other party to *tell you* what the format is! This problem you are describing has nothing to with validation vs no validation.

  4. Ben: “your real problem is getting the other party to *tell you* what the format is”

    This was my day job for years (sometimes still is), and yes, that can be a huge problem.

    However, I described a situation where *I* have to tell the other party what data is allowed in, and then to validate or not to validate is a choice for me – validate, I’d say.

  5. My point is that the title “Validation Considered Essential” does not seem to relate to what you ended up writing about. Using XML validation is your choice, but that seems to be beside the point.

    And what about this story involves you telling them what data is allowed in? It seems like you are entering data on their site to get sample data sets to practice processing. It certainly doesn’t sound like you are handing them a schema that they have to conform to, on the contrary, you are being subjected to whatever format they produce.

    I took your message to be that if they had given you an XSD your life would have been easier. I can agree with that and I would say use it as a guide to implementing your processing, but don’t actually validate their documents against it, that would have no purpose. If the XSD gets out of sync with theirs it can serve as an early warning system, but you will still have to simply disable the validation and make do with what they are giving you.

Comments are closed.