I just ran into a disaster scenario which Mark Baker recently described as the way things should be: a new message exchange without schema validation. He writes: “If the message can be understood, then it should be processed” and in a comment “I say we just junk the practice of only processing ‘valid’ documents … and let the determination of obvious constraints … be done by the software responsible for processing that value.” I’ll show this is unworkable, undesirable and impossible (in that order).
I’ve got an application out there which reads XML sent to my customer. The XML format is terrible, and old – it predates XML Schema. So there is no schema, just an Excel file with “AN…10” style descriptions and value lists. It is built into my software, and works pretty well – my code does the validation, and the incoming files are always processed fine.
Now a second party is going to send XML in the same format. Since there is no schema, we started testing in the obvious way – entering data on their website, exporting to XML, send it over, import in my app, see what goes wrong, fix, start over. We have had an awful lot of those cycles so far, and no error-proof XML yet. Given a common schema, we could have a decent start the first cycle. Check unworkable.
So I wrote a RelaxNG schema for the XML. It turned out there where hidden errors which my software did not notice. For instance, there is a code field, and if it has some specific value, such as ‘D7’, my customer must be alerted immediately. My code checks for ‘D7’ and alerts the users if it comes in. The new sender sent ‘d7’ instead. My software did not see the ‘D7’ code and gave no signal. I wouldn’t have caught this so easily without the schema – I would have caught it in the final test rounds, but it is so much easier to catch those errors early, which schema’s can do. Check undesirable.
Next look at an element with value ‘01022007’. According to Mark, if it can be understood, it should be processed. And indeed I can enter ‘Feb 1, 2007’ in the database. Or did the programmer serialize the American MM/DD/YYYY format as MMDDYYYY and is it ‘Jan 2, 2007’? Look at the value ‘100,001’ – perfectly understandable, one hundred thousand and one hundred – or is this one hundred and one thousandth, with a decimal comma? Questions like that may not be common in an American context, but in Europe they arise all the time – on the continent we use DD-MM-YYYY dates and decimal comma’s, but given the amount of American software MM/DD/YYYY dates and decimal points occur everywhere. The point is the values can apparently be understood, but aren’t in fact. One cannot catch those errors with processing logic because the values are perfectly acceptable to the software. Check impossible.
In exchanges, making agreements and checking if those agreements are met is essential. Schema’s wouldn’t always catch the third kind of errors either, but they provide a way to avoid the misinterpretations. The schema is the common agreement – unless one prefers to fall back to prose – and once we have it, not using it for validation seems pointless. Mark Baker makes some good points on over-restrictive validation rules, but throws out the baby with the bathwater.