The #referent Convention

Update: I learned from the TAG list that Dan Connolly already proposed using #it or #this for the same purpose, and  Tim Berners-Lee proposed using #i to refer to oneself in a similar way. My idea therefore was not very original, and since I regularly read the TAG list and similar sources, it’s even possible I read the idea somewhere and (much) later thought of it as one of my own – though if this happened it was certainly unintentional.

There is a very simple solution to the entire hash-versus-slash debate: whenever you would want to identify anything with a hashless URI, suffix it with #referent. The meaning of x#referent is: I identify whatever x is about. And x is simply an information resource (about x#referent).

The httpRange-14 debate is about what hashless URI’s (without a #) refer to: can they refer only to documents (information resources) or to anything, i.e. persons or cars or concepts. Is it meaningful to say https://www.marcdegraauw.com/marcdegraauw/ refers to ‘Marc de Graauw’. Or does it now identify both a web page and a person, and is this meaningful and/or desirable?

Hash URI’s aren’t thought to be much of a problem in this respect. They have some drawbacks however. It may be desirable to retrieve an entire information resource which describes what the referent of the URI is. And putting all identifiers in one large file makes the file large. Norman Walsh did this: http://norman.walsh.name/knows/who#norman-walsh identifies Norman Walsh, and the ‘who’ file got big. So Norm switched to hashless URI’s: http://norman.walsh.name/knows/who/norman-walsh identifies Norman Walsh. The httpRange-14 solution requires Norm to answer to a GET on this URI with a 303 redirect, in this case to http://norman.walsh.name/knows/who/norman-walsh.html, which does not identify Norman, but simply is an information resource.

If we use the #referent convention, I can say: https://www.marcdegraauw.com/marcdegraauw.html#referent identifies me. And https://www.marcdegraauw.com/marcdegraauw.html is simply an information resource, which is about me. Problem solved.

If I put https://www.marcdegraauw.com/marcdegraauw.html#referent in a browser, I will simply get the entire https://www.marcdegraauw.com/marcdegraauw.html resource, which is a human readable resource about https://www.marcdegraauw.com/marcdegraauw.html#referent. Semantic Web software which understands the #referent convention will know https://www.marcdegraauw.com/marcdegraauw.html#referent refers to a non-information resource (except when web pages are about other web pages) and https://www.marcdegraauw.com/marcdegraauw.html is simply an information resource. Chances of collision of the #referent fragment identifier are very small (Semantic Web jokers who do this intentionally apart) and even in the case of collision with existing #referent fragment identifiers the collision seems pretty harmless. The only thing the #referent convention does not solve is all the existing hashless URI’s out there which (are purported to) identify non-information resources.
In Semantic Web architecture, there is no need ever for hashless URI’s. The #referent convention is easier, more explicit about what is meant, retrieves a nice descriptive human-readable information resource in a browser, along with all necesssary rdf metadata for Semantic Web applications.

Validation Considered Essential

I just ran into a disaster scenario which Mark Baker recently described as the way things should be: a new message exchange without schema validation. He writes: “If the message can be understood, then it should be processed” and in a comment “I say we just junk the practice of only processing ‘valid’ documents … and let the determination of obvious constraints … be done by the software responsible for processing that value.” I’ll show this is unworkable, undesirable and impossible (in that order).

I’ve got an application out there which reads XML sent to my customer. The XML format is terrible, and old – it predates XML Schema. So there is no schema, just an Excel file with “AN…10” style descriptions and value lists. It is built into my software, and works pretty well – my code does the validation, and the incoming files are always processed fine.

Now a second party is going to send XML in the same format. Since there is no schema, we started testing in the obvious way – entering data on their website, exporting to XML, send it over, import in my app, see what goes wrong, fix, start over. We have had an awful lot of those cycles so far, and no error-proof XML yet. Given a common schema, we could have a decent start the first cycle. Check unworkable.

So I wrote a RelaxNG schema for the XML. It turned out there where hidden errors which my software did not notice. For instance, there is a code field, and if it has some specific value, such as ‘D7’, my customer must be alerted immediately. My code checks for ‘D7’ and alerts the users if it comes in. The new sender sent ‘d7’ instead. My software did not see the ‘D7’ code and gave no signal. I wouldn’t have caught this so easily without the schema – I would have caught it in the final test rounds, but it is so much easier to catch those errors early, which schema’s can do. Check undesirable.

Next look at an element with value ‘01022007’. According to Mark, if it can be understood, it should be processed. And indeed I can enter ‘Feb 1, 2007’ in the database. Or did the programmer serialize the American MM/DD/YYYY format as MMDDYYYY and is it ‘Jan 2, 2007’? Look at the value ‘100,001’ – perfectly understandable, one hundred thousand and one hundred – or is this one hundred and one thousandth, with a decimal comma? Questions like that may not be common in an American context, but in Europe they arise all the time – on the continent we use DD-MM-YYYY dates and decimal comma’s, but given the amount of American software MM/DD/YYYY dates and decimal points occur everywhere. The point is the values can apparently be understood, but aren’t in fact. One cannot catch those errors with processing logic because the values are perfectly acceptable to the software. Check impossible.

In exchanges, making agreements and checking if those agreements are met is essential. Schema’s wouldn’t always catch the third kind of errors either, but they provide a way to avoid the misinterpretations. The schema is the common agreement – unless one prefers to fall back to prose – and once we have it, not using it for validation seems pointless. Mark Baker makes some good points on over-restrictive validation rules, but throws out the baby with the bathwater.