To ZIB or not to ZIB

De Nederlandse zorginformatie beweegt hard richting ZorgInformatieBouwstenen: de ZIBS. Het idee is eenvoudig: alle zorginformatie die voor hergebruik in aanmerking komt, wordt eenmalig op eenduidige wijze vastgelegd bij de bron, met dezelfde ZIBS. Zo hoeft alles maar een keer ingevoerd te worden. De ZIBS worden vervolgens uitgewisseld, en iedereen maakt gebruik van dezelfde data.

De werkelijkheid is weerbarstiger. Voor de duidelijkheid: ik vind de ZIBS een geweldig idee, en dit is zeker de richting waarin het in de zorg ICT heen moet. Maar tussen de ZIBS en de werkelijkheid zit vaak veel licht. Een paar voorbeelden.

Roken

Er is een zorginformatiebouwsteen ‘Tabakgebruik’:

Tabakgebruik (zibs.nl cc-by-nd)

Hiermee kan in detail het rookgedrag worden vastgelegd. Met “SoortTabakGebruik” wordt vastgelegd wat iemand rookt (sigaretten, sigaren, pijp etc.) en met “TabakGebruikStatus” het rookgedrag (rookt dagelijks, rookt soms, rookte vroeger, niet-roker etc.). Er kan ook een “Hoeveelheid” bij (30 sigaretten per week, 100 gram shag etc.)

Wanneer we de ZIB willen gebruiken in de geboortezorg, lopen we direct tegen een probleem aan. In de geboortezorg wordt al een gegeven “Rookgedrag” uitgewisseld. Daarin zitten waarden als: 1-1o, 11-20. Wat staat er niet bij. Hoe maken we daar een ZIB van? We kunnen wat sjoemelen en 1-10 als “5” in de ZIB opslaan, en 11-20 als “15”. Maar dat suggereert natuurlijk een nauwkeurigheid die de zorgverlener of zwangere nooit heeft geuit. Getallen aan laten passen door software is gewoon niet goed in het primaire proces .

We kunnen ook de knoop doorhakken en helemaal voor de ZIB gaan. Alle leveranciers moeten dan de verloskundige en gynaecologische systemen aanpassen, en we leggen voortaan de ZIB Tabakgebruik vast. Daarbij duikt een ander probleem op. De Rookgedrag lijst van de Geboortezorg kent nog twee categorieën: “gestopt vóór huidige zwangerschap” en “gestopt tijdens huidige zwangerschap”. Relevant bij zwangerschap, maar dit zit niet in de ZIB Tabakgebruik. We kunnen het ergens anders stoppen, buiten de ZIB, alleen is het rookgedrag, wat één handzaam lijstje was, dan opeens over twee lijsten verspreid. Een bijkomend probleem bij overstappen op de ZIB is dat gegevens uit het verleden niet meer goed te vergelijken zijn met het heden, omdat er iets anders wordt vastgelegd.

Scores

In de zorg worden veel scores gebruikt. B.v. de Apgar score voor pasgeboren kinderen. Daar is een ZIB van, en het werkt als volgt: de verloskundige of gynaecoloog geeft een 0, 1 0f 2 voor een aantal observaties van de pasgeborene zoals ademhaling, huidskleur etc. Deze worden opgeteld tot een totaalscore. 7 of hoger is een goede score, bij een lagere is aandacht of ingrijpen nodig. Er staan er 13 in de ZIBS.

Er is een probleem wanneer we scores die niet in de ZIBS staan in een ZIB willen gieten. Zulke scores zijn er heel veel: vele tientallen in de oncologie alleen al. De losse items kunnen we nog wel in de zorginformatiebouwsteen AlgemeneMeting stoppen: die is voor metingen of bepalingen waarvoor geen specifieke ZIB is. Alleen is er met ZIBS geen manier om de totaalscores en de losse bepalingen te groeperen in één Score-ZIB. De relatie tussen de items is dus zoek.

Uitgerekend

Photo by Alex Hockett on Unsplash

Voor laboratoriumbepalingen is – heel terecht – gekozen voor een ZIB met een testcode (uit LOINC, voor de soort test) en een uitslag, en niet voor een ZIB per labtest. Daar zijn er veel te veel van, en er komen regelmatig nieuwe bij. Prima opgelost dus.

Voor niet-laboratoriumbepalingen is er de AlgemeneMeting. Er staat: “Een algemene meting legt de uitkomst vast van een meting of bepaling die bij een patënt is uitgevoerd” en : “Afhankelijk van de soort meting bestaat de uitslag uit een waarde met eenheid of uit een gecodeerde waarde (ordinaal of nominaal) of uit een tekstuele uitslag.” Daar zijn ook een paar problemen mee.

Een paar andere gegevens van een patiënt:

  • Datum laatste menstruatie: belangrijk in de verloskunde. Maar het is geen eenheid of gecodeerde waarde. Moet het dan maar als tekst? Terwijl het een datum is, en niet zomaar tekst? En het is ook geen meting, want de verloskundige meet niets.
  • In de geboortezorg wordt ook vastgelegd of er sprake is van seksueel geweld, of huiselijk geweld. Daar speelt hetzelfde probleem: het is geen “meting”, maar een vraag. Het antwoord is “ja” of “nee”, geen eenheid en geen code. Het past dus niet echt in AlgemeneMeting.
  • A terme datum: dat is de datum waarop de vrouw is “uitgerekend”. Die zit in de ZIB Zwangerschap, maar die is te onnauwkeurig voor geboortezorg. Daar wil men ook weten op basis waarvan de à terme datum bepaald is: de laatste menstruatie, een echoscopie etc. Wanneer we dan een AlgemeneMeting willen gebruiken, lopen we weer tegen het probleem van de scores aan: er is geen manier om twee gegevens (à terme datum en basis bepaling) “samen te voegen” in een ZIB.

Samenvattend: hoewel de AlgemeneMeting een nuttige ZIB is, is deze nog niet breed genoeg. Er is een “Algemene Bevinding” nodig, waarin niet alleen metingen passen maar ook andere bevindingen en vragen. Waar ook datums en ja/nee als antwoord kunnen. En belangrijk: waarin gegevens aan elkaar gerelateerd kunnen worden. Als dat er is, kan het meeste vastgelegd worden als een (specialisatie van) een ZIB.

Gewicht

Photo by i yunmai on Unsplash

Een wat kleiner probleem, maar toch goed om aan te geven hoe complex “verzibben” is. Gewicht, zou men denken, is toch wel een basisgegeven. Maar in de ZIB Lichaamsgewicht is gekozen voor LOINC code 29463-7 : “Body weight”.  In het PWD van de Geboortezorg is gekozen voor LOINC code 3141-9: “Body weight Measured”. Is dit nu hetzelfde? Kunnen we het gewoon in de ZIB stoppen, en de ene LOINC code weggooien en de andere opnemen? Ook hier geldt weer: codes met toch een iets andere betekenis aan laten passen door computers is zelden een goed idee.

Bloedgroep

Iets soortgelijks speelt bij “Bloedgroep”. Daar is geen zorginformatiebouwsteen voor (toch een vrij algemeen gegeven, dus dat zou je wel verwachten). Bij ontbreken van een ZIB kunnen we kiezen uit Labuitslag of AlgemeneMeting. Wanneer er een lab onderzoek gedaan wordt, is LabUitslag natuurlijk logisch. Maar wanneer alleen de bloedgroep A, B of O wordt vastgelegd, is het complexer. Er zijn twee labonderzoeken in de Nederlandse Labcodeset: ABO en ABO+rhesus, en dus twee LOINC codes. In een systeem van de huisarts of verloskundige zal niet staan welke test is uitgevoerd, alleen A, B of O. Dan kunnen we dus geen LabUitslag gebruiken, omdat de LOINC code niet meer te achterhalen is. Rhesus is nog ingewikkelder. Wanneer alleen de rhesus-factor wordt doorgegeven, kan de LOINC-code voor ABO+rh niet gebruikt worden, omdat dan niet duidelijk is of het over ABO of rh gaat. De AlgemeneMeting is een oplossing. Lastig is dan alleen dat we de ene keer een Bloedgroep in de ZIB Labuitslag doen, en de andere keer in AlgemeneMeting. Geen echte standaardisatie.

Conclusie

Nogmaals, de ZIBS zijn een goed idee, en dit is de weg vooruit. Maar bij de praktijk van het omzetten van de huidige zorginformatie naar ZIBS, het “verzibben” zijn nog heel veel obstakels. Even alles nu opeens als ZIBS uitwisselen is niet realistisch. Het vereist een heel andere registratie, en heel veel aanpassingen in systemen. En – zoals betoogd – ook nog wel wat aanpassingen aan de ZIBS.

Is deze blog nuttig? Deel hem via LinkedIn, Twitter of anders!

Semantics and behavior

Larry Masinter believes there is a risk in making ‘the “meaning” of [a] language depend on operational behavior‘.

In discussions like this there should be a sharp distinction between natural and computer languages. Larry says meaning ‘depends on what the speaker intends and how the listener will interpret the utterance’ – that’s a valid viewpoint for natural languages, but for markup (html, xml etc.) or programming languages it’s really stretching definitions. There is no speaker, and no direct intentions – unless software has intentions. For markup document instances, there are usually three indirect intentions: those of the language designer, the software builder and the end-user producing the instance. I’m not sure an approach based on intentions is workable for computer languages at all – whose intention does the document reflect? How do we know the intentions of end-user, software builder and language designer do not conflict? How do we know their intentions at all?

Besides, semantics in this way can be described in prose or first order logic. I don’t think first order logic is appropriate for most language design (it’s not human readable, only logician-readable), and prose may be fine but is often inexact. I spoke at length at Balisage 2008 on this issue, and discussed it with David Orchard and others, and I believe an operational approach is often appropriate for computer languages. Why? It certainly is problematic for natural language semantics. There may be an utterance, and a meaning, without any behavior – I can say ‘North Korea tested an A-bomb’, and that has a meaning, even if no-one exhibits any behavior whatsoever after my words. For computer languages, it’s different. It’s possible to describe operational behavior as testable conditions – if I send a system test message A, it should do X. And that makes operational semantics a nearly perfect fit for computer languages – the semantics can be tested with a test suite. That’s enough. No need for real-time behavior after receiving each message.

Larry writes: ‘…standards organizations are in the business of defining languages … and not in … telling organizations and participants … how they are supposed to behave’. That’s far too lenient. If I render text marked <b> as italic, and <i> as bold, I’m not following your spec. Period. If I implement a language specification, and claim conformance, I’m not free to do as I choose. This is even more true in other contexts – business, healthcare, insurance. There organizations exchanging documents sign contracts, and are legally bound to do certain things upon receiving documents – paying after ordering, for instance. Larry’s free-for-all approach to language definitions does not apply to the real world. It’s only true that behavior should not be constrained any further than necessary – but without behavioral consequences, document exchange is meaningless indeed.

Axioms of Versioning

Obsolete, please see the latest version

Version 1

An attempt to define syntactical and semantical compatiblity of versions in a formal way. Much derives from the writings of David Orchard, especially the parts on syntactical forward and backward compatibility (though my terminology differs).

  1. Let U be a set of (specifications of) software processable languages {L1, L2, … Ln}
    1. This axiomatization does not concern itself with natural language
  2. The extension of a language Lx is a set of conforming document instances ELx = {Dx1, Dx2, … Dxn}
    1. Iff ELx = ELy then Lx and Ly are extensionally equivalent
      • Lx and Ly may still be different: they may define different semantics for the same syntactical construct
    2. Iff ELx ? ELy then Lx is an extensional sublanguage of Ly
    3. Iff ELx ? ELy then Lx is an extensional superlanguage of Ly
    4. D is the set of all possible documents; or the union of all ELx where Lx ? U
  3. For each Lx ? U there is a set of processors Px = {Px1, Px2, … Pxn} which implement Lx
    1. Each Pxy exhibits behaviour as defined in Lx
    2. Processors can produce and consume documents
    3. Each Pxy produces only documents it can consume itself
    4. At consumption, Pxy may accept or reject documents
  4. The behaviour of a processor Pxy of language Lx is a function Bxy
    1. The function Bxy takes as argument a document, and its value is a processor state
      • We assume a single processor state before each function execution, alternatively we could assume a <state, document> tuple as function argument
    2. If for two processors Pxy and Pxz for language Lx for a document d Bxy(d) = Bxz(d) then the two processors behave similar for d
      • Two processor states for language Lx are deemed equivalent if a human with thorough knowledge of language specification Lx considers the states equivalent. Details may vary insofar as the language specification allows it.
      • Processor equivalence is not intended to be formally or computably decidable; though in some cases it could be.
    3. If ?d ( d ? ELx ? Bxy(d) = Bxz(d) ) then Pxy and Pxz are behaviourally equivalent for Lx
      • If two processors behave the same for every document which belongs to a language Lx, the processors are behaviourally equivalent for Lx.
  5. An ideal language specifies all aspects of desired processor behaviour completely and unambiguously; assume all languages in U are ideal
  6. A processor Px is an exemplary processor of a language Lx if it fully implements language specification Lx; assume all processors for all languages in U are exemplary
    1. Because they are (defined to be) exemplary, every two processors for a language Lx are behaviourally equivalent
    2. ELx = { d is a document ? Px accepts d }
    3. The complement of ELx is the set of everything (normally, every document) which is rejected by Px
    4. The make set MLx = { d is a document ? Px can produce d }
  7. A language Lx is syntactically compatible with Ly iff MLx ? ELy and MLy ? ELx
    • Two languages are syntactically compatible if they accept the documents produced by each other.
    1. A language Ln+1 is syntactically backward compatible with Ln iff MLn ? ELn+1 and Ln+1 is a successor of Ln
      • A language change is syntactically backward compatible if a new receiver accepts all documents produced by an older sender.
    2. A language Ln is syntactically forward compatible with Ln+1 iff MLn+1 ? ELn and Ln+1 is a successor of Ln
      • A language change is syntactically forward compatible if an old receiver accepts all documents produced by a new sender.
  8. A document d can be a member of the extension of any number of languages
    1. Px is an (exemplary) processor of Lx, Py is an (exemplary) processor of language Ly
    2. Two langauges Lx and Ly are semantically equivalent iff ELx = ELy ? ?d ( (d ? ELx ) ? Bx(d) = By(d) )
      • If two languages Lx and Ly take the same documents as input, and their exemplary processors behave the same for every document, the languages are semantically equivalent.
      • Two languages can only be compared if their exemplary processors are similar enough to be compared.
      • Not every two languages can be compared.
      • “Semantic” should not be interpreted in the sense of “formal semantics”.
  9. The semantical equivalence set of a document d for Lx = { y ? ELx | Bx(d) = Bx(y) }
    1. Or: SLx,d = { y ? ELx | Bx(d) = Bx(y) }
      • The semantical equivalence set of a document d is the set of documents which make a processor behave the same as d
      • Semantical equivalence occurs for expressions which are semantically equivalent, such as i = i + 1 and i += 1 for C, or different attribute order in XML etc.
    2. d ? SLx,d
    3. Any two semantical equivalence sets of Lx are necessarily disjunct
      • If z ? SLx,e were also z ? SLx,d then every member of SLx,e would be in SLx,d and vice versa and thus SLx,d = SLx,e
  10. A language Ly is a semantical superlanguage of Lx iff ?d ( d ? MLx ? By(d) = Bx(d) )
    1. For all documents produced by Px, Py behaves the same as Px
      • Equivalence in this case should be decided based on Lx; if Ly makes behavioural distinctions which are not mentioned in Lx, behaviour is still the same as far as Lx is concerned
    2. It follows: ?d ( d ? MLx ? ?SLy,d ( SLy,d ? ELy ? ( SLx,d ? MLx ) ? SLy,d ? By(d) = Bx(d) ) )
      • For any document produced by Px, the part of its semantical equivalence set which Px can actually produce, is a subset of the semantical equivalence set of Py for this document
    3. For all d ? ELx ? d ? MLx there may be many equivalence sets in Ly for which By(d) ? Bx(d)
      • In other words: for documents accepted but not produced by Px, Ly may define additional behaviours
    4. Lx is a semantical sublanguage of Ly iff Ly is a semantical superlanguage of Lx
  11. A language Ln+1 is semantically backward compatible with Ln iff Ln+1 is a semantical superlanguage of Ln and Ln+1 is a successor of Ln
    1. An old sender may expect a newer, but semantically backward compatible, receiver to behave as the sender intended
    2. A language Ln is semantically forward compatible with Ln+1 iff Ln+1 is a semantical sublanguage of Ln and Ln+1 is a successor of Ln
    3. Semantic forward compatibility is only possible if a language loses semantics; i.e. it’s processors exhibit less functionality, and produce less diverse documents
    4. A processor cannot understand what it does not know about yet

The #referent Convention

Update: I learned from the TAG list that Dan Connolly already proposed using #it or #this for the same purpose, and  Tim Berners-Lee proposed using #i to refer to oneself in a similar way. My idea therefore was not very original, and since I regularly read the TAG list and similar sources, it’s even possible I read the idea somewhere and (much) later thought of it as one of my own – though if this happened it was certainly unintentional.

There is a very simple solution to the entire hash-versus-slash debate: whenever you would want to identify anything with a hashless URI, suffix it with #referent. The meaning of x#referent is: I identify whatever x is about. And x is simply an information resource (about x#referent).

The httpRange-14 debate is about what hashless URI’s (without a #) refer to: can they refer only to documents (information resources) or to anything, i.e. persons or cars or concepts. Is it meaningful to say https://www.marcdegraauw.com/marcdegraauw/ refers to ‘Marc de Graauw’. Or does it now identify both a web page and a person, and is this meaningful and/or desirable?

Hash URI’s aren’t thought to be much of a problem in this respect. They have some drawbacks however. It may be desirable to retrieve an entire information resource which describes what the referent of the URI is. And putting all identifiers in one large file makes the file large. Norman Walsh did this: http://norman.walsh.name/knows/who#norman-walsh identifies Norman Walsh, and the ‘who’ file got big. So Norm switched to hashless URI’s: http://norman.walsh.name/knows/who/norman-walsh identifies Norman Walsh. The httpRange-14 solution requires Norm to answer to a GET on this URI with a 303 redirect, in this case to http://norman.walsh.name/knows/who/norman-walsh.html, which does not identify Norman, but simply is an information resource.

If we use the #referent convention, I can say: https://www.marcdegraauw.com/marcdegraauw.html#referent identifies me. And https://www.marcdegraauw.com/marcdegraauw.html is simply an information resource, which is about me. Problem solved.

If I put https://www.marcdegraauw.com/marcdegraauw.html#referent in a browser, I will simply get the entire https://www.marcdegraauw.com/marcdegraauw.html resource, which is a human readable resource about https://www.marcdegraauw.com/marcdegraauw.html#referent. Semantic Web software which understands the #referent convention will know https://www.marcdegraauw.com/marcdegraauw.html#referent refers to a non-information resource (except when web pages are about other web pages) and https://www.marcdegraauw.com/marcdegraauw.html is simply an information resource. Chances of collision of the #referent fragment identifier are very small (Semantic Web jokers who do this intentionally apart) and even in the case of collision with existing #referent fragment identifiers the collision seems pretty harmless. The only thing the #referent convention does not solve is all the existing hashless URI’s out there which (are purported to) identify non-information resources.
In Semantic Web architecture, there is no need ever for hashless URI’s. The #referent convention is easier, more explicit about what is meant, retrieves a nice descriptive human-readable information resource in a browser, along with all necesssary rdf metadata for Semantic Web applications.

Validation Considered Essential

I just ran into a disaster scenario which Mark Baker recently described as the way things should be: a new message exchange without schema validation. He writes: “If the message can be understood, then it should be processed” and in a comment “I say we just junk the practice of only processing ‘valid’ documents … and let the determination of obvious constraints … be done by the software responsible for processing that value.” I’ll show this is unworkable, undesirable and impossible (in that order).

I’ve got an application out there which reads XML sent to my customer. The XML format is terrible, and old – it predates XML Schema. So there is no schema, just an Excel file with “AN…10” style descriptions and value lists. It is built into my software, and works pretty well – my code does the validation, and the incoming files are always processed fine.

Now a second party is going to send XML in the same format. Since there is no schema, we started testing in the obvious way – entering data on their website, exporting to XML, send it over, import in my app, see what goes wrong, fix, start over. We have had an awful lot of those cycles so far, and no error-proof XML yet. Given a common schema, we could have a decent start the first cycle. Check unworkable.

So I wrote a RelaxNG schema for the XML. It turned out there where hidden errors which my software did not notice. For instance, there is a code field, and if it has some specific value, such as ‘D7’, my customer must be alerted immediately. My code checks for ‘D7’ and alerts the users if it comes in. The new sender sent ‘d7’ instead. My software did not see the ‘D7’ code and gave no signal. I wouldn’t have caught this so easily without the schema – I would have caught it in the final test rounds, but it is so much easier to catch those errors early, which schema’s can do. Check undesirable.

Next look at an element with value ‘01022007’. According to Mark, if it can be understood, it should be processed. And indeed I can enter ‘Feb 1, 2007’ in the database. Or did the programmer serialize the American MM/DD/YYYY format as MMDDYYYY and is it ‘Jan 2, 2007’? Look at the value ‘100,001’ – perfectly understandable, one hundred thousand and one hundred – or is this one hundred and one thousandth, with a decimal comma? Questions like that may not be common in an American context, but in Europe they arise all the time – on the continent we use DD-MM-YYYY dates and decimal comma’s, but given the amount of American software MM/DD/YYYY dates and decimal points occur everywhere. The point is the values can apparently be understood, but aren’t in fact. One cannot catch those errors with processing logic because the values are perfectly acceptable to the software. Check impossible.

In exchanges, making agreements and checking if those agreements are met is essential. Schema’s wouldn’t always catch the third kind of errors either, but they provide a way to avoid the misinterpretations. The schema is the common agreement – unless one prefers to fall back to prose – and once we have it, not using it for validation seems pointless. Mark Baker makes some good points on over-restrictive validation rules, but throws out the baby with the bathwater.

The trouble with PSI’s

Published Subject Identifiers face a couple of serious problems, as do all URI-based identifier schemes. A recent post of Lars Marius Garshol reminded me of the – pardon the pun – subject. I was pretty occupied with PSI’s some time ago, maybe now is the moment to write down some of my reservations about PSI’s. PSI’s are URI’s which uniquely identify something, which is – ideally – described on the web page you see when you browse to the URI – read Lars’ post for a real introduction.

First, PSI’s solve the wrong problem. The idea of using some schema to uniquely identify things is hardly novel. Any larger collection of names or id’s for whatever sooner or later faces the problem of telling whether two names point to the same whatever or not. So we’ve got ISBN numbers for books, codes for asteroids and social security numbers for people. They are supposed to designate a single individual or single concept. The problem with all those identifier schemes is not the idea, but the fact that they get polluted over time. Through software glitches and human error and – mostly – intentional fraud a single social security number may point to two or more persons and two or more social security numbers may point to a single person. I can’t speak for non-Dutch realms, but the problem here is very real, and I assume given human nature it is not much different elsewhere. So the real problem is not inventing a identification scheme, the problem is avoiding pollution. This may seem like unfair criticism – no, PSI’s don’t solve famine and war either – but it does set PSI’s in the right light – they are not a panacea for identity problems.

Second, PSI’s are supposed to help identification for both computers – they will compare URI’s, and conclude two things are the same if their URI’s are equivalent – and for humans, through an associated web page. The trouble is what to put on the web page. Let’s make a PSI for me, using my social securitiy number: http://www.sofinummer.nl/0123.456.789. Now what can we put on the web page? We could say “This page identifies the person with the Dutch social security number 0123.456.789” – but that is hardly additional information. If we elaborate – “This page identifies Marc de Graauw, the son of Joop de Graauw and Mieke Hoendervangers, who was born on the 6th of March 1961 in Tilburg, the Netherlands, the person with the Dutch social security number 0123.456.789” we get into trouble. I could find out for instance I was not actually born in Tilburg, but my that my parents for some reason falsely reported this as my birthplace to the authorities. Now even if this were the case, 0123.456.789 would still be my social security number, and it would identify me, not someone else. But if we look at the page, we have to conclude http://www.sofinummer.nl/0123.456.789 identifies nobody, since nobody fits all the criteria listed. The same goes for any other fact we could list – I could find out I was born on another day, to other parents et cetera. The only truly reliable information, the one piece we cannot change, is “This page identifies the person who has been given the Dutch social security number 0123.456.789 by the Dutch Tax Authority”, which hardly is no information at all beyond the social security number itself. All we’ve achieved is prepending my social security number with http://www.sofinummer.nl/, and this simple addition won’t solve any real-world problem. The problem is highlighted by Lars’ example of a PSI, the date page for my birthday, http://psi.semagia.com/iso8601/1961-03-06. This page has no information whatsoever which could not be conveyed with a simple standardized date format, such ‘1961-03-06’ in ISO 8601.

Third, in a real-world scenario, establishing identity statements across concepts from diverse ontologies is the problem to solve. Getting everybody to use the same single identifier for a concept is not feasible. Take an example such as Chimezie Ogbuji’s work on a Problem-Oriented Medical Record Ontology, where it says about ‘Person’:

cpr:person = foaf:Person and galen:Person and rim:EntityPerson and dol:rational-agent

In FOAF a person is defined with:”Something is a foaf:Person if it is a person. We don’t nitpic about whethet they’re alive, dead, real or imaginary.”

HL7’s RIM defines Person as:

“A subtype of LivingSubject representing a human being.”

and defines LivingSubject as:
“A subtype of Entity representing an organism or complex animal, alive or not.”

In FOAF persons can be imaginary and ‘not real’, in the RIM they cannot. Now Chimezie wisely uses and which is an intersection in OWL, so his cpr:Person does not include imaginary persons. And for patient records, which are his concern, it doesn’t matter: we don’t treat Donald Duck for bird flu, so for medical care the entire problem is theoretical. But what about PSI’s: could we ever reconcile the concepts behind foaf:Person and rim:EntityPerson? Probably not: there are a lot of contexts where imaginary persons make sense. So if we make two PSI’s, foaf:Person and rim:EntityPerson, our subjects won’t merge, even when – in a certain context such as medical care – they should. Or we could forbid the use of foaf:Person in the medical realm, but this seems to harsh: the FOAF approach to personal information is certainly useful in medical care.

Identity of concepts is context-dependent. The definitions behind the concepts don’t matter much. Trying to find a universal definition for any complex concept such as ‘person’ will only lead to endless semantic war. Usually natural language words will do for a definition (but you do need disambiguation for homonyms). Way more important than trying to establish a single new id system with new definitions, are ways to make sensible context-dependent equivalences between existing id systems.

15 Sep 2008: Comments are closed

Validate for Machines, not Humans

Mark Baker misses an important distinction in “Validation Considered Harmful” when he writes:

“Today’s sacred cow is document validation, such as is performed by technologies such as DTDs, and more recently XML Schema and RelaxNG.

Surprisingly though, we’re not picking on any one particular validation technology. XML Schema has been getting its fair share of bad press, and rightly so, but for different reasons than we’re going to talk about here. We believe that virtually all forms of validation, as commonly practiced, are harmful; an anathema to use at Web scale.”

Dare Obasanjo replied in “Versioning does not make validation irrelevant“:

“Let’s say we have a purchase order format which in v1 has a element which can have a value of "U.S. dollars" or "Canadian dollars" then in v2 we now support any valid currency. What happens if a v2 document is sent to a v1 client? Is it a good idea for such a client to muddle along even though it can't handle the specified currency format?"

to which Mark replied:

“No, of course not. As I say later in the post; ‘rule of thumb for software is to defer checking extension fields or values until you can’t any longer'”

With software the most important point is whether the data sent ends up with a human, or ends up in software – either to be stored in a database for possible later retrieval, or is used to generate a reply message without human intervention. Humans can make sense of unexpected data: when they see “Euros” where “EUR” was expected, they’ll understand. Validating as little as possible makes sense there. When software does all the processing, stricter validation is necessary – trying to make software ‘intelligent’ by enabling it to process (not just store, but process) as-yet-unknown format deviations is a road to sure disaster. So in the latter case stricter validation makes a lot of sense – we accept “EUR” and “USD”, not “Euros”. And if we do that, the best thing for two parties who exchange anything is to make those agreements explicit in a schema. If we “defer checking extension fields or values until you can’t any longer” we end up with some application’s error message. You don’t want to return that to the partner who sent you a message – you’ll want to return “Your message does not validate against our agreed-upon schema”, so they know what to fix (though sometimes you’ll want your own people to look at it first, depending on the business case).

Of course one should not include unnecessary constraints in schema’s – but whether humans or machines will process the message is central in deciding what to validate and what not.
Another point is what to validate – values in content or structure, and Uche Ogbuji realistically adds:

“Most forms of XML validation do us disservice by making us nit-pick every detail of what we can live with, rather than letting us make brief declarations of what we cannot live without.”

Yes, XML Schema and others make structural requirements which impose unnecessary constraints. Unexpected elements often can be ignored, and this enhances flexibility.

The Semantics of Addresses

There has been a lot of discussion over the past 10-something years on URI’s: are they names or addresses? However, there does not appear to have been a lot of investigation into the semantics of addresses. This is important, since while there are several important theories on the semantics of names (Frege, Russell, Kripke/Donnellan/Putnam et. al.), there have been little classical accounts of the semantics of addresses. A shot.

What are addresses? Of course, first come the standard postal addresses we’re all accustomed to:

Tate Modern
Bankside
London SE1 9TG
England

 

Other addresses, in a broad sense, could be:

52°22’08.07” N 4°52’53.05” E (The dining table on my roof terrace, in case you ever want to drop by. I suggest however, for the outdoors dining table, to come in late spring or summer.)

e2, g4, a8 etc. on a chess board

The White House (if further unspecified, almost anyone would assume the residence of the President of the United States)

(3, 6) (in some x, y coordinate system)

Room 106 (if we are together in some building)

//Myserver/Theatre/Antigone.html

128.30.52.47

Addresses are a lot like names – they are words, or phrases, which point to things in the real world. They enable us to identify things, and to refer to things – like names. ‘I just went to the van Gogh Museum‘ – ‘I was in the Paulus Potterstraat 7 in Amsterdam‘ – pretty similar, isn’t it?
So what makes addresses different from names, semantically? The first thing which springs to mind is ordinary names are opaque, and addresses are not. Addresses contain a system of directions, often but not always, hierarchical. In other words: there is information in parts of addresses, whereas parts of names do not contain useful information. From my postal address you can derive the city where I live, the country, the street. From chess notations and (geo-)coordinates one can derive the position on two (or more) axes. So addresses contain useful information within them, and names for the most part do not.

This is not completely true – names do contain some informative parts – from ‘Marc de Graauw’ you can derive that I belong to the ‘de Graauw’ family, and am member ‘Marc’ of it, but this does not function the way addresses do – it is not: go to the collection ‘de Graauw’ and pick member ‘Marc’. On a side note, though ‘de Graauw’ is an uncommon last name even in the Netherlands, I know at least one other ‘Marc de Graauw’ exists, so my name is not unique (the situation could have been worse though). I don’t even know whether my namesake is part of my extended family or not, so ‘looking up’ the ‘de Graauw’ family is not even an option for me.

Unique names or identifiers are usually even more opaque than natural names – my social security number does identify me uniquely in the Dutch social security system, but nothing can be derived from its parts other than a very vague indication of when it was established. So even when names contain some information within their parts, it is not really useful in the sense that it doesn’t establish much – not part of the location, or identity, or reference. The parts of addresses do function as partial locators or identifiers, the parts of names provide anecdotal information at best.

Names and addresses are fundamentally different when it comes to opacity. What else? Ordinary names – certainly unique names – denote unmediated, they point directly to an individual. Addresses denote mediated, they use a system of coordination to move step-by-step to their endpoint. Addressing systems are set up in such a way they provide a drilling-down system to further and further refine a region in a space until a unique location is denoted. Addresses are usually unique in their context, names sometimes are, and sometimes not. So, e4 denotes a unique square on a chess board, and my postal address a unique dwelling on Earth. The name ‘Amsterdam’ does denote a unique city if the context is the Netherlands, but my name does not denote a unique individual. So addresses pertain to a certain space, where a certain system of directives applies.

Addresses do not denote things, they denote locations. My postal address does not denote my specific house: if we tear it down and build another, the address does not change. e4 does not denote the pawn which stands there, it denotes a square on a chess board, whatever piece is there. So addresses do not denote things, but slots for things. Addresses uniquely denote locations, in a non-opaque, mediated way. If we use ‘name’ in a broad sense, where names can be non-opaque, we could say: addresses are unique names for locations in a certain space.

Names Addresses
Can identify identify
Can refer refer
Denote directly mediated
Point into the world a space
Denote things slots
Are opaque not opaque

Where does this leave us with URI’s? It’s quite clear URL’s (locator URI’s) are addresses. Looking at a URL like http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html#loc_independent , this tells us a lot:

1) this is the http part of uri space we’re looking at,

2) this is on host www.w3.org

3) the path (on this host) to the resource is 2001/tag/doc/URNsAndRegistries-50.html

4) and within this, I’m pointing to fragment #loc_independent

So URL’s fulfill all conditions of addresses. They are not opaque. Their parts contain useful information. Their parts – schema, authority, path etc. – provide steps to the URL’s destiny – the resource it points to. The identify, they refer, like names. No, URL’s are not addresses of files on file systems on computers, not addresses in this naive sense. But URL’s are addresses in URI space. HTTP URI’s are names of locations in HTTP space. Semantically, URL’s are addresses – at least. Whether URL’s can be names too is another question.

Do we have to know we know to know?

John Cowan wrote ‘Knowing knowledge‘ a while ago, about what it means to know something. His definition (derived from Nozick) is:

‘The following four rules explain what it is to know something. X knows the proposition p if and only if:

  1. X believes p;
  2. p is true;
  3. if p weren’t true, X wouldn’t believe it;
  4. if p were true, X would believe it.’

This raises an interesting question. A common position of religious people (or at least religious philosophers) is: ‘I believe in the existence of God, but I cannot know whether God exists’. God’s existence is a matter of faith, not proof. I don’t hold such a position myself, but would be very reluctant to denounce it on purely epistemological grounds.

Now if we suppose for the sake of the argument that God does in fact exist, and that the religious philosopher, X, would not have believed in the existence of God in case God would not have existed (quite coherently, since typically in such views nothing would have existed without God, so no one would have believed anything). Our philosophers’ belief would satisfy the above four criteria. Yet, could we say ‘X knows p’, when X himself assures us he does not know whether p is true? In other words: doesn’t knowing something presuppose the knower would be willing to assert knowing his or her knowledge?