Published Subject Identifiers face a couple of serious problems, as do all URI-based identifier schemes. A recent post of Lars Marius Garshol reminded me of the – pardon the pun – subject. I was pretty occupied with PSI’s some time ago, maybe now is the moment to write down some of my reservations about PSI’s. PSI’s are URI’s which uniquely identify something, which is – ideally – described on the web page you see when you browse to the URI – read Lars’ post for a real introduction.
First, PSI’s solve the wrong problem. The idea of using some schema to uniquely identify things is hardly novel. Any larger collection of names or id’s for whatever sooner or later faces the problem of telling whether two names point to the same whatever or not. So we’ve got ISBN numbers for books, codes for asteroids and social security numbers for people. They are supposed to designate a single individual or single concept. The problem with all those identifier schemes is not the idea, but the fact that they get polluted over time. Through software glitches and human error and – mostly – intentional fraud a single social security number may point to two or more persons and two or more social security numbers may point to a single person. I can’t speak for non-Dutch realms, but the problem here is very real, and I assume given human nature it is not much different elsewhere. So the real problem is not inventing a identification scheme, the problem is avoiding pollution. This may seem like unfair criticism – no, PSI’s don’t solve famine and war either – but it does set PSI’s in the right light – they are not a panacea for identity problems.
Second, PSI’s are supposed to help identification for both computers – they will compare URI’s, and conclude two things are the same if their URI’s are equivalent – and for humans, through an associated web page. The trouble is what to put on the web page. Let’s make a PSI for me, using my social securitiy number: http://www.sofinummer.nl/0123.456.789. Now what can we put on the web page? We could say “This page identifies the person with the Dutch social security number 0123.456.789” – but that is hardly additional information. If we elaborate – “This page identifies Marc de Graauw, the son of Joop de Graauw and Mieke Hoendervangers, who was born on the 6th of March 1961 in Tilburg, the Netherlands, the person with the Dutch social security number 0123.456.789” we get into trouble. I could find out for instance I was not actually born in Tilburg, but my that my parents for some reason falsely reported this as my birthplace to the authorities. Now even if this were the case, 0123.456.789 would still be my social security number, and it would identify me, not someone else. But if we look at the page, we have to conclude http://www.sofinummer.nl/0123.456.789 identifies nobody, since nobody fits all the criteria listed. The same goes for any other fact we could list – I could find out I was born on another day, to other parents et cetera. The only truly reliable information, the one piece we cannot change, is “This page identifies the person who has been given the Dutch social security number 0123.456.789 by the Dutch Tax Authority”, which hardly is no information at all beyond the social security number itself. All we’ve achieved is prepending my social security number with http://www.sofinummer.nl/, and this simple addition won’t solve any real-world problem. The problem is highlighted by Lars’ example of a PSI, the date page for my birthday, http://psi.semagia.com/iso8601/1961-03-06. This page has no information whatsoever which could not be conveyed with a simple standardized date format, such ‘1961-03-06’ in ISO 8601.
Third, in a real-world scenario, establishing identity statements across concepts from diverse ontologies is the problem to solve. Getting everybody to use the same single identifier for a concept is not feasible. Take an example such as Chimezie Ogbuji’s work on a Problem-Oriented Medical Record Ontology, where it says about ‘Person’:
cpr:person = foaf:Person and galen:Person and rim:EntityPerson and dol:rational-agent
In FOAF a person is defined with:”Something is a foaf:Person if it is a person. We don’t nitpic about whethet they’re alive, dead, real or imaginary.”
HL7’s RIM defines Person as:
“A subtype of LivingSubject representing a human being.”
and defines LivingSubject as:
“A subtype of Entity representing an organism or complex animal, alive or not.”
In FOAF persons can be imaginary and ‘not real’, in the RIM they cannot. Now Chimezie wisely uses and which is an intersection in OWL, so his cpr:Person does not include imaginary persons. And for patient records, which are his concern, it doesn’t matter: we don’t treat Donald Duck for bird flu, so for medical care the entire problem is theoretical. But what about PSI’s: could we ever reconcile the concepts behind foaf:Person and rim:EntityPerson? Probably not: there are a lot of contexts where imaginary persons make sense. So if we make two PSI’s, foaf:Person and rim:EntityPerson, our subjects won’t merge, even when – in a certain context such as medical care – they should. Or we could forbid the use of foaf:Person in the medical realm, but this seems to harsh: the FOAF approach to personal information is certainly useful in medical care.
Identity of concepts is context-dependent. The definitions behind the concepts don’t matter much. Trying to find a universal definition for any complex concept such as ‘person’ will only lead to endless semantic war. Usually natural language words will do for a definition (but you do need disambiguation for homonyms). Way more important than trying to establish a single new id system with new definitions, are ways to make sensible context-dependent equivalences between existing id systems.
15 Sep 2008: Comments are closed
Frankly, I think the biggest problem with PSIs is that they’ve been oversold, so that lots of people think they are something they’re not, and others dismiss them for failing to do what they were never intended to do.
I’ll try to put down what I think of your objections. In general, I think they are valid and that it’s good that you point them out, because that helps dispel all the confusion around this. I think you overstate the case a bit (e.g: “PSIs solve the wrong problem”), but I think that’s mostly because of all the overblown PSI marketing. With luck we’ll find that after the stick has been pushed too far first one way and then the other it will finally be sticking straight up. :-)
1: Yes, the pollution problem is real. No, PSIs will not solve it, or even affect it at all. That just means we have a challenge, though. It’s not a reason to give up on PSIs, just as governments don’t give up on social security numbers because of pollution problems.
2: You are right that the Semagia PSI page provides no information beyond what knowing that a date is an ISO 8601 date will. But then that’s a special case. The PSI for, say tm:supertype-subtype will (once published) define the meaning of that association type. A URI can’t do that. So for subjects where there actually is something useful to define, the subject indicator will do that. Where there isn’t something useful to define PSIs add nothing to basic URIs, and you might as well skip the indicator.
3: It’s definitely true that getting everyone to use the same PSI for every concept is a hopeless task, but I think your cpr:person example shows perfectly what you can do with URIs to identify concepts. With global identifiers (PSIs or just plain URIs) you can reuse someone else’s concept where that works, or relate your own concept to someone else’s (with subclassing, DL expressions, or whatever).
make it easy on yourself and just use Wikipedia URIs (or, if you must, dbPedia.org URIs)…
you are both worrying too much… ;-)
>> “PSI’s are URI’s which uniquely identify something”
To be more correct, PSIs are the resources that you get when you resolve the URIs.