Deduplicating Patient Summaries

The International Patient Summary (IPS) is the to-go exchange format for cross-border patient care – and a growing inspiration for in-country data exchange as well. The IPS is a document containing a summary of the most essential healthcare data.

There is, however, a problem with the idea of a Patient Summary to be exchanged internationally – most patients do not have such a single PS in their own country. Most often their data resides with several healthcare organizations – a GP, some pharmacists, one or more hospitals and so on. So how should a single IPS be generated from separate data sources? Here is a take a the problem based on the Dutch proposals (not final yet).

For this discussion I’ll take the IPS FHIR format as starting point – so I’ll assume the various data holders will generate an PS per data source in the IPS FHIR format, and those PSes are consolidated into a single IPS. This should not be a problem when all data sources only store their “own” data, but this is normally not the case. Data is more often than not copied between healthcare organizations, and the resulting IPS would contain a lot of duplicate information.

When not to dedup

For a lot of data, we choose not to deduplicate at all. For patient demographics, healthcare providers in the Netherlands use a single source (SBV-Z), and they do not usually copy addresses etc. between them – they just access the single source. Likewise, for patient data not in the single source, such as email address or mobile phone of first contact, deduplicating is not worth the effort. Ever been to a hospital? They usually ask if your first contact, email etc. is still correct anyway, so why deduplicate when data is checked anyway?

In other cases deduplication may not be worth the effort: we’ll check sections in detail below.

How to deduplicate

One approach to deduplicating FHIR resources is simply comparing the entire resources. This, however is hard. In XML, attribute order may be different but the resources may still be the same. In JSON, objects need not be in the same order at all. On top of that comes non-significant whitespace and other problems such as character encoding – the same issues that make electronic signatures on XML or JSON data a non-trivial effort. To sum up: comparing raw resources may not be impossible, but is far from ideal. Of course for those who want to this and do have a rigid algorithm for raw resource comparison, deduplication is allowed: if resources are the same on this level, they are true duplicates. I will however explore a higher level approach here.

True copies

There are other ways to find out whether resources are the same or not. On a FHIR server, this is not a problem. Any resource will look like this: [base]/[resourcetype]/[id], and it will yield a single resource – and adding a version id retrieves a specific version of it. The IPS however is a FHIR document, and things are more complicated. The id, to start with, cannot be expected to be unique in a FHIR Document: it is only so on a FHIR server. (If the meta.source element is included in the resource, meta.source+id might suffice to establish resource identity – but discussions on Zulip indicate that meta.source is not very strictly defined, so without further agreements this is not a very reliable approach.)

Photo by Jan Antonin Kolar on Unsplash

Looking at the actual international examples of the IPS from various countries, the meta element is not often populated. Within an ecosystem with tightly defined rules, the various items in meta could in principle be used to deduplicate resources in a FHIR document, but such an ecosystem is not in place internationally.

Relevant resources may have an identifier (a business identifier, such as social security number for a Patient). This should be globally unique, due to the system namespace. If we find two resources with the same identifier, they will represent the same entity, but maybe not the same version of that entity. The main question then is: what entities change over time, and may have separate versions, and which do not? (In FHIR, in theory any resource can have versions but in practice a lot won’t change over time.) Unfortunately an identifier by itself won’t indicate version information.

Last there is the Provenance resource, which could be the best solution, but in none of the IPS examples there are Provenance resources (unsurprisingly, since the IPS says nothing on Provenance), and this would require an ecosystem with even tighter rules. To sum up: there is no foolproof way (yet) in an international context to establish whether two resources are same or not, and certainly not which is the latest version.

Luckily in practice the situation is not always as dire as this paragraph might suggest.

Regarding Observations

For Observations there usually is no real need to deduplicate. First, duplicates for Observations should be rarer than for some other FHIR resources. A measurement such a body temperature or a lab result is usually done at one place at some specific time, and though doctors will want to see previous results, they may not always need to store those in their own EHR.

Second, if there are duplicates, this is usually not much of a problem. If there is an entry of my Body Weight stating that I weigh 79 kilograms on January 4, 2024, 09:57, and there is a second entry stating the same, if I plot the two, the points on the plot will be indiscernible. Likewise, if they are in a date-ordered table, no doctor will get confused. Of course, if one entry states that I weigh 79 kilograms and the second that it’s 89 kilograms on exactly the same datetime, there is a problem, but such data corruption problems would need to be addressed by humans anyway, not machines. On other words: don’t deduplicate, leave as is or just mark apparent conflicts.

The same applies to the Results section comprising lab, pathology, radiology and other Observations. In theory a lab result can change, but he IPS (and our Dutch PS) restricts lab results to only final results – so if there are duplicates (the same code and value on the same datetime), this should not be much of a problem. Of course the UI rendering the IPS can do something about duplicate results, i.e. plot on a single plot, or indent or shade apparent copies.

Guideline: do not bother to deduplicate Observations, except when they are known to be true copies

The same reasoning would apply to (most of) the Diagnostic Results, Vital Signs, Pregnancy, Social History and Functional Status sections of the IPS.

Conditions et cetera

The problem list is one of the main suspects for duplicated data. Doctors will want to see a patient’s current and past problems, and quite often some or most of this list will come from other sources – that means, duplicates from elsewhere. In the Netherlands entries from the problem list of one’s GP usually float around everywhere, and since the list can be long, this is a real problem. Doctors do not want to fish in a large pond of duplicated data to find out what the problems of a patient are.

Let’s look at some examples:

A patient has shortness of breath and goes to his GP. The GP makes a record, and since the condition is not easily diagnosed, sends the patient to a pulmonologist in a hospital, along with the record. After treatment the condition has abated and the shortness of breath is no longer an active problem. What could happen here?

Photo by National Cancer Institute on Unsplash
  • The GP could send a record with an identifier, and the hospital could copy and update this record and set it to: ‘not active’.
  • The GP could send a record with an identifier, but the hospital might only consult that, make its own record with its own identifier, and set that record to: ‘not active’.
  • Or the GP could send a record without a proper identifier, and force the hospital to make its own record.
  • The GP could code the condition as Dyspnea (Snomed: 267036007) and the hospital might retain that code.
  • Or the hospital’s doctor might refine the code to Dyspnea on exertion (Snomed: 60845006).
  • After a year or so, the condition might recur, and the GP might set it to ‘active’ again – illustrating that the status of a condition cannot be used for establishing which instance comes first.
  • If the two doctors both conclude that the patient has Asthma (Snomed: 195967001), the patient will have one asthma (which is a non-curable disease) and not two asthmas, even when the diagnoses are apart in time.
  • Even though there is only one asthma for this patient, there can very well be multiple records, i.e. multiple FHIR Condition resources – highlighting that the Condition resource is often not really about a patient problem, but about a record of that problem.
  • But if the patient is diagnosed with Covid-19 (Snomed: 840539006), the patient might very well have had Covid twice, not once, likely indeed if the diagnoses are removed in time.

All this illustrates that with Condition there is preciously little to rely on when deduplicating. When everything is a FHIR server, with support for updates, changing an existing record is not much of a problem: send an update or patch to the server which hosts the resource. (Of course, this assumes an ecosystem with agreements on who can update what.) But in the real world, not everything is a FHIR server with support for updates. When a PS is exchanged as a FHIR document, or retrieved with RESTful read, but without RESTful updates, one cannot update the original resource. To handle the list of possible changes to a Condition, we are considering a set of rules.

Guideline: when copying a resource, retain the original identifiers and do not change the resource.

Guideline: when changing a copied resource, issue a new identifier and discard the old ones.

The guidelines in practice mean that a resource, such as Condition, essentially becomes a doctor’s record of that Condition. If two doctors are saying something about a patient’s disorder, there will be two records. For instance, if the “Shortness of breath” problem is diagnosed by the GP as active, and later set to to not active by the pulmonologist, there will be two copies of the Condition, and another doctor receiving both should be able to see how the actual patient’s disorder has evolved over time. In the future, we might also consider relating Condition resources, enabling one to say: this record was based on that one. That also is useful when the code for the diagnosis changes, like from ‘Dyspnea’ to ‘Dyspnea on exertion’ above.

When did it happen?

Some resources have a dedicated element which indicates when a record was made or changed. As mentioned above, Observation has an effective[x] element with a datetime, period, instant etc. (There also is Observation.issued, but if we limit ourselves to Observations with status ‘final’ effective[x] seems to do.)

Condition does not have such an element. There’s recordedDate, but that is ‘Date record was first recorded’ – so that won’t change when the record is changed. There is meta.lastUpdated, but this is shaky ground. For reads on a single server, this could be useful, but if records are copied to another server, there should be a new meta.lastUpdated on the copy, even if nothing else has changed, and it’s not even clearly defined what the value of meta.lastUpdated should be in a FHIR Document: the value from the server? The moment it was written to the FHIR Document?

So for Condition and other resources there is a need for an element to capture when the record was updated. With such an ‘update time’ any PS could be rendered with possible copies grouped on Condition.code and shown in ascending time order, thus allowing the doctor to see how the Condition has changed over time – or, if wanted, simply show the last state of affairs. (We still have to decide how to register the ‘update time’ in FHIR – maybe Provenance.)

Medication

Medication in the Netherlands is handled by a separate project, and we’re moving to an ecosystem where all medication information is available to all (authorized) care providers. In theory one could go a long way with deduplication using subject, medication code and effective[x], but deduplication (especially of MedicationStatement) is error-prone and consequences of faulty deduplication can be serious. Good advice would be to always include identifiers and leave deduplication to the expert, not the machine, unless there are rigid rules on which to rely. Since we do have such an (evolving) set of rules in the Netherlands, I won’t go into details.

Other sections

For Allergies and Intolerances, the situation is similar to the Problem list, although in most cases the allergy list won’t be as long as the problem list can be, making duplicates easier to handle for humans.

Immunizations also should be of overseeable length, and grouping by code can help. Since an immunization once given should not change, deduplication on identifiers is also possible.

The entries in History of procedures should normally be past events which do not change (in some cases, the procedure could be ongoing or planned). Adding identifiers helps, and if two copies are seen, a look a status could help. Unlike Observations, the IPS does not restrict procedures to completed ones.

Medical devices such as pacemakers will carry an UDI which allows deduplication, and devices which do not, such as wheelchairs, can easily be grouped on code and should not constitute a very long list anyway.

Plan of care is often more the current plan of care. Deduplication should not be tried – if there are multiple plans, a doctor should decide what to do.

For Advance directives, the list should not be too long, and errors in deduplication are serious. So we advice not to deduplicate.

Conclusion

Deduplication of patient summaries is a though problem. However, when looking in detail, it often is not necessary and when it is, there are guidelines which can help a lot. One final piece of advice:

Guideline: when in doubt, let the humans decide.