The Hash Function For Radiation Oncologists

Hashing as a way to de-identify without losing track of a Patient

One of the issues with sharing electronic data is that if it leaves the confines of your network, it MUST BE de-identified.

It is worth pointing out that many items can serve to identify patients - actual dates and addresses are particularly useful. While patients may share as much as they like online, we are called to a higher standard. For this reason, my spreadsheets never include anything more that ages.

If you wish to store locality data, you should deliberately obfuscate it. I can think of two methods. You can use a personally defined spot and then define your position coordinates from that place. You could use something like what3words to find the front door to my home and since this is specified in 3 words 'think party figs', you can then process this test to give slightly different but predictable variations 'think+party+figs', or 'thinkthinkpartypartyfigsfigs', or 'thinkfigspartythinkfigsparty', or '1think2party3figs', or '1/think2/party3/figs' which you can hash to make unidentifiable to others. I'll show you this as examples later.

There are several ways of doing this all with later ramifications that may be unfortunate.

  1. take your spreadsheet and remove all identifiers completely. If you do this to your main spreadsheet you have a big problem. How will you ever update the data?
  1. take the MRN and alter it to something else. If you are choosing random patterns, then you might as well as remove them as there is no pattern to the change.
  1. use a stable hashing system with non-changing identifiers and re-generate them anytime you need them. This is what I do. I extract my spreadsheets from an Oncology Information System, and I auto-generate the HashID each time I pull them out. When I anonymise the patient's DICOM-RT files with DICOMpyler, I use the HashID as the new ID.

HowTo Hash an Identifier

I add an extra ID column to my spreadsheet, and using MRN and FirstName and Date of Birth without extraneous characters. I will use the MD5SUM algorithm which gives me a probability of 1 x 10-27 of having a result repetition. If this is too high for you them you can use the SHA algorithm which has a much lower chance of repetition.

Your String
ICCCDEPTAlexis11051955
MD5 Hash
225748c15f74e40755ec328cb1d96be7
SHA1 Hash
db7da9948204641ee8e469ec0bc04fa96d587e7f

So my spreadsheet will look like this:

MRN FirstName DateOfBirth CompositeID HashID
ICCCDEPT Alexis 11/05/1955 ICCCDEPTAlexis11051955 225748c15f74e40755ec328cb1d96be7

When it came time to share, I will remove all columns to the left of HashID.

How sensitive is the Hash function? Well lets say that I choose to leave in the '/' in the date, so that my CompositeID is ICCCDEPTAlexis11/05/1955 rather than ICCCDEPTAlexis11051955. These are the results:

old HashID
ICCCDEPTAlexis11051955 » 225748c15f74e40755ec328cb1d96be7
new HashID
ICCCDEPTAlexis11/05/1955 » 72168ac3178b9106a0c47283a143e475

So you see that the Hash output has changed completely. Even if I put a space at the end of the line:

new HashID
'ICCCDEPTAlexis11/05/1955' » 72168ac3178b9106a0c47283a143e475
new HashID + space
'ICCCDEPTAlexis11/05/1955 ' » 0efe8e223ea11ef52e74c1840db2b91b

The Hash function is used in IT to quickly tell if an file has been changed. Files can be hashed. The online hash generators will allow you to upload a file to generate a hash. Take a picture, get its hash, then edit ONE PIXEL in the file and see if the hash changes.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License