Big Data For Radiation Oncologists

Big Data

It is likely impossible that you have not hear about "Big Data". The message usually comes with the gusto of a salesperson in full flight spruiking the way that the world will be changed by "Big Data".

Google some years ago published their characteristics of big data, and called it the 4 Vs - volume, variety, velocity, and variability. Unfortunately on three of these front, medical data does not fulfill the characteristic.


It is frequently claimed that medical data is BIG DATA because of the amount of space it takes up. This is a gross oversimplification of the real situation.

I downloaded and analysed a PET/CT recently. The image files occupied over 250MB, while the report occupied over 250kB. That's a 1000-fold difference. And on our PACS we have CT sequences that occupy over 1GB. The sad reality is that imaging data makes up the vast majority of our medical data, and none of it contains any medical data without interpretation.


This we have, however the current paradigm of collection is that a health professional does the collecting face-to-face, which is of course the most expensive way to get data. We have not utilised Patient Reporting as a way of discovering lots of data about which we are too busy to ask.

The increasing implementation of Synoptic Reporting is very useful for this, however in many of the items the report is free text (see comments below).


I have had 3 PSAs, 2 MRIs, and maybe 30 XRs in my life of over 60 years, the velocity of this data is ….. slow! Even in cancer medicine, a scan every 3 months does not constitute fast data.


In this we have the situation of both little and a lot!

There is little variability as the T staging has one 3-6 entries, PSAs change but not dramatically.

There is however huge variability in our free text entries. Free text continues to stump Natural Language Processing gurus. Obvious stuff is easy (it's why it's called 'obvious'!), but subtle inference is not. Nor is the presumption of text present - that's where you read something and say "Oh, did they have a stroke?" or the like.


I would ask you to notice that the most important V - veracity - is not included. Our medical retrospective reports mandate veracity - the data is the truth.

The danger in the Big Data space is that analysis occurs on available data without investigation of the veracity of the data. Births. Deaths & Marriages data is an excellent example of this phenomenon. You can say three things about BDM data with respect to Death:

  1. Date/Time of Death - there can be no argument with this datum as the doctor is standing right there filling it in, and a glance at a watch or phone gives the data.
  2. Place of Death - while it might not be the actual place of death, the site of declaration of death is likewise accurate because the doctor will know where they are standing.
  3. Cause of Death - this data is suspect and often incorrect (?>25%). If there is this much inaccuracy, can you use it for oncological analysis?

Consider this example. A study assumes that the 2 year overall survival rate (uses Date of Death) is a reliable surrogate for cure in Lung cancer (it has been done, and is frequent in the Radiomics literature). The problem with this assumption is that the normal age of this group is old, and the co-morbidity burden of this group is usually huge, so overall survival grossly overestimates cause specific survival (uses Date of Death from Cancer). This approach would be far more reasonable in a cancer of young people who have little other reason to die in the next 2 years.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-NonCommercial-NoDerivs 3.0 License