Modern medicine runs on data. That’s how it works. That’s how we know that vaccines don’t cause autism, how we know which treatments work better, how we know that smoking causes cancer and that exercise makes you live longer. It’s a relatively simple concept: if you look at 1,000 people who share some characteristic (“obese”, “smoker”, “being treated with Herceptin”) with 1,000 people who don’t, and compare how they do in some predetermined outcome (survival rates, levels of reported pain, hospital admissions, whatever), you can see what effect that characteristic has on that outcome. “Of the 1,000 people with headaches to whom we gave aspirin, 950 reported that the headache went away within 20 minutes, compared to just 600 of those to whom we gave a placebo. Thus we conclude that aspirin can make headaches go away.”
Of course it’s a little more complicated than that. I’ve pretended, above, that studies of interventions like aspirin or Herceptin are the same as studies of lifestyle factors like exercise and smoking. But of course they’re not. You can give someone aspirin, or Herceptin, fairly straightforwardly over a relatively short period of time. But changes to lifestyle can’t be doled out in 100mg tablets. If you’re doing a study into whether eating a piece of fruit every day improves health, you can’t prescribe 1,000 people an apple a day for 10 years and tell 1,000 people not to eat apples for the same period (or to eat a placebo apple), for what I hope are obvious reasons. So you have to ask people what they eat, or smoke, or drink, and try to correct what they say for known tendencies to misreport. This is why you can pretty much discount any claims about a “superfood” which is really good for you; the data just won’t be good enough.
It’s more complicated still. People aren’t random collections of characteristics. If you’re not careful, you can end up taking 1,000 people who tend to share other characteristics as well. So, for instance, if smokers are more likely to be drinkers as well, then your results might show that smoking is more harmful than it actually is, because you’d be comparing 1,000 “smokers who tend to be drinkers” with 1,000 “non-smokers who tend not to be drinkers”, and ill health caused by drinking could be blamed on smoking. So what you have to do is try to break down those sets of 1,000 into subsets “smokers who drink more than X units” and “smokers who don’t drink more than X units” and “non-smokers who” etc and so on, and compare them against each other.
More on the NHS
All of which makes it important that those sets are as big as possible. That’s important anyway: the more data you have, the less likely it is that any of your results will be simple coincidence. If you’ve got three smokers and three non-smokers, you might find that two of the non-smokers are killed in car crashes aged 25, and one of the smokers is your gran who lived to 102 despite smoking 40 a day for six decades. If that was all the data you had, you might conclude that smoking doubles life expectancy. If you do the same analysis with 1,000 smokers and 1,000 non-smokers, though, the odds of that freak event happening are far smaller.
But when your data sets have to be subdivided (that “smokers who don’t drink” stuff), it becomes doubly – or triply, or quadrupally – important. Your hefty-looking sets of 1,000 are suddenly hacked down to piddling little 250s or whatever. And all of a sudden your ability, as a health researcher, to draw weighty conclusions from the data is crippled.
All of which is a long-winded way of saying that large data sets might not be sexy things, but they save actual lives of actual people, because they allow doctors and other health professionals to make good decisions in healthcare, whether that’s prescribing breast cancer drugs, or recommending that we eat five pieces of fruit and veg a day.
And that, in turn, means that the decision to delay rolling out the NHS data-sharing system, care.data, for six months out of fears for privacy, will almost certainly mean that years of people’s lives are lost. The system will, when it goes ahead, allow millions and millions of patients’ health records, and all the lifestyle and medical data that they contain, to be shared with NHS researchers (and, in carefully limited ways, outside bodies). That’s a vast data set, which could be subdivided and subdivided and still provide decent-size subsets. But people, and the media, are scared of privacy in the age of the internet, and often quick to assume that any data sharing is a bad thing, so the NHS has been scared off, albeit temporarily.
That’s not, necessarily, to say that the delay is a bad idea. The good folks at the Science Media Centre have been compiling comments from senior scientists on it, and quite a few of them seem to think that the NHS simply hasn’t done a good enough job of explaining why the system will save lives, and how they will safeguard their privacy. There are, it seems, strong safeguards against misuse of the data – it will, for instance, be illegal for insurance companies to use it to work out who’s an insurance risk – and powerful checks and balances where identifiable data will be used. But the efforts to inform the public about it have been inadequate, leaving people concerned, or unaware, and unable to make an informed judgment about the risks and benefits.
So perhaps the delay is justified; it’s a value judgment over whether allaying privacy concerns outweighs a small but probably not negligible number of life-years lost (years spent with grandchildren, years spent walking through cornfields with the sun on your back, years spent kite-surfing in the Algarve, whatever emotionally manipulative image you prefer). But when, in future, we complain about some poor health outcome in Britain – for instance, that five-year cancer survival rates are on a par with those in eastern Europe – remember that we could make things better, easily.
More by Tom Chivers