‘Big data’ raises big questions for biomedicine

We are used to numbers getting extreme. The European Bioinformatics Institute has its 1,000 genomes. The UK10K project is looking at ten times that number of exomes. Now we have the Government’s 100,000 genomes project and China’s BGI has announced the intention to sequence a million. If biology crossed the threshold of ‘big science’ with the Human Genome Project all those years ago, it is now making itself at home there.

We are used to hearing about the cost of sequencing falling through the floor and the speed increasing steeply but a salient feature of these developments is the unprecedented generation of data. And it’s not just genomics, but all the other ‘omics’, medical imaging, routine health care, telemonitoring, citizen science, ‘body hacking’ and social networking that are generating ‘biomedical’ data. IBM’s website estimates that 90% of the data generated in human history were generated in the last two years.

And these data are not only being generated; they are accumulating. Data technologies such as digitisation and cloud storage mean that the data can be stored, communicated, linked and manipulated indefinitely. This is made possible by the way that so many professional, public and even social interactions are increasingly mediated by electronic devices. Another estimate has it that less than 2% of the stored information that exists in the world is in non-digital form.* Developments in data science, now grouped under the popular rubric of ‘big data’, offer powerful ways to analyse and ‘mine’ the data for meaningful insights, to build models with it and to predict future states of affairs.

Biomedical data no longer constitutes merely an asset with a clearly defined purpose but has become a valuable resource with multiple potential uses. Under this new light we may be seeing the preoccupations of data owners and policy makers turn from ‘data protection’ to ‘value extraction’. This value can be scientific knowledge, insight that leads to the development of new treatments, the prevention of disease, or service improvement. In another register it can also be national economic growth and international competitiveness for the UK among the world’s knowledge economies. Or, again, it could be value of a shadier kind: there is, apparently, a vibrant black market in unlawfully acquired personal health data.

Just yesterday there was an announcement from NHS England and that we would all be getting leaflets next January to explain the virtues of ‘care.data’ – a programme to improve health services by linking data across all care settings – and, of course, remind us that we have a right to opt out of this. As announcements in October go, “You’ll all be hearing from us in January” is not exactly a blockbuster but the tectonic shifts that underlie these minor media eruptions are truly transformative. One of the great continental plates in data tectonics is the Health and Social Care Information Centre (HSCIC), which underlies care.data and is intended to provide a ‘safe harbour’ for pretty much all health data that is not written freehand in the traditionally illegible medics’ scrawl.

The potential of linking the data on these great tectonic plates is clearly enormous. One, UKBiobank, was based on a prospectus that it would provide a substantial research resource in the public interest, by taking some baseline measurements and tracking the health of participants. HSCIC provides a similar kind of resource from the side of health care, delivering Care Episode Statistics (a greatly broadened HES). Genomics England will, among other things, link their 100,000 genome sequences to health records of patients with complex and rare diseases. The potential of such linking will be developed by bodies such as the new Farr Institute and Health eResearch Centre. As the tectonic plates enmesh and the data becomes combined, distinctions between research and treatment, medical and non-medical data, become even less clear.

These developments – huge accumulation of data combined with increasingly powerful tools for putting it to use – and the existence of strong incentives to link and use data, put pressure on the measures that have evolved to govern it. As I suggested in an earlier blog post, the conventional approach of “consent or anonymise” appears fragile when statistical methods allow information about individuals to be picked out of ‘anonymised’ data sets and when the ways in which data may be used cannot be predicted or perhaps even imagined when the data are first collected.

This, then, is the background to the our current project examining the ethical questions raised by the collection, linking, use and exploitation of biological and health data.

The change in incentives and potentials (for harm as well as benefit) brings a need to re-examine the relationship between values, interests and norms in the light of these developments. We will start with the data and consider what opportunities exist to use it, and how these opportunities might be opened up or constrained. In doing so we will explore the significance of individual privacy and the public interest, as well as the impact of the new data paradigm on norms of social and professional behaviour (e.g. social networking, research methodologies). We will identify the values and principles that will guide how we govern and control the use of biomedical data.

Today we are launching an open consultation to invite people to contribute their information and views. This is a really important part of how we work at the Council, since it provides a method of both gathering information and an opportunity to expand the expert advice available to us via our interdisciplinary Working Party and other advisors.

* Mayer-Schönberger and Cukier (2013) Big Data, p.9