Statistically Speaking: Trust in Data: The importance of ethics and privacy in producing statistics for the public good

Trust in Data: The importance of ethics and privacy in producing statistics for the public good

Aug 22, 2022

In this episode Miles is joined by Professor Luciano Floridi of Oxford University; Simon Whitworth of the UK Statistics Authority; and Pete Stokes from the ONS to talk about data ethics and public trust in official statistics.

TRANSCRIPT

MILES FLETCHER

Hello, I'm Miles Fletcher, and in this episode of Statistically Speaking we're exploring data ethics and public trust in official statistics. In 2007, 15 years ago to the very day we are recording this, the UK Parliament gave the Office for National Statistics the objective of promoting and safeguarding the production and publication of official statistics that serve the public good. But what does, or should, the “public good” mean? How does the ONS seek to deliver it in practice? Why should the public trust us to act in their interests at a time of exponential growth in data of all kinds? Where are the lines to be drawn between individual privacy and anonymity on the one hand, the potential of data science to improve public services and government policies to achieve better health outcomes, even saving lives, on the other.

Joining me to discuss these topics today are Simon Whitworth, Head of Data Ethics at the UK statistics authority, Pete Stokes, Director of the Integrated Data programme here at the ONS and Luciano Floridi, professor of philosophy and the ethics of information and director of the digital ethics lab at the Oxford Internet Institute.

Professor let's start this big concept with you. What do you think Parliament meant when it said that the ONS should serve the public good in this context?

LUCIANO FLORIDI

It might have meant many things, and I suspect that a couple of them must have been in their minds. First of all, we know that data or information, depending on the vocabulary, has an enormous value if you know how to use it. And, collecting it and using it properly for the future of the country, to implement the right policies, to avoid potential mistakes and to see things in advance - knowledge is power, information is power. So, this might have been one of the things that they probably meant by “public good”. The other meaning, it might be a little bit more specific...It's when we use the data appropriately, ethically, to make sure that some sector or some part of the population is not left behind, to learn who needs more help, to know what help and when to deliver it, and to whom. So, it's not just a matter of the whole nation doing better, or at least avoiding problems, but also specific sectors of the population being helped, and to make sure that the burden and the advantages are equally distributed among everybody. That's normally what we mean by public good and certainly, that analysis is there to serve it.

MF

So there's that dilemma between using the power of data to actually achieve positive outcomes. And for government, on the other hand, being seen as overbearing, or Orwellian, and spying on people through the use of data.

LF

That would be the risk that sometimes comes under the term “paternalism”, that knowing a lot about your citizens might lead to the temptation of manipulating their lives, their choices, their preferences. I wouldn't over-emphasise this though. The kind of legislation that we have and the constraints, the rules, the double checking, make sure that the advantage is always in view and can more easily be squeezed out of the data that we accumulate, and sometimes the potential abuses and mistakes, the inevitable temptation to do the wrong thing, are kept in check. So yes, the State might use the government’s political power, might misuse data, and so we need to be careful, but I wouldn't list that as my primary worry. My primary worry perhaps, would be under-using the data that we have, or making mistakes inadvertently.

MF

Do you think then, perhaps as a country, the UK has been too cautious in this area in the past?

LF

I don't think it has been too cautious, either intellectually or strategically. There's been a lot of talking about doing the right thing. I think it's been slightly cautious, or insufficiently radical, in implementing policies that have been around for some time. But we now have seen several governments stating the importance of that analysis, statistical approaches to evidence, and so on. But I think that there is more ambition in words than in deeds, so I would like to see more implementations, more action and less statements. Then the ambition will be matched by the actions on the ground.

MF

One of the reasons perhaps there might have been caution in the past is of course concern about how the public would react to that use of data. What do we know of public attitudes now in 2022, to how government bodies utilise data?

LF

I think the impression is that, depending on whom you ask, whether it is the younger population or slightly older people my age, people who lived in the 50s versus my students, they have different attitudes. We're getting used to the fact that our data are going to be used. The question is no longer are they going to be used, but more like, how and who is using them? For what purposes? Am I in charge? Can I do something if something goes wrong? And I would add also, in terms of attitude, one particular feature which I don't see sufficiently stressed, is who is going to help me if something goes wrong? Because the whole discussion, or discourse, should look more at how we make people empowered, so that they can check, they have control, they can go do this, do that. Well, who has the time, the ability, the skills, and indeed the will, to do that? It's much easier to say, look, there will be someone, for example the government, who will protect your rights, who you can approach, and they will do the right thing for you. Now we're getting more used to that. And so, I believe that the attitude is slightly changing towards a more positive outlook, as long as everything is in place, we are seeing an increasingly positive attitude towards public use of public data.

MF

Pete, your role is to make this happen. In practice, to make sure that government bodies, including the ONS, are making ethical use of data and serving the public good. Just before we get into that though, explain if you would, what sort of data is being gathered now, and for what purposes?

PETE STOKES

So we've got a good track record of supporting research use of survey data, that we collect largely in ONS, but on other government departments as well. But over the last few years, there's been an acceleration and a real will to make use of data that have been collected for other purposes. We make a lot of use now of administrative data, these are data that are collected by government not for an analytical purpose but for an operational purpose. For example, data that are collected by HMRC from people when they're collecting tax, or from the Department of Work and Pensions when they're collecting benefits, or from local authorities when they're collecting council tax - all of those administrative data are collected and stored. There's an increasing case to make those data available for analysis which we're looking to support. And then the other new area is what's often called “faster data”, and these data that are typically readily available, usually in the public domain where you get a not so deep insight as you'd get from a survey of administrative data, but you could get a really quick answer. And a good example of that from within the ONS is that we calculate inflation. As a matter of routine, we collect prices from lots of organisations, but you can more quickly do some of that if you can pull some data that are readily available on the internet to give you those quicker indicators, faster information of where prices are rising quickly where they're dropping quickly. There's a place for all of these depending on the type of analysis that you want to do.

MF

This is another area where this ethical dilemma might arise though isn't it, because when you sit down with someone and they've agreed to take part in the survey, they know what they're going in for. But when it comes to other forms of information, perhaps tax information that you've mentioned already, some people might think, why do they want to know that?

PS

When people give their data to HMRC or to DWP as part of the process of receiving a service, like paying tax for example, I think people generally understand what they need to give that department for their specific purpose. When we then want to use this data for a different purpose, there is a larger onus on us to make sure that we are protecting those data, we're protecting the individual and that those data are only being used ethically and in areas of trust, specifically in the public interest. So, it's important that we absolutely protect the anonymity of the individuals, that we make sure where their data are used, and that we are not using the data of those data subjects as individuals, but instead as part of a large data-set to look for trends and patterns within those data. And finally, that the analysis that are then undertaken with them are explicitly and demonstrably in the public interest, that they serve the public good of all parts of society.

MF

And that's how you make the ethical side of this work in practice, by showing that it can be used to produce faster and more accurate statistics than we could possibly get from doing a sample survey?

PS

Yes, exactly, and sample surveys are very, very powerful when you want to know about a specific subject, but they're still relatively small. The largest sample survey that the ONS does is the Labour Force Survey, which collects data from around 90,000 people every quarter. Administrative datasets have got data from millions of people, which enables you to draw your insights not just at a national level and national patterns, but if you want to do some analysis on smaller geographic areas, administrative data gives you the power to do that when surveys simply don't. But, any and all use of data must go through a strict governance process to ensure that the confidentiality of the data subjects be preserved. And not only will the use be clearly and demonstrably in the public interest, but also, will be ethically sound and will stand up to scrutiny in that way as well.

And who gets to see this stuff?

The data are seen by the accredited researchers that apply to use it. So, a researcher applies to use the data, they're accredited, and they demonstrate their research competence and their trustworthiness. They can use those data in a secure lockdown environment, and they do their analysis. When they complete their analysis, those can then be published. Everybody in the country can see the results of those analyses. If you've taken part in a social survey, or you've contributed some data to one of the administrative sources that we make available, you can then see all the results of all the analysis that are done with those data.

MF

But when you say its data, this is where the whole process of anonymization is important, isn't it? Because if I'm an accredited researcher selling it to see names and addresses, or people's personal, sensitive personal information.

PS

No, absolutely not. And the researchers only get to see the data that they need for their analysis. And because we have this principle, that the data are being used as an aggregated dataset, you don't need to see people's names or people's addresses. You need to know where people live geographically, in a small or broad area, but not the specific address. You need to know someone's demographic characteristics, but you don't need to know their name, so you can't see their name in the data. And that principle of pseudonymisation, or the de-identification of data, before their used is really important. When the analyses are completed and the outputs are produced, those are then reviewed by an expert team at ONS, and so the data are managed by us to ensure that they are fully protected, wholly non-disclosive, and that it's impossible to identify a member of the public from the published outputs.

MF

Historically, government departments didn't have perhaps the best record in sharing data around other bodies for the public benefit in this way. But all that changed, didn't it? A few years back with a new piece of legislation which liberalised, to an extent, what the ONS is able to do.

PS

So, the Digital Economy Act, passed in 2017, effectively put on a standard footing the ability of other departments to make their data available for researchers in the same way that ONS had already been able to do since the 2007 System Registration Service Act. It gave us parity, which then gave other departments the ability to make their data available and allow us to help them to do so, to take the expertise that the ONS has in terms of managing these data securely, managing access to them appropriately, accrediting the researchers, checking all the outputs and so on, to give the benefit of our expertise to the rest of government. In order that the data that they hold, that has previously been underutilised arguably, could then be fully used for analyses to develop policies or deliver services, to improve understanding of the population or cohorts of the population or geographic areas of the country, or even sectors of industry or segments of businesses, for example, in a way that hasn't previously been possible, and clearly benefits the country overall.

MF

So the aim here is to make full use of a previously untapped reservoir, a vast reservoir, an ocean you might even say, of public data. But who decides what data gets brought in in this way?

PS

We work closely with the departments that control the data, but ultimately, those departments decide what use can be made of their data. So, it is for HMRC, DWP, the Department for Education, it’s for them to decide which data they choose to make available through the Secure Research Service (SRS) or the Integrated Data Service (IDS) that we run in ONS. When they're supportive and recognise the analytical value of their data, we then manage the service where researchers apply to use those data. Those applications are then assessed by ONS first and foremost, we then discuss those requests and the use cases with the data owning departments and say, do you agree this would be a sensible use of your data?

Is there an independent accreditation panel that reports to the UK statistics Authority Board, that assesses the request to use the data is in the public interest, that it serves the public good?

The ethics of the proposal are also assessed by an independent ethics advisory committee, whether it's the national statistician's data ethics advisory committee or another. There's a lot of people involved in the process to make sure that any and every use of data is in the public interest.

MF

From what we know from the evidence available, certainly according to the latest public confidence and official statistics survey - that's a big biannual survey run by the UK Statistics Authority (UKSA) - I guess for that, and other reasons, public trust remains high. The Survey said 89% of people that gave a view trusted ONS, and 90% agreed that personal information provided to us would be kept confidential. But is there a chance that we could lose some of that trust now, given that there is much greater use, and much greater sharing, of admin data? It should be said that it doesn't give people the chance to opt out.

PS

I think one of the reasons that trust has remained high is because of the robust controls we have around the use of data. Because of the comprehensive set of controls and the framework that we put around use of data that protects confidentiality, that ensures that all uses are in the public interest. And another important component of it is that all use of data that we support is transparent by default. So, any analyst wanting to use data that are held by ONS, or from another department that we support, we publish the details of who those analysts are, which data they're using, what they're using them for, and then we require them to publish the outputs as well. And that transparency helps maintain public trust because if someone wants to know what their data is being used for, they can go to our website or directly to the analyst, and they can see the results tangibly for themselves. Now, they might not always agree that every use case is explicitly in the public interest, but they can see the thought process. They can see how the independent panel has reached that conclusion, and that helps us to retain the trust. There's a second half of your question around whether there is a risk of that changing. There is always a risk but we are very alive to that, which is why as we built the Integrated Data Service, and we look to make more and more government data available, that we don't take for granted the trust we've already got, and that we continue to work with the public, and with privacy groups, to make sure that as we build the new service and make more data available, we don't cross a line inadvertently, and we don't allow data to be used in a way that isn't publicly acceptable. We don't allow data to be combined in a way that would stretch that comfort. And this is that kind of proactive approach that we're trying to take, that we believe will help us retain public trust, despite making more and more data available.

MF

Professor Floridi, we gave you those survey results there, with people apparently having confidence in the system as it stands, but I guess it just takes a couple of negative episodes to change sentiment rapidly. What examples have we seen of that, and how have institutions responded?

LF

I think the typical examples are when data are lost, for example, inadvertently because of a breach and there is nobody at fault, but maybe someone introduced the wrong piece of software. It could be a USB, someone may be disgruntled, or someone else has found a way of entering the database - then the public gets very concerned immediately. The other case is when there is the impression, which I think is largely unjustified, but the impression remains, that the data in question are being used unjustly to favour maybe some businesses, or perhaps support some policies rather than others. And I agree with you, unfortunately, as in all cases, reputation is something very hard to build and can be easily lost. It's a bit unfair, but as always in life, building is very difficult but breaking down and destroying is very easy. I think that one important point here to consider is that there is a bit of a record as we move through the years. The work that we're talking about, as we heard, 2017 is only a few years ago, but as we build confidence and a good historical record, mistakes will happen, but they will be viewed as mistakes. In other words, there will be glitches and there will be forgiveness from the public built into the mechanism, because after say 10 or 15 years of good service, if something were to go wrong once or twice, I think the public will be able to understand that yes, things may go wrong, but they will go better next time and the problem will be repaired. So, I would like to see this fragility if you like, this brittle nature of trust, being counterbalanced by a reinforced sense of long-term good service that you know delivers, and delivers more and more and better and better, well then you can also build a little bit of tolerance for the occasional mistakes that are inevitable, as in everything human, they will occur once or twice.

MF

Okay, well, touching my mic for what would in effect be my desk, I can say that I don't think ONS has had an episode such as you describe, but of course, that all depends on the system holding up. And that seems a good point to bring in Simon Whitworth from the UK Statistics Authority, as kind of the overseeing body of all this.

Simon, how does the authority go about its work? One comment you see quite commonly on social media when these topics are discussed, is while I might trust the body I give my data to, I don't trust them not to go off and sell it, and there have been episodes of data being sold off in that way. I think it's important to state isn't it, that the ONS certainly never sells data for private gain. But if you could talk about some of the other safeguards that the authority seeks to build into the system.

SIMON WHITWORTH

The big one is around the ethical use of data. The authority, and Pete referred to this, previously back in 2017, established something called the National Statisticians Data Ethics Advisory Committee, and that's an independent committee of experts in research, ethics and data law. And we take uses of data to that committee for their independent consideration. And what's more, we're transparent about the advice that that committee provides. So, what we have done, what we've made publicly available, is a number of ethical principles which guide our work. And that committee provide independent guidance on a particular use of data, be they linking administrative data, doing new surveys, using survey data, whatever they may be, they consider projects from across this statistical system against those ethical principles and provide independent advice and guidance to ensure that we keep within those ethical principles. So that's one thing we do, but there's also a big programme of work that comes from something that we've set up called the UK Statistics Authority Centre for Applied Data Ethics, and what that centre is trying to do is to really empower analysts and data users to do that work in ethically appropriate ways, to do their work in ways that are consistent with those ethical principles. And that centres around trying to promote a culture of ethics by design, throughout the lifecycle of different uses of data, be they the collection of data or the uses of administrative data. We've provided lots of guidance pieces recently, which are available on our website, around particular uses of data - geospatial data, uses of machine learning - we've provided guidance on public good, and we're providing training to support all of those guidance pieces. And the aim there is, as I say, to empower analysts from across the analytical system, to be able to think about ethics in their work and identify ethical risks and then mitigate those ethical risks.

MF

You mentioned the Ethics Committee, which is probably not a well-known body, independent experts though you say, these are not civil servants. These are academics and experts in the field. Typically, when do they caution researchers and statisticians, when do they send people back to think again, typically?

It's not so much around what people do, it's about making sure how we do it is in line with those ethical principles. So, for example, they may want better articulations of the public good and consideration of potential harms. Public good for one section of society might equal public harm to another section of society. It's very often navigating that and asking for consideration of what can be done to mitigate those potential public harms and therefore increase the public good of a piece of research. The other thing I would say is being transparent. Peter alluded to this earlier, being transparent around data usage and taking on board wherever possible, the views of the public throughout the research process. Encouraging researchers as they're developing the research, speaking to the public about what they're doing, being clear and being transparent about that and taking on board feedback that they receive from the public whose data they're using. I would say that they're the two biggest areas where an estate provides comments and really useful and valuable feedback to the analytical community.

Everyone can go online and see the work of the committee, to get the papers and minutes and so forth. And this is all happening openly and in a comfortable way?

SW

Yes, absolutely. We publish minutes of the meetings and outcomes from those meetings on the UK Statistics Authority’s website. We also make a range of presentations over the course of the year around the work of the committee and the supporting infrastructure that supports the work because we have developed a self-assessment tool which allows analysts at the research design phase to consider those ethical principles, and different components of the ethical principles, against what they're trying to do. And that's proved to be extremely popular as a useful framework to enable analysts to think through some of these issues, and I suppose move ethics from theory to something a bit more applied. In terms of their work last year, over 300 projects from across the analytical community, both within government and academia, used that ethics self-assessment tool, and the guidance and training that sits behind it is again available on our website.

MF

I'm conscious of sounding just a little bit sceptical, and putting you through your paces to explain how the accountability and ethical oversight works, but can you think of some examples where there's been ethical scrutiny, and research outcomes having satisfied that process, have gone on to produce some really valuable benefits?

SW

ONS has done a number of surveys with victims of child sex abuse to inform various inquiries and various government policies. They have some very sensitive ethical issues that require real thinking about and careful handling. You know, the benefits of that research has been hugely important in showing the extent of child sex abuse that perhaps previously was unreported and providing statistics to both policymakers and charities around experiences of child sex abuse. In terms of administrative data, yes, there are numerous big data linkage projects that have come to ONS and have been considered by ONS, in particular, linkage surveys that follow people over time. Linkages done over time provide tremendous analytical value, but of course need some careful handling to ensure that access to that data is provided in an ethically appropriate way, and that we're being transparent. So those are the two I think of, big things we are thinking about in an ethically appropriate way. And being able to do them in an ethically appropriate way has really allowed us to unleash the analytical value of those particular methods, but in a way that takes the public with us and generates that public trust.

MF

Pete, you are part of the organisation that in fact runs an award scheme to recognise some of the outstanding examples of the secure use of data?

PS

We do, and it's another part of promoting the public benefit that comes from use of data. Every year we invite the analysts who use the Secure Research Service (SRS), or other similar services around the country, to put themselves forward for research excellence awards. So that we can genuinely showcase the best projects from across the country, but then also pick up these real examples of where people have made fantastic use of data, and innovative use of data, really demonstrating the public good. We've got the latest of those award ceremonies in October this year, and it's an open event so anybody who is interested in seeing the results of that, the use of data in that way, they would be very welcome to attend.

MF

Give us a couple of examples of recent winners, what they've delivered.

PS

One of the first award winners was looking at the efficacy of testing that was done for men who may or may not have been suffering from prostate cancer, and it analysed when if a person was given this test, what was the likelihood of its accuracy, and therefore whether they should start treatment, and the research was able to demonstrate that actually, given the efficacy, that it wasn't appropriate to treat everyone who got a positive test, because there was risk of doing more harm than good if it had persisted, which is really valuable. But this year, we'll be seeing really good uses of data in response to the pandemic, for example, tying this back to the ethics, when you talk about the use of data made during the pandemic in retrospect, it's clearly ethical, it's clearly in the public interest. But, at the start of the pandemic, we had to link together data from the NHS on who was suffering from COVID which was really good in terms of the basic details of who had COVID and how seriously and sadly, whether they died, but it missed a lot of other detail that helps us to understand why.

We then linked those data with data from the 2011 Census where you can get data on people's ethnic group, on their occupation, on their living conditions, on the type and size of the family they live with, which enable much richer insights, but most importantly, enabled government to be able to target its policy at those groups who were reluctant to get the vaccination to understand whether people were suffering from COVID due to their ethnicity, or whether it was actually more likely to be linked to the type of occupation they did. Really, really valuable insights that came from being able to link these data together, which now sounds sensible, but at the time did have those serious ethical questions. Can we take these two big datasets that people didn't imagine we could link together and and keep the analyses ethically sound and in the public interest. What’s what we were able to do.

That's certainly a powerful example. But before we pat ourselves on the back too much for that survey I mentioned, some of the research we've been doing at the ONS does suggest that there is nevertheless a hardcore cohort of sceptics on all of this. Particularly, it is suggested, among the older age groups, the over 55’s in particular. I mentioned the social media reaction you see as well. Kind of ironic you might think, given the amount of data that big social media platforms and other private organisations hold on people.

Professor, do you think there's a paradox at work there? People are apparently inclined not to trust public bodies, accountable public bodies, but will trust the big social media and internet giants? Or is it just a question of knowledge, do you think?

LF

I think it might be partly knowledge, the better you know the system, who is doing what, and also the ability to differentiate between the different organisations and how they operate, under what kind of constraints, how reliable they are, etc, versus for example, commercial uses, advertisement driven, etc.

The more you know, and it happens to be almost inevitably the younger you are, the more you might be able to see with a different kind of degree of trust, but also almost indifference, toward the fact that the data are being collected and what kind of data are being collected. I think the statistics that you were mentioning seem to be having an overlapping feature. A less young population, a less knowledgeable population, is also the population that is less used to social media, sharing, using data daily, etc. And is also almost inevitably a little bit more sceptical when it comes to giving the data for public good, or knowing that something is going to be done by, for example, cross referencing different databases.

On the other side, you find the slightly younger, the more socially active, the kids who have been growing with social media - and they are not even on Facebook these days anymore, as my students remind me, Facebook is for people like me - so let's get things right now, when it comes to Tiktok, they know that they are being monitored, they know that the data is going to be used all over the place. There is a mix of inevitability, a sense of who cares, but also a sense of, that's okay. I mean data is the air you breathe, the energy you must have, it's like electricity. We don't get worried every time we turn on the electricity on in the house because we might die if someone has unreliably connected the wires, we just turn it on and trust that everything is going to be okay. So, I think that as we move on with our population becoming more and more well acquainted with technology, and who does work with the data and what rules are in place, as we heard before, from Simon and Pete, I mean, there are plenty of frameworks and robust ways of double checking that nothing goes wrong, and if something goes wrong, it gets rectified as quickly as possible. But the more we have that, I think the less the sceptics will have a real chance of being any more than people who subscribe to the flat earth theory. But we need to consider that the point you made is relevant. A bit of extra education on the digital divide, which we mentioned implicitly in our conversation today. Who is benefiting from what? And on which side of the digital innovation are these people placed? I think that needs to be addressed precisely now, to avoid scepticism which might be not grounded.

MF

I hope through this interesting discussion we've managed to go some way to explaining how it's all done, and why it's so very important. Simon Whitworth, Pete Stokes, Professor Luciano Floridi, thank you very much indeed for taking part in Statistically Speaking today.

I'm Miles Fletcher and thanks for listening. You can subscribe to new episodes of this podcast on Spotify, Apple podcasts and all the other major podcast platforms. You can comment or ask us a question on Twitter at @ONSFocus. Our producer at the ONS is Julia Short. Until next time, goodbye