Statistically Speaking: Integrating Data: Boosting the capabilities of researchers to inform policymaking.

Integrating Data: Boosting the capabilities of researchers to inform policymaking.

Jan 31, 2023

Miles explores how data linking can help tackle cross-cutting issues in an increasingly uncertain world, and how the ONS’ new Integrated Data Service will provide a step-change transformation in how researchers will be able to access public data.

Joining him are ONS colleagues Bill South, Deputy Director of Research Services and Data Access; Jason Yaxley, Director of the Integrated Data Programme; and award-winning researcher Dr Becky Arnold, from the University of Keele.

TRANSCRIPT

MILES FLETCHER

Welcome again to Statistically Speaking - the Office for National Statistics Podcast. I'm Miles Fletcher and in this episode, we're going to step back from the big news making numbers and take a detailed look at an aspect of the ONS which is, less well known, but arguably just as important.

The ONS gather an awful lot of data of course, and much of it remains valuable long after it's been turned into published statistics. It is used by analysts and government, universities and the wider research community. So we're going to explain how that's done and look at some really interesting and valuable examples of how successful that has been to date. And we're also going to hear about a step-change transformation that's now underway in how public data is made available to researchers, and the future potential of that really important, exciting process. Our guides through this subject are Jason Yaxley, Director of the ONS’s integrated data programme, Bill South who is Deputy Director of the Research Services and Data Access Division here at the ONS, and later in the podcast we’ll hear from Dr. Becky Arnold who is an award-winning researcher from Keele University.

Right Bill, set the scene for us to start with then, we are talking here about the ONS Secure Research Service, take it from the top please. What is it? What's it all about? What does it do? What do we get from it?

BILL SOUTH

Hi Miles, thank you. Yes, the Secure Research Service, or the SRS, is the ONS’ trusted research environment. We've been running now for about 15 years, and we provide secure access to unpublished de-identified micro data for research that's in the public good. So in terms of numbers, we hold over 130 datasets, we've got about 5000 Researchers accredited to use the service and about 1500 of those would be working in the system at any given time on about 600 live projects.

MF
So what sort of data, what is stored and what's made available? Is this survey responses?

BS
Traditionally the SRS has held most of our ONS surveys. So that's the labour market, business...all of our surveys really. In the last four years, thanks to funding we've received from Administrative Data Research UK (ADRUK), we've been able to grow the amount of data we hold, so now we've increasingly got data coming from other government departments. And we've got more linked datasets that enable us to offer new insights into the data.

MF
And so these are people's responses to survey questions and people's records, as well as data that are held by other departments?

BS
Indeed, yes, the data coming from other departments is often administrative data, so not from surveys but more admin data.

MF
And a lot of the value in that is in being able to compare and to link this data to achieve different research insights?

BS

Absolutely. I mean, a good example of that is a dataset that's been added in the last year or so where our ONS census data from 2011 was linked to educational attainment data from the Department for Education into a research dataset called Growing up in England (GUiE). And it's hugely important because we have a lot of rich information from the census but you know, linking that with the educational attainment data offers new insights about how kids do at school, and how they're linked to the characteristics of their background.

MF
So you use the underpinning of census to provide a really universal picture of what's going on across that particular population, and therefore gain some insight into how people have achieved educationally in a way that we wouldn't have done before. Of course, all this and the power of it is clear in that example, but a lot of people might think, oh my gosh, they must know an awful lot about me that in that case, tell us about how privacy and anonymity are protected in those circumstances.

BS
Yeah, absolutely. It's a central part of their operation, and clearly the word secure in the name is key there. So we follow a five safes principle which underpins everything we do. The five safes are safe people, so that anyone who uses the SRS has to be trained and go through an assessment to be accredited by us to use the environment. Once they're accredited, they then have to apply to have a project that's running in the system, and that gets independently assessed. There are a number of checks around whether it's ethically sound, whether the use of data is appropriate, but the key thing really is around the public good. So all research projects that happen in the SRS have to be in the public good and there's a commitment to be transparent. So every project that happens in the SRS, there's a record which is published on the UK Statistics Authority website. The third safe is around the settings, so it's a very controlled environment where people access the data. The fourth stage is around the data, so although we've said it's record level data it's already identified. Names and addresses, any identifiers are stripped out of the data before researchers can access it. And the final stage, the final part of the of the researcher journey if you like, is around outputs. What that means is we do checks to ensure that when any analysis leaves the environment that no individual or business can be identified for the published results.

MF
So in essence, you must convince the ONS that you are a Bonafide researcher, and you also have to convince them that what you're doing is definitely for the public benefit.

BS
That's right. And the other thing that's worth noting is that the SRS, like a number of other trusted research environments across the country, has been accredited under the Digital Economy Act to be a data processor, which means we go through a rigorous assessment process around the security, the environment, but also our capability to run it. So that's our processes, our procedures, whether our staff are adequately trained to run the service. That's a key part of that accreditation under the Digital Economy Act.

MF
So, on that point then about anonymity, you can drill right down to individual level, but you'll never know who those individuals actually are or be able to identify them?

BS
That's right. Researchers typically will run their code against the record level data, but when they've got the results of the analysis, there are clear rules that say you won't be allowed to take out very low counts. So that means like our published outputs, there's no way of identifying anyone once the research is published.

MF
And the SRS has built up over the years a good reputation for actually doing this effectively and efficiently.

BS
Yes, I think that's fair to say. We have a good reputation, and the service is growing in terms of the number of datasets and the number of projects and the number of people using it. So, I think that speaks for itself.

MF
Okay, let's pull out another I think powerful example of why this facility is so important and that comes from the recent COVID pandemic. Many listeners will be aware that the ONS ran a very, very large survey involving upwards of 100,000 people providing samples, taking COVID tests, and they were sent off to be analysed creating an awful lot of community level data about COVID infections, and we in the ONS then publish our estimates and continue to do so as we record estimates every week of fluctuating infection levels. But behind all that work, there were expert researchers in institutions around the country who were doing far more with that data. And the SRS was fundamental to delivering the data to them. Tell us about how that operated Bill, and some of the results that we got out of it.

BS
Yeah, sure. I mean, the COVID infection survey that you refer to there, that dataset is available for accredited researchers to apply to use, and they have done, but we've also brought in a number of others, about 20 COVID related datasets are in the SRS, so things around vaccination or the schools infection survey, mortality, etc.

So since the start of the pandemic we've had over 50 projects that have either taken place and completed, or are currently underway, in the environment. Some of those are directly using the COVID related dataset. So looking, if you like, at the health impact, but there's also projects that are are looking at, if you like, non COVID data, economic data or education data, that are projects dedicated to understanding the impact of COVID.

MF
What sort of insights have we seen from those?

BS
In terms of those using the COVID related data there's been analysis to highlight the disproportionate impact of the virus on ethnic minorities, that went on to implement a number of government interventions. Another project assessed the role of schools in the in the Coronavirus transmission. We had another project that was run specifically on behalf of local authorities to inform their response to the pandemic that offered insights into the risks between occupation. Also research into footfall in retail centres and how business sectors were affected by the pandemic. So a really huge range of things. There were other research projects looking at the impact and you know, an example there was a project that looked at learning loss. So, kids not being in school for that sort of 20 to 21 academic year. Similarly, the Bank of England ran a project looking at the financial stability of the UK during the pandemic period. So hopefully those examples give you this sense of the range.

MF
An incredibly impressive array of projects, all underpinned by that big survey, the likes of which the ONS has a unique ability to run, that big survey taking part run across the United Kingdom of people providing and answering questionnaires as well as providing survey samples. And don't take our word for it, I mean, it was reported in the Daily Mirror no less. A researcher who benefited from that data described the COVID Infection Survey as, when it came to the pandemic, one of the most valuable resources on the planet. So that's a powerful example of the research value that can be extracted through the secondary uses of data gathered by the ONS.

Anyway, enough of blowing our own trumpet, the service has been running a very successful award scheme that recognises the achievements of external researchers Bill. Tell us about some of the projects that have been recognised in that.

BS
It’s worth mentioning I think also that we've got case studies on our website, the Secure Research Service website and the ADRUK website, which show in a little bit more detail the impact some of these research projects have had, but like you say, we also hold an annual Research Excellence Awards, which is great. We have different categories of awards where people can submit their project and explain where their research has been published and had an impact. And like I said, we get a lot of nominations and reviewing the applications, which I did last year, it really emphasises the breadth and quality of the research taking place in the SRS.

MF
Check those out then if you're interested in learning more about those projects, some of the examples that Bill mentions and winners of the Research Excellence Awards, of course, one of whom I'm very pleased to say joins us now and that's Dr. Becky Arnold from the University of Keele, who took home the cross-government analysis award for her team's work on controlling the spread of COVID-19 in vulnerable settings in a project undertaken at the UK health security agency.

Becky I guess that's but another example of the kind of secondary uses of the COVID infection data. Welcome to the podcast. Please tell us all about that.

Dr. Becky Arnold
Yeah, very, very glad to. So first thing I want to talk about essentially is what a vulnerable setting is. And that was really key to the sort of cross governmental aspects of this because vulnerable settings are settings like care homes, hospitals, prisons, schools, where you have a lot of quite often vulnerable people in a really dense environment where COVID can sort of spread and get out of control really quickly. And if we want to define a testing policy for that, so our testing policy being perhaps everybody takes like three LFT tests a week, or maybe one monthly PCR test, but also other factors, like what's your isolation policy? So, if somebody is infected with COVID, how many days do they have to be isolated for? Do they need a negative test to be released? What is your outbreak policy in these institutions, if you know that there's an outbreak going on? It's this really, really complicated thing. And you know, for government policy, you need a testing regime to try and keep COVID under control in these settings. But there's a few difficulties with that. The first thing is that the settings are all really different. So, when I just mentioned about the cross governmental thing, it meant interacting with lots of different departments, lots of different data sources to try and understand these particular settings and their particular characteristics. The really, really critical point I want to make is that the whole project was about trying to understand what that testing policy should be. And the best testing policy in one setting may not be the best testing policy in another setting, because when we're trying to give advice to policymakers and policy departments about what testing strategy you should use in an institution, you don't want to just pull that out of the hat. You don't want to just go oh, I think this many LFT tests a week. We want to give data-driven, informed, evidence-based advice. So essentially, what this project was looking at was all of these different settings in a lot of detail, looking at the demographics within them and their particular vulnerabilities. So, care home residents are particularly vulnerable, as are people in prison. They're more clinically vulnerable than people of the same age that are not in prison and a bunch of different aspects, how people interact in these different settings, how infection spreads in these different settings. And from that, essentially, we created a model where you can simulate the spread of COVID in these different settings under different testing strategies. So, you can answer questions like if we use ‘x’ testing strategy versus ‘y’ testing strategy, what is the likely impact going to be on the number of people that died, the number of people that need hospitalisation, how many of those people that go to hospital are going to need intensive care, which often comes with long recovery and sometimes permanent impacts on people's lives. So, there are huge things to consider. And it's actually the point of this project was to study these environments and try and make something which can provide that evidence to inform decision making.

MF
This was data being gathered, presumably then in institutional settings up and down the country and then being collected centrally and made available to you at a single point of contact?

BA
It would have been very nice if that was the case. Because we're looking at so many different settings we were kind of scrambling around quite a lot just to try and identify what datasets were available and to sort of gather them together. And also there were so many different types of data that we needed to drive this. So firstly, like you say, the health outcomes data, in some cases, there were specific datasets available for certain institution types, but we weren't always able to get access to those for various reasons. But there were also considerations like the sort of data that was published every day, there's sort of a nationwide aspect, when we're also looking at another data type is how people interact within these different settings. For that we used an awful lot of literature review. We spoke to people that work in the settings. We spoke to people that work in care homes, we spoke to care homes franchise owners to understand their staffing policies and things related to that. We also spoke to government departments like the Department of Justice. So, it was a lot of different data sources all sort of gathered together for the various aspects of this project.

MF
This model you’ve created, what's its future? Perhaps in different scenarios that might arise in the future.

BA
The model was very, very carefully constructed to be as flexible as possible at the time for potential future COVID variants in mind, but because of that, it means it's very adaptable to different infectious diseases. So if you change just a few input parameters, like the mortality rates, you know, the infection rate, a few factors like that, it's quite easy to transform this model to simulate the spread of other infectious diseases. So, things like flu, which has a big impact on care homes every year and has the potential to be used to better understand how to combat that. But another thing that I think is very useful about this model is it has the ability to help us in game plan for potential future pandemics, because I think it's fair to say that governments around the world when COVID came along, were kind of caught by surprise, or wrong-footed, sort of without a game plan of how to respond. And as we know, the early stages, whether it's a single pandemic or an individual outbreak, it's those early stages which are really, really critical. With this sort of model, we can gameplan you know, what response should we give if we have a future pandemic with these properties? Say we've got this transmissibility, it's got this mortality rate, we have tests that cost this much and they give you this accuracy. In that scenario, what should we do? And to be able to do that research upfront and to have some sort of game plan in mind so that if and when future pandemics come along, we are better prepared and can respond efficiently and quickly to try and have the best outcomes possible. So that's something I think is really exciting for the for the future of this model.

MF

Okay, that's beautifully explained, thank you very much indeed.

Bill, so we've heard from Becky about how the data that she had to access had to come from many different places, but I guess that might have been an impediment to actually producing a model as rapidly in the pressing circumstances of the pandemic as it could potentially have been achieved. Does that suggest then that while the SRS has achieved on its own terms, a great deal, nevertheless, there have been limitations, and perhaps it's time to be doing this kind of data sharing across the public sector in a much bigger and better way?

BS
Yes. When I look at the sort of challenges and limitations around the SRS, I think there's probably three things, one of which is the ability to get the data sharing moving as fast as we need to meet this sort of policy need. The second area would be around the fact that actually the SRS is ageing technology now, and although it's performed really well, and especially during that sort of pandemic response we talked about earlier, it's fair to say it has struggled to cope with some of the really sort of heavy processing requirements that have come out of during that sort of COVID response. Some of the modelling required was much larger than the traditional sort of research projects we might have had in the SRS. And then the final thing is around some of the processes that we described earlier, that sort of five stages framework. All of our processes and rules apply to users, regardless of their sector. What that means is for government analysts who are seeking to access government data, working on government systems to inform government policy, there's a feeling that we could do things faster. Only 25% of our user base is government analysts at the moment, you know, I think that's something we certainly could improve to build that area of the service.

MF
Building the service then for the future is where Jason comes in, Jason Yaxley. As the director of the new Integrated Data Service, we've heard about potential, we've heard about the opportunity to do more in future. Tell us then about the Integrated Data Service, which promises to expand the amount of data available to researchers to speed up the delivery of it and to really produce a huge step-change or transformation in the ability of researchers to do this kind of work in the future. Is that a fair expectation?

Jason Yaxley
Hi Miles, pleased to be here. Yes, I think it's a very fair expectation. So I have the pleasure of being the programme director for the Integrated Data Programme, which will deliver the Integrated Data Service and the ONS is the lead delivery partner for all of government to deliver a transformation both in how government uses data, but also the underpinning technology that enables us to analyse and use that data much more quickly. And so that's a reason why we're one of the key enablers of the government's data strategy and why I view this very much as a transformation rather than just another big data lake where lots of government data goes and we can't really get into it. So, it's a really exciting opportunity. Were in the sort of middle stage of the programme where we have a service that is built and now we have to sort of grow it and expand it and get more data to really enhance its functionality, but it's a really exciting time. A really great job to have.

MF
And in terms of scale, what's the difference between IDS coming in, the Integrated Data Service, compared to the old, if I can put it that way, Secure Research Service?

JY
When it comes to the SRS, it is brilliant at what it does, but it's technology is starting to age and that is causing limitations. And I think what makes the Integrated Data Service sort of a step-change and perhaps unique across government falls into sort of four broad categories. There's the enabling infrastructure itself, which will be state of the art cloud-based, there is the data which will be much more friction free and will be quicker and easier to access data, use data, shar data. It will enable data visualisation in a way that's never been done before. And rather than having to do individual agreements to link one bit of data to a different bit of data, what we will have here is a service for people that will be scalable, repeatable, standardised, which makes it much much easier on a regular basis to link and index and then do research against much larger datasets much more quickly and produce faster results, which is going to be a huge benefit to the public good through the lens of better more informed and evidence-based policy decision making, that has much more statistical and analytical evidence that sits underneath it.

And so we're transforming both the data access itself and the technology that enables that, but also the sort of almost the cultural lens through which we work together. We share information to simplify it. I really want to stress the IDS is keeping all the really good parts of SRS around the five safes, around the de-identification of data, protecting that data and ensuring that you know, public concerns about how government holds and uses data are entirely met.

MF
That's an obvious question isn't it, if this is happening much more widely on a much bigger scale, and how are those safeguards that were heard about from Bill going to be protected? How are they going to persist, and the same level of protection be provided?

JY
2023 is a big year for the programme, particularly March when we hope and we're aiming to receive our own Digital Economy Act accreditation in the same way that the SRS has. So we will carry forward the same safeguards that SRS has used so successfully, as I say around the five safes around, how users are accredited, but through technology and through the service that we operate, to streamline and simplify that, particularly for government users using government data. So this is about that cultural journey as well as that technological journey. Very central to what we're doing is the security of data, the protection of data, you know, we have to convince all of the Chief Technical Officers and all the data analysts across Whitehall that we are as safe and as secure as we could possibly be. So that they'll be comfortable with us having access to that data.

MF
Other potential areas that most UK government data will be made available will be accessible by researchers.

JY
And that's the end game. Absolutely. As I say, we're on a journey at this point. Again, 2023 is important to us. We've just brought in what we're calling super early adopters, which are strategic experienced government analysts from both Whitehall departments and the devolved administrations, particularly Welsh Government right now, and we have brought census 2021 data into the system very early. And so we're already working with government analysts to start to do early exploratory projects that unlocks the information and the power of the census data against certain government priorities, for example, around the economy or around energy, and particularly, we're working with Welsh Government to look at what is the impact of recent economic situation on the Welsh farming community and how can we analyse the industry against the information that we hold in the census data and other data sources to find outcomes of what's happened in say, the last 10 years between the two census datasets.

MF
So what happens next, what are the next steps on this? And particularly what's the message to researchers who think that they would like to be involved in this project?

JY
2023’s really big steps are, as I've just mentioned, DEA accreditation, we reach the next level of maturity for our functionality also in in March, which means in the rest of 2023, having had these two points in time, we’ll be in position to unlock the full sort of power of ideas, we will be wanting to encourage particularly more government researchers. Our aspiration is that every government professional analyst will be registered on and be able to use the service. We will accelerate our pipeline with Whitehall departments with data that we want to bring in. And over the life of the programme we will want to transition SRS itself, and its data and its users into IDS unlocking for those users as I say, the enabling technology of data visualisation, the speed and the pace, the scale. So, I at the moment feel that what we have is a huge warehouse with one corner that has data in it but the potential to fill it with as much data as we can in a way that is linked and matched and indexed. So that you can do much greater analytical research than hitherto has been possible. Just to illustrate that the way the way I like to think of it is there are a lot of people both in government and in academia that can do point to point linkage between dataset A, dataset B, and then run some research against it. And you can think of that perhaps as a ferry crossing a river from point A to point B on the other side, what helps visualise why IDs will be different is to think of us as a bridge and a road that goes over the river and so we can have multiple streams of traffic. We can have a much greater flow of information and research and all the agreements only have to be done once and then it's just repeatable from there. And that's one of the reasons why I'm so excited to be working with the colleagues on the programme and colleagues across government and academia to deliver the transformation which we aim to complete by March 2025. So we still have some way to go to fully exploit all of the technology and get all the data in, but we're on our way.

MF
In the meantime however, there are a couple of examples already out there that listeners might care to check out for themselves if they haven't already. The first of which is the climate statistics data dashboard, creating a one-stop shop if you like for statistics on climate change related topics, bringing together data from around government, you can see it at climate-change.data.gov.uk and another one is the violence against women and girls data dashboard that's vawg.GSS-data.org.uk, which has been created as an important part of the government's 2021 tackling violence against women and girls strategy. And of course, the very popular and widely used COVID dashboard which continues to be available as well. So real living examples of the Integrated Data Service already serving the public benefit.

Becky, if I could bring you back in again, if we're able to deliver on this and the warehouse as Jason described, it becomes bursting with data from right across government sources, presumably then in the future, the kind of work you told us about your award winning work during the pandemic will become that much faster, much easier to execute.

Dr. Becky Arnold
Yes, it really, really would. And I also can't understate how much the integration value of it of having things in the same place and linked just saves so much time and try to track down what data is available and then trying to combine it all together is such a undertaking. Having that sort of delivered there, sort of knowing what is available in a much more accessible way. Being able to use it much more readily would vastly, vastly speed up the sort of research that I did. But it would also be hugely, hugely valuable.

MF
Perhaps some of those listening to this Becky might be surprised actually at how difficult it has been to access public data like this in the past, and that government departments haven't collaborated in making it available in a single place.

BA
One of the biggest difficulties in doing the research I did was trying to get access. Just trying to find what datasets are out there is also a really, really big time sink and the idea of these all being integrated together and much more findable in a way that they aren't now is really, really exciting because it means that if you know what data there is you can use the most appropriate data for what you're trying to use, rather than trying to cobble together what you know exists and you can get your hands on. So integrating this all together in one place where it's findable. It would be a huge, huge win for the sort of research like what I did - or what my team did a lot more accurately. Another factor on that as well is the linking. It is so difficult if you've got different datasets compiled for completely different purposes by different departments - trying to combine those together is really hard. Even if they are about the same sorts of people, the same sorts of things. So having datasets that are already integrated would be a huge, huge step forward in trying to use that data as effectively as possible for the sort of research to drive evidence-based decision making in policy, which I think is something that is so important, and it's something I'm really passionate about.

MF

Becky, thank you very much for joining us. And thanks also to Jason Yaxley, and to Bill South for taking us through this important topic.

I'm conscious that we've approached it largely through the perspective of researchers. And the whole issue of data ethics and how public good is assessed. It's something we've tackled in a previous podcast - do please listen to that and hear about the work of the data ethics committee as well because obviously, confidence in these kinds of initiatives, public trust in these kind of initiatives, depends very much on people understanding the ethical framework under which this work goes on. That's another big topic we will return to in the future, no doubt, and also track progress in the development, the ongoing development, of the Integrated Data Service and tracking the progress of some of the fantastic research projects that have already resulted from this kind of work and the potential ones very excitingly in future too, as well.

I’m Miles Fletcher, and thanks once again for listening to Statistically Speaking. You can subscribe to new episodes of this podcast on Spotify, Apple podcasts and all the other major podcast platforms.

Our producers at the ONS are Steve Milne and Alisha Arthur. Until next time, goodbye.

ENDS