Jan 31, 2023
Miles explores how data linking can help tackle cross-cutting issues in an increasingly uncertain world, and how the ONS’ new Integrated Data Service will provide a step-change transformation in how researchers will be able to access public data.
Joining him are ONS colleagues Bill South, Deputy Director of Research Services and Data Access; Jason Yaxley, Director of the Integrated Data Programme; and award-winning researcher Dr Becky Arnold, from the University of Keele.
TRANSCRIPT
MILES FLETCHER
Welcome again to Statistically Speaking - the Office for National Statistics Podcast. I'm Miles Fletcher and in this episode, we're going to step back from the big news making numbers and take a detailed look at an aspect of the ONS which is, less well known, but arguably just as important.
The ONS gather an awful lot of data of course, and much of it remains valuable long after it's been turned into published statistics. It is used by analysts and government, universities and the wider research community. So we're going to explain how that's done and look at some really interesting and valuable examples of how successful that has been to date. And we're also going to hear about a step-change transformation that's now underway in how public data is made available to researchers, and the future potential of that really important, exciting process. Our guides through this subject are Jason Yaxley, Director of the ONS’s integrated data programme, Bill South who is Deputy Director of the Research Services and Data Access Division here at the ONS, and later in the podcast we’ll hear from Dr. Becky Arnold who is an award-winning researcher from Keele University.
Right Bill, set the scene for us to
start with then, we are talking here about the ONS Secure Research
Service, take it from the top please. What is it? What's it all
about? What does it do? What do we get from it?
BILL SOUTH
Hi Miles, thank you. Yes, the Secure
Research Service, or the SRS, is the ONS’ trusted research
environment. We've been running now for about 15 years, and we
provide secure access to unpublished de-identified micro data for
research that's in the public good. So in terms of numbers, we
hold over 130 datasets, we've got about 5000 Researchers accredited
to use the service and about 1500 of those would be working in the
system at any given time on about 600 live
projects.
MF
So what sort of data, what is stored
and what's made available? Is this survey
responses?
BS
Traditionally the SRS has held most of
our ONS surveys. So that's the labour market, business...all
of our surveys really. In the last four years, thanks to funding
we've received from Administrative Data Research UK (ADRUK), we've
been able to grow the amount of data we hold, so now we've
increasingly got data coming from other government departments. And
we've got more linked datasets that enable us to offer new insights
into the data.
MF
And so these are people's responses to
survey questions and people's records, as well as data
that are held by other departments?
BS
Indeed, yes, the data coming from other
departments is often administrative data, so not from surveys but
more admin data.
MF
And a lot of the value in that is
in being able to compare and to link this data to achieve
different research insights?
BS
Absolutely. I mean, a good example of
that is a dataset that's been added in the last year or so where
our ONS census data from 2011 was linked to educational
attainment data from the Department for Education into a research
dataset called Growing up in England (GUiE). And it's
hugely important because we have a lot of rich information
from the census but you know, linking that with the educational
attainment data offers new insights about how kids do at school,
and how they're linked to the characteristics of their
background.
MF
So you use the underpinning of census to
provide a really universal picture of what's going on across that
particular population, and therefore gain some insight into how
people have achieved educationally in a way that we wouldn't have
done before. Of course, all this and the power of it is clear in
that example, but a lot of people might think, oh my gosh,
they must know an awful lot about me that in that case, tell
us about how privacy and anonymity are protected in those
circumstances.
BS
Yeah, absolutely. It's a central part of
their operation, and clearly the word secure in the name is key
there. So we follow a five safes principle which underpins
everything we do. The five safes are safe people, so that anyone
who uses the SRS has to be trained and go through an assessment
to be accredited by us to use the environment. Once they're
accredited, they then have to apply to have a project that's
running in the system, and that gets independently assessed. There
are a number of checks around whether it's ethically sound,
whether the use of data is appropriate, but the key thing really is
around the public good. So all research projects that happen in the
SRS have to be in the public good and there's a commitment to be
transparent. So every project that happens in the SRS, there's a
record which is published on the UK Statistics Authority website.
The third safe is around the settings, so it's a very controlled
environment where people access the data. The fourth stage is
around the data, so although we've said it's record level data it's
already identified. Names and addresses, any identifiers are
stripped out of the data before researchers can access it. And the
final stage, the final part of the of the researcher journey if you
like, is around outputs. What that means is we do checks to ensure
that when any analysis leaves the environment that no individual or
business can be identified for the published
results.
MF
So in essence, you must convince the
ONS that you are a Bonafide researcher, and you also have to
convince them that what you're doing is definitely for the public
benefit.
BS
That's right. And the other thing that's
worth noting is that the SRS, like a number of other trusted
research environments across the country, has been accredited
under the Digital Economy Act to be a data processor, which means
we go through a rigorous assessment process around the
security, the environment, but also our capability to run it. So
that's our processes, our procedures, whether our staff are
adequately trained to run the service. That's a key part
of that accreditation under the Digital Economy
Act.
MF
So, on that point then about anonymity,
you can drill right down to individual level, but you'll never know
who those individuals actually are or be able to identify
them?
BS
That's right. Researchers typically will
run their code against the record level data, but when they've got
the results of the analysis, there are clear rules that say you
won't be allowed to take out very low counts. So that means like
our published outputs, there's no way of identifying anyone once
the research is published.
MF
And the SRS has built up over the years
a good reputation for actually doing this effectively and
efficiently.
BS
Yes, I think that's fair to say. We have
a good reputation, and the service is growing in terms of the
number of datasets and the number of projects and the number of
people using it. So, I think that speaks for
itself.
MF
Okay, let's pull out another I
think powerful example of why this facility is so important and
that comes from the recent COVID pandemic. Many listeners will be
aware that the ONS ran a very, very large survey involving upwards
of 100,000 people providing samples, taking COVID tests, and they
were sent off to be analysed creating an awful lot of community
level data about COVID infections, and we in the ONS then
publish our estimates and continue to do so as we
record estimates every week of fluctuating infection levels.
But behind all that work, there were expert researchers in
institutions around the country who were doing far more with that
data. And the SRS was fundamental to delivering the data to them.
Tell us about how that operated Bill, and some of the results that
we got out of it.
BS
Yeah, sure. I mean, the COVID infection
survey that you refer to there, that dataset is available for
accredited researchers to apply to use, and they have done, but
we've also brought in a number of others, about 20 COVID
related datasets are in the SRS, so things around vaccination or
the schools infection survey, mortality, etc.
So since the start of the pandemic
we've had over 50 projects that have either taken place and
completed, or are currently underway, in the environment. Some of
those are directly using the COVID related dataset. So looking, if
you like, at the health impact, but there's also projects that are
are looking at, if you like, non COVID data, economic data or
education data, that are projects dedicated to understanding the
impact of COVID.
MF
What sort of insights have we seen from
those?
BS
In terms of those using the COVID
related data there's been analysis to highlight the
disproportionate impact of the virus on ethnic minorities, that
went on to implement a number of government interventions.
Another project assessed the role of schools in the in the
Coronavirus transmission. We had another project that was run
specifically on behalf of local authorities to inform their
response to the pandemic that offered insights into the risks
between occupation. Also research into footfall in retail centres
and how business sectors were affected by the pandemic. So a really
huge range of things. There were other research projects looking at
the impact and you know, an example there was a project that looked
at learning loss. So, kids not being in school for that sort
of 20 to 21 academic year. Similarly, the Bank of England ran a
project looking at the financial stability of the UK during the
pandemic period. So hopefully those examples give you
this sense of the range.
MF
An incredibly impressive array of
projects, all underpinned by that big survey, the likes of which
the ONS has a unique ability to run, that big survey taking part
run across the United Kingdom of people providing and answering
questionnaires as well as providing survey samples. And don't
take our word for it, I mean, it was reported in the Daily Mirror
no less. A researcher who benefited from that data described the
COVID Infection Survey as, when it came to the pandemic, one
of the most valuable resources on the planet. So that's a powerful
example of the research value that can be extracted through the
secondary uses of data gathered by the ONS.
Anyway, enough of blowing our own
trumpet, the service has been running a very successful award
scheme that recognises the achievements of external researchers
Bill. Tell us about some of the projects that have been recognised
in that.
BS
It’s worth mentioning I think also that
we've got case studies on our website, the Secure Research Service
website and the ADRUK website, which show in a little bit more
detail the impact some of these research projects have had, but
like you say, we also hold an annual Research Excellence Awards,
which is great. We have different categories of awards where people
can submit their project and explain where their research has been
published and had an impact. And like I said, we get a lot of
nominations and reviewing the applications, which I did last
year, it really emphasises the breadth and quality of the
research taking place in the SRS.
MF
Check those out then if you're
interested in learning more about those projects, some of the
examples that Bill mentions and winners of the Research
Excellence Awards, of course, one of whom I'm very pleased to say
joins us now and that's Dr. Becky Arnold from the University of
Keele, who took home the cross-government analysis award for her
team's work on controlling the spread of COVID-19 in
vulnerable settings in a project undertaken at the UK health
security agency.
Becky I guess that's but another
example of the kind of secondary uses of the COVID infection data.
Welcome to the podcast. Please tell us all about
that.
Dr. Becky
Arnold
Yeah, very, very glad to. So first thing
I want to talk about essentially is what a vulnerable setting is.
And that was really key to the sort of cross governmental aspects
of this because vulnerable settings are settings like care homes,
hospitals, prisons, schools, where you have a lot of quite
often vulnerable people in a really dense environment where COVID
can sort of spread and get out of control really quickly. And if we
want to define a testing policy for that, so our testing policy
being perhaps everybody takes like three LFT tests a week, or
maybe one monthly PCR test, but also other factors, like what's
your isolation policy? So, if somebody is infected with COVID, how
many days do they have to be isolated for? Do they need a negative
test to be released? What is your outbreak policy in these
institutions, if you know that there's an outbreak going on? It's
this really, really complicated thing. And you know, for government
policy, you need a testing regime to try and keep COVID under
control in these settings. But there's a few difficulties with
that. The first thing is that the settings are all really
different. So, when I just mentioned about the cross governmental
thing, it meant interacting with lots of different departments,
lots of different data sources to try and understand these
particular settings and their particular characteristics. The
really, really critical point I want to make is that the whole
project was about trying to understand what that testing
policy should be. And the best testing policy in one setting
may not be the best testing policy in another setting, because when
we're trying to give advice to policymakers and policy departments
about what testing strategy you should use in an institution, you
don't want to just pull that out of the hat. You don't want to
just go oh, I think this many LFT tests a week. We want to
give data-driven, informed, evidence-based advice. So essentially,
what this project was looking at was all of these
different settings in a lot of detail, looking at the demographics
within them and their particular vulnerabilities. So, care home
residents are particularly vulnerable, as are people in prison.
They're more clinically vulnerable than people of the same age that
are not in prison and a bunch of different aspects, how people
interact in these different settings, how infection spreads in
these different settings. And from that, essentially, we created a
model where you can simulate the spread of COVID in these different
settings under different testing strategies. So, you can answer
questions like if we use ‘x’ testing strategy versus
‘y’ testing strategy, what is the likely impact going to
be on the number of people that died, the number of people
that need hospitalisation, how many of those people that go to
hospital are going to need intensive care, which often comes with
long recovery and sometimes permanent impacts on people's lives.
So, there are huge things to consider. And it's actually
the point of this project was to study these environments and try
and make something which can provide that evidence to inform
decision making.
MF
This was data being gathered, presumably
then in institutional settings up and down the country and then
being collected centrally and made available to you at a single
point of contact?
BA
It would have been very nice if that was
the case. Because we're looking at so many different settings we
were kind of scrambling around quite a lot just to try and identify
what datasets were available and to sort of gather them together.
And also there were so many different types of data that we needed
to drive this. So firstly, like you say, the health outcomes data,
in some cases, there were specific datasets available for certain
institution types, but we weren't always able to get access to
those for various reasons. But there were also
considerations like the sort of data that was published
every day, there's sort of a nationwide aspect, when we're also
looking at another data type is how people interact within these
different settings. For that we used an awful lot of literature
review. We spoke to people that work in the settings. We spoke to
people that work in care homes, we spoke to care homes franchise
owners to understand their staffing policies and things related to
that. We also spoke to government departments like the
Department of Justice. So, it was a lot of different data sources
all sort of gathered together for the various aspects of this
project.
MF
This model you’ve created, what's
its future? Perhaps in different scenarios that might arise in the
future.
BA
The model was very, very carefully
constructed to be as flexible as possible at the time for potential
future COVID variants in mind, but because of that, it means it's
very adaptable to different infectious diseases. So if you
change just a few input parameters, like the mortality rates, you
know, the infection rate, a few factors like that, it's quite easy
to transform this model to simulate the spread of other infectious
diseases. So, things like flu, which has a big impact on care homes
every year and has the potential to be used to better understand
how to combat that. But another thing that I think is very useful
about this model is it has the ability to help us in game plan for
potential future pandemics, because I think it's fair to say that
governments around the world when COVID came along, were kind
of caught by surprise, or wrong-footed, sort of without a game plan
of how to respond. And as we know, the early stages, whether
it's a single pandemic or an individual outbreak, it's those
early stages which are really, really critical. With this sort of
model, we can gameplan you know, what response should we give if we
have a future pandemic with these properties? Say we've got this
transmissibility, it's got this mortality rate, we have tests that
cost this much and they give you this accuracy. In that scenario,
what should we do? And to be able to do that research upfront and
to have some sort of game plan in mind so that if and when future
pandemics come along, we are better prepared and can respond
efficiently and quickly to try and have the best outcomes possible.
So that's something I think is really exciting for the for the
future of this model.
MF
Okay, that's beautifully explained, thank you very much indeed.
Bill, so we've heard from Becky
about how the data that she had to access had to come
from many different places, but I guess that might have been
an impediment to actually producing a model as rapidly in the
pressing circumstances of the pandemic as it could potentially have
been achieved. Does that suggest then that while the SRS has
achieved on its own terms, a great deal, nevertheless, there
have been limitations, and perhaps it's time to be doing this kind
of data sharing across the public sector in a much bigger and
better way?
BS
Yes. When I look at the sort of
challenges and limitations around the SRS, I think there's probably
three things, one of which is the ability to get the data sharing
moving as fast as we need to meet this sort of policy need. The
second area would be around the fact that actually the SRS is
ageing technology now, and although it's performed really well, and
especially during that sort of pandemic response we talked about
earlier, it's fair to say it has struggled to cope with some of the
really sort of heavy processing requirements that have come out of
during that sort of COVID response. Some of the modelling required
was much larger than the traditional sort of research projects we
might have had in the SRS. And then the final thing is around some
of the processes that we described earlier, that sort of five
stages framework. All of our processes and rules apply to users,
regardless of their sector. What that means is for government
analysts who are seeking to access government data, working on
government systems to inform government policy, there's a feeling
that we could do things faster. Only 25% of our user base is
government analysts at the moment, you know, I think that's
something we certainly could improve to build that area of the
service.
MF
Building the service then for the future
is where Jason comes in, Jason Yaxley. As the director of the new
Integrated Data Service, we've heard about potential, we've heard
about the opportunity to do more in future. Tell us then about the
Integrated Data Service, which promises to expand the amount of
data available to researchers to speed up the delivery of it and to
really produce a huge step-change or transformation in the ability
of researchers to do this kind of work in the future. Is that a
fair expectation?
Jason
Yaxley
Hi Miles, pleased to be here. Yes,
I think it's a very fair expectation. So I have the pleasure of
being the programme director for the Integrated Data Programme,
which will deliver the Integrated Data Service and the ONS is
the lead delivery partner for all of government to deliver a
transformation both in how government uses data, but also the
underpinning technology that enables us to analyse and use that
data much more quickly. And so that's a reason why we're one of the
key enablers of the government's data strategy and why I view this
very much as a transformation rather than just another big
data lake where lots of government data goes and we can't really
get into it. So, it's a really exciting opportunity. Were in the
sort of middle stage of the programme where we have a service that
is built and now we have to sort of grow it and expand it and get
more data to really enhance its functionality, but it's a really
exciting time. A really great job to have.
MF
And in terms of scale, what's the
difference between IDS coming in, the Integrated Data Service,
compared to the old, if I can put it that way, Secure Research
Service?
JY
When it comes to the SRS, it is
brilliant at what it does, but it's technology is starting to age
and that is causing limitations. And I think what makes the
Integrated Data Service sort of a step-change and perhaps unique
across government falls into sort of four broad categories.
There's the enabling infrastructure itself, which will be state of
the art cloud-based, there is the data which will be much more
friction free and will be quicker and easier to access data, use
data, shar data. It will enable data visualisation in a way that's
never been done before. And rather than having to do individual
agreements to link one bit of data to a different bit of data, what
we will have here is a service for people that will be scalable,
repeatable, standardised, which makes it much much easier on a
regular basis to link and index and then do research against much
larger datasets much more quickly and produce faster results, which
is going to be a huge benefit to the public good through the lens
of better more informed and evidence-based policy decision
making, that has much more statistical and analytical evidence
that sits underneath it.
And so we're transforming both the
data access itself and the technology that enables that, but also
the sort of almost the cultural lens through which we work
together. We share information to simplify it. I really
want to stress the IDS is keeping all the really good parts of
SRS around the five safes, around the de-identification of data,
protecting that data and ensuring that you know, public concerns
about how government holds and uses data are entirely
met.
MF
That's an obvious question isn't it, if
this is happening much more widely on a much bigger scale, and how
are those safeguards that were heard about from Bill going to be
protected? How are they going to persist, and the same level of
protection be provided?
JY
2023 is a big year for the programme,
particularly March when we hope and we're aiming to receive our own
Digital Economy Act accreditation in the same way that the SRS has.
So we will carry forward the same safeguards that SRS has used
so successfully, as I say around the five safes around, how users
are accredited, but through technology and through the service that
we operate, to streamline and simplify that, particularly for
government users using government data. So this is about that
cultural journey as well as that technological journey. Very
central to what we're doing is the security of data, the protection
of data, you know, we have to convince all of the Chief Technical
Officers and all the data analysts across Whitehall that we are as
safe and as secure as we could possibly be. So that they'll be
comfortable with us having access to that data.
MF
Other potential areas that most UK
government data will be made available will be accessible by
researchers.
JY
And that's the end game. Absolutely. As
I say, we're on a journey at this point. Again, 2023 is important
to us. We've just brought in what we're calling super early
adopters, which are strategic experienced government analysts from
both Whitehall departments and the devolved administrations,
particularly Welsh Government right now, and we have brought census
2021 data into the system very early. And so we're already working
with government analysts to start to do early exploratory projects
that unlocks the information and the power of the census data
against certain government priorities, for example, around the
economy or around energy, and particularly, we're working with
Welsh Government to look at what is the impact of recent
economic situation on the Welsh farming community and how can
we analyse the industry against the information that we hold in the
census data and other data sources to find outcomes of what's
happened in say, the last 10 years between the two census
datasets.
MF
So what happens next, what are the next
steps on this? And particularly what's the message to researchers
who think that they would like to be involved in this
project?
JY
2023’s really big steps are, as I've
just mentioned, DEA accreditation, we reach the next level of
maturity for our functionality also in in March, which means in the
rest of 2023, having had these two points in time, we’ll be in
position to unlock the full sort of power of ideas, we will be
wanting to encourage particularly more government researchers. Our
aspiration is that every government professional analyst will be
registered on and be able to use the service. We will accelerate
our pipeline with Whitehall departments with data that we want
to bring in. And over the life of the programme we will want to
transition SRS itself, and its data and its users into
IDS unlocking for those users as I say, the enabling
technology of data visualisation, the speed and the pace, the
scale. So, I at the moment feel that what we have is a huge
warehouse with one corner that has data in it but the potential to
fill it with as much data as we can in a way that is linked and
matched and indexed. So that you can do much greater analytical
research than hitherto has been possible. Just to illustrate that
the way the way I like to think of it is there are a lot of people
both in government and in academia that can do point to point
linkage between dataset A, dataset B, and then run some
research against it. And you can think of that perhaps as
a ferry crossing a river from point A to point B on the other
side, what helps visualise why IDs will be different is to think of
us as a bridge and a road that goes over the river and so we can
have multiple streams of traffic. We can have a much greater flow
of information and research and all the agreements only have to be
done once and then it's just repeatable from there. And that's one
of the reasons why I'm so excited to be working with the colleagues
on the programme and colleagues across government and academia to
deliver the transformation which we aim to complete by March 2025.
So we still have some way to go to fully exploit all of the
technology and get all the data in, but we're on our
way.
MF
In the meantime however, there are a
couple of examples already out there that listeners might care to
check out for themselves if they haven't already. The first of
which is the climate statistics data dashboard, creating a one-stop
shop if you like for statistics on climate change related topics,
bringing together data from around government, you can see it at
climate-change.data.gov.uk and another one is the violence against
women and girls data dashboard that's vawg.GSS-data.org.uk, which
has been created as an important part of the government's 2021
tackling violence against women and girls strategy. And of course,
the very popular and widely used COVID dashboard which continues to
be available as well. So real living examples of the Integrated
Data Service already serving the public benefit.
Becky, if I could bring you back in
again, if we're able to deliver on this and the warehouse as Jason
described, it becomes bursting with data from right across
government sources, presumably then in the future, the kind of work
you told us about your award winning work during the pandemic will
become that much faster, much easier to execute.
Dr. Becky
Arnold
Yes, it really, really would. And I also
can't understate how much the integration value of it of having
things in the same place and linked just saves so much time and try
to track down what data is available and then trying to combine it
all together is such a undertaking. Having that sort of delivered
there, sort of knowing what is available in a much more accessible
way. Being able to use it much more readily would vastly, vastly
speed up the sort of research that I did. But it would also be
hugely, hugely valuable.
MF
Perhaps some of those listening to this
Becky might be surprised actually at how difficult it has been to
access public data like this in the past, and that government
departments haven't collaborated in making it available in a single
place.
BA
One of the biggest difficulties in doing
the research I did was trying to get access. Just trying to find
what datasets are out there is also a really, really big time sink
and the idea of these all being integrated together and much more
findable in a way that they aren't now is really, really exciting
because it means that if you know what data there is you can use
the most appropriate data for what you're trying to use, rather
than trying to cobble together what you know exists and you can get
your hands on. So integrating this all together in one place where
it's findable. It would be a huge, huge win for the sort of
research like what I did - or what my team did a lot more
accurately. Another factor on that as well is the linking. It is so
difficult if you've got different datasets compiled for completely
different purposes by different departments - trying to combine
those together is really hard. Even if they are about the same
sorts of people, the same sorts of things. So having datasets that
are already integrated would be a huge, huge step forward in trying
to use that data as effectively as possible for the sort of
research to drive evidence-based decision making in policy, which I
think is something that is so important, and it's something I'm
really passionate about.
MF
Becky, thank you very much for joining us. And thanks also to Jason Yaxley, and to Bill South for taking us through this important topic.
I'm conscious that we've approached it largely through the perspective of researchers. And the whole issue of data ethics and how public good is assessed. It's something we've tackled in a previous podcast - do please listen to that and hear about the work of the data ethics committee as well because obviously, confidence in these kinds of initiatives, public trust in these kind of initiatives, depends very much on people understanding the ethical framework under which this work goes on. That's another big topic we will return to in the future, no doubt, and also track progress in the development, the ongoing development, of the Integrated Data Service and tracking the progress of some of the fantastic research projects that have already resulted from this kind of work and the potential ones very excitingly in future too, as well.
I’m Miles Fletcher, and thanks once again for listening to Statistically Speaking. You can subscribe to new episodes of this podcast on Spotify, Apple podcasts and all the other major podcast platforms.
Our producers at the ONS are Steve Milne and Alisha Arthur. Until next time, goodbye.
ENDS