Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Aiden Doherty, David Eyre, and Tom Nichols describe the ‘Data Challenge’ undertaken by students in Healthcare Data Science.

Healthcare Data Science students participating in a data challenge exercise.

What is the Data Challenge?

Our doctoral training programme in Healthcare Data Science (formerly Health Data Science) starts with two terms of intense structured teaching and coursework which gives hands-on experience through computational practicals, and assignments that are based on modest-sized, well-curated data. Each term concludes with a two-week ‘Data Challenge’, where students work in small teams, guided by clinical and domain experts, to address healthcare questions using ‘raw’ health datasets and present their results.  

Why did you introduce the Data Challenge?

Whether assessing how to best respond to a global pandemic or using millions of hospital records to improve health outcomes, health data scientists need cutting-edge analytical techniques to address clinical research questions. There is a growing demand for health data scientists, in particular with experience of analysing large-scale datasets, and this requires specialised training with real health data.

What makes the Data Challenge successful?

The key elements of a successful data challenge lie in the combination of students, staff and real health data. Projects are designed to develop or solidify partnerships with clinical collaborators, and often result in further work. For example, data challenges using electronic health records have served as pilot projects with our local hospital group, and preparing the data for the challenge has led to improved data quality for related projects.

Who is involved in the Data Challenge?

Students come from diverse technical backgrounds, typically with an MSc in computer science, statistics, engineering, or a similar quantitative discipline. For the Data Challenge, they are divided into groups of three-five students, chosen to ensure a balance of skills and backgrounds. Their initial coursework includes material on statistics, machine learning, deep learning, epidemiology, and the ethics of health data, and we ensure that students are equipped with the necessary skills for their specific challenge before it starts.

A member of staff is responsible for coordinating all facets of each challenge, such as identifying datasets, clinical collaborators, and tutors (senior graduate students and postdocs), and works closely with them to develop challenges that can be addressed with suitable data within two weeks. The tutors also do basic curation and may prepare notebooks that illustrate how to access the data and its key features.

The Healthcare Data Science students wrangle the data, fit models, and work closely with the experts to interpret the data and present their findings.

Could you describe the datasets used for the challenge?

The data must be appropriately large, as well as comparable to a size that cutting-edge health researchers routinely handle – we are based in the Big Data Institute, after all! They must be appropriate for the specific research questions addressed through the challenge, but not too highly filtered, to encourage students to critically identify useful variables. Datasets must also be sufficiently curated, but ‘raw’ enough to capture the inherent noise and complexity of real-world health data. Typically, the health data used are sensitive and require working on a secure private cloud that needs to be configured with basic data science tools.

Could you describe a typical Data Challenge week?

At the beginning of the week, the lead introduces the data and general scope of the challenge.  Clinical collaborators introduce the research question and explain how it can be addressed with the data. The lead tutor then gives an overview of the data, how the dataset has been prepared and how to access it.

The final presentation must address ethical issues associated with each project, providing possible solutions. Team tutors provide support for any questions about the data and the nature of the challenge and assist with data wrangling and modelling. Clinical collaborators also check in throughout the week, based on their availability and interest.

A check-in at the end of week one (an informal discussion or a mini presentation) ensures that all groups are on track. Groups may be encouraged to wrap up extended data wrangling, or to curtail exploration of machine learning phenotyping models. Groups will also be given feedback around proposed ethical issues and research plans to associate phenotypes with clinical outcomes.

By the middle of week two, students have created various versions of their models and start developing their final presentation in coordination with their clinical experts. The last day presentations often draw an audience from outside the immediate research and clinical community.

What sorts of research questions do the students consider?

Our data science projects have covered a range of topics, including:

  • Improving early detection of clinical deterioration in patients with heart disease
  • Predicting demand for hospital beds and patient flow through hospitals
  • Measuring inequalities in colorectal cancer care and outcomes
  • Understanding patterns of change in kidney function over time in patients with chronic kidney disease
  • Selecting optimal treatments for pneumonia
  • Associating steps with polygenic risk scores and coronary artery disease
  • Associating sleep with cognitive/brain health
  • Detecting COVID-19 faster, so patients with COVID-19 could be identified before diagnostic results were back
  • Measuring the impact of COVID-19 on acute surgical admissions.

What do students say about the challenge?

‘I learnt a lot, very fast, in fact much faster than during the coursework weeks. Having something to do instead of to read or listen to I think is much more effective.’

‘What I liked about the data challenge was that it was open-ended and creative, and that we were able to work in groups and share ideas.’

‘It provides a fascinating insight into problems in healthcare and hospitals today.’

‘It was a fascinating experience to go through the Data Challenge which almost mimics the whole lifecycle of a research project covering: data annotation, data cleaning all the way to model development and deployment for population health inference.’

‘I think the Data Challenge gives a good introduction to the uncertainty and difficulties of doing research with real data.’

Acknowledgements

The Data Challenges have been created and delivered by Jim Davies, Aiden Doherty, David Eyre, Kathy Jarvis, Konstantinos Kamnitsas, Angeliki Kerasidou, Federica Lucivero, Katrina Lythgoe, Tom Nichols, Jasmina Panovska-Griffiths, and Jens Rittscher.

The Data Challenges and the entire training programme are funded by UK Research and Innovation, as the Engineering and Physical Sciences Research Council Centre for Doctoral Training in Healthcare Data Science. The NHS data are made accessible thanks to support from the National Institute for Health and Care Research Oxford Biomedical Research Centre.