Analysing large-scale electronic health records to identify at-risk populations
2025/52
external supervisor
Professor Sarah Walker, Nuffield Department of Medicine
background
Large-scale electronic health records offer the opportunity to investigate many more potential risk factors for infections than traditional epidemiological studies using questionnaires. Their volume and scale are continuously increasing as larger and larger amounts of healthcare data are linked and de-identified for research, for example from GPs and hospitals. However, the vast number of potential risk factors that could be considered poses challenges to traditional epidemiological approaches, with similarities to those faced by Genome Wide Association Studies, both in terms of the number of factors and the potential for strong associations between them. For example, diabetes/pre-diabetes may be reflected in some or all of: hospital diagnostic codes, procedure (e.g. surgery) codes and lab test results such as HbA1c (blood sugar).
A recent novel statistical analysis approach called ‘doublethink’ that has been applied to COVID-19 aims to bridge this gap. This D.Phil. project will investigate its application to identify populations at risk of the most common bloodstream infections – defined by microbiological isolations of specific pathogens – and will explore different approaches for defining large-scale sets of covariates based on diagnostic and procedure codes, to gain epidemiological insights into populations at highest risk of these interventions to whom interventions could be targeted. Depending on the student’s interests, they could go on to further develop statistical approaches to using this large-scale data, compare results with those from machine learning models or explore a wider range of outcomes and risk factors. This project will exploit an existing large datawarehouse of de-identified individual patient data, called the Infections in Oxfordshire Database in the first instance.
RESEARCH EXPERIENCE, RESEARCH METHODS AND TRAINING
Specific training will be in a wide range of statistical and epidemiological methods, use of large-scale health record data, and microbiology. Attending relevant specialised training courses will be encouraged.
The student will join a dynamic team of around 30 DPhil. students and post-docs with expertise in biostatistics, epidemiology, machine learning, infectious diseases, microbiology, molecular biology and bioinformatics, providing opportunities for skills and career development, both in terms of biostatistics/epidemiology and more broadly in terms of research careers. Our group, located primarily at the John Radcliffe Hospital, has strong inter-disciplinary links with national and international collaborators and public health agencies, and represents a unique opportunity promoting pathways to research translation in infectious diseases diagnostics.
PROSPECTIVE STUDENT
This project would suit a student with a Bachelor's or Master's degree in a science or quantitative subject, and an interest in the epidemiological analysis of electronic health records, statistical methods and the use of large datasets to answer real-world questions.