Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

external supervisor

Dr Anshul Thakur, Department of Engineering Science


As health data for epidemiological research becomes increasingly large and complex, classical statistical methods for medical research face arising challenges. The first challenge is to identify relevant features (i.e. risk factors) for a given disease outcome. The second challenge is to model the complex non-linear relations among potential predictors.

Artificial Intelligence (AI), also known as Machine Learning (ML), uses methods that draw mathematical inferences by learning from data samples. The objective of this project is to bring some benefits of ML into epidemiological research, using data from the UK Biobank prospective cohort study that provides extensive phenotypic and genetic data from approximately 500,000 individuals.

The research questions are:

  1. For common diseases/cancers, can ML methods identify novel risk factors to augment the existing statistical models?
  2. For rare diseases/cancers, can ML methods overcome the challenge of class imbalance and data scarcity?

We will first explore ML methods for feature selection, including random forest, Bayesian variable selection methods and gradient analysis in deep neural networks, aiming to extract features from a vast number of variables. Using the extracted features as inputs to the model, we will then construct modern ML methods (e.g. deepsurv), as well as classical statistical models (e.g. Cox models) to predict a given disease outcome. 

The developed methodology should be applicable to a wide range of disease outcomes, from rare to common diseases. The specific disease areas are subject to the student's personal interest and further discussion.


This interdisciplinary project will allow candidates to build analytical skills in both machine learning and medical statistics. Candidates will receive professional mentorship through regular supervisory meetings, and acquire research skills by attending seminars and workshops. Candidates will work closely with other team members, and communicate their findings in international conferences and with the public.


There may be opportunities to work with external partners and/or on different datasets.


Candidates should have an MSc degree in statistics/mathematics/machine learning, or equivalent mathematical background. Prior knowledge on genetics is not required, but the student is expected to learn basic genetics for computing polygenic risk scores. 

Candidates will write programs in statistical language R, and ML language Python if necessary. The student is expected to document their ongoing findings with clarity, and publish one or more peer-reviewed articles by the end of their DPhil.