Machine learning for large epidemiological studies
OPH/23/30
external supervisor
Dr Anshul Thakur, Department of Engineering Science, University of Oxford
BACKGROUND
As health data for epidemiological research becomes increasingly large and complex, classical statistical methods for medical research face arising challenges. The first challenge is to identify relevant features (i.e. risk factors) for a given disease outcome. The second challenge is to model the complex non-linear relations among potential predictors.
Machine learning (ML) methods draw mathematical inferences by learning from data samples. The aim of this project is to bring some benefits of ML into epidemiological research. At the feature selection stage, one could use feature-importance estimators (such as SHAP, or Shapley values) with ensemble-based classifiers (e.g. XGBoost) which have time-to-event extensions, Bayesian variable selection methods, and gradient analysis in deep neural networks.
The extracted features can then be used as inputs to analytical models, such as modern ML methods (e.g. deepsurv, which is a time-to-event architecture), as well as their classical statistical counterparts (e.g. Cox model) to predict a given disease outcome.
Potential data sources include the UK Biobank prospective cohort study that provides extensive phenotypic and multi-omics (including genomic, transcriptomic, proteomic, and metabolomic) data from approximately 500,000 individuals.
The developed methodology will be applicable to a wide range of diseases (e.g. pancreatic cancer, myeloma). The specific disease areas are subject to the student's personal interest and further discussion.
RESEARCH EXPERIENCE, RESEARCH METHODS AND TRAINING
This interdisciplinary project will allow candidates to build analytical skills in both machine learning and medical statistics. Candidates will receive professional mentorship through regular supervisory meetings, and acquire research skills by attending seminars and workshops. Candidates will work closely with other team members, and communicate their findings in international conferences and with the public.
FIELD WORK, SECONDMENTS, INDUSTRY PLACEMENTS AND TRAINING
There may be opportunities to work with external partners and/or on different datasets.
PROSPECTIVE STUDENT
Candidates should have postgraduate training in statistics/mathematics/machine learning, or equivalent mathematical background. Prior knowledge on genetics is not required, but students are expected to learn basic genetics for computing polygenic risk scores.
Proficiency with statistical and machine learning software is required; strong programming skills are essential. Students are expected to document their ongoing findings with clarity, and publish 2-3 articles as the lead author in peer-reviewed scientific journals by the end of their DPhil.