Creating simulated electronic healthcare records for data protection, study design, and education
Dr Tingting Zhu, Institute of Biomedical Engineering, Department of Engineering
The UK and many other countries with advanced healthcare systems collect large volumes of data on the health of their citizens. Often collected for administrative purposes, these datasets are increasingly being linked to epidemiological studies to investigate the causes and consequences of disease.
Because they contain highly sensitive personal health information, legal and practical restrictions are placed on access to health data. These restrictions are well-justified, but have side effects: for example, preventing educational uses, in which realistic data are invaluable to demonstrating and assessing population health knowledge, or delaying or preventing access for researchers with limited financial resources or who lack suitable institutional affiliations.
Simulated data could help to address these limitations, and could have other benefits in study design and methodological research. However, such data must be realistic, must not (for most uses) contain identifiable information about real patients or introduce bias to later analyses of genuine data.
research experience, methods and training
This health data science project will use statistical and machine learning methods to generate simulated health/healthcare data that is realistic and, as necessary, privacy-preserving and non-biasing.
The successful candidate will gain experience working with large UK healthcare databases, such as hospital or primary care records and cancer and death registries. The candidate will learn how to process and clean these data, and to develop statistical and machine learning models to simulate real-world patient data that meet the criteria noted above. Relevant models may include random processes such as Markov chains, and deep learning methods such as generative adversarial networks. The project will also involve some applied statistical analysis, using methods from descriptive and analytical epidemiology to develop and test examples using the simulated data. The project may use either or both R and Python software for computing.
planned fieldwork, industry placements and training
Training in the use of specific health datasets will be provided as needed via internal or external courses. Training in epidemiology, statistics and machine learning will be available as needed available through departmental courses. Internal and on-the-job training will be available in writing analysis code in relevant programming languages, and developing R and/or Python software packages. There will be opportunities to present the work at relevant epidemiological and machine learning conferences and other internal and external meetings/workshops.
Suitable candidates could have a variety of backgrounds, but will have an interest in health research and the use of large-scale health datasets to answer important applied and methodological problems. They will have some prior experience in statistical computing, machine learning or another data science-related field.