Cookies on this website
We use cookies to ensure that we give you the best experience on our website. If you click 'Continue' we'll assume that you are happy to receive all cookies and you won't see this message again. Click 'Find out more' for information on how to change your cookie settings.


Important information about diagnosis, treatment, and outcomes is often available only in the form of unstructured data: in clinical or laboratory reports, in patient notes, or as free text responses on case report forms. Even where the information exists also in coded form, there may be questions as to the accuracy or completeness of the coding.

Research Experience, Research Methods and Training

The studentship will be focussed upon the development and evaluation of methodologies for the management and transformation of unstructured data. This will involve: a systematic review of literature in natural language processing, domain-specific modelling, model-driven transformation, data governance, trials design and compliance; the design and implementation of techniques for automatic analysis, de-identification, and quality assurance; the development of metrics for measuring the applicability of these techniques to different classes of unstructured data; the development of domain-specific modelling languages and ontologies for the classification and management of information contained within and derived from unstructured data; the establishment of key properties of these languages and ontologies, in terms of mathematical foundations and relationships to alternative approaches.

The HPS2-THRIVE and HPS3-REVEAL trials constitute a valuable resource for the development and evaluation of techniques: not only have these trials collected large quantities of unstructured data, including more than 42,000 medication reports, but this data has been manually interpreted against an agreed ontology – at considerable expense in terms of time and clinical effort; the raw, unstructured data and the coded interpretation will be made available to support this research.

Field Work, Secondments, Industry Placements and Training

No field work is required, although opportunities for collaboration will be available through the MRC Hub Network, the Farr Institute, UK Healthcare Text Analytics Research Network, the Oxford Big Data Institute, and existing research collaborations, including Stanford University (NIH National Center for Biomedical Ontology; Stanford Center for Biomedical Informatics Research), Vanderbilt University (PheKB, eMERGE consortium), and the University of Washington/Fred Hutchinson Cancer Research Centre.

Prospective Candidate

The candidate needs to demonstrate:

1. Proven academic excellence in computer science or related discipline (i.e., 1st class or upper second-class undergraduate degree; or international equivalent; a master’s degree)

2. Proficiency in English and excellent communications skills

3. Research or employment experience relevant to population health would be beneficial, as would experience with unstructured data.

Our team


Related research themes