Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

IntroductionWe have a large cohort study of half a million people, which continually incorporates new data through health insurance, centre for disease control records, death certificates, resurveys, and ongoing quality assurance and participant information updates. To support our researchers we need data which is correct, up-to-date, and unchanging. Objectives and ApproachWe must provide the new data, fixes and corrections to researchers, without missing anything or introducing issues. We make sequential iterations of our data available to researchers on a biannual basis; allowing a static version that can be referenced with regards to earlier work and providing the newest version of the data for new work. Due to the very large size of the data/code base and the small size of the team managing it, delivering this without error is a struggle. To mitigate this we developed testing scripts which catch issues and flag for resolution prior to release to researchers. ResultsWe currently have 32 tests which catch all known issues which occur during a rebuild. On any occasion where a new type of issue is encountered, tests which would catch that issue and related issues are developed. As a result our last few releases have gone far more smoothly, with few if any issues reported after a release and certainly no previously encountered issues! Examples of current tests include: detection of a failed health insurance import; that we have the same number of participants; failure to increment version number between releases; checking that disease numbers have not changed dramatically over the shared timeframe. Conclusion/ImplicationsProducing multiple static releases is a good way to balance the needs of a researcher for both static and current data, but it does introduce opportunities for both human and computer errors. Mitigating this risk with automated testing is convenient and effective.

Original publication

DOI

10.23889/ijpds.v3i4.645

Type

Journal article

Journal

International Journal of Population Data Science

Publisher

Swansea University

Publication Date

23/08/2018

Volume

3