Hash-based core genome multi-locus sequencing typing for Clostridium difficile
Eyre DW., Peto TEA., Crook DW., Walker AS., Wilcox MH.
<jats:title>Abstract</jats:title><jats:sec><jats:title>Background</jats:title><jats:p>Pathogen whole-genome sequencing has huge potential as a tool to better understand infection transmission. However, rapidly identifying closely-related genomes among a background of thousands of other genomes is challenging.</jats:p></jats:sec><jats:sec><jats:title>Methods</jats:title><jats:p>We describe a refinement to core-genome multi-locus sequence typing (cgMLST) where alleles at each gene are reproducibly converted to a unique hash, or short string of letters (hash-cgMLST). This avoids the resource-intensive need for a single centralised database of sequentially-numbered alleles. We test the reproducibility and discriminatory power of cgMLST/hash-cgMLST compared to mapping-based approaches in <jats:italic>Clostridium difficile</jats:italic> using repeated sequencing of the same isolates (replicates) and data from consecutive infection isolates from six English hospitals.</jats:p></jats:sec><jats:sec><jats:title>Results</jats:title><jats:p>Hash-cgMLST provided the same results as standard cgMLST with minimal performance penalty. Comparing 272 pairs of replicate sequences, using reference-based mapping there were 0, 1 or 2 SNPs between 262(96%), 5(2%) and 1(<1%) pairs respectively. Using hash-cgMLST or standard cgMLST, 197(72%) replicate pairs had zero gene differences, 37(14%), 8(3%) and 30(11%) pairs had 1, 2 and >2 differences respectively. False gene differences were clustered in specific genes and associated with fragmented assemblies. Considering 413 pairs of infections within ≤2 SNPS, i.e. consistent with recent transmission, 266(64%) had ≤2 gene differences and 50(12%) ≥5 differences. Comparing a genome to 100,000 others took <1 minute using hash-cgMLST.</jats:p></jats:sec><jats:sec><jats:title>Conclusion</jats:title><jats:p>Hash-cgMLST is an effective surveillance tool that can rapidly identify clusters of related genomes. However, cgMLST/hash-cgMLST generates potentially more false variants than mapping-based analysis. Refined mapping-based variant calling is likely required to precisely define close genetic relationships.</jats:p></jats:sec>