Benchmarking transformer-based models for medical record de-identification in a single center multi-specialty evaluation.
Kuo R., Soltan AAS., O'Hanlon C., Hasanic A., Clifton DA., Collins G., Furniss D., Eyre DW.
Protecting patient confidentiality is central to enabling research using electronic health records. Automated text de-identification offers a scalable alternative to manual redaction. However, different approaches vary in accuracy and adaptability. We evaluated four transformer-based, task-specific models and five large language models on 3,650 clinical records spanning general and specialty datasets from a UK hospital group. Records were dual-annotated by clinicians, allowing precise comparison of performance. The Microsoft Azure de-identification service achieved the highest F1 score, approaching clinician performance, while fine-tuned AnonCAT and GPT-4-0125 with few-shot prompting also performed strongly. Smaller LLMs frequently over-redacted or produced hallucinatory content, limiting interpretability. Task-specific models demonstrated greater stability across datasets, while low-level adaptation improved performance in both model classes. These findings highlight that automated de-identification systems can provide effective support for large-scale sharing of clinical records, but success depends on careful model choice, adaptation strategies, and safeguards to ensure robust data utility and privacy.
