Development of an algorithm for ethnicity recording in cohorts from the UK Clinical Practice Research Datalink primary care and linked Hospital Episode Statistics databases
Objective To evaluate various prioritisation strategies within an algorithm designed to ascertain the most likely ethnicity and create a standardised methodology to benefit future research.Design Retrospective cohort study.Setting The Clinical Practice Research Datalink (CPRD) primary care and linke...
Saved in:
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMJ Publishing Group
2025-07-01
|
Series: | BMJ Open |
Online Access: | https://bmjopen.bmj.com/content/15/7/e100533.full |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Objective To evaluate various prioritisation strategies within an algorithm designed to ascertain the most likely ethnicity and create a standardised methodology to benefit future research.Design Retrospective cohort study.Setting The Clinical Practice Research Datalink (CPRD) primary care and linked Hospital Episode Statistics (HES) data sets.Participants The population of 54 029 174 patients included all acceptable patients registered at English practices in CPRD GOLD or CPRD Aurum from the May 2023 to May 2022 builds, respectively.Primary outcome measure Ethnicity data within CPRD and HES data sets were identified by employing established code lists and subsequently categorised into broader ethnic groups. Changes were made to a previously used algorithm to assess their effect on ethnic categorisations. Modifications included prioritising primary over secondary care data, recent over frequent records and ‘non-other’ ethnicity categories. Different data sources were examined: CPRD with all HES data sets, CPRD with HES Admitted Patient Care (APC) only, CPRD only and HES APC only. Ethnic distributions from these variations were compared using counts and percentages, evaluating inter-rater reliability using Cohen’s kappa. Sensitivity analyses included repetition using only currently registered patients and after removing cases with unknown ethnicity. Ethnic distributions were compared with English Census 2021.Results There was almost perfect agreement in ethnicity distributions whether prioritising primary over secondary care data (kappa=1.0000, SE=0.0001), whether prioritising most frequently or most recently recorded data (kappa=0.9824, SE=0.0001) and whether prioritising ‘non-Other’ categories (kappa=0.9705, SE=0.0001). There was moderate agreement in ethnicity distributions when sourcing data from single data sources (CPRD only (kappa=0.5554, SE=0.0001) or HES APC only (kappa=0.5526, SE=0.0001)) compared with combined data sources (CPRD and HES datasets).Conclusions All variations of the algorithm produced similar population-level ethnicity distributions. Versions using data from multiple sources had higher inter-rater reliability than those using a subset of sources; however, there was little difference in categorisations produced by varying the hierarchical decision-making of the ethnicity algorithm. The CPRD population was representative of the English population in terms of ethnicity. While researchers should remain vigilant of the limitations of using these data, the CPRD Ethnicity Records provide a standardised and pragmatic approach to ascertaining ethnicity for future research. |
---|---|
ISSN: | 2044-6055 |