Skip to main content

Table 3 Characteristics of the source databases for validation. #, number. Data from anonymous queries of source databases prior to per-patient de-identified data extraction. ‡ Evidence of under-coding. ¶ Measured during tests where database queries and hashing were a separate step; for final analysis, data were extracted and hashed in a single step. † Indicates evidence of some coding errors (e.g. DOB in the future). * Categories combined for accuracy analysis. Abbreviations: “ < 10” small-group suppression applied; CTV3, Clinical Terms Version 3; ICD-10, World Health Organization International Classification of Diseases, tenth revision; k, thousand; MH, mental health; SD, standard deviation; y, year

From: De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation

Property

Database

CDL

PCMIS

RiO

SystmOne

NATURE

 Nature and dates

Secondary care MH services (> 10 k referrals/year from 1999–2012)

Psychological therapy services (> 1 k referrals/y, 2008–20, > 10 k referrals/y, 2015–)

Secondary care MH services (> 10 k referrals/y from 2012–21)

Community services (> 10 k referrals/y from 2007–); secondary care MH services (2020–). Live link to the NHS Spine, likely to improve validation of identifiers

 Middle names extracted

None recorded

One

Up to four (aliases etc. also recorded but only “usual name” used here)

One (including some single-character names i.e. likely initials)

 Postcode handling

Current

Current and previous

Dated history

Dated history

 Principal coding system

ICD-10

ICD-10

ICD-10

Read/CTV3

SIZE

Total number of people

162,874

120,966

216,739

619,062

 ‡ Number with no DOB

0

0

352

 < 10

 Number included (valid NHS# + DOB)

152,888

117,961

208,632

613,169

 † Duplicated NHS numbers: #records (#distinct NHS numbers duplicated)

0 (0)

6,356 (3,142)

0 (0)

0 (0)

SOFTWARE PERFORMANCE

 ¶ Time to hash identity file (s)

56

41

138

328

 Time to link to self (s)

608

544

1259

5437

 Time to link to next (s)

 → PCMIS, 605

 → RiO, 903

 → SystmOne, 2651

 → CDL, 1151

DEMOGRAPHICS

Year of birth

 † Range (years)

1890–2012

1915–2049

1902–2021

1899–2022

 Mean ± SD (years)

1963 ± 26

1979 ± 15

1973 ± 24

1974 ± 29

Sex/gender

 Female (%)

55.4

63.9

55.4

54.1

 Male (%)

44.6

35.5

44.6

45.9

 * Other (%)

0.0007

0

0.03

0.004

 ‡* Unknown (%)

0.003

0.6

0.01

0.02

Ethnicity

 Asian (%)

0.86

2.47

1.56

3.12

 Black (%)

0.37

0.90

0.84

0.88

 Mixed (%)

0.41

1.97

1.42

1.02

 White (%)

54.47

74.08

57.18

39.70

 Other (%)

0.79

1.53

1.01

1.53

 ‡ Unknown (%)

43.10

19.05

37.99

53.75

Coded ICD-10 diagnoses

 Severe mental illness (%)

2.65

0.35

2.95

0.18

 MH (‘F’) code but no SMI (%)

5.89

88.76

16.67

0.62

 * Code but no MH code (%)

1.68

0.34

0.77

0.03

 * No ICD-10 codes (%)

89.78

10.56

79.61

99.17

Address information

 Postcodes per person: mean (range)

0.992 (0–1)

0.972 (0–2)

1.22 (0–20)

0.998 (0–1)

Deprivation centile (0 least, 100 most)

 Range

0.027–100

0.12–99.7

0.015–100

0–100

 Mean ± SD

43.5 ± 26.9

41.1 ± 26.2

44.7 ± 27.1

48.6 ± 27.4

 Unknown (%)

0.9

6.9

1.1

2.1

Age at first MH care

 † Range (years)

0–115

−39–97

−10–113

−1–105

 Mean ± SD (years)

43.5 ± 26.1

36.7 ± 14.9

39.1 ± 24.1

37.5 ± 21.9

 Unknown (%)

0

0.0017

42.8

85.7