Skip to main content

Table 1 Descriptive statistics for datasets. Mimic3-echo does not contain enough PHI on which to train a model, and is thus used for testing only. We select Name, Date, and Location to show the variety in frequency of PHI types within the datasets

From: Customization scenarios for de-identification of clinical notes

Dataset

Note source

# of patients

# of notes

Train/Test partition by note

Total tokens

Total PHIs

% NAME

% DATE

% LOCATION

i2b2-2014

diabetic longitudinal records

296

1304

61% / 39%

758 k

28.8 k

24.2%

43.3%

15.2%

i2b2-2006

discharge notes

889

889

75% / 25%

487 k

19.5 k

24.0%

36.4%

13.7%

physionet

nursing notes

163

2434

59% / 41%

345 k

1.9 k

32.5%

29.7%

25.9%

mimic3-radiology

radiology notes

1000

1000

50% / 50%

205 k

4.1 k

10.2%

44.8%

1.8%

mimic3-echo

echocardiogram notes

1000

1000

Test only

276 k

2.5 k

9.7%

88.7%

1.1%

mimic3-discharge

discharge notes

1000

1000

81% / 19%

128 k

40.8 k

21.2%

61.1%

9.9%