Customization scenarios for de-identification of clinical notes

Table 1 Descriptive statistics for datasets. Mimic3-echo does not contain enough PHI on which to train a model, and is thus used for testing only. We select Name, Date, and Location to show the variety in frequency of PHI types within the datasets

Dataset	Note source	# of patients	# of notes	Train/Test partition by note	Total tokens	Total PHIs	% NAME	% DATE	% LOCATION
i2b2-2014	diabetic longitudinal records	296	1304	61% / 39%	758 k	28.8 k	24.2%	43.3%	15.2%
i2b2-2006	discharge notes	889	889	75% / 25%	487 k	19.5 k	24.0%	36.4%	13.7%
physionet	nursing notes	163	2434	59% / 41%	345 k	1.9 k	32.5%	29.7%	25.9%
mimic3-radiology	radiology notes	1000	1000	50% / 50%	205 k	4.1 k	10.2%	44.8%	1.8%
mimic3-echo	echocardiogram notes	1000	1000	Test only	276 k	2.5 k	9.7%	88.7%	1.1%
mimic3-discharge	discharge notes	1000	1000	81% / 19%	128 k	40.8 k	21.2%	61.1%	9.9%

ISSN: 1472-6947