Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models

Table 5 Statistical data concerning the de-identification dataset

Data split/annotation method	#Sentences	#DOCTOR	#PATIENT	#DATE	#VILLE	#ZIP	#STR	#EMAIL	#PHONE
training/automatic	4,948,186	3,883,360	1,853,646	4,948,519	2,544,287	1,305,402	1,165,009	276,208	2,210,577
validation/automatic	608,305	479,821	229,925	607,383	315,771	162,791	144,654	35,288	271,081
test/automatic	620,581	489,008	232,030	620,028	322,279	165,492	147,432	35,322	276,873
test/manual	23,196	1206	510	2078	764	293	234	96	545

ISSN: 1472-6947