Skip to main content

Table 5 Statistical data concerning the de-identification dataset

From: Automatic de-identification of French electronic health records: a cost-effective approach exploiting distant supervision and deep learning models

Data split/annotation method

#Sentences

#DOCTOR

#PATIENT

#DATE

#VILLE

#ZIP

#STR

#EMAIL

#PHONE

training/automatic

4,948,186

3,883,360

1,853,646

4,948,519

2,544,287

1,305,402

1,165,009

276,208

2,210,577

validation/automatic

608,305

479,821

229,925

607,383

315,771

162,791

144,654

35,288

271,081

test/automatic

620,581

489,008

232,030

620,028

322,279

165,492

147,432

35,322

276,873

test/manual

23,196

1206

510

2078

764

293

234

96

545