Skip to main content

Table 1 Overview of the datasets

From: Leveraging text skeleton for de-identification of electronic medical records

 

i2b2–2006

i2b2–2014

Chinese

Number of records

669

1304

9700

Number of tokens

560,852

1,005,582

3,026,944

Number of PHIs

19,498

28,862

48,072

Number of PHI tokens

29,917

38,435

137,496

Vocabulary Size

20,254

41,879

32,265

Percentage of ID

24.6%

3.6%

8.8%

Percentage of DATE

36.4%

43.2%

38.9%

Percentage of HOSPITAL

12.3%

8.0%

2.2%

Percentage of DOCTOR

19.2%

16.6%

14.7%

Percentage of PATIENT

4.7%

7.6%

17.3%

Percentage of AGE

0.1%

6.9%

16.1%