Skip to main content

Table 1 Overview of the datasets

From: Leveraging text skeleton for de-identification of electronic medical records

  i2b2–2006 i2b2–2014 Chinese
Number of records 669 1304 9700
Number of tokens 560,852 1,005,582 3,026,944
Number of PHIs 19,498 28,862 48,072
Number of PHI tokens 29,917 38,435 137,496
Vocabulary Size 20,254 41,879 32,265
Percentage of ID 24.6% 3.6% 8.8%
Percentage of DATE 36.4% 43.2% 38.9%
Percentage of HOSPITAL 12.3% 8.0% 2.2%
Percentage of DOCTOR 19.2% 16.6% 14.7%
Percentage of PATIENT 4.7% 7.6% 17.3%
Percentage of AGE 0.1% 6.9% 16.1%