BMC Medical Informatics and Decision Making

Table 1 Overview of the datasets

From: Leveraging text skeleton for de-identification of electronic medical records

	i2b2–2006	i2b2–2014	Chinese
Number of records	669	1304	9700
Number of tokens	560,852	1,005,582	3,026,944
Number of PHIs	19,498	28,862	48,072
Number of PHI tokens	29,917	38,435	137,496
Vocabulary Size	20,254	41,879	32,265
Percentage of ID	24.6%	3.6%	8.8%
Percentage of DATE	36.4%	43.2%	38.9%
Percentage of HOSPITAL	12.3%	8.0%	2.2%
Percentage of DOCTOR	19.2%	16.6%	14.7%
Percentage of PATIENT	4.7%	7.6%	17.3%
Percentage of AGE	0.1%	6.9%	16.1%

Back to article page

ISSN: 1472-6947

Contact us

General enquiries: journalsubmissions@springernature.com