From: Customization scenarios for de-identification of clinical notes
Dataset | Note source | # of patients | # of notes | Train/Test partition by note | Total tokens | Total PHIs | % NAME | % DATE | % LOCATION |
---|---|---|---|---|---|---|---|---|---|
i2b2-2014 | diabetic longitudinal records | 296 | 1304 | 61% / 39% | 758 k | 28.8 k | 24.2% | 43.3% | 15.2% |
i2b2-2006 | discharge notes | 889 | 889 | 75% / 25% | 487 k | 19.5 k | 24.0% | 36.4% | 13.7% |
physionet | nursing notes | 163 | 2434 | 59% / 41% | 345 k | 1.9 k | 32.5% | 29.7% | 25.9% |
mimic3-radiology | radiology notes | 1000 | 1000 | 50% / 50% | 205 k | 4.1 k | 10.2% | 44.8% | 1.8% |
mimic3-echo | echocardiogram notes | 1000 | 1000 | Test only | 276 k | 2.5 k | 9.7% | 88.7% | 1.1% |
mimic3-discharge | discharge notes | 1000 | 1000 | 81% / 19% | 128 k | 40.8 k | 21.2% | 61.1% | 9.9% |