Skip to main content

Table 1 Descriptive statistics for datasets. Mimic3-echo does not contain enough PHI on which to train a model, and is thus used for testing only. We select Name, Date, and Location to show the variety in frequency of PHI types within the datasets

From: Customization scenarios for de-identification of clinical notes

DatasetNote source# of patients# of notesTrain/Test partition by noteTotal tokensTotal PHIs% NAME% DATE% LOCATION
i2b2-2014diabetic longitudinal records296130461% / 39%758 k28.8 k24.2%43.3%15.2%
i2b2-2006discharge notes88988975% / 25%487 k19.5 k24.0%36.4%13.7%
physionetnursing notes163243459% / 41%345 k1.9 k32.5%29.7%25.9%
mimic3-radiologyradiology notes1000100050% / 50%205 k4.1 k10.2%44.8%1.8%
mimic3-echoechocardiogram notes10001000Test only276 k2.5 k9.7%88.7%1.1%
mimic3-dischargedischarge notes1000100081% / 19%128 k40.8 k21.2%61.1%9.9%