Skip to main content
Figure 5 | BMC Medical Informatics and Decision Making

Figure 5

From: Improved de-identification of physician notes through integrative modeling of both public and private medical text

Figure 5

Term frequency distributions in PHI and non-PHI word tokens. In each of the four histograms, the log normalized term frequency (x-axis) is plotted against the percentage of word tokens. PHI words (red) are more common on the left hand side of each histogram, showing that PHI words tend to be rarer than non-phi words (blue). Top Figures (a) and (b) contain training data. Bottom Figures (c) and (d) contain testing data. Histograms for Training and Testing are characteristically similar. Term frequency histograms on the left (a) and (c) refer to words matched according to their part of speech. Term frequency histograms on the right (b) and (d) refer to raw word matches.

Back to article page