Improved de-identification of physician notes through integrative modeling of both public and private medical text

Table 2 Complete list of all 28 features annotated by the NLP pipeline

Lexical	Frequency	Medical dictionary	Known PHI
Part of Speech	Term Frequency (Token)	# matches HL7 2.5	# matches US Census Names
Part of Speech (Binned)	Term Frequency (Token, Part of Speech)	# matches HL7 3.0
Capitalization		# matches ICD9 CM	# matches for pattern HOSPITAL
Word or Number		# matches ICD10 CM	# matches for pattern AGE
Length		# matches ICD10 PCS	# matches for pattern DATE
		# matches LOINC	# matches for pattern DOCTOR
		# matches MESH	# matches for pattern LOCATION
		# matches RXNORM	# matches for pattern PATIENT
		# matches SNOMED	# matches for pattern ID
		# matches COSTAR	# matches for pattern PHONE
		# consectutive tokens any dictionary	# consecutive tokens any pattern

In the lexical phase, part of speech and capitalization usage is annotated for each word token. In the frequency phase, each word is annotated with the frequency of appearance in public and private medical texts. In the dictionary phase, each word is compared to a list of standard medical concepts in UMLS sources. In the knownPHI phase, tokens and phrases are compared against suspicious patterns of HIPAA identifiers.

ISSN: 1472-6947