Skip to main content

Table 2 Complete list of all 28 features annotated by the NLP pipeline

From: Improved de-identification of physician notes through integrative modeling of both public and private medical text

Lexical

Frequency

Medical dictionary

Known PHI

Part of Speech

Term Frequency (Token)

# matches HL7 2.5

# matches US Census Names

Part of Speech (Binned)

Term Frequency (Token, Part of Speech)

# matches HL7 3.0

 

Capitalization

 

# matches ICD9 CM

# matches for pattern HOSPITAL

Word or Number

 

# matches ICD10 CM

# matches for pattern AGE

Length

 

# matches ICD10 PCS

# matches for pattern DATE

  

# matches LOINC

# matches for pattern DOCTOR

  

# matches MESH

# matches for pattern LOCATION

  

# matches RXNORM

# matches for pattern PATIENT

  

# matches SNOMED

# matches for pattern ID

  

# matches COSTAR

# matches for pattern PHONE

  

# consectutive tokens any dictionary

# consecutive tokens any pattern

  1. In the lexical phase, part of speech and capitalization usage is annotated for each word token. In the frequency phase, each word is annotated with the frequency of appearance in public and private medical texts. In the dictionary phase, each word is compared to a list of standard medical concepts in UMLS sources. In the knownPHI phase, tokens and phrases are compared against suspicious patterns of HIPAA identifiers.