Skip to main content

Table 4 Feature sets for CRF

From: Mining FDA drug labels for medical conditions

Token features
Current token features The original form of the current token, the lowercase form, and the stemmed form of the current token.
Tokens in the 5-window size The previous two tokens and the next two tokens in their original form.
Bigram of current token The current token bigram and the previous token bigram.
Linguistic features
POS features The Part-Of-Speech (POS) of the tokens in a 5-token window, including the current token, the previous two tokens, and the next two tokens.
Initial capital features The features indicating whether the tokens (including the current token, the previous two tokens, and the next two tokens) are upper-case-initial.
Number or not features The features indicating whether the current token is digital or alphabetic or mixed.
Capital feature The feature indicating whether the current token is all capitalized or mixed with capital characters.
Prefix and suffix The prefix and suffix of the current token (first or last two characters).
Token length The character length of the current token.
Semantic features
CUI The CUI code of the current token from cTAKES by using dictionary based method.
TUI The TUI code of the current token assigned by cTAKES, which provides the semantic type information contained in the UMLS thesaurus.
  1. Note: the features in bold are utilized directly from the cTAKES output; the rest of the features are generated or modified by custom-developed processes.