Skip to main content

Table 4 Feature sets for CRF

From: Mining FDA drug labels for medical conditions

Token features

Current token features

The original form of the current token, the lowercase form, and the stemmed form of the current token.

Tokens in the 5-window size

The previous two tokens and the next two tokens in their original form.

Bigram of current token

The current token bigram and the previous token bigram.

Linguistic features

POS features

The Part-Of-Speech (POS) of the tokens in a 5-token window, including the current token, the previous two tokens, and the next two tokens.

Initial capital features

The features indicating whether the tokens (including the current token, the previous two tokens, and the next two tokens) are upper-case-initial.

Number or not features

The features indicating whether the current token is digital or alphabetic or mixed.

Capital feature

The feature indicating whether the current token is all capitalized or mixed with capital characters.

Prefix and suffix

The prefix and suffix of the current token (first or last two characters).

Token length

The character length of the current token.

Semantic features

CUI

The CUI code of the current token from cTAKES by using dictionary based method.

TUI

The TUI code of the current token assigned by cTAKES, which provides the semantic type information contained in the UMLS thesaurus.

  1. Note: the features in bold are utilized directly from the cTAKES output; the rest of the features are generated or modified by custom-developed processes.