Token features | |
Current token features | The original form of the current token, the lowercase form, and the stemmed form of the current token. |
Tokens in the 5-window size | The previous two tokens and the next two tokens in their original form. |
Bigram of current token | The current token bigram and the previous token bigram. |
Linguistic features | |
POS features | The Part-Of-Speech (POS) of the tokens in a 5-token window, including the current token, the previous two tokens, and the next two tokens. |
Initial capital features | The features indicating whether the tokens (including the current token, the previous two tokens, and the next two tokens) are upper-case-initial. |
Number or not features | The features indicating whether the current token is digital or alphabetic or mixed. |
Capital feature | The feature indicating whether the current token is all capitalized or mixed with capital characters. |
Prefix and suffix | The prefix and suffix of the current token (first or last two characters). |
Token length | The character length of the current token. |
Semantic features | |
CUI | The CUI code of the current token from cTAKES by using dictionary based method. |
TUI | The TUI code of the current token assigned by cTAKES, which provides the semantic type information contained in the UMLS thesaurus. |