Skip to main content

Table 2 The descriptions of pre-processing steps of English eligibility criteria sentences

From: Semantic categorization of Chinese eligibility criteria in clinical trials using machine learning methods

English eligibility criteria sentences preprocess

Descriptions

Delete ordinal number

There are many types of ordinal number (e.g., “1.”, “”, “(1)”), and were deleted by regular expression

Replace the ASCII code

We replace the ASCII code with the format that MetaMap can handle based on rules

Lemmatization

Lemmatization is a process of grouping together the different inflected forms of a word and be analyzed as canonical form of the word. We did it with Python package NLTK

Replace abbreviation

We replace the abbreviation with full spelling format based on dictionary

Delete symbols of number, operator and unit

Various expression formats of number, operator and unit sometimes will interfere the output of MetaMap, and was deleted by regular expression