Table 2 The descriptions of pre-processing steps of English eligibility criteria sentences

From: Semantic categorization of Chinese eligibility criteria in clinical trials using machine learning methods

English eligibility criteria sentences preprocess Descriptions
Delete ordinal number There are many types of ordinal number (e.g., “1.”, “”, “(1)”), and were deleted by regular expression
Replace the ASCII code We replace the ASCII code with the format that MetaMap can handle based on rules
Lemmatization Lemmatization is a process of grouping together the different inflected forms of a word and be analyzed as canonical form of the word. We did it with Python package NLTK
Replace abbreviation We replace the abbreviation with full spelling format based on dictionary
Delete symbols of number, operator and unit Various expression formats of number, operator and unit sometimes will interfere the output of MetaMap, and was deleted by regular expression