An approach for transgender population information extraction and summarization from clinical trial text

Background Gender information frequently exists in the eligibility criteria of clinical trial text as essential information for participant population recruitment. Particularly, current eligibility criteria text contains the incompleteness and ambiguity issues in expressing transgender population, leading to difficulties or even failure of transgender population recruitment in clinical trial studies. Methods A new gender model is proposed for providing comprehensive transgender requirement specification. In addition, an automated approach is developed to extract and summarize gender requirements from unstructured text in accordance with the gender model. This approach consists of: 1) the feature extraction module, and 2) the feature summarization module. The first module identifies and extracts gender features using heuristic rules and automatically-generated patterns. The second module summarizes gender requirements by relation inference. Results Based on 100,134 clinical trials from ClinicalTrials.gov, our approach was compared with 20 commonly applied machine learning methods. It achieved a macro-averaged precision of 0.885, a macro-averaged recall of 0.871 and a macro-averaged F1-measure of 0.878. The results illustrated that our approach outperformed all baseline methods in terms of both commonly used metrics and macro-averaged metrics. Conclusions This study presented a new gender model aiming for specifying the transgender requirement more precisely. We also proposed an approach for gender information extraction and summarization from unstructured clinical text to enhance transgender-related clinical trial population recruitment. The experiment results demonstrated that the approach was effective in transgender criteria extraction and summarization.


Background
Clinical trials are observations or experiments carried out in clinical research. Under efficacious guidance and strict administration, they can generate reliable testimony and contribute remarkably to evidence-based medicine [1,2].
To a large extent, obtaining satisfactory outcomes from clinical trials depends on the effectiveness of identifying and recruiting suitable participants [3,4]. However, the recruitment of target population has turned into a main impediment in clinical trials due to the time-consuming labor, great cost of fund, and rising complexity [5][6][7][8][9]. The unsatisfactory result of recruitment can bring tremendous trouble to clinical investigators and omitted opportunities to patients [4,[10][11][12].
In the recruiting phase, eligibility criteria evaluate the qualification of participants for certain clinical care or research. Gender is a widely-used and core criterion for electronic prescreening and the gender requirement is approximately defined as structured information by every clinical trial [13][14][15]. Besides, gender is widely utilized for clinical data searching and processing, e.g., the gender identification in Electronic Medical Record (EMR) data [16], the transgender status estimation using EMR data [17], and the core data element for clinical retrieval [18]. Thus, the explicit gender type requirement is essential for enrolling proper participants in clinical trials.
However, prevalent applied gender-required criteria in clinical trial may contain gender issues including incompleteness and ambiguity issues, particularly for transgender-recruiting trials. Transgender is an umbrella term denoting the population whose gender identity or expression is different from their assigned sex at birth [19,20]. Existing websites for registry of clinical trials only define Male, Female and Both as structured information in the gender requirement. Whereas, the transgender type is generally neglected in gender specification during registration.
Taking ClincialTrials.gov, 1 the largest official registry of clinical trials in the world, as an example. Clinical-Trials.gov, sponsored by the United States National Institutes of Health (NIH), provides a summary for each registered clinical trial [21]. In this registry, clinical trials recruiting certain types of transgender population need to clarify gender requirements in sex eligibility section. Nevertheless, only "Male", "Female", and "All" are listed as the structured gender types alternatives for trial registration, which may induce unsuitable recruitment for transgender population. For example, in the clinical trial NCT01880489, 2 eligible participants need to be transgender female according to "Identify as a transgender woman (assigned male at birth and currently identify as female)" from inclusion criteria section. However, the gender requirement is incorrectly registered as "Female" in the structured section "Sexes Eligible for Study" due to lack of the "transgender" option on the website. For trials recruiting transgender populations (e.g., clinical trials studying Human Immunodeficiency Virus), the incorrect transgender information issue is much common. As a result, this incorrect gender registration may cause wrong information in clinical trial text processing and negative influences in clinical trial electrical recruitment.
Meanwhile, the population of transgender and the number of trials requiring for transgender has been increasing. As reported in 2011, around 700,000 transgender adults (about 0.3% of total adult population) were identified in the United States [22]. This number rose rapidly to almost 1.4 million in 2016 [23] and was nearly twice as high as the number in 5 years ago. To explore the demographic changes of transgender population in clinical trials, we conducted a manual investigation on ClinicalTrials.gov. We used keyword searching combined with manual review to identify transgender-recruiting trials from all 277,012 clinical trials on ClinicalTrials.gov as to 2018/07/ 10. With respect to the "first posted" and "last update posted" dates of the clinical trials during 2000 to 2017, the number of trials recruiting transgender population was calculated and reported in Fig. 1, demonstrating an upward trend. This growth trend illustrates the increasing importance of the identification of transgender-recruiting trials for transgender population in the participation of clinical trials. Therefore, the data quality issue caused by inappropriate gender registration may lead to more incorrect transgender population recruitment cases among clinical trial studies. Thus, it is imperative and urgent to deal with the transgender data quality issues.
To that end, focusing on improving transgender related-trials searching and recruiting, we designed a virtual gender model to present more complete and explicit gender requirement information. In addition, an automated approach was developed for transgender information extraction and summarization from unstructured clinical trial text. This approach consists of the feature extraction module and the feature summarization module. The feature extraction module utilizes patterns automatically learned from annotated clinical trial text and combines with a list of heuristic rules to extract gender features from clinical trials. The second module computes these features and summarizes gender requirements by relation reasoning for a clinical trial.
We further treated the whole procedure of gender information extraction and summarization as a multi-classification task and compared our approach with 20 machine learning methods on the same clinical text datasets, which incorporate transgender-recruiting trials and non-transgender-recruiting trials together. Based on the largest dataset containing 100,134 trials, our approach achieved a macro-averaged precision of 0.885, a macro-averaged recall of 0.871 and a macro-averaged F 1 -measure of 0.878. The results outperformed all the other baseline methods and demonstrated the effectiveness of the approach in gender information extraction and summarization.

Methods
To solve the transgender-related issues, a virtual gender model is proposed, as shown in Fig. 2. This model extends the widely and commonly used conventional gender types in ClinicalTrials.gov registration options ('Male' , 'Female' , and 'All'). To include transgender types, 13  Based on the gender model, we further propose an automated approach for extracting and summarizing gender information from free clinical trial text. This approach comprises the feature extraction module for extracting gender features and the feature summarization module for concluding final gender requirements based on the proposed gender model. The overall framework of the approach is shown as Fig. 3.

Feature extraction
Clinical trial text frequently contains gender mentions. The feature extraction module utilizes a group of heuristic rules to identify, extract and verify gender information features from clinical trial text including "Study Description" and "Eligibility Criteria" sections. These rules consist of predefined logical relations, a list of gender mention features from clinical trials, and regular expressions. Then, a set of structural patterns is automatically generated from clinical text based on transgender mention annotations. After that, the heuristic rules and the patterns are combined to be applied to gender information extraction and verification from clinical trial text.

Heuristic rule generation
According to the investigation of existing clinical trial text, a variety of gender mention features is categorized into distinct gender mention types. We treat the gender mention detection as a process of feature identification. The examples of some gender mention types and the corresponding features are listed as Table 1. For instance, the feature extraction module regards features 'males' , 'male' , 'man' , 'men' , 'gay' , 'gays' and 'masculine' as the gender mention type [Male].
Combining with the defined gender mention types, we develop a set of heuristic rules to detect the mentions in clinical trial text. The rules contain logical relations such as "If (gender X) in exclusion criteria, then output (Not gender X)" and regular expressions such as "match '([Transgender] ' are two pre-defined gender mention types. Given a sentence "Self identify as a transgender woman" (NCT03270969 3 ), the gender mention "transgender woman" is identified using the regular expression and is annotated as "<Detected Gender Mention = Transgender Female>".

Transgender pattern learning
Identifying transgender features by utilizing heuristic rules only may lead to erroneous extraction. For instance, in sentence " … Participants who were female at birth, who now identify as male, will not be excluded … " (NCT02356302 4 ), the "male" indicates transgender male rather than biological male from the context. However, it is wrongly regarded as not transgender features by heuristic rule as this sentence does not contain any transgender-specific key words. Thus, we incorporate automatic structural pattern learning to improve feature extraction performance. The essential notion of the pattern learning is to extract all pattern candidates based on original annotated text and to filter patterns which may have less significance to feature extraction.
To automatically generate the transgender matching patterns, the clinical trial text with transgender feature annotations is split into sentences initially. The feature extraction module leverages a sentence boundary identification algorithm by rule-based matching to split text into individual sentences, especially for the text lacking full stop symbols. Those sentences are verified by pre-defined rules to rectify incorrect cases, such as treating the "." in "e.g.," as stop symbols. Then, the original transgender mention annotation tags are replaced with a specified tag "<TG_Start > <TG_End>" for context extraction. For example, a sentence is processed as " … Participants who were female at birth, <TG_Start>who now identify as male<TG_End>, will not be excluded … " (NCT02356302 5 ).
After replacement, a list of pattern candidates is extracted by setting a word window length as a parameter β. The optimized value of β is empirically chosen as 7, presenting that the number of words around the transgender tag is not larger than 7. All patterns containing the tags and their surrounding words are extracted and  Table 1 The gender mention types and their related gender mention features Gender mention types Gender mention features [Partner] 'partner' , 'partners' , 'sexual partner' , 'sexual partners' , 'wife' , 'husband' [Negation_Word] 'no' , 'not' , 'except' , 'besides' , 'rather' , 'rather than' , 'neither' , 'not identify as' , 'not identified as' regarded as candidate patterns. Each candidate pattern is further matched back to the training clinical trial text. For example, the pattern "female at birth, <TG>, will not" can be matched with " … Participants who were female at birth, <TG_Start>who now identify as male<TG_End>, will not be excluded … " (NCT02356302 6 ), " … Participants who are female at birth, <TG_Start>who now identify as male<TG_End>, will not be excluded … " (NCT03467347 77 ) and "… Participants who are female at birth, <TG_Start > who now identify as male < TG_End>, will not be excluded …" (NCT03234400 8 ). The confidence and support values of each candidate pattern are calculated after matching. We define a support metric ∂ S as the count of correct matches and a confidence metric ∂ C as the rate of correct expression matches among all matches. The candidate patterns with confidence or support values lower than the two metrics are regarded as invalid patterns and are filtered out. To enlarge the matching coverage of the generated patterns, ∂ C is set as 0.7 and ∂ S is set to 4 empirically in this study.

Gender mention verification
The automatically learned patterns and heuristic rules are utilized to identify gender mention features from free clinical trial text sequentially. After the detection, the mentions are identified and annotated with corresponding gender types. A context-based method is then proposed to verify the gender mention annotations. In certain cases, some identified mentions are not adaptive to context information and should be excluded. For example, the "male" identified in "male sex partners" (NCT02704208 9 ) and the "female" identified in "Male Patients with female sexual partners" (NCT00231465 10 ) do not represent required target population and should be removed from the annotations. Therefore, we develop a list of regular expressions to verify the annotations. The rule "<Detected Gender Mention> ([Partner])" thus is used to detect the identified "male" and "female" in similar examples. In addition, some identified mentions are associated with negation words, which may change the meaning of required gender types. To detect and rectify those negation cases, a list of negation features as [Negation_Word] is defined, as listed in Table 1. The module identifies the negation features and filters out irrelevant gender mentions in sentence-level context. For example, the rule "[Negation_Word] <Detected Gender Mention>" can be used to identify "transgendered" in "Biologically male (not transgendered)" (NCT01023620 11 ), where "transgendered" should be annotated together with "not".
Algorithm 1 illustrates the feature extraction module for extracting gender information. Firstly, an unstructured clinical trial text is split into sentences. The algorithm then detects and extracts gender mentions from each sentence by incorporating the generated transgender patterns and heuristic rules. At this step, we apply the patterns from Generated_Patterns and heuristic rules from Heuristic_Rules to annotate gender features from each sentence. Generated_Patterns are patterns generated automatically from clinical trial text with manual annotations, as shown in line 5. Then rules from Verification_Rules are used to remove marked mentions that should not be included. Those extraction and verification procedures are shown as line 7-22.

Gender summarization Gender mention inference
Since different gender types have internal relations, we design a list of gender inference functions for relation calculation and deduction. These functions are composed of logical judgment functions and transformation functions.
The logical judgment function determines whether gender types are conformed to a certain relation. Some examples of logical judgment functions are shown as Table 2 and each function contains a function description in logical way. For example, we used SuperJudgement function to determine whether a gender type has superior relation with the other or another. Using this function, 'Transgender All' can be treated as superior gender of 'Transgender Male' or 'Transgender Female' , and the model can therefore compute the relations among those gender types.
With the logical judgment functions, we further develop a list of transformation functions to operate and transform different gender types for concluding final gender requirements. The examples of transformation functions are shown as Table 3, where some of them contain parameter restrictions of logical judgment functions. For example, 'Transgender All' can be split into 'Transgender Male' and 'Transgender Female' using the function Split('Transgender All'). 'Biological Male' and 'Biological Female' can be merged into 'Biological All' using the function Merge('Biological Male' , 'Biological Female'). 'Biological Female' can be converted into 'Transgender Male' using the function TransConstrain('Biological Female'). The function TransConstrain can be applied to transform biological gender into transgender types while the context is identified as transgender condition.

Gender requirements summarization
To conclude the required gender types of a clinical trial, all the mapped gender types with valid annotations are split into a list of meta gender types, i.e., 'Biological Male' , 'Biological Female' , 'Transgender Male' and 'Transgender Female' according to the gender relations defined in the feature summarization model. For example, 'Biological All' is split into 'Biological Male' and 'Biological Female'.
Since a text may contain multiple gender types while some of them may be noise, we design a strategy using majority rule to detect frequently mentioned gender types considering that some meta gender types are predominant in a text. All meta gender types are then sorted by their frequencies in descending order. If the frequency of a meta gender type ranked at top i + 1 multiplies a threshold μ is lower than the previous meta gender type ranked at i, the feature summarization module treats the meta gender types from i + 1 to n as noise. Using MG i to present the frequency of a meta gender type ranked at top i and μ to denote the threshold, the final predominant score as Pred is calculated using Eq. (1). If Pred < 1, the module treats (MG 1 , … MG i ) as predominant gender types and treats the reminder as noise. The optimization of μ is presented in Experiment and Result section.
Taking clinical trial NCT02401867 12 as an example. The study description contains the statement "among sexually active female-to-male (FTM) transgender adults", "among 150 FTM patients in Boston", "online focus groups with FTMs" and "to gather information on the sexual health needs of FTM individuals". The study population description contains the statement "enroll 150 female-to-male (FTM) individuals", "recruited from the existing FTM patient population" and "recruit 40% racial/ethnic minority FTMs". The inclusion criteria Splitting G 1 into G 2 and G 3 SplitJudgement(G 1 ) == True Input includes "Assigned a female sex at birth and now self-identifies as a man, trans masculine, trans man, FTM, transgender, genderqueer/non-binary, transsexual, male, and/or another diverse transgender identity or expression". After gender information extraction and verification of the above text, the identified and valid 'Transgender Male' type occurs 10 times, while 'Transgender All' twice, 'Biological Male' twice and 'Biological Female' once. After splitting 'Transgender All' into two meta gender types, 'Transgender Male' is counted as 12 times and 'Transgender Female' as 2. According to the Eq. (1), 'Transgender Male' is treated as MGT 1 while 'Transgender Female' , 'Biological Male' and 'Biological Female' are treated as MGT 2 , MGT 3 and MGT 4 . By using the threshold μ as 5, the Pred of MGT 1 is lower than 1. Therefore, the feature summarization module takes 'Transgender Male' as the predominate gender type and ignores the rest gender types. The module treats all kept meta gender types as equal and merges them using the transformation function Merge(G 1 , G 2 ) for generating a final gender conclusion. For example, the meta gender types 'Transgender Female' and 'Transgender Male' are merged into the finial gender 'Transgender All'.

Evaluation metrics
For performance evaluation, we treat the gender information identification and summarization as a multi-classification task. As commonly used as performance evaluation metrics in Nature Language Processing (NLP) and information retrieval tasks, precision, recall and F β -measure are adopted in the experiment [24,25]. Typically, in a binary classification task, a data is labeled as either positive or negative (where positive and negative represent two generic categories). A confusion matrix can be generated according to True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). In the matrix, precision represents the percentage of correctly classified positive data divided by the total number of data classified as positive (Precision = TP/ (TP + FP)). Recall is the percentage of correctly classified positive data divided by the total number of data expecting to be classified as positive (Recall = TP/ (TP + FN)). F β -measure is the harmonic mean of precision and recall (Eq. 2). Non-negative real value β enables F β -measure to balance emphasize precision or recall. We empirically use F 1 -measure by setting β = 1 to equal the weights of precision and recall.
In addition, the proportion of non-transgender-recruiting clinical trials is much higher than transgender-recruiting trials. The results of precision, recall, and F 1 -measure may be affected by such an unbalanced data. We thus use micro-averaged values as additional metrics to reduce the effect of unbalanced quantity of predominated gender types. The macro-averaged metrics assign equal weights to categories in the evaluation to discount the performance of better-populated categories [24,25]. The calculations of macro-averaged precision, macro-averaged recall, and macro-averaged F 1 -measure are shown as Eqs. 3, 4, and 5, respectively, where n denotes the number of gender types. Dataset The 277,012 clinical trials on the ClinicalTrials.gov as to 2018/07/10 were used as experimental data. All transgender-related keywords were used to match the trial text to retrieve transgender-recruiting clinical trials as a candidate dataset. Three human annotators including one clinician and two clinical researchers manually annotated the dataset independently using the proposed gender data model. The inter-agreement rate was 73% using Fleiss Kappa. After discussion, the three annotators solved all disagreements and formed the final gold standard for transgender criteria in clinical trials. As a result, 134 clinical trials were identified as transgender-recruiting trials, generating a dataset TG.
To generate transgender annotation dataset for automated transgender patterns learning, we leveraged a bootstrap method which reduced the impact of dataset size difference and increase the efficiency of experimental estimation [26]. The bootstrap method is useful when the scale of dataset was not large and effectively partitioning training sets was difficult [26]. Based on the dataset TG, we used a bootstrap method to generate a transgender dataset TG'. One trial from TG was randomly selected and its copy was sent into TG'. This execution will repeat until the scale of TG' is equal to TG. Then, we randomly extracted 10,000 clinical trials containing non-transgender-related features. These clinical trials were added into TG' to form transgender patterns learning training dataset with better validation by enlarging data scale. Based on transgender patterns learning training dataset with manually annotated transgender features, our approach extracted all pattern candidates from this dataset and calculated the confidence and support by matching back to original annotated training dataset. After calculation, the patterns with a confidence and a support lower than a threshold were filtered out. As a result, the approach generated 14 patterns.

Result
To optimize the threshold μ described in the Method section, the performances in terms of F 1 -measure values were calculated by setting the threshold from 1 to 10 leveraging 10-fold cross-validation. Taking dataset G as an example, as shown in Table 4, the results showed that the F 1 -measure obtained the different values when the threshold increased from 1 to 10 in round 1 to 10 on the training datasets (nine of ten using 10-fold), respectively. We thus selected μ = 5 in round 1-3 and 5-10 while μ = 4 in round 4 as the optimized parameters. Eventually, μ = 5 was chosen as the best parameter value for the following experiments.
To test the stability of our approach, it ran on all the datasets A to G. The macro-averaged precision, recall and F 1 -measure in each round were calculated. The values were further averaged based on ten rounds and were reported in Fig. 4 In addition, to compare our approach with state-of-the-art methods, we applied 20 widely used machine learning algorithms as baselines. These baselines were implemented in a suite of machine learning toolkit -Weka 3.8 [27], including Bayesian Network [28], Naive Bayes [29], SMO (Sequential Minimal Optimization) [30], Random Forest [31], LMT (Logistic Model Tree) [32], and J48(C4.5) [33]. The same features were processed by those baseline algorithms in WEKA using the same 10-fold cross-validation strategy. Our approach was compared with those algorithms using macro-averaged F 1 -measure on the dataset A to G. The performance greater than 0.6 in terms of macro-averaged F 1 -measure were reported in Table 5. Classical Random Forest, LMT, and Bayes Net achieved the macro-averaged F 1 -measure 0.765, 0.665 and 0.655 on dataset G, respectively. Our approach achieved the highest macro-averaged F 1 -measure score, outperforming all the baselines on every dataset.

Discussion
Our approach was proposed for automatically extracting and summarizing transgender information from unstructured clinical trial text. On the basis of our previous work at [34], we improved the transgender extraction method by introducing an automatic pattern learning method. Compared with the previous work, the new approach intentionally applied the macro-averaged metric in order to better validate the approach considering that the experiment datasets contain much less transgenderrecruiting trials than non-transgender-recruiting trials. Besides, the new approach was compared with 20

Precision
Recall F1 Fig. 4 The performance of our approach on different datasets commonly applied machine learning algorithms on the same experiment datasets and achieved higher performance. The overall micro-averaged F 1 -measure and macro-averaged F 1 -measure of our approach on the largest dataset G was 0.98 and 0.878, respectively, achieving 0.02 and 0.113 higher compared with the best baseline algorithm Random Forest. According to the results, the approach could remain stable when the number of clinical trials was increasing.
To demonstrate the effectiveness of integrating pattern-learning method, we compared the performance of our approach with or without pattern matching on the dataset A to G. The macro-averaged F 1 -measure values without pattern matching were 0.791, 0.804, 0.801, 0.807, 0.813, 0.803 and 0.813 respectively. The results illustrated that the performances of the approach with pattern matching were consistently higher than using heuristic rule only.
To understand the weakness of our approach for further improvement, we analyzed all error cases and identified the following error types: 1. Context verification errors: The incorrect gender mention identifications incurred when context containing irrelevant information. For example, in "The specific objectives of this study are reduce stigma towards lesbian, gay, bisexual, and transgender persons in Swaziland and Lesotho" (NCT02410434 13 ), the "lesbian, gay, bisexual, and transgender persons" was annotated as ['Transgender All, Biological All'] by the approach, while human annotators treated it as irrelevant information. In "this is a process that provides an opportunity to study the sex hormone dependent influences that explain differences in morbidity in men and women respectively" (NCT02518009 14 ), the approach treated "men and women" as ["Biological Both"] while human annotators treated it as irrelevant information. 2. Pattern matching errors: While matching the correct features, the pattern might also identify the wrong information. For instance, the pattern "with men (msm) and <TG> (" correctly identified the transgender feature "transgender women" in " … thai men who have sex with men (msm) and transgender women (tg) … " (NCT01869595 15 ). However, this pattern incorrectly matched the nontransgender information "female sex workers" in " … including early injectors, men who have sex with men (msm) and female sex workers (fsw) … " (NCT02573948 16 ). We intend to open the source code of the proposed approach in this paper. The code is publicly available at https://github.com/ Tony-Hao/GenX.

Conclusions
This paper focused on gender, fundamental information in clinical trial for electrical prescreening to recruit appropriate participants. To facilitate transgender population recruitment, a virtual gender model was developed. An automated approach was further proposed for gender information extraction and gender summarization from unstructured clinical trial text. Based on 100,134 real clinical trials, our approach was compared with 20 machine learning algorithms. The results presented that our approach achieved the best performance using both widely adopted metrics and macro-averaged metrics, demonstrating the effectiveness of the approach in gender information processing.