An improved BM25 algorithm for clinical decision support in Precision Medicine based on co-word analysis and Cuckoo Search

Background Retrieving gene and disease information from a vast collection of biomedical abstracts to provide doctors with clinical decision support is one of the important research directions of Precision Medicine. Method We propose a novel article retrieval method based on expanded word and co-word analyses, also conducting Cuckoo Search to optimize parameters of the retrieval function. The main goal is to retrieve the abstracts of biomedical articles that refer to treatments. The methods mentioned in this manuscript adopt the BM25 algorithm to calculate the score of abstracts. We, however, propose an improved version of BM25 that computes the scores of expanded words and co-word leading to a composite retrieval function, which is then optimized using the Cuckoo Search. The proposed method aims to find both disease and gene information in the abstract of the same biomedical article. This is to achieve higher relevance and hence score of articles. Besides, we investigate the influence of different parameters on the retrieval algorithm and summarize how they meet various retrieval needs. Results The data used in this manuscript is sourced from medical articles presented in Text Retrieval Conference (TREC): Clinical Decision Support (CDS) Tracks of 2017, 2018, and 2019 in Precision Medicine. A total of 120 topics are tested. Three indicators are employed for the comparison of utilized methods, which are selected among the ones based only on the BM25 algorithm and its improved version to conduct comparable experiments. The results showed that the proposed algorithm achieves better results. Conclusion The proposed method, an improved version of the BM25 algorithm, utilizes both co-word implementation and Cuckoo Search, which has been verified achieving better results on a large number of experimental sets. Besides, a relatively simple query expansion method is implemented in this manuscript. Future research will focus on ontology and semantic networks to expand the query vocabulary.


Background
With the proliferation of computer technologies, the information available on the Internet has swiftly increased leading to various implementations utilized for information extraction from medical articles. Hence, medical treatment techniques have stepped into the age of Big Data. However, managing the immense data and extracting information from them is a critical endeavor. If this process can be improved, the advantages that it could offer would be so beneficial for medical doctors. For instance, some routine decision-making tasks require significant repetition, which takes time and increases costs. However, computerized medical information retrieval systems can effectively improve efficiency, save costs, and reduce errors. Proper use of computer technology can bring efficacy to all fields where it will be used. Therefore, the development of medical information retrieval systems is crucial. In reality, every decision of a doctor is critical to the patient, so the doctor must follow the state-of-the-art techniques and keep abreast with the latest technology and methods of clinical science. The academic literature providing the latest research results in the medical community can be accessed via the Internet and the medical retrieval models play a crucial role. Furthermore, searching the relevant biomedical literature on the Internet for a reference can be highly beneficial for medical practitioners who encounter a difficult problem on a certain medical record.
Information Retrieval (IR) methods for Clinical Decision Support (CDS) have been the focus of recent research and assessment campaigns. Specifically, the CDS track between 2014 and 2016 Text Retrieval Conferences (TREC) [1][2][3] sought to assess the systems providing evidence-based information in the form of either full-text or abstracts from an open-access subset of MEDLINE to the clinicians in return to their queries. Furthermore, the tracks from 2017 to 2019 [4][5][6] focused on important implementations in clinical decision support providing both useful and precise medical information to clinicians treating cancer patients. In these, each case described the disease (a type of cancer), the relevant genetic variants (which genes), and basic demographic information (age and sex) of patients. Precision Medicine introduced in [7] is a new medical concept utilizing individualized medicine that develops with the rapid progress of genome sequencing technology and the crossapplication of bioinformatics and Big Data science.

Preleminaries
The IR aims to retrieve related documents based on a given query. The relevancy of documents to queries is often gauged by the score assigned by an IR model, e.g., the widely-implemented BM25 model [8]. On the one hand, the past few decades witnessed the implementation of machine learning technology when information retrieval was a concern. The document ranking process could be classified into three groups as follows: (i) the single document methods, (ii) the document pair methods, and (iii) the document list methods. The common single-document methods, such as [9] utilizing a logistic regression technique, deal with a feature vector of each document as an input, where the output is the relevance of each document. The document pair methods, e.g., the ones utilizing Rank-SVM [10] or Rank-Boost [11], implement a feature vector of a pair of documents as the input and use the correlation between the documents as the output. The document list methods, e.g., the ones proposed List-Net [12], Ada-Rank [13], or Lambda-Mart [14], employ a set of documents associated with a query as the input and a ranked list as the output. In recent years, query expansion methods have been widely implemented in information retrieval. Singh et al. [15] suggested a method based on fuzzy logic, in which the top-ranked documents were regarded as relevant feedback documents for mining query information. Furthermore, the choice of different query expansion terms was determined according to their importance. These methods often assign each term to a different relevance score and then select the expansion term based on a certain threshold.
Keikha et al. [16] considered the Wikipedia corpus as the feedback set space to train the Word Vector Model and determined the long-term selection of the best features in both supervised and unsupervised models. Almasri et al. [17] also utilized vectors to represent query words and query expansion terms returned by pseudocorrelation feedback. They added cosine similarity is to the Bag-of-Words Model, and the frequency of each word in the query term was recalculated. Singh et al. [18] proposed a classic correlation feedback method, which increased the entry weight of the related documents and reduced it to that of the non-relevant ones. However, one of the disadvantages of this method was to be very timeconsuming for practitioners in assessing the relevance of documents.
Cui et al. [19] developed a query expansion method for web search logs utilizing the interaction information of practitioners. The key assumption behind this method was that the documents chosen by a user to read were related to the query. The new words in the related documents were sorted according to their similarity with the user query, and the new words with the highest similarity were selected as the expanded word. The candidate expanded words were extracted from the top documents, and then the candidate expanded words were weighted and sorted by the probability generated by the language model. Aronson and Rindflesch [20] proposed a method based on the Unified Medical Language System (UMLS) query expansion, which benefitted from the Meta-Map program [21] to identify the medical phrases in the original query and then expanded the query with new phrases. Hence, the experimental results showed that the query expansion utilizing the UMLS was an effective method to improve the performance of information retrieval.
Li et al. [22] proposed a method of keyword-weighted network analysis to implement a medical full-text recommendation, which helped to expand the medical acronym list by searching the full-text. Domain experts verified that the algorithm worked well in terms of accuracy in recommending medical literature. Balaneshinkordan et al. [23] developed a query expansion method utilizing the Bayesian approach, which expanded the genes of a disease to be no less than three words. The experiments revealed that the algorithm had a higher precision value.
The literature review brings us the idea of using both query expansion and keywords to retrieve documents that are highly related to a query. Hence, this manuscript proposes a method utilizing expanded words and coword analysis as new tools to optimize the information retrieval of biomedical articles implementing the BM25 algorithm as a base method. This is to compute scores of the abstract, expanded words, and co-word as a composite retrieval function. Besides, when a disease and a gene both appear in the same biomedical article, the score of the article tends to increase. Finally, the Cuckoo Algorithm [28] is utilized to optimize the parameters of the proposed retrieval algorithm.

Data structure
The abstracts of biomedical articles are presented in XML format. The MeSH headings, chemical lists, and keyword lists for XML documents are selected to utilize abstracts whose displays are presented in Fig. 1.

Data distribution
While the total number of biomedical articles in both 2017 and 2018 TREC Precision Medicine is 26,613,834, the 2019 set has 29,137,637 articles. Table 1 shows some of the statistics that are used in information retrieval, where Abstract-Mean-Length represents the average length of the abstracts after deleting stop-words; Abstract-Number represents the number of articles with abstracts; Chemical-Mean-Length represents the average length of the chemical lists; Chemical-Number represents the number of articles with a chemical list; Mesh-Mean-Length represents the average length of the MeSH headings; Mesh-Number represents the number of articles with MeSH headings; Keyword-Mean-Length represents the average length of the keyword list, and Keyword-Number represents the number of articles with keyword lists.

Query expansion
Medical Subject Headings (MeSH) is a controlled vocabulary developed by the U.S. National Library of Medicine, which is mainly utilized to index, catalog, and search articles relevant to both biomedical and health sciences. The important role of the MeSH in medical information retrieval is mainly manifested with two aspects, namely accuracy and specificity. While indexers input information into the retrieval system, researchers utilize this information concerning the two aspects. The MeSH is used as the platform making the terms consistent between the index and search to achieve the best outcomes. Hence, the accurate and comprehensive usage of the MeSH has a significant impact on the results of information retrieval. In this manuscript, we utilize the MeSH database (meshb.nlm.nih.gov/MeSHon-Demand) to find expansion terms or new words and their broader terms. The MeSH on Demand is utilized to expand query terms and obtain additional terms if possible. Table 2 shows Topic 2017-1 as an example, and Table 3 presents the results of the extended words.

Age expansion
The variable age included in the demographic field is expanded to the terms or new words proposed by Kastner et al. [24]. We have readjusted the age division and assumed that those over 18 should be adults. Table 4 presents our expansion model that is based on variable age.

The proposed model
We first utilize the MeSH on Demand to find MeSH terms and additional terms that can be used in the retrieval of abstracts for any given query. Then, we construct a "wordlist" including chemical words, keywords, and MeSH headings that utilizes query expansion, thereby finding documents that are more related to query expansion. This is to increase the relevance score of documents. In the next step, we performed the co-word analysis utilizing either each separate resource, such as abstract, keywords, chemical words, or MeSh headings, or all sources at a time to find co-occurrence of selected words, such as disease and gene, in our case. While the first step deals with computing the score of abstracts based on a query and its morpheme, the second step deals with calculating the score of expansion words. Then, the score of the co-word is calculated. Hence, as long as the number of documents is reduced based on the "word list", the score of the co-word tend to increase. In the last step, we compute the composite score consisting of three scores of abstract, expanded words, and coword. Afterward, Cuckoo Search [28], an evolutionary optimization method, is applied to optimize the parameters of the proposed retrieval model.

The abstract scoring model
The BM25 [8] is a classical information retrieval model that is based on analyzing a query Q to find a morpheme qi . For each search result d , it calculates the correlation between each morpheme qi and d , and finally gives a weight to the sum of the correlation score of qi concerning d to obtain a correlation score between Q and d . The general formula of the BM25 can be expressed by:   where D represents the total number of corpus documents, and card({j|i ∈ d i }) represents the number of documents containing morpheme qi . According to (2), the more qi contained in a document, the lower the weight of qi for a given set of documents. In other words, when several documents contain qi , the discrimination of qi is not so robust that the importance of utilizing q i to judge relevance is so weak. The relevance score R(q i , d) of morpheme q i documenting d is defined as: where parameter K is: where, k 1 , k 2 , and b are adjustment factors that are usually set according to experience, f i is the frequency of q i in d , qf i is the frequency of q i in query, dl is the length of document d , and avgdl is the average length of all documents. In most cases, q i appears only once in the query, i.e., qf i = 1. Hence, (3) can be rewritten as: As seen from the definition, the role of parameter b is to tune the impact of the document length on the relevance. The larger the b is, the greater the impact of the document length on the relevance score will be, and vice versa. Similarly, the longer the relative length of the document is, the larger the K , and hence the smaller the relevance score will be. In the end, the correlation score of the abstract of the document d can be expressed as: The expanded word score As seen in Table 1, most biomedical articles have both abstracts and titles. The number of biomedical articles containing chemical words, MeSH headings, and keywords varies widely. Specifically, there exist 13,113,093 articles containing chemical words, 2,438,717,151 articles containing MeSH headings, and 4,005,446 articles containing keywords. As the literature suggests, direct utilization of the BM25 leads to failure when dealing with a large selection of documents [50]. In this subsection, we propose an improved BM25 algorithm to compute the scores of expanded words. We combine chemical words, MeSH headings, and keywords into a list called 'Word List' . The length of the "Word List" in the document is defined by: where dcl is the length of chemical words in document d , dml is the length of MeSH headings in document d , and dkl is the length of keywords in document d.
The IDF value of the expanded word appearing in the "Word List" of document d can be given by: where N represents the number of documents in which dwl > 0, and n(q i ) represents the number of documents containing the extended morpheme q i . The frequency value of the term of the word list is defined by: where n represents the number of expanded words in query Q , and q i represents the morpheme of each expanded word in query Q . The score of an expanded word in document d is defined by: where k 2 and b 1 are the adjustment factors that are usually set according to experience, and avgdwl represents the average length of all word lists.

The co-word score
The co-word analysis utilizes the co-occurrence of lexical pairs or noun phrases in an article set to determine the relationship between topics in the discipline represented by the article set. In this manuscript, the co-word analysis is introduced into the article scoring model for the case when avgdwl a disease and a gene co-occur across in an abstract, Chemical List, MeSH heading, and Keyword List (as presented in Fig. 2) or co-occur within any abstract, Chemical List, MeSH heading, or Keyword List (as presented in Fig. 3).
We utilize the IDF value as the co-word score to distinguish the importance of a gene, which can be formulated as: where N represents the number of documents, and n g i represents the number of documents containing gene morpheme g i .
Finally, the Co-Word score is defined by: where n is the number of genes with co-word having a disease in query Q , and g i is the morpheme of each gene in query Q.

Retrieval model
We utilize the composite score as the final score for document d under query Q , which can be formulated as: Score composite (Q, d) = Score abstract (Q, d)

Parameter optimization
The proposed method has six parameters: k 1 , k 2 , k 3 , b 1 , b 2 , and α , and the choice of parameters can affect the results of information retrieval. Various algorithms, e.g., the Genetic Algorithm (GA) [25], Simulated Annealing (SA) Algorithm [26], and Ant Colony (AC) Algorithm [27], have been implemented to optimize the function in use, i.e., the objective function. With the continuous effort in developing better algorithms, several new Swarm Intelligence Optimization (SIO) Algorithms have emerged during recent years, such as Cuckoo Search (CS) [28], Glow Worm Swarm Optimization (GWSO) [29], and Particle Swarm Optimization (PSO) [30]. Among them, SIO has been widely utilized.

Cuckoo Search Algorithm
CS is a SIO proposed by Yang et al. [28] in 2009. Guerrero et al. [31] claimed that CS outperformed GA in terms of efficiency. Some of the idealized rules utilized by CS can be given as: (1) Each cuckoo lays only one egg every time and selects a parasitic nest to randomly hatch its egg. (2) The best parasitic nest will be handed down to the next generation. (3) The number of available parasitic nests is fixed and the detection probability of parasitic nest' master is P a ∈ (0, 1).
The cuckoo finds the nest and updates the position according to the above-given rules. The position update formula is: where T is the step size ( T > 0 ), ⊕ is the point-to-point multiplication operator, Levy( ) is the search path following the Levy distribution [32,33]. The pseudo-code of the algorithm [28] is presented as follows: Fig. 4 Architecture of the system retrieving biomedical articles ranking in information retrieval. Let ϑ denote the relevance grade, and gain(ϑ) denote the gain associated with ϑ . Also, assume that g 1 , g 2 , ...g z are the gain values associated with the Z documents retrieved by a system in response to query q , such that g i = gain(ϑ) if the relevance grade of the document in rank i is ϑ . Hence, the nDCG value for this system can be calculated as: and DCG I denotes the DCG value for an ideal ranked list for query q.
We define the average nDCG as follows:

Objective function
Precision is calculated as: where RR and RN refer to relevant and irrelevant documents retrieved, respectively.
Then,P@10 is defined as the Precision when RR + RN = 10 . Hence, the average P@10 can be formulated as: where P@10(t) represents the P@10 value of the t th topic among n topics.
The Normalized Discounted Cumulative Gain (nDCG) is a commonly utilized index to assess the quality of where nDCG(t) represents the nDCG value of the t th topic among n topics.

Algorithm flow
Since k 2 has a fixed value ( k 2 = 1 ), we utilize k 1 , k 3 ,b 1 ,b 2 , and α as input parameters. Firstly, the algorithm generates the initial population and either set the maximum number of iterations or stop the criterion. If the number of iterations reaches the maximum number or the stop criterion is met, the algorithm ends and returns the optimal solution. Otherwise, the algorithm performs a series of operations to optimize the objective function. This manuscript defines Avg P@10 + Avg nDCG as the objective function and employs the dataset of the 2017 Precision Medicine as the training dataset to optimize the parameters. Figure 5 presents the flowchart of the proposed algorithm.

Experimental results and comparison Dataset
The data used in this work were sourced from medical articles published in 2017, 2018, and 2019 TREC Due to the semi-structured nature of the XML format, we used MongoDB as the database for document storage and Python as the programming language. All the code can be found on the corresponding author's GitHub (https ://githu b.com/Bruce -V/CS-BM25). Table 5 presents the parameter values used in the proposed algorithm.

Experimental results
In Table 6, "Normal" refers to the values of empirical parameters, where CS represents the parameters trained using the 2017 dataset consisting of 1000 documents with the highest scores as a result of the selected retrieval model.
When the data of three years are compared, the optimized parameters are better than the empirical parameters. For an information retrieval system, the users desire related documents to appear earlier; hence, infNDCG and P@10 are two important indicators in assessing the performance of the information retrieval process. The parameters that are optimized using both NDCG and P@10 would increase the weights of the word list. The word list includes extended information about age, gender, and genes, which is crucial for distinguishing the relevant literature from the irrelevant ones. In conclusion, different parameters can be utilized to meet the needs of various users.
In Figs. 6, 7, and 8, RR represents the relevance in coword documents, while RN represents all relevance α_boundary Boundary of α (0,5) Table 6 The results of the Cuckoo Search Algorithm The average coverage rate of 30 topics in 2017 is 52.9%, which is 74.13% for 50 topics in 2018, and 54.4% for 40 topics in 2019. These outcomes reveal that the co-word analysis has a better impact on the retrieval of relevant documents, which greatly reduces the scope of the search.  When information retrieval is a concern, NR , RR , NN , and RN respectively represent the relevant documents that are not retrieved, the irrelevant documents that are not retrieved, the relevant documents that are retrieved, and the irrelevant documents that are retrieved. Here, Precision is defined as formula (15).
Recall is defined as: and F1-score is defined as: As seen in Figs. 9, 10, and 11, the optimized parameters are better than the empirical parameters for both P@10 and infNDCG. Since we utilize the 2017 Precision Medicine dataset as the training set, the optimization effect is the most obvious on this dataset. When the test data in 2018 and 2019 are a concern, both P@10 and infNDCG have improved, but R-predicted has declined. This happens since the adopted objective function has improved the ranking of the most relevant documents. Precision and Recall are inversely proportional to each other when the retrieval system is a concern. However, in our case examining the retrieval of biomedical articles, we are more concerned about the precision rate to alleviate the doctors' decision-making.

Experimental comparison
Considering the results of the models taken from 2017, 2018, and 2019 TREC Precision Medicine, three indicators called infNDCG, R-predicted, and P@10 are selected for comparison, and the experimental results are presented in Tables 7, 8, and 9. We use three years of the TREC datasets to verify our experimental results. The selected methods either utilize the BM25 algorithm or its improved version. The experiments using the 2017 dataset showed a significant improvement in our proposed method for all indicators. For the 2018 dataset, our method performed better than similar algorithms for P@10, ranked second for infNDCG. The same result was observed for the 2019 dataset.

Conclusion
This manuscript proposes a BM25-based method incorporating co-word analysis to retrieve biomedical articles. We improved the BM25 algorithm and used it to compute the score of expanded words by combining the co-word score with the gene appearance weight. Then, we utilized the Cuckoo Search Algorithm to optimize parameters on the evaluation function of both P@10 and nDCG. Optimization results suggested that increasing the score weight of the "word list" could effectively improve the ranking of the related documents. The manuscript also discusses the influence of different parameters on the retrieval algorithm and presents the parameters to meet different retrieval needs in the future. Although the proposed algorithm in this manuscript is based on the improved version of BM25, it highlights the general rules for improving the parameters of the BM25 algorithm, which were verified through numerous experiments. Since the query expansion used in this manuscript is simple, our future research will focus on adopting more linked data to investigate utilizing the topic data.