An Improved BM25 for Clinical Decision Support in Precision Medicine Based on Co-word Analysis and Cuckoo Search

Background: Retrieving gene and disease information from a very large collection of biomedical abstracts to provide doctors with clinical decision support is one of the important research directions of Precision Medicine . Method: We propose a new method for the retrieval of biomedical articles utilizing expanded word and co-word implementations and conducting Cuckoo Search to optimize parameters of the retrieval function in the final stage of the proposed method. The specific goal is to retrieve biomedical abstracts of articles addressing treatments. The method employed in this manuscript first implements the BM25 algorithm to compute the score of the abstract, then we propose a method utilizing the BM25, an improved version of BM25, to compute the scores of expanded words and co-word that lead to a composite retrieval function. Afterward, the retrieval function is optimized using Cuckoo Search. The proposed method is utilized to find both disease and gene in the abstract of the same biomedical article. By doing so, the relevance of articles would tend to increase so would the score of the biomedical article. Besides, the manuscript discusses the influence of different parameters on the retrieval algorithm and summarizes the parameters to meet various retrieval needs. Results: All data are taken from medical articles provided in the Text Retrieval Conference (TREC) utilizing Clinical Decision Support (CDS) Tracks of 2017, 2018, and 2019 in Precision Medicine. 120 standard topics are tested. Three test indicators are employed to make comparisons among the methods utilized. To conduct comparable experiments, only the BM25 algorithm and its improved version of it are utilized. The experimental results show that the proposed algorithm achieves both better results and ranking outcomes. Conclusion: The proposed algorithm, an improved version of the BM25 algorithm, utilizes both co-word implementation and Cuckoo Search and verifies that the proposed algorithm produces better results on a large number of experimental sets. On the other hand, a simple expansion in and semantic network to expand the query vocabulary is planned to be conducted.

With the continuous development of computer technologies, that the information available on the internet has been increasing sharply leads to implementations of various means that are utilized for extracting insights in medical articles. Hence, medical treatment has also stepped into the age of big data. Managing this immense data and extracting insights from it has been a critical endeavor. If this process would be improved, the advantages that it could carry would be so beneficial that medical doctors benefit from it on a large scale. For instance, some routine decision-making tasks require significant repetition, taking time, and leading to higher costs. However, the scientific application of computerized medical information retrieval systems could effectively improve efficiency, save costs, and reduce errors. Proper use of computer technology could bring efficacy in all fields where it would be used. Therefore, the development of computer-assisted medical information retrieval systems is highly significant. In real practice, every decision made by a doctor is critical to the patient, so the doctor must constantly update his knowledge and pay attention to the latest technology and methods of clinical science. The authoritative literature and the latest research results in the medical community could be consulted via the Internet, so the medical retrieval models play a crucial role. Moreover, for medical practitioners who encounter a difficult medical problem for a certain medical record, searching the relevant biomedical literature via the internet as a reference and inspiration could provide an important way of solving difficult issues.
Information Retrieval (IR) methods for Clinical Decision Support (CDS) have been the focus of several recent types of research and assessment campaigns. Specifically, the CDS track between 2014 and 2016 Text Retrieval Conferences (TREC) [1][2][3] sought to assess the systems providing evidence-based information in the form of either full-text articles or abstracts from an open-access subset of the MEDLINE to clinicians in return to queries. Moreover, 2017-2019 [4][5][6] tracks focused on important implementations in clinical decision support that provides both useful and precise medical information to clinicians treating cancer patients. In the track, each case described the patient's disease (a type of cancer), the relevant genetic variants (which genes), and basic demographic information (age, sex). Precision Medicine (PM) in [7] is a new medical concept and model utilizing individualized medicine, develops with the rapid progress of genome sequencing technology, and the cross-application of bioinformatics and big data science.

2.Preleminaries
The objective of Information Retrieval (IR) is to retrieve related documents based on a given query. Generally, the relevancy of documents to queries is usually gauged by the score assigned by an IR model, for example, the extensively implemented classical BM25 model [8]. On the other hand, the past few decades witnessed the implementation of machine learning technology when information retrieval was a concern. The process could be classified into three varieties as follows: 1. the single document methods, 2. document pair methods, and 3. document list methods. The common single-document methods, for instance, utilizing a logistic regression method [9], deal with a feature vector of each document as an input and an output is the relevance of each document. The document pair methods, for example, utilizing the Rank-SVM [10] or Rank-Boost [11], implement a feature vector of a pair of documents as an input, and an output is a correlation between the documents. Besides, the document list methods, for instance, utilizing the List-Net [12], or the Ada-Rank [13], or the Lambda-Mart [14], employ a set of documents associated with a query as an input and an output is a ranked list. In recent years, query expansion methods have been widely implemented in information retrieval. Singh et al. [15] suggest a query expansion method utilizing fuzzy logic. The top-ranked documents are regarded as relevant feedback documents for mining other relevant query information. The choice of different query expansion terms is determined according to their importance in the top documents in [15]. These types of methods assign each term to a different relevance score and then select the expansion term through a certain threshold.
Keikha et al. [16] further consider the Wikipedia corpus as a feedback set space to train the Word Vector Model and determines the long-term selection of the best features in both supervised and unsupervised models. Almasri et al. [17] also utilize vectors to represent query words and query expansion terms returned by pseudo-correlation feedback. Cosine similarity is added to the Bag-of-Words Model, and the frequency of each word in the query term is recomputed. Rocchio et al. [18] suggest a classic correlation feedback method, which increases the entry weight of related documents and reduces the entry weight of non-relevant documents. However, one of the disadvantages of this method is to be very time-consuming for practitioners to assess the relevance of documents.
Cui et al. [19] propose a query expansion method for web search logs utilizing practitioners' interaction information. The key assumption behind this method is that the documents chosen by a user to read are related to the query. The new words in related documents are sorted according to their similarity with the user query, and the new words with the highest similarity are selected as the expanded word. The candidate expanded words are extracted from the top documents, and then the candidate expanded words are weighted and sorted by the probability generated by the language model. Aronson [20] proposes a method utilizing the UMLS query expansion, which benefits from the Meta-Map program [21] to identify medical phrases in the original query and then expand the query with new phrases. Hence, the experimental results show that query expansion utilizing the UMLS is an effective method to improve the performance of information retrieval.
Li et al. [22] propose a method of keyword-weighted network analysis to implement a medical full-text recommendation, which helps expand the medical acronym list by searching the full text and is evaluated by domain experts. It is verified that the algorithm works well in terms of recommendation accuracy in the medical literature. Saeid et al. [23] propose a query expansion method utilizing the Bayesian approach, which expands the genes of a disease to be no less than 3 words, and experiments suggest that the algorithm has higher precision value.
To retrieve documents that are highly related to a query, the literature review brings us the idea of benefitting from both query expansion and keywords that could be helpful to achieve this objective. Hence, this manuscript proposes a new method utilizing expanded words and co-word analysis as new tools to optimize the information retrieval of biomedical articles implementing the BM25 algorithm as a base method to compute scores of the abstract, expanded words, and co-word as a composite retrieval function. Besides, when disease and a gene both appear in the same biomedical article, the score of the biomedical article would tend to increase. Finally, the Cuckoo Algorithm is utilized to optimize the parameters of the retrieval algorithm.

Data 3.1 Data structure
The biomedical articles in scientific abstracts are presented in XML formats that contain information about articles. The MeSH headings, chemical lists, and keyword lists for XML documents are selected to utilize abstracts whose displays are presented in Figure 1.

Data distribution
While the total number of biomedical articles in both the 2017 and 2018 TREC Precision Medicine set is 26,613,834, it has 29,137,637 for the 2019 TREC Precision Medicine set. Some of the statistics that are used in information retrieval are tabulated in Table 1. Table 1 have definitions that are expressed as follows: Abstract-Mean-Length represents the average length of the abstracts after deleting stopwords; Abstract-Number represents the number of articles with abstracts; Chemical-Mean-Length represents the average length of the chemical lists; Chemical-Number represents the number of articles with a chemical list; Mesh-Mean-Length represents the average length of the MeSH headings; Mesh-Number represents the number of articles with MeSH headings; Keyword-Mean-Length represents the average length of the keyword list, and Keyword-Number represents the number of articles with keyword lists.

Query expansion
Medical Subject Headings (MeSH) is a controlled vocabulary developed by the U.S. National Library of Medicine, which is mainly utilized to index, catalog, and search articles pertinent to both biomedical and health sciences. The important role of the MeSH in medical information retrieval is mainly manifested with two aspects, which are called accuracy and specificity. While indexers enter information into the retrieval system, researchers utilize the information in the system concerning the two processes. The MeSH is utilized as the platform to make the terms consistent between the index and search to achieve the best retrieval outcomes. Hence, the accurate and comprehensive usage of the MeSH has a significant impact on the results of retrieval. In this manuscript, we utilize the MeSH database to extend the MeSH (meshb.nlm.nih.gov/MeSHonDemand) to find expansion terms or new words and their corresponding broader terms. The MeSH on Demand is utilized to expand query terms and obtain additional terms if possible.
Topic 2017-1 is taken as an example presented in Table 2. The results of the extended words are presented in Table 3.

Age expansion
The variable age included in the demographic field is also expanded to the terms or new words proposed by Kastner et al. [24]. We have readjusted the division of age and believe that those over 18 should be adults. Our expansion based on the variable age is presented in Table 4.

The Proposed Model
We first utilize the MeSH on Demand to find MeSH terms and additional terms that would be used in the retrieval of abstracts regarding any given query. Then, we construct a "wordlist" including chemical words, keywords, and MeSH headings utilizing query expansion. By doing so, we can find documents that are more related to query expansion. Then, the relevance score of the document would increase so would the score of it. In the next step, the co-word analysis is conducted utilizing either each separate resource of such as abstract, keywords, chemical words, or MeSh headings or utilizing them one at a time to find co-occurrence of selected words such as disease and gene in our case. While the first step deals with computing the score of abstracts based on a query and its morpheme, the second step deals with calculating the score of expansion words. In the last step, the score of the co-word is computed. Hence, as long as the number of documents is reduced to a smaller number of documents based on utilizing the "word list" to search for, the score of co-word would tend to increase. In the last step, we compute a composite score consisting of three scores of abstract, expanded words, and co-word. Afterward, Cuckoo Search, an evolutionary optimization method, is applied to optimize the parameters of the retrieval model, which is a composite model.

Scoring model for abstract
The BM25 model [8] is a classical information retrieval model whose main idea is to analyze a query to find morpheme . Then, for each search result , it calculates the correlation score of each morpheme and ; and finally gives a weight to the sum of the correlation score of concerning to to obtain a correlation score between and . The general formula of the BM25 algorithm is expressed by where ! is a weight to determine the relevance of a word to a document; we define Inverse Document Frequency (IDF) as ! .
where D represents the total number of corpus documents, and ({ | ∈ ! }) represents the number of documents containing morpheme . According to the definition of the IDF, the more is contained in a document, the lower the weight of is for a given set of documents. In other words, when many documents contain , the discrimination of is not so robust that the importance of utilizing ! to judge relevance is so weak. The relevance score ( ! , ) of morpheme ! to document d is defined by where is defined as follows: where, 3 , 8 and are adjustment factors, which are usually set according to experience; ! represents the frequency of ! in ; ! represents the frequency of ! in query; represents the length of document ; and represents the average length of all documents. In most cases, ! would appear only once in query; that is, ! = 1, so the formula could be modified to: As seen from the definition, the role of parameter is to tune the impact of the document length on the relevance. The larger b would be, the greater the impact of the document length on the relevance score would be, and vice versa. The longer the relative length of the document would be, the larger the value would be. Hence, the smaller the relevance score would be. In the end, the correlation score formula of the abstract of the document can be expressed by

Score of expanded word
As seen in Table 1 [52]. In this subsection, we suggest an improved BM25 algorithm to compute the scores of expanded words. We combine Chemical Words, MeSH headings, and Keywords into a list called 'Word List'. The length of the "Word List" in the document is defined by where represents the length of the Chemical List in document ; represents the length of the MeSH headings in document ; and represents the length of the Keyword list in document . The IDF value of the expanded word appearing in the "Word List" of a document is defined by =>&' ( ! , ) = ?<"(* ! )2@.B "(* ! )2@.B (8) where N represents the number of documents in which > 0, and ( ! ) represents the number of documents containing extended morpheme ! . The frequency value of the term of the word list is defined by where n represents the number of expanded words in query , and ! represents the morpheme of each expanded word in . The score of an expanded word in a document is defined by

Score of co-word
Co-Word analysis utilizes the co-occurrence of lexical pairs or noun phrases in an article set to determine the relationship between topics in the discipline represented by the article set. In this manuscript, Co-Word analysis is introduced into the article scoring model. when both a disease and a gene co-occur across in an abstract, Chemical List, MeSH heading, and Keyword List as presented in Figure 2 or co-occur within any abstract, Chemical List, MeSH heading, or Keyword List as presented in Figure 3.
We utilize the Inverse Document Frequency (IDF) value as a Co-Word score to distinguish the importance of a gene whose formula is defined by where N represents the number of documents and ( ! ) represents the number of documents containing gene morpheme ! . The Co-Word score is defined by where n represents the number of genes with co-word having a disease in a query , and ! represents the morpheme of each gene in the .

Retrieval Model
We utilize the composite score as the final score for a document under query , and the specific formula is expressed by The architecture of the system for biomedical article retrieval is depicted in Figure 4.

Parameter Optimization
There exist 6 parameters in the proposed method, namely, 3 , 8 , F , 3 , 8 and . Choosing better parameters would improve retrieval results. Many optimization algorithms are implemented to optimize the function in use such as the Genetic Algorithm (GA) [25], Simulated Annealing (SA) Algorithm [26], Ant Colony (AC) Algorithm [27]. With the continuous effort to develop better intelligence optimization algorithms, many new Swarm Intelligence Optimization (SIO) Algorithms have emerged during recent years, such as the Cuckoo Search (CS) Algorithm [28], Glow Worm Swarm Optimization (GWSO) Algorithm [29], Particle Swarm Optimization (PSO) algorithm [30] and so on. The Swarm Intelligence Optimization (SIO) Algorithm has been widely utilized for solving function optimization.

Cuckoo Search Algorithm (CS) is a novel Swarm Intelligence Optimization (SIO)
Algorithm proposed by Yang [28] in 2009. Maribel [31] suggests that the CS Algorithm is more efficient than the GA Algorithm. The CS Algorithm utilizes some idealized rules expressed by (1) Each cuckoo produces only one egg every time and selects a parasitic nest to hatch its egg randomly.
(2) The best parasitic nest will be kept to the next generation.
(3) The number of available parasitic nests is fixed and the detection probability of parasitic nest' master is P G ∈ (0,1). The cuckoo finds the nest and updates the position according to the above-idealized rules. The position update formula is defined by where T stands for step size and T>0, the operator, ⊕, expresses a point-to-point multiplication, Levy(λ) is the search path and follows the Levy distribution [32][33].
The pseudo-code of the algorithm [28] is presented as follows:

Objective function
While the RR represents the relevant document retrieved, the RN represents the irrelevant documents retrieved. The calculation formula for Precision is defined by Then, @10 is defined as the Precision at RR + RN = 10. We define the average @10 as follows: where @10( ) represents the @10 value of the ;P topic in the topics.
NDCG [70] is a commonly utilized index to assess the quality of ranking in information retrieval. Let ϑ denote a relevance grade and ( ) denote the gain associated with ϑ , respectively. Besides, let 3 , 8 , . . . Q be the gain values associated with the documents retrieved by a system in response to a query , such as ! = ( ) if the relevance grade of the document in rank is ϑ. Then, the nDCG value for this system is computed by and Z denotes the value for an ideal ranked list for query . We define the average as follows: where ( ) represents the value of the th topic in the topics.

Algorithm flow
Since 8 is a fixed value, namely, 8 = 1, we utilize 3 , F , 3 , 8 , as input parameters. Firstly, the algorithm generates the initial population and set the maximum number of iterations or stop criterion. If the number of iterations reaches the max generation or the stop criterion is met, the algorithm stops and gives the optimal solution. Otherwise, the algorithm will perform a series of optimization operations based on the value of the objective function. This manuscript utilizes M@3@ + "#RS as the objective function, and employs the data set of the 2017 Precision Medicine as the training data set to optimize the parameters. The flow chart of the algorithm is presented in Figure 5.

Algorithm for parameter setting
The parameter values of the algorithm in this manuscript are presented in Table 5.

Experimental results
In Table 6,while "Normal" represents values of the empirical parameters; the CS represents the parameters trained utilizing the 2017 data set implementing the Cuckoo Search algorithm using 1000 documents having the highest scores as the result of the retrieval model that is selected.
When 3 years of data are compared, the optimized parameters are better than the empirical parameters. For a retrieval system, users want related documents to appear earlier, so infNDCG and P@10 are two important indicators to assess information retrieval. The parameters that are optimized utilizing both NDCG and P@10 would increase the weights of the word list. The word list contains extended information about age, gender, and genes, which is very important for distinguishing relevant literature from non-related literature. In conclusion, different parameters could be utilized to meet the needs of various users. As seen in Figures 6、7 and 8, while the RR represents relevance in co-word documents; the RN represents all relevance except for the RR. Many relevant documents contain both a disease and a gene. We define the average relevant document coverage rate as follows: The average coverage rate of 30 topics in 2017 is 52.9%, that of 50 topics in 2018 is 74.13%, and that of 40 topics in 2019 is 54.4%. These outcomes suggest that the Co-Word Analysis has a better impact on the retrieval of relevant documents, which greatly reduces the search scope. When information retrieval is a concern, the NR, RR, NN, and RN represent the relevant documents that are not retrieved, the irrelevant documents that are not retrieved, the relevant documents that are retrieved, the irrelevant documents that are retrieved, respectively. Precision is defined by The Recall is defined by The F1-score is defined by As seen in Figures 9、10 and 11, the optimized parameters are better than the empirical parameters on both P@10 and infNDCG. Since we utilize the 2017 Precision Medicine dataset as the training set, the optimization effect is the most obvious on the 2017 precision medicine dataset. When the test data in 2018 and 2019 are a concern, both P@10 and infNDCG have improved, but R-predicted has declined. This happens since the adopted objective function has improved the ranking of the most relevant documents. The Precision and the Recall are inversely proportional when the retrieval system is a concern. In our case examining the retrieval of biomedical articles, we are more concerned about the precision rate to facilitate doctors' scientific decision-making.  Tables 7, 8, and 9, respectively. We select 3 years of the TREC datasets to verify our experimental results. These methods all utilize the BM25 algorithm or an improved BM25 algorithm. In the 2017 experiments, the significant improvement of the proposed method could be observed in three indicators. In the 2018 experiments, the proposed method is better than other similar algorithms when the P@10, ranked second on infNDCG, and R-predicted are a concern. In the 2019 experiments, the proposed method is better than other similar algorithms when the P@10, ranked second on infNDCG, and R-predicted are a concern.

Conclusion
This manuscript proposes a method based on BM25 taking into account Co-Word Analysis to retrieve biomedical articles. We improved the BM25 algorithm and employed it to compute the score of expanded words by combining the co-word score with the gene appearance weight. Then, we utilize the Cuckoo Search Algorithm to optimize parameters on the evaluation function of both P @ 10 and NDCG. Optimization results suggest that increasing the score weight of the "word list" can effectively improve the ranking of related documents. Besides, the manuscript discusses the influence of different parameters on the retrieval algorithm and presents the parameters to meet different retrieval needs in the future. Although the proposed algorithm in this manuscript is based on the improved version of BM25 utilizing expanded words and co-word analysis and the optimization of parameters of the retrieval function, it summarizes the general rules for improving the parameters of the BM25 algorithm and verifies it through many experiments. Since the query expansion is simple in this manuscript, our future research will focus on utilizing more linked data to conduct the research utilizing the topic data.

Ethics approval and consent to participate
On behalf of, and having obtained permission from all the authors, I declare that: the material has not been published in whole or in part elsewhere; the paper is not currently being considered for publication elsewhere; all authors have been personally and actively involved in substantive work leading to the report, and will hold themselves jointly and individually responsible for its content; all relevant ethical safeguards have been met concerning patient or subject protection, or animal experimentation. I testify to the accuracy of the above on behalf of all the authors.

Consent for publication
This article utilizes public datasets