Skip to main content

An improved BM25 algorithm for clinical decision support in Precision Medicine based on co-word analysis and Cuckoo Search

Abstract

Background

Retrieving gene and disease information from a vast collection of biomedical abstracts to provide doctors with clinical decision support is one of the important research directions of Precision Medicine.

Method

We propose a novel article retrieval method based on expanded word and co-word analyses, also conducting Cuckoo Search to optimize parameters of the retrieval function. The main goal is to retrieve the abstracts of biomedical articles that refer to treatments. The methods mentioned in this manuscript adopt the BM25 algorithm to calculate the score of abstracts. We, however, propose an improved version of BM25 that computes the scores of expanded words and co-word leading to a composite retrieval function, which is then optimized using the Cuckoo Search. The proposed method aims to find both disease and gene information in the abstract of the same biomedical article. This is to achieve higher relevance and hence score of articles. Besides, we investigate the influence of different parameters on the retrieval algorithm and summarize how they meet various retrieval needs.

Results

The data used in this manuscript is sourced from medical articles presented in Text Retrieval Conference (TREC): Clinical Decision Support (CDS) Tracks of 2017, 2018, and 2019 in Precision Medicine. A total of 120 topics are tested. Three indicators are employed for the comparison of utilized methods, which are selected among the ones based only on the BM25 algorithm and its improved version to conduct comparable experiments. The results showed that the proposed algorithm achieves better results.

Conclusion

The proposed method, an improved version of the BM25 algorithm, utilizes both co-word implementation and Cuckoo Search, which has been verified achieving better results on a large number of experimental sets. Besides, a relatively simple query expansion method is implemented in this manuscript. Future research will focus on ontology and semantic networks to expand the query vocabulary.

Peer Review reports

Background

With the proliferation of computer technologies, the information available on the Internet has swiftly increased leading to various implementations utilized for information extraction from medical articles. Hence, medical treatment techniques have stepped into the age of Big Data. However, managing the immense data and extracting information from them is a critical endeavor. If this process can be improved, the advantages that it could offer would be so beneficial for medical doctors. For instance, some routine decision-making tasks require significant repetition, which takes time and increases costs. However, computerized medical information retrieval systems can effectively improve efficiency, save costs, and reduce errors. Proper use of computer technology can bring efficacy to all fields where it will be used. Therefore, the development of medical information retrieval systems is crucial. In reality, every decision of a doctor is critical to the patient, so the doctor must follow the state-of-the-art techniques and keep abreast with the latest technology and methods of clinical science. The academic literature providing the latest research results in the medical community can be accessed via the Internet and the medical retrieval models play a crucial role. Furthermore, searching the relevant biomedical literature on the Internet for a reference can be highly beneficial for medical practitioners who encounter a difficult problem on a certain medical record.

Information Retrieval (IR) methods for Clinical Decision Support (CDS) have been the focus of recent research and assessment campaigns. Specifically, the CDS track between 2014 and 2016 Text Retrieval Conferences (TREC) [1,2,3] sought to assess the systems providing evidence-based information in the form of either full-text or abstracts from an open-access subset of MEDLINE to the clinicians in return to their queries. Furthermore, the tracks from 2017 to 2019 [4,5,6] focused on important implementations in clinical decision support providing both useful and precise medical information to clinicians treating cancer patients. In these, each case described the disease (a type of cancer), the relevant genetic variants (which genes), and basic demographic information (age and sex) of patients. Precision Medicine introduced in [7] is a new medical concept utilizing individualized medicine that develops with the rapid progress of genome sequencing technology and the cross-application of bioinformatics and Big Data science.

Preleminaries

The IR aims to retrieve related documents based on a given query. The relevancy of documents to queries is often gauged by the score assigned by an IR model, e.g., the widely-implemented BM25 model [8]. On the one hand, the past few decades witnessed the implementation of machine learning technology when information retrieval was a concern. The document ranking process could be classified into three groups as follows: (i) the single document methods, (ii) the document pair methods, and (iii) the document list methods. The common single-document methods, such as [9] utilizing a logistic regression technique, deal with a feature vector of each document as an input, where the output is the relevance of each document. The document pair methods, e.g., the ones utilizing Rank-SVM [10] or Rank-Boost [11], implement a feature vector of a pair of documents as the input and use the correlation between the documents as the output. The document list methods, e.g., the ones proposed List-Net [12], Ada-Rank [13], or Lambda-Mart [14], employ a set of documents associated with a query as the input and a ranked list as the output. In recent years, query expansion methods have been widely implemented in information retrieval. Singh et al. [15] suggested a method based on fuzzy logic, in which the top-ranked documents were regarded as relevant feedback documents for mining query information. Furthermore, the choice of different query expansion terms was determined according to their importance. These methods often assign each term to a different relevance score and then select the expansion term based on a certain threshold.

Keikha et al. [16] considered the Wikipedia corpus as the feedback set space to train the Word Vector Model and determined the long-term selection of the best features in both supervised and unsupervised models. Almasri et al. [17] also utilized vectors to represent query words and query expansion terms returned by pseudo-correlation feedback. They added cosine similarity is to the Bag-of-Words Model, and the frequency of each word in the query term was recalculated. Singh et al. [18] proposed a classic correlation feedback method, which increased the entry weight of the related documents and reduced it to that of the non-relevant ones. However, one of the disadvantages of this method was to be very time-consuming for practitioners in assessing the relevance of documents.

Cui et al. [19] developed a query expansion method for web search logs utilizing the interaction information of practitioners. The key assumption behind this method was that the documents chosen by a user to read were related to the query. The new words in the related documents were sorted according to their similarity with the user query, and the new words with the highest similarity were selected as the expanded word. The candidate expanded words were extracted from the top documents, and then the candidate expanded words were weighted and sorted by the probability generated by the language model. Aronson and Rindflesch [20] proposed a method based on the Unified Medical Language System (UMLS) query expansion, which benefitted from the Meta-Map program [21] to identify the medical phrases in the original query and then expanded the query with new phrases. Hence, the experimental results showed that the query expansion utilizing the UMLS was an effective method to improve the performance of information retrieval.

Li et al. [22] proposed a method of keyword-weighted network analysis to implement a medical full-text recommendation, which helped to expand the medical acronym list by searching the full-text. Domain experts verified that the algorithm worked well in terms of accuracy in recommending medical literature. Balaneshinkordan et al. [23] developed a query expansion method utilizing the Bayesian approach, which expanded the genes of a disease to be no less than three words. The experiments revealed that the algorithm had a higher precision value.

The literature review brings us the idea of using both query expansion and keywords to retrieve documents that are highly related to a query. Hence, this manuscript proposes a method utilizing expanded words and co-word analysis as new tools to optimize the information retrieval of biomedical articles implementing the BM25 algorithm as a base method. This is to compute scores of the abstract, expanded words, and co-word as a composite retrieval function. Besides, when a disease and a gene both appear in the same biomedical article, the score of the article tends to increase. Finally, the Cuckoo Algorithm [28] is utilized to optimize the parameters of the proposed retrieval algorithm.

As a classical information retrieval algorithm, BM25 has been frequently implemented on TREC, such as 2017, 2018, and 2019 Precision Medicine [34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49]. These algorithms mainly utilize either the original BM25 algorithm or its improved version to retrieve information [37, 38].

Experimental data

Data structure

The abstracts of biomedical articles are presented in XML format. The MeSH headings, chemical lists, and keyword lists for XML documents are selected to utilize abstracts whose displays are presented in Fig. 1.

Fig. 1
figure1

General structure and the XML attributes of MEDLINE abstracts

Data distribution

While the total number of biomedical articles in both 2017 and 2018 TREC Precision Medicine is 26,613,834, the 2019 set has 29,137,637 articles. Table 1 shows some of the statistics that are used in information retrieval, where Abstract-Mean-Length represents the average length of the abstracts after deleting stop-words; Abstract-Number represents the number of articles with abstracts; Chemical-Mean-Length represents the average length of the chemical lists; Chemical-Number represents the number of articles with a chemical list; Mesh-Mean-Length represents the average length of the MeSH headings; Mesh-Number represents the number of articles with MeSH headings; Keyword-Mean-Length represents the average length of the keyword list, and Keyword-Number represents the number of articles with keyword lists.

Table 1 The statistics of the TREC Precision Medicine covering the period of 2017–2019

Query expansion

Medical Subject Headings (MeSH) is a controlled vocabulary developed by the U.S. National Library of Medicine, which is mainly utilized to index, catalog, and search articles relevant to both biomedical and health sciences. The important role of the MeSH in medical information retrieval is mainly manifested with two aspects, namely accuracy and specificity. While indexers input information into the retrieval system, researchers utilize this information concerning the two aspects. The MeSH is used as the platform making the terms consistent between the index and search to achieve the best outcomes. Hence, the accurate and comprehensive usage of the MeSH has a significant impact on the results of information retrieval. In this manuscript, we utilize the MeSH database (meshb.nlm.nih.gov/MeSHonDemand) to find expansion terms or new words and their broader terms. The MeSH on Demand is utilized to expand query terms and obtain additional terms if possible.

Table 2 shows Topic 2017-1 as an example, and Table 3 presents the results of the extended words.

Table 2 The retrieval topic description of the TREC Precision Medicine
Table 3 The expanded MeSH of 2017 TREC Precision Medicine retrieval task 1

Age expansion

The variable age included in the demographic field is expanded to the terms or new words proposed by Kastner et al. [24]. We have readjusted the age division and assumed that those over 18 should be adults. Table 4 presents our expansion model that is based on variable age.

Table 4 The expanded age of the TREC Precision Medicine

The proposed model

We first utilize the MeSH on Demand to find MeSH terms and additional terms that can be used in the retrieval of abstracts for any given query. Then, we construct a “wordlist” including chemical words, keywords, and MeSH headings that utilizes query expansion, thereby finding documents that are more related to query expansion. This is to increase the relevance score of documents. In the next step, we performed the co-word analysis utilizing either each separate resource, such as abstract, keywords, chemical words, or MeSh headings, or all sources at a time to find co-occurrence of selected words, such as disease and gene, in our case. While the first step deals with computing the score of abstracts based on a query and its morpheme, the second step deals with calculating the score of expansion words. Then, the score of the co-word is calculated. Hence, as long as the number of documents is reduced based on the “word list”, the score of the co-word tend to increase. In the last step, we compute the composite score consisting of three scores of abstract, expanded words, and co-word. Afterward, Cuckoo Search [28], an evolutionary optimization method, is applied to optimize the parameters of the proposed retrieval model.

The abstract scoring model

The BM25 [8] is a classical information retrieval model that is based on analyzing a query \(Q\) to find a morpheme \(qi\). For each search result \(d\), it calculates the correlation between each morpheme \(qi\) and \(d\), and finally gives a weight to the sum of the correlation score of \(qi\) concerning \(d\) to obtain a correlation score between \(Q\) and \(d\). The general formula of the BM25 can be expressed by:

$$Score\left( {Q,d} \right) = \mathop \sum \limits_{i}^{n} W_{i} \times R\left( {q_{i} ,d} \right)$$
(1)

where \(W_{i}\) is a weight determining the relevance of a word to a document.

The Inverse Document Frequency (IDF) is defined as:

$$IDF\left( {q_{i} } \right) = log\frac{D}{{card(\{ q_{i} |i \in d_{i} \} )}}$$
(2)

where \(D\) represents the total number of corpus documents, and \(card(\{ j|i \in d_{i} \} )\) represents the number of documents containing morpheme \(qi\). According to (2), the more \(qi\) contained in a document, the lower the weight of \(qi\) for a given set of documents. In other words, when several documents contain \(qi\), the discrimination of \(qi\) is not so robust that the importance of utilizing \(q_{i}\) to judge relevance is so weak.

The relevance score \(R \left( {q_{i} , d} \right)\) of morpheme \(q_{i}\) documenting \(d\) is defined as:

$$R\left( {q_{i} ,d} \right) = \frac{{f_{i} \times \left( {k_{1} + 1} \right)}}{{f_{i} + K}} \times \frac{{qf_{i} \times \left( {k_{2} + 1} \right)}}{{qf_{i} + k_{2} }}$$
(3)

where parameter \(K\) is:

$$K = k_{1} \times \left( {1 - b_{1} + b_{1} \times \frac{dl}{{avgdl}}} \right)$$
(4)

where, \(k_{1}\), \(k_{2}\), and \(b\) are adjustment factors that are usually set according to experience, \(f_{i}\) is the frequency of \(q_{i}\) in \(d\), \(qf_{i}\) is the frequency of \(q_{i}\) in query, \(dl\) is the length of document \(d\), and \(avgdl\) is the average length of all documents. In most cases, \(q_{i}\) appears only once in the query, i.e., \(qf_{i}\) = 1. Hence, (3) can be rewritten as:

$$R\left( {q_{i} ,d} \right) = \frac{{f_{i} \times \left( {k_{1} + 1} \right)}}{{f_{i} + K}}$$
(5)

As seen from the definition, the role of parameter \(b\) is to tune the impact of the document length on the relevance. The larger the \(b\) is, the greater the impact of the document length on the relevance score will be, and vice versa. Similarly, the longer the relative length of the document is, the larger the \(K\), and hence the smaller the relevance score will be. In the end, the correlation score of the abstract of the document \(d\) can be expressed as:

$$Score_{abstract} \left( {Q,d} \right) = \mathop \sum \limits_{i}^{n} IDF\left( {q_{i} } \right) \times \frac{{f_{i} \times \left( {k_{1} + 1} \right)}}{{f_{i} + k_{1} \times \left( {1 - b_{1} + b_{1} \times \frac{dl}{{avgdl}}} \right)}}$$
(6)

The expanded word score

As seen in Table 1, most biomedical articles have both abstracts and titles. The number of biomedical articles containing chemical words, MeSH headings, and keywords varies widely. Specifically, there exist 13,113,093 articles containing chemical words, 2,438,717,151 articles containing MeSH headings, and 4,005,446 articles containing keywords. As the literature suggests, direct utilization of the BM25 leads to failure when dealing with a large selection of documents [50]. In this subsection, we propose an improved BM25 algorithm to compute the scores of expanded words. We combine chemical words, MeSH headings, and keywords into a list called ‘Word List’. The length of the “Word List” in the document is defined by:

$$dwl = dcl + dml + dkl$$
(7)

where \(dcl\) is the length of chemical words in document \(d\), \(dml\) is the length of MeSH headings in document \(d\), and \(dkl\) is the length of keywords in document \(d\).

The IDF value of the expanded word appearing in the “Word List” of document \(d\) can be given by:

$$IDF_{word} \left( {q_{i} ,d} \right) = log\frac{{N - n\left( {q_{i} } \right) + 0.5}}{{n\left( {q_{i} } \right) + 0.5}}$$
(8)

where \(N\) represents the number of documents in which \(dwl\) > 0, and \(n\left( {q_{i} } \right)\) represents the number of documents containing the extended morpheme \(q_{i}\). The frequency value of the term of the word list is defined by:

$$tf_{word} \left( {Q,d} \right) = \mathop \sum \limits_{i}^{n} IDF_{word} \left( {q_{i} ,d} \right)$$
(9)

where \(n\) represents the number of expanded words in query \(Q\), and \(q_{i}\) represents the morpheme of each expanded word in query \(Q\). The score of an expanded word in document \(d\) is defined by:

$$Score_{word} \left( {Q,d} \right) = \frac{{tf_{word} \left( {Q,d} \right) \times \left( {k_{3} + 1} \right)}}{{tf_{word} \left( {Q,d} \right) + k_{3} \times \left( {1 - b_{2} + b_{2} \times \frac{dwl}{{avgdwl}}} \right)}}$$
(10)

where \(k_{2}\) and \(b_{1}\) are the adjustment factors that are usually set according to experience, and \(avgdwl\) represents the average length of all word lists.

The co-word score

The co-word analysis utilizes the co-occurrence of lexical pairs or noun phrases in an article set to determine the relationship between topics in the discipline represented by the article set. In this manuscript, the co-word analysis is introduced into the article scoring model for the case when a disease and a gene co-occur across in an abstract, Chemical List, MeSH heading, and Keyword List (as presented in Fig. 2) or co-occur within any abstract, Chemical List, MeSH heading, or Keyword List (as presented in Fig. 3).

Fig. 2
figure2

The cross co-word of abstract, chemical list, MeSH heading, and keyword list

Fig. 3
figure3

The co-word of abstract, Chemical List, Me-SH heading, and Keyword List

We utilize the IDF value as the co-word score to distinguish the importance of a gene, which can be formulated as:

$$IDF_{gene} \left( {g_{i} ,d} \right) = log\frac{{N - n\left( {g_{i} } \right) + 0.5}}{{n\left( {g_{i} } \right) + 0.5}}$$
(11)

where \(N\) represents the number of documents, and \(n\left( {g_{i} } \right)\) represents the number of documents containing gene morpheme \(g_{i}\).

Finally, the Co-Word score is defined by:

$$Score_{{co}\text{-}{word}} \left( {Q,d} \right) = \mathop \sum \limits_{i}^{n} IDF_{word} \left( {g_{i} ,d} \right)$$
(12)

where \(n\) is the number of genes with co-word having a disease in query \(Q\), and \(g_{i}\) is the morpheme of each gene in query \(Q\).

Retrieval model

We utilize the composite score as the final score for document \(d\) under query \(Q\), which can be formulated as:

$$Score_{composite} \left( {Q,d} \right) = Score_{abstract} \left( {Q,d} \right) + Score_{word} \left( {Q,d} \right) + \alpha \times Score_{{co}\text{-}{word}} \left( {Q,d} \right)$$
(13)

Figure 4 depicts the architecture of the biomedical article retrieval system.

Fig. 4
figure4

Architecture of the system retrieving biomedical articles

Parameter optimization

The proposed method has six parameters: \(k_{1}\), \(k_{2}\), \(k_{3}\), \(b_{1}\), \(b_{2}\), and \(\alpha\), and the choice of parameters can affect the results of information retrieval. Various algorithms, e.g., the Genetic Algorithm (GA) [25], Simulated Annealing (SA) Algorithm [26], and Ant Colony (AC) Algorithm [27], have been implemented to optimize the function in use, i.e., the objective function. With the continuous effort in developing better algorithms, several new Swarm Intelligence Optimization (SIO) Algorithms have emerged during recent years, such as Cuckoo Search (CS) [28], Glow Worm Swarm Optimization (GWSO) [29], and Particle Swarm Optimization (PSO) [30]. Among them, SIO has been widely utilized.

Cuckoo Search Algorithm

CS is a SIO proposed by Yang et al. [28] in 2009. Guerrero et al. [31] claimed that CS outperformed GA in terms of efficiency. Some of the idealized rules utilized by CS can be given as:

  1. (1)

    Each cuckoo lays only one egg every time and selects a parasitic nest to randomly hatch its egg.

  2. (2)

    The best parasitic nest will be handed down to the next generation.

  3. (3)

    The number of available parasitic nests is fixed and the detection probability of parasitic nest’ master is \(P_{a} \in \left( {0,1} \right)\).

The cuckoo finds the nest and updates the position according to the above-given rules. The position update formula is:

$$X_{i}^{{\left( {t + 1} \right)}} = X_{i}^{\left( t \right)} + T \oplus Levy\left( \lambda \right)$$
(14)

where \(T\) is the step size (\(T > 0\)), \(\oplus\) is the point-to-point multiplication operator, \(Levy\left( \lambda \right)\) is the search path following the Levy distribution [32, 33]. The pseudo-code of the algorithm [28] is presented as follows:

figurea

Objective function

Precision is calculated as:

$$Precision = \frac{RR}{{\left( {RR + RN} \right)}}$$
(15)

where \(RR\) and \(RN\) refer to relevant and irrelevant documents retrieved, respectively.

Then,\(P@10\) is defined as the Precision when \(RR + RN = 10\). Hence, the average \(P@10\) can be formulated as:

$$Avg_{P@10} = \frac{{\mathop \sum \nolimits_{t = 1}^{n} P@10\left( t \right)}}{n}$$
(16)

where \(P@10\left( t \right)\) represents the \(P@10\) value of the \(t\)th topic among \(n\) topics.

The Normalized Discounted Cumulative Gain (nDCG) is a commonly utilized index to assess the quality of ranking in information retrieval. Let \(\vartheta\) denote the relevance grade, and \(gain\left( \vartheta \right)\) denote the gain associated with \(\vartheta\). Also, assume that \(g_{1} , g_{2} , . . . g_{z}\) are the gain values associated with the \(Z\) documents retrieved by a system in response to query \(q\), such that \(g_{i} = gain\left( \vartheta \right)\) if the relevance grade of the document in rank \(i\) is \(\vartheta\). Hence, the nDCG value for this system can be calculated as:

$$nDCG = \frac{DCG}{{DCG_{I} }}, \quad {\text{where}}\,\, DCG = \mathop \sum \limits_{i = 1}^{Z} \frac{{g_{i} }}{{{\text{log}}\left( {i + 1} \right)}}$$
(17)

and \(DCG_{I}\) denotes the \(DCG\) value for an ideal ranked list for query \(q\).

We define the average \(nDCG\) as follows:

$$Avg_{nDCG} = \frac{{\mathop \sum \nolimits_{t = 1}^{n} nDCG\left( t \right)}}{n}$$
(18)

where \(nDCG\left( t \right)\) represents the \(nDCG\) value of the \(t\) th topic among \(n\) topics.

Algorithm flow

Since \(k_{2}\) has a fixed value (\(k_{2} = 1\)), we utilize \(k_{1}\), \(k_{3}\),\(b_{1}\),\(b_{2}\), and \(\alpha\) as input parameters. Firstly, the algorithm generates the initial population and either set the maximum number of iterations or stop the criterion. If the number of iterations reaches the maximum number or the stop criterion is met, the algorithm ends and returns the optimal solution. Otherwise, the algorithm performs a series of operations to optimize the objective function. This manuscript defines \(Avg_{P@10} + Avg_{nDCG}\) as the objective function and employs the dataset of the 2017 Precision Medicine as the training dataset to optimize the parameters. Figure 5 presents the flowchart of the proposed algorithm.

Fig. 5
figure5

Flowchart of the proposed algorithm

Experimental results and comparison

Dataset

The data used in this work were sourced from medical articles published in 2017, 2018, and 2019 TREC Precision Medicine, which can be found on http://www.trec-cds.org/2017.html, http://www.trec-cds.org/2018.html, and http://www.trec-cds.org/2019.html, respectively. Each article was formatted using the XML 2017. The assessment results of the articles were obtained from (https://trec.nist.gov/data/precmed/qrels-final-abstracts.txt), (https://trec.nist.gov/data/precmed/qrels-treceval-abstracts-2018-v2.txt), and (https://trec.nist.gov/data/precmed/qrels-treceval-abstracts.2019.txt).

Due to the semi-structured nature of the XML format, we used MongoDB as the database for document storage and Python as the programming language. All the code can be found on the corresponding author’s GitHub (https://github.com/Bruce-V/CS-BM25).

Parameter setting

Table 5 presents the parameter values used in the proposed algorithm.

Table 5 The parameter settings of the Cuckoo Search Algorithm

Experimental results

In Table 6, “Normal” refers to the values of empirical parameters, where CS represents the parameters trained using the 2017 dataset consisting of 1000 documents with the highest scores as a result of the selected retrieval model.

Table 6 The results of the Cuckoo Search Algorithm

When the data of three years are compared, the optimized parameters are better than the empirical parameters. For an information retrieval system, the users desire related documents to appear earlier; hence, infNDCG and P@10 are two important indicators in assessing the performance of the information retrieval process. The parameters that are optimized using both NDCG and P@10 would increase the weights of the word list. The word list includes extended information about age, gender, and genes, which is crucial for distinguishing the relevant literature from the irrelevant ones. In conclusion, different parameters can be utilized to meet the needs of various users.

In Figs. 6, 7, and 8, \(RR\) represents the relevance in co-word documents, while \(RN\) represents all relevance except for \(RR\). It is shown that many relevant documents contain both a disease and a gene. As a result, we define the rate of average relevant document coverage as:

$$Avg_{cov} = \frac{{\mathop \sum \nolimits_{1}^{n} \frac{{ relevance\, in\, co{ - }word}}{ relevance }}}{n}$$
(19)
Fig. 6
figure6

2017 TREC Precision Medicine relevance in co-word biomedical articles

Fig. 7
figure7

2018 TREC Precision Medicine relevance in co-word biomedical articles

Fig. 8
figure8

2019 TREC Precision Medicine relevance in co-word biomedical articles

The average coverage rate of 30 topics in 2017 is 52.9%, which is 74.13% for 50 topics in 2018, and 54.4% for 40 topics in 2019. These outcomes reveal that the co-word analysis has a better impact on the retrieval of relevant documents, which greatly reduces the scope of the search.

When information retrieval is a concern, \(NR\), \(RR\), \(NN\), and \(RN\) respectively represent the relevant documents that are not retrieved, the irrelevant documents that are not retrieved, the relevant documents that are retrieved, and the irrelevant documents that are retrieved. Here, Precision is defined as formula (15).

Recall is defined as:

$$Recall = \frac{RR}{{\left( {RR + NR} \right)}}$$
(20)

and F1-score is defined as:

$$F1{ - }score = 2 \times \frac{Precision \times Recall}{{Precision + Recall}}$$
(21)

As seen in Figs. 9, 10, and 11, the optimized parameters are better than the empirical parameters for both P@10 and infNDCG. Since we utilize the 2017 Precision Medicine dataset as the training set, the optimization effect is the most obvious on this dataset. When the test data in 2018 and 2019 are a concern, both P@10 and infNDCG have improved, but R-predicted has declined. This happens since the adopted objective function has improved the ranking of the most relevant documents. Precision and Recall are inversely proportional to each other when the retrieval system is a concern. However, in our case examining the retrieval of biomedical articles, we are more concerned about the precision rate to alleviate the doctors’ decision-making.

Fig. 9
figure9

Comparison of 2017 TREC Precision Medicine indicators

Fig. 10
figure10

Comparison of 2018 TREC Precision Medicine indicators

Fig. 11
figure11

Comparison of 2019 TREC Precision Medicine indicators

Experimental comparison

Considering the results of the models taken from 2017, 2018, and 2019 TREC Precision Medicine, three indicators called infNDCG, R-predicted, and P@10 are selected for comparison, and the experimental results are presented in Tables 7, 8, and 9.

Table 7 Experimental comparison of methods published in 2017 TREC Precision Medicine
Table 8 Experimental comparison of methods published in 2018 TREC Precision Medicine
Table 9 Experimental comparison of methods published in 2019 TREC Precision Medicine

We use three years of the TREC datasets to verify our experimental results. The selected methods either utilize the BM25 algorithm or its improved version. The experiments using the 2017 dataset showed a significant improvement in our proposed method for all indicators. For the 2018 dataset, our method performed better than similar algorithms for P@10, ranked second for infNDCG. The same result was observed for the 2019 dataset.

Conclusion

This manuscript proposes a BM25-based method incorporating co-word analysis to retrieve biomedical articles. We improved the BM25 algorithm and used it to compute the score of expanded words by combining the co-word score with the gene appearance weight. Then, we utilized the Cuckoo Search Algorithm to optimize parameters on the evaluation function of both P@10 and nDCG. Optimization results suggested that increasing the score weight of the “word list” could effectively improve the ranking of the related documents. The manuscript also discusses the influence of different parameters on the retrieval algorithm and presents the parameters to meet different retrieval needs in the future. Although the proposed algorithm in this manuscript is based on the improved version of BM25, it highlights the general rules for improving the parameters of the BM25 algorithm, which were verified through numerous experiments. Since the query expansion used in this manuscript is simple, our future research will focus on adopting more linked data to investigate utilizing the topic data.

Availability of data and materials

The data used in this work were sourced from medical articles published in 2017, 2018, and 2019 TREC Precision Medicine, which can be found on http://www.trec-cds.org/2017.html, http://www.trec-cds.org/2018.html, and http://www.trec-cds.org/2019.html, respectively. Each article was formatted using the XML 2017. The assessment results of the articles were obtained from (https://trec.nist.gov/data/precmed/qrels-final-abstracts.txt), (https://trec.nist.gov/data/precmed/qrels-treceval-abstracts-2018-v2.txt), and (https://trec.nist.gov/data/precmed/qrels-treceval-abstracts.2019.txt). All the code can be found on the corresponding author’s GitHub (https://github.com/Bruce-V/CS-BM25).

Abbreviations

TREC:

Text Retrieval Conference

CDS:

Clinical Decision Support

IR:

Information Retrieval

SVM:

Support Vector Machines

UMLS:

Unified Medical Language System

NDCG:

Normalized Discounted Cumulative Gain

GA:

Genetic Algorithm

SA:

Simulated Annealing

AC:

Ant Colony

SIO:

Swarm Intelligence Optimization

CS:

Cuckoo Search

GWSO:

Glow Worm Swarm Optimization

PSO:

Particle Swarm Optimization

References

  1. 1.

    Simpson MS, Voorhees EM, Hersh W. Overview of the TREC 2014 clinical decision support track. In: Proceedings of Text Retrieval Conference (TREC); 2014.

  2. 2.

    Roberts K, Simpson MS, Voorhees EM, Hersh WR. Overview of the TREC 2015 clinical decision support track. In: Proceedings of Text Retrieval Conference (TREC); (2015).

  3. 3.

    Roberts K, Demner-Fushman D, Voorhees EM, Hersh WR. Overview of the TREC 2016 clinical decision support track. In: Proceedings of Text Retrieval Conference (TREC); 2016.

  4. 4.

    Roberts K, Demner-Fushman D, Voorhees EM, Hersh WR, Bedrick S, Lazar AJ, Pant S. Overview of the TREC 2017 precision medicine track. In: Proceedings of Text Retrieval Conference (TREC); 2017.

  5. 5.

    Roberts K, Demner-Fushman D, Voorhees EM, Hersh WR, Bedrick S, Lazar SJ. Overview of the TREC 2018 precision medicine track. In: Proceedings of Text Retrieval Conference (TREC); 2018.

  6. 6.

    Roberts K, Demner-Fushman D, Voorhees EM, Hersh WR, Bedrick S, Lazar SJ. Overview of the TREC 2019 precision medicine track. In: Proceedings of Text Retrieval Conference (TREC); 2019.

  7. 7.

    Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015;372(9):793–5.

    CAS  Article  Google Scholar 

  8. 8.

    Robertson SE, Walker S, Hancock-Beaulieu M, Gatford M, Payne A. Okapi at TREC-4. In: TREC, 1995.

  9. 9.

    Gey FC. Inferring probability of relevance using the method of logistic regression. In: SIGIR’94. London: Springer; 1994. p. 222–31.

  10. 10.

    Joachims T. Optimizing search engines using clickthrough data. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining. ACM; 2002. p. 133–42

  11. 11.

    Freund Y, Layer R, Schapire RE. An efficient boosting algorithm for combining preferences. J Mach Learn Res. 2003;4(9):933–69.

    Google Scholar 

  12. 12.

    Cao Z, Qin T, Liu TY. Learning to rank: from pairwise approach to listwise approach. In: Proceedings of the 24th international conference on machine learning. ACM; 2007. p. 129–36.

  13. 13.

    Xu J, Li H. Adarank: a boosting algorithm for information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM; 2007. p. 391–8.

  14. 14.

    Burges CJC. From ranknet to lambdarank to lambdamart: an overview. Learning. 2010;11:523–81, 81.

    Google Scholar 

  15. 15.

    Singh J, Prasad M, Prasad OK. A novel fuzzy logic model for pseudo-relevance feedback-based query expansion. Int J Fuzzy Syst. 2016;18(6):980–9.

    Article  Google Scholar 

  16. 16.

    Keikha A, Ensan F, Bagheri E. Query expansion using pseudo relevance feedback on Wikipedia. J Intell Inf Syst. 2018;50(3):455–78.

    Article  Google Scholar 

  17. 17.

    Almasri M, Berrut C, Chevallet JP. A comparison of deep learning-based query expansion with pseudo-relevance feedback and mutual information. In: Proceedings of European conference on information retrieval padua. ECIR Press; 2016. p. 709–715.

  18. 18.

    Singh J, Sharan A. A new fuzzy logic-based query expansion model for effificient information retrieval using relevance feedback approach. Neural Comput Appl. 2017;28:2557–80.

    Article  Google Scholar 

  19. 19.

    Cui H, Wen JR, Nie JY. Probabilistic query expansion using query logs. In: Proceedings of the 11th international conference on World Wide Web. ACM; 2002. p. 325–332.

  20. 20.

    Aronson AR, Rindflesch TC. Query expansion using the UMLS Meta Thesaurus. In: Proceedings of the AMIA annual fall symposium. American Medical Informatics Association; 1997. p. 485.

  21. 21.

    Aronson AR. Effective mapping of biomedical text to the UMLS Meta-Thesaurus: the MetaMap program. In: Proceedings of the AMIA symposium. American Medical Informatics Association; 2001. p. 17.

  22. 22.

    Li S, Sun Y, Soergel D. Automatic decision support for clinical diagnostic literature using link analysis in a weighted keyword network. J Med Syst. 2018;42:27.

    Article  Google Scholar 

  23. 23.

    Balaneshinkordan S, Kotov A. Bayesian approach to incorporating different types of biomedical knowledge bases into information retrieval systems for clinical decision support in precision medicine. J Biomed Inform. 2019;98:103238.

    Article  Google Scholar 

  24. 24.

    Kastner M, Wilczynski NL, Walker-Dilks C, Ann MK, Haynes B. Age-specific search strategies for MedLine. J Med Internet Res. 2006;8(4):1–10.

    Article  Google Scholar 

  25. 25.

    Holland JH. Adaptation in natural and artificial systems. Ann Arbor, Michigan

  26. 26.

    Kirkpatrick S, Gelatt CD Jr, Vecchi MP. Optimization by simulated annealing. Science. 1983;220(4598):671–80.

    CAS  Article  Google Scholar 

  27. 27.

    Dorigo M, Gambardella LM. A study of some properties of Ant-Q. In: Proceedings of the 44th international conference on parallel problem solving from nature; 1996. p. 656–665.

  28. 28.

    Yang XS, Deb S. Cuckoo search via levy flights. In: World congress on nature & biologically inspired computing; 2009. p. 210–214.

  29. 29.

    Krishnand KN, Ghose D. Detection of multiple source locations using a glowworm metaphor with applications to collective robotics. In: Proceedings of IEEE swarm intelligence symposium; 2005. p. 84–91.

  30. 30.

    Kenney J, Eberhart R. Particle swarm optimization. In: Proceedings of IEEE conference on neural networks; 1995.

  31. 31.

    Guerrero M, Castillo O, Valdez M. Cuckoo Search via Lévy flights and a comparison with genetic algorithms. In: Castillo O, Melin P, editors. Fuzzy logic augmentation of nature-inspired optimization metaheuristics, vol. 574. Cham: Springer; 2015. pp. 91–103.

    Google Scholar 

  32. 32.

    Pavlyukevich I. Levy flights, non-local search, and simulated annealing. Comput Phys. 2007;226:1830–44.

    CAS  Article  Google Scholar 

  33. 33.

    Pavlyukevich I. Cooling down Levy flights. J Phys A Math Theor. 2007;40:12299–313.

    Article  Google Scholar 

  34. 34.

    Wang Y, Komandur-Elayavilli R, Rastegar-Mojarad M. Leveraging both structured and unstructured data for Precision Information Retrieval. In: Proceedings of Text Retrieval Conference (TREC); 2017.

  35. 35.

    Li C, He B, Sun Y. UCAS at TREC-2017 Precision Medicine Track. In: Proceedings of Text Retrieval Conference (TREC); 2017.

  36. 36.

    Jo S-H, Lee K-S. CBNU at TREC 2017 Precision Medicine Track. In: Proceedings of Text Retrieval Conference (TREC); 2017.

  37. 37.

    Wang Y, Fang H. Combining term-based and concept-based representation for clinical retrieval. In: Proceedings of Text Retrieval Conference (TREC); 2017.

  38. 38.

    Ling Y, Hasan SA, Filannino M. A hybrid approach to Precision Medicine-related biomedical article retrieval and clinical trial matching. In: Proceedings of Text Retrieval Conference (TREC); 2017.

  39. 39.

    Noh J., Kavuluru R., Team UKNLP at TREC 2017 Precision Medicine Track: A Knowledge-Based IR System with Tuned Query-Time Boosting.Proceedings of Text Retrieval Conference (TREC), 2017.

  40. 40.

    Baruah P, Dulepet R. Kyle Qian. Brown University at TREC Precision Medicine 2018. In: Proceedings of Text Retrieval Conference (TREC); 2018.

  41. 41.

    Nishani L, Kolla M., Baruah G., Klick Labs at TREC 2018 Precision Medicine track. In: Proceedings of Text Retrieval Conference (TREC); 2018.

  42. 42.

    Zheng Z, Li C, He B. UCAS at TREC-2018 Precision Medicine Track. In: Proceedings of Text Retrieval Conference (TREC); 2018.

  43. 43.

    Taylor S.J., Goodwin T.R., Harabagiu S.B, UTD HLTRI at TREC 2018:Precision Medicine Track.Proceedings of Text Retrieval Conference (TREC), 2018.

  44. 44.

    Jo S-H, Lee K-S. CBNU at TREC 2019 Precision Medicine Track. In: Proceedings of Text Retrieval Conference (TREC); 2019.

  45. 45.

    Zheng Q, Li Y, Hu J. ECNU-ICA team at TREC 2019 Precision Medicine Track. In: Proceedings of Text Retrieval Conference (TREC); 2019.

  46. 46.

    Di Nunzio GM, Marchesin S, Agosti M. Exploring how to combine query reformulations for Precision Medicine. In: Proceedings of Text Retrieval Conference (TREC); 2019.

  47. 47.

    Cieslewicz A, Dutkiewicz J, Jedrzejek CL. Poznan contribution to TREC-PM 2019. In: Proceedings of text retrieval conference (TREC); 2019.

  48. 48.

    Wu DTY, Su W-C. Retrieving scientific abstracts using venue-and concept-based approaches: CincyMedIR at TREC 2019 Precision Medicine Track. In: Proceedings of Text Retrieval Conference (TREC); 2019.

  49. 49.

    Rybinski M, Karimi S, Paris C. CSIRO at 2019 TREC Precision Medicine Track. In: Proceedings of Text Retrieval Conference (TREC); 2019.

  50. 50.

    Trotman A. Choosing document structure weights. Inf Process Manag. 2005;41:243–64.

    Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

No funding was obtained for this study.

Author information

Affiliations

Authors

Contributions

This article has been independently completed by ZZ. The author read and approved the final manuscript.

Corresponding author

Correspondence to Zicheng Zhang.

Ethics declarations

Ethics approval and consent to participate

On behalf of, and having obtained permission from all authors, I declare that: the material has not been published in whole or in part elsewhere; the paper is not currently being considered for publication elsewhere; all authors have been actively involved in substantial work leading to the submitted version, and will hold themselves jointly and individually responsible for its content; all relevant ethical safeguards have been met concerning patient or subject protection, or animal experimentation. I testify to the accuracy of the above on behalf of all authors. The datasets in this manuscript are public datasets and do not require any administrative permissions.

Consent for publication

This article uses publicly available datasets.

Competing interests

All authors declare that: (i) no support, financial or otherwise, has been received from any organization that may have an interest in the submitted work; and (ii) there are no other relationships or activities that could appear to have influenced the submitted work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, Z. An improved BM25 algorithm for clinical decision support in Precision Medicine based on co-word analysis and Cuckoo Search. BMC Med Inform Decis Mak 21, 81 (2021). https://doi.org/10.1186/s12911-021-01454-5

Download citation

Keywords

  • Clinical decision support
  • Precision Medicine
  • Information retrieval
  • Co-word analysis
  • Improved BM25
  • Cuckoo Search
\