Skip to main content

AMFormulaS: an intelligent retrieval system for traditional Chinese medicine formulas

Abstract

Background

Formula is an important means of traditional Chinese medicine (TCM) to treat diseases and has great research significance. There are many formula databases, but accessing rich information efficiently is difficult due to the small-scale data and lack of intelligent search engine.

Methods

We selected 38,000 formulas from a semi-structured database, and then segmented text, extracted information, and standardized terms. After that, we constructed a structured formula database based on ontology and an intelligent retrieval engine by calculating the weight of decoction pieces of formulas.

Results

The intelligent retrieval system named AMFormulaS (means Ancient and Modern Formula system) was constructed based on the structured database, ontology, and intelligent retrieval engine, so the retrieval and statistical analysis of formulas and decoction pieces were realized.

Conclusions

AMFormulaS is a large-scale intelligent retrieval system which includes a mass of formula data, efficient information extraction system and search engine. AMFormulaS could provide users with efficient retrieval and comprehensive data support. At the same time, the statistical analysis of the system can enlighten scientific research ideas and support patent review as well as new drug research and development.

Background

Traditional Chinese medicine (TCM) is a science that studies human life, health, and disease as well as a summary of the valuable experience of the Chinese nation in long-term survival and practice. Doctors of TCM treat patients based on syndrome differentiation by looking, listening, questioning, and feeling the pulse, then give TCM prescriptions to acquire the therapeutic effect. So, formula is an important therapeutic concept in TCM and always the research hotspot. There are many research directions of formulas, for example, the research about theory of formulas compatibility [1,2,3,4], the research about dosimetry of formulas[5, 6]; and the research about formula and disease [7,8,9].

However, knowledge of formulas is mainly recorded in diverse TCM books which results in the difficulties of retrieval and acquirement. Consequently, integrating formula information as well as constructing database can greatly improve the efficiency of retrieval, bring convenience of knowledge acquisition and utilization for researchers and clinical doctors. Nowadays, there are some formula databases, Guo et al. constructed a formula knowledge graph which presented the knowledge by the way of node-relation-node, and knowledge of the graph included traditional Chinese medicine, dosage, traditional Chinese Medicine processing, efficacy, and so on [10], Min He et al.constructed a traditional Chinese medicine database in the form of node and property, knowledge of the database included Chinese medicines, original plants, bioactive components, and the function of search and display were provided [11]. Both of the two databases store and display formula information based on the node-edge-node. However, the form of node-edge-node could only show the most important information by some terms rather than complete sentences, which might consult information loss; Shen et al.structured the listed proprietary Chinese medicine data and built the Chinese patent medicine database which integrated patent medicine information, but the patent medicine information was not sufficient for academic and clinical research [12; Ruichao Xue et al. constructed a traditional Chinese medicine integrative database to integrate the traditional Chinese medicine and western medicine which included some TCM knowledge, like formula, TCM drugs, and herbal ingredients. This database mainly focused on the herb molecular mechanism analysis, and didn't meet the needs of other formula research [13].

The above databases have some basic functions, like information retrieval and knowledge display. There are some limitations that need improvement, such as being small-scale and inaccessible to the original information. Meanwhile, with the development of computer technology, users tend to choose rich, strong correlation retrieval results in practical operation. To improve the efficiency of retrieval and utilization as well as realize the knowledge mining and discovery of formulas, we proposed an intelligent retrieval system, named AMFormulaS (means Ancient and Modern Formula system, 古今方药系统 in Chinese), which was based on a database containing a large number of formula and the relevant information of formula, like name of formula, composition, dosage of Chinese medicine and so on. The system also can efficiently extract formula information from the text data to extend the size of the database. In the meantime, we also proposed a method of weight calculation on formula drugs to improve the retrieval efficiency which compose the core part of the intelligent search engine.

Methods

In the study, we firstly constructed an automatic standardization system that embedded the word segmentation packages and term dictionary. The semi-structured data was processed into structured and standardized formula records. A structured formula database was designed by incorporating the ontology modeling method. Meanwhile, we designed and implemented an algorithm of weight calculation on formula to improve the retrieval efficiency. Lastly, the formula intelligent retrieval system AMFormulaS was realized (See the pipeline in Fig. 1).

Fig. 1
figure 1

The pipeline of the construction about AMFormulaS

Data sources

In this study, the data of AMFormulaS was selected and collected from the formula database maintained by the Institute of Information of Chinese medicine, Chinese academy of traditional Chinese medicine. It is a semi-structured database including 85,989 ancient and modern Chinese medicine formulas from more than 710 ancient books and modern literature. Considering the difference between ancient and modern medication habits, we took the modern medication habits as the standard reference and selected 38,000 formulas whose medicinal source of the component could be found or ascertained at present.

Data processing

Word segmentation

Formula database involves a large amount of data, so it is necessary to use computer technology to improve the efficiency of data processing. Considering the huge workload and our existing researches, we decided to adopt the information extract solution of word segmentation algorithm integrated with large-scale terminology.

The ancient Chinese medicine text is distinctive in grammar and expression as well as professional terms. Therefore, the standard of word segmentation needed to be investigated and determined initially. In one of our previous studies [14], we constructed a corpus for training the algorithm of word segmentation. According to the classification method of traditional Chinese medicine philology [15], we firstly selected 30 TCM ancient books of Qing Dynasty involving 10 categories: Materia medica, formulas, febrile diseases, internal medicine, surgery, gynecology, pediatrics, facial features, acupuncture and massage, medical cases respectively, and manually selected 150 pieces of rough corpus which contained 1705 sentences and 88,889 words from these books to train the model. Then the selected corpus was tagged manually by referencing TCM teaching material for higher education students [16], TCM reference books [17], and TCM related standard [18, 19] as terminology sources. After that, we preliminarily summarized the standard of word segmentation in TCM ancient books, that was taking the existing facts and semantic changes as the primary principle and considering the principles of part-of-speech grammar and semantic type in the meantime. There were 17 semantic types of text which were segmented based on the principle including physiology, symptom, syndrome, pathological factors, pathological products, efficacy, method of treatment, channel meridian and acupuncture points, four diagnostic methods, traditional Chinese drug, prescription, nature and flavor, toxicity, processing, contraindications, decoction method, and proprietary words in Chinese medicine.

After manual labeling the training set of word segmentation, we trained a model based on the algorithm of capsule network [20]. Compared with other algorithms, the algorithm of capsule network showed a good performance for word segmentation in ancient traditional Chinese medicine literature, so the capsule network model was used for word segmentation.

Information extraction and standardization

Due to the heterogeneity in the data structures and lack of standards for formula information, we built a system named automatic standardization system of formulas in successive dynasties to extract and standardize the information of formulas under the guidance of the above-mentioned word segmentation standard and algorithm. The system firstly realized the identification, extraction, and standardization of formula, then the processed data was submitted to the formula database after manual verification. The extracted and standardized content includes name, composition, source, formation year of formulas, the dose of decoction pieces, the processing method of natural crude Chinese medicine, etc.[21]. For example, the formula of Tiefen pellet (铁粉丸 in Chinese) comes from the book You you xin shu (《幼幼新书》in Chinese) written in the Song Dynasty. The system could transform the text into structured data. Firstly, the system recognized the information of the formula in the form of text, like name of the formula, traditional Chinese medicine, dosage, and then extracted and standardized this information. For instance, one of the components is “Shehuang(蛇黄 in Chinese)” in the original records, the system identified it and normalized to “Shehanshi(蛇含石 in Chinese)”, the dosage and the measuring unit also could be normalized (as shown in Fig. 2).

Fig. 2
figure 2

The information extraction and standardization of Tiefen pellet

Design of formula database

There is a variety of information about formulas, therefore, all contents of the formula database should be completed and well-organized including formula name, source of formula, formation date, author, and composition of formula, etc. Employed our formal research [22, 23], on the ontology-based modeling, the concept, relation, and property were analyzed and determined, then schema of formulas database was designed based on the conceptual modeling method of ontology and the authoritative references of TCM, such as Pharmacopoeia of the People's Republic of China (part 1) [24], Coding Rules and Codes of Traditional Chinese Medicine [25], Chinese materia medica [26], Dictionary of traditional Chinese medicine [27], etc. The entities of the ontology model contain the information of formula (name, source, author, the subordinate departments, effect, and nature, flavor and channel tropism), information about Chinese medicine (medicinal name, medicinal sources, effect, Chinese patent medicine, decoction pieces, and effect and nature, flavor and channel tropism), the core concept graph of formula database is shown in Fig. 3.

Fig. 3
figure 3

Conceptual data model of formula database

Implementation of the intelligent retrieval system

There exist some formula databases or retrieval systems that integrate some formula information, most of which only support full-text retrieval or retrieval by keyword. Yet, formulas are composed based on the TCM theory named Monarch, Minister, Assistant and Guide (君臣佐使 in Chinese). In the context of drug retrieval, users prefer to get the results which the search term of composition herb plays an important role. For example, by inputting ‘processed licorice (炙甘草 in Chinese)’, the users usually expect the result including formulas in which processed licorice play the Monarch role. Hence, a method of weight calculation on formula drugs was proposed in this research which made results be sorted by the importance of decoction pieces in formula or formation time of formula [28]. In this research, three factors were included to calculate the weight of the drug composition in the formula:

  1. (1)

    Whether the decoction piece is part of the name of the prescription which is processed by string matching;

  2. (2)

    The relative dose of traditional Chinese medicine. The relative dose was calculated as:

    $$f(o|S) = \frac{1}{n}\sum\limits_{i = 1}^{n} {G_{h} (t,t_{i} )} = \frac{1}{{n\sqrt {2\pi } }}\sum\limits_{i = 1}^{n} {e^{{ - \frac{{d(t,t_{i} )^{2} }}{{2h^{2} }}}} }$$
    (1)

    where d(t, ti)2 represents the dose-distance between tuples t, ti,\(G_{h} (t,t_{i} ) = \frac{1}{{\sqrt {2\pi } }}e^{{ - \frac{{d(t,t_{i} )^{2} }}{{2h^{2} }}}}\) is the Gaussian kernel function, n represents the number of different doses (or dose intervals) in T.

  3. (3)

    Whether decoction pieces are commonly used. The weight was calculated as:

    $$w(t) = \log (n/f(t))$$
    (2)

    where w(t) is the weight of decoction piece t, n is the number of all different decoction pieces in set S, f(t) is the number of formulations containing the specified decoction pieces t.

  4. (4)

    Multiple linear regression was used to calculate the optimal parameters:

    $$y = w_{0} + x_{1} w_{1} + \cdots + x_{i} w_{i} + \cdots + x_{n} w_{n}$$
    (3)

    where x1: Whether the drug is commonly used, that is, the occurrence frequency of the drug; x2: Whether the drug appears in the drug name; x3: The ratio of the dose used to the general dose of the drug.

Based on a training data set of 400 records (part of the experimental results shown in Table1), the obtained training parameters were:

$$w_{1} = 52.8231, w_{2} = 0.8773,w_{3} = 0.0470;w_{0} = 3.0705$$
Table 1 Part of the experimental results of parameter calculation

The standard deviation between the predicted result and the labeled result was 0.719, and the error was within the acceptable range. Then the algorithm was applied to the intelligent search engine system of formula.

Besides, other retrieval functionalities also were implemented, including:

  1. (1)

    Full-text retrieval: link to the index base according to the search terms and realize the global retrieval by the keywords.

  2. (2)

    Precise retrieval: by different semantic types of search terms to achieve precise retrieval including:

    1. (a)

      by decoction pieces

    2. (b)

      by Chinese crude drug

    3. (c)

      by the creation time of formula

    4. (d)

      by the department of formula

    5. (e)

      by the classification of the formula efficacy

    6. (f)

      by the nature, flavor, and meridian tropism of the formula

    7. (g)

      combination of full text and precise retrieval: by keywords and semantic entries

Results

AMFormulaS was developed based on B/S architecture, Java language, and MySql5.7, composed by modules of information retrieval of formulas, decoction pieces and Chinese crude drug, statistical analysis, and visualization (the home page of the retrieval system is shown in Fig. 4). On the search results page, users can not only browse the specific information of formulas, the related information of decoction pieces and decoction pieces combination, but also the global statistical information of decoction pieces and formulas in the whole database. Users can search relevant information according to their needs and select the appropriate presentation pages, such as formula retrieve, decoction pieces retrieve, decoction pieces combination retrieve, and Dashboard.

Fig. 4
figure 4

The home page of AMFormulaS

Formula retrieval

By the entered name of the formula, the information of formulas will be displayed on the page including an ID of the formula, composition, efficacy, nature, flavor and channel tropism, department, source, formation time as well as the original text information of the formula. A formula is made up of decoction pieces, the addition or subtraction of drugs lead to continuous changes of formula, like name, efficacy. Take Suzi Decoction for example, the retrieval results are shown in Fig. 5. The efficacy relation and graph about the addition or subtraction of composition drugs are shown in Fig. 6.

Fig. 5
figure 5

The retrieval results of Suzi Decoction

Fig. 6
figure 6

The efficacy relation and the graph of addition or subtraction of drugs on Suzi Decoction

Decoction pieces retrieval

By the inputted name of a decoction piece, the system will search and return the basic information, time distribution of formulas containing the decoction piece, and the use frequency about the diverse dosage of the decoction piece. Take ginseng for example, the retrieval results are shown in Fig. 7.

Fig. 7
figure 7

The decoction pieces retrieval results of ginseng

Retrieval of decoction pieces combination

The system supports the query of the combination of decoction pieces. By the entered name of decoction pieces, the system can display the basic information of the decoction pieces combination, like clinical application, indications, action classification, efficacy, compatibility of the combination, etc. (as shown in Fig. 8).

Fig. 8
figure 8

The basic information about the combination of ginseng and largehead atractylodes rhizome

Figure 9 shows rich information about relations between the formulas and the combination of ginseng (人参 in Chinese) and largehead atractylodes rhizome (白术 in Chinese). Such as: (1) ginseng and largehead atractylodes rhizome both appear in 3,047 formulas; (2) ginseng, largehead atractylodes rhizome and Indian bread (茯苓 in Chinese) appeared in 1,665 formulas, and (3) ginseng, largehead atractylodes rhizome and dried tangerine peel (陈皮 in Chinese) appeared in 1,143 formulas.

Fig. 9
figure 9

The compatibility and frequency of ginseng and largehead atractylodes rhizome

Besides, the retrieved formulas containing these decoction pieces can be looked up and sorted according to the importance of the combinations of these decoction pieces in formulas. For instance, by inputting “ginseng” and “largehead atractylodes rhizome”, there are 3,047 formulas that can be retrieved. After intelligent sorting, the first formula shown to users is "Renshenbaizhu soup" (as shown in Fig. 10). In the same light, when searching “ginseng”, “largehead atractylodes rhizome" and “Indian bread”, the first formula is "Sanwu soup" (as shown in Fig. 11).

Fig. 10
figure 10

The retrieval results about the combination of ginseng and largehead atractylodes rhizome

Fig. 11
figure 11

The retrieval results about the combination of ginseng, largehead atractylodes rhizome

Dashboard

The dashboard shows the statistics about all the Chinese medicine decoction pieces and formulas in the database, such as the top 10 decoction pieces and efficacy of formulas appeared in this database, the statistics about nature, flavor and channel tropism of formula and the number of formulas formed in every dynasty (as shown in Fig. 12).

Fig. 12
figure 12

Dashboard of AMFormulaS

Discussion

AMFormulas aims at sorting formula information and providing intelligent retrieval services for medical staff, researchers, and students, at the same time, providing data support for the generation of class formulas, screening of classic formulas, data mining of formulas, as well as new drug research and development. Based on integrating the formula data, the system makes a multi-dimensional statistical analysis of the formation time of formula, medication frequency, dosage of formulas and drugs in the past dynasties. The data and analytical results about the time, medication habits, dosage of traditional Chinese medicine to help enlighten many new research directions. Meanwhile, the system can provide comprehensive and accurate intelligent query services for patent application and protection.

Considering the knowledge of formulas involves a wide scope and enormous quantity, more formulas need be included in the future. The current version of AMFormulaS only aims at verification for system design and retrieval algorithm. As the scale of database and users growing, more tests and updates on performance will carried out to meet users’ needs of more accurate retrieval engine, high-quality data, and other services.

Conclusions

In this study, a total of 38,000 formulas were structured and standardized through information extraction methods, then imported into the structured formula database. A novel intelligent formula retrieval system, AMFormulas, was built capable of multi-dimensional retrieval, and statistical analysis of formula information. The system collected, standardized, and integrated a large amount of formula information, including the original text of formulas. It not only realizes efficient retrieval and statistical analysis but also enables the users to access the original data source.

Availability of data and materials

The data that supporting the findings of this study are available from the corresponding author on request.

Abbreviations

AMFormulaS:

Ancient and Modern Formula System

TCM:

Traditional Chinese medicine

References

  1. Pei M, Duan X , Pei X , et al. Research on compatibility chemistry of acid-alkaline pair medicines in formulas of traditional Chinese medicine. Zhongguo Zhong yao za zhi = Zhongguo zhongyao zazhi = China Journal of Chinese Materia Medica, 2009, 34(15):1989–93.

  2. Sun B. Study on the properties theory and compatibility law of the mild-nature traditional Chinese medicine. Jinan: Shandong University of traditional Chinese Medicine; 2010.

    Google Scholar 

  3. Yu-Hang LI. Discussion on the syndrome-factors and the formula-factors. China J Tradit Chin Med Pharm. 2009;24(02):117–21.

    Google Scholar 

  4. Wang J, Wang Y, Yang G. Methods and modes about the theory of traditional Chinese prescription composition. China J Chin Mater Medi. 2005;7:9–12.

    Google Scholar 

  5. Liu S. Study on the historical Track of clinical dosage in Dacheng Qi Decoction. Beijing: Beijing University of Chinese Medicine; 2016.

    Google Scholar 

  6. Song Y, Fu Y. A preliminary study on the dosage of medicines in Li Dongyuan’s prescriptions. J Tradit Chin Med. 2011;62:64–81.

    Google Scholar 

  7. Ai N. Malignant disease syndromes the literature of traditional Chinese medicine research. Harbin: Heilongjiang University Of Chinese Medicine; 2017.

    Google Scholar 

  8. Leung WK, Wu JCY, Liang SM, et al. Treatment of diarrhea-predominant irritable bowel syndrome with traditional chinese herbal medicine: a randomized placebo-controlled trial. Am J Gastroenterol. 2006;101(7):1574–80.

    Article  Google Scholar 

  9. Iwasaki K, Kato S, Monma Y, et al. A pilot study of banxia houpu tang, a traditional Chinese medicine, for reducing pneumonia risk in older adults with dementia. J Am Geriatr Soc. 2008;55(12):2035–40.

    Article  Google Scholar 

  10. Guo W. Research and implementation of knowledge mapping of Traditional Chinese Medicine Prescription. Lanzhou: Lanzhou University; 2019.

    Google Scholar 

  11. He M, Yan X, Zhou J, et al. Traditional Chinese medicine database and application on the web. J Chem Inf Comput. 2001;32(2):273–7.

    Article  Google Scholar 

  12. Shen D. Chinese patent drug database construction and prescription rule research. Beijing: Chinese Academy of Traditional Chinese Medicine; 2014.

    Google Scholar 

  13. Xue R, Fang Z, Zhang M, Yi Z, Wen C, Shi T. TCMID: Traditional Chinese Medicine integrative database for herb molecular mechanism analysis. Nucleic acids research,2013,41(Database issue).

  14. Fu L, Li S, Li M, et al. Discussion on the standard of word segmentation of ancient Chinese medicine books: taking the medical books of Qing dynasty as an example. China J Chin Mater Med. 2018;33(10):454–9.

    Google Scholar 

  15. Yan J, Gu Z, et al. Traditional Chinese medicine philology. Beijing: China Press of Traditional Chinese Medicine; 2002.

    Google Scholar 

  16. Zhou Z, Tang D. Traditional Chinese pharmacology. Beijing: China Press of Traditional Chinese Medicine; 2016.

    Google Scholar 

  17. Wu L, et al. Chinese traditional medicine and materia medica subject headings (last volume). Beijing: Traditional Chinese Medicine Classics Press; 2008. p. 01.

    Google Scholar 

  18. National standard of the people’s Republic of China. Chinese information processing vocabulary Part 01: basic terms(GB12200·1-90). Beijing: Standards Press of China; 1991.

    Google Scholar 

  19. National standard of the people's Republic of China.GB/T13715–92. Modern Chinese word segmentation standard for information processing. Beijing: Standards Press of China, 1992

  20. Li S, Li M, Xu Y, et al. Capsules based Chinese word segmentation for ancient Chinese medical books. IEEE Access. 2018;6:70874–83.

    Article  Google Scholar 

  21. Liu L, Zhu Y, Li H, et al. Building of Traditional Chinese Medicine Integrative Data Model(TCMIDM). China Dig Med. 2015;10(10):70–2.

    Google Scholar 

  22. Liu L, Liu J, Jia L, et al. Study of concept description of Chinese medicine ontology. China Dig Med. 2016;11(2):90–2.

    Google Scholar 

  23. Liu L, Zhu Y. Building Chinese medicine conceptual data model based on semantic representation. World Chin Med. 2017;12(4):936–9.

    Google Scholar 

  24. Chinese Pharmacopoeia Commission. Pharmacopoeia of People’s Republic of China: Part one. Beijing: China Medical Science Press; 2010.

    Google Scholar 

  25. General Administration of quality supervision, inspection and Quarantine of the people's Republic of China, Standardization Administration of China. Coding Rules and Codes of Traditional Chinese Medicine :GB/T 31774–2015. Beijing: Standards Press of China, 2015.

  26. Editorial board of Chinese materia medica of National Administration of traditional Chinese Medicine. Chinese Materia Medica. Shanghai: Shanghai Scientific & Technical Publishers, 1999.

  27. Nanjing Medical University. Dictionary of Traditional Chinese Medicine. SHANGHAI RENMIN CHUBANSHE, 1977.

  28. CN109801697A_ An evaluation method of the importance of Chinese Herbal Pieces[EB/OL]. https://zhuanli.tianyancha.com/80a3d65404855cb9ee3ec8fd244b2dd2, 2020-4-29.

Download references

Acknowledgements

We thank Mingzhe Li for help with critical proofreading of the manuscript.

About this supplement

This article has been published as part of BMC Medical Informatics and Decision Making Volume 21, Supplement 2 2021: Health Big Data and Artificial Intelligence. The full contents of the supplement are available at https://bmcmedinformdecismak.biomedcentral.com/articles/supplements/volume-21-supplement-2

Funding

This study was supported by National Key R&D Program of China (2019YFC1710400; 2019YFC1710401). The work was also partially supported by the Fundamental Research Funds for the Central public welfare research institutes (ZZ13-YQ-126; ZZ13-YQ-127) and Beijing Natural Science Foundation (7174328). The publication charges of this study come from Fundamental Research Funds for the Central public welfare research institutes (ZZ13-YQ-127).

Author information

Authors and Affiliations

Authors

Contributions

YZ designed this study, YC and BG processed the data, LL and JL reviewed the data. Meanwhile, YC and BG wrote the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Yan Zhu.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent to publication

Not applicable.

Competing interests

The authors declare that there are no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cui, Y., Gao, B., Liu, L. et al. AMFormulaS: an intelligent retrieval system for traditional Chinese medicine formulas. BMC Med Inform Decis Mak 21 (Suppl 2), 56 (2021). https://doi.org/10.1186/s12911-021-01419-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12911-021-01419-8

Keywords