Measuring diversity in medical reports based on categorized attributes and international classification systems
© Přečková et al; licensee BioMed Central Ltd. 2012
Received: 29 June 2011
Accepted: 12 April 2012
Published: 12 April 2012
Narrative medical reports do not use standardized terminology and often bring insufficient information for statistical processing and medical decision making. Objectives of the paper are to propose a method for measuring diversity in medical reports written in any language, to compare diversities in narrative and structured medical reports and to map attributes and terms to selected classification systems.
A new method based on a general concept of f-diversity is proposed for measuring diversity of medical reports in any language. The method is based on categorized attributes recorded in narrative or structured medical reports and on international classification systems. Values of categories are expressed by terms. Using SNOMED CT and ICD 10 we are mapping attributes and terms to predefined codes. We use f-diversities of Gini-Simpson and Number of Categories types to compare diversities of narrative and structured medical reports. The comparison is based on attributes selected from the Minimal Data Model for Cardiology (MDMC).
We compared diversities of 110 Czech narrative medical reports and 1119 Czech structured medical reports. Selected categorized attributes of MDMC had mostly different numbers of categories and used different terms in narrative and structured reports. We found more than 60% of MDMC attributes in SNOMED CT. We showed that attributes in narrative medical reports had greater diversity than the same attributes in structured medical reports. Further, we replaced each value of category (term) used for attributes in narrative medical reports by the closest term and the category used in MDMC for structured medical reports. We found that relative Gini-Simpson diversities in structured medical reports were significantly smaller than those in narrative medical reports except the "Allergy" attribute.
Terminology in narrative medical reports is not standardized. Therefore it is nearly impossible to map values of attributes (terms) to codes of known classification systems. A high diversity in narrative medical reports terminology leads to more difficult computer processing than in structured medical reports and some information may be lost during this process. Setting a standardized terminology would help healthcare providers to have complete and easily accessible information about patients that would result in better healthcare.
We can consider two approaches how to store information about health state of patients in a medical report. The first one is based on storing information in the form of a free text and thus creating narrative medical reports. The second approach is to store information in a structured form (e.g. using structured electronic health record) and thus create structured medical reports.
In current medicine we can meet with many synonyms for one concept, e.g. single disease. That was the reason why coding systems providing codes for any medical findings have arisen. Coding systems limit the variability of expression. Only the approved terms and their phrases can be used according to strictly given rules. Formal codes are usually used instead of the approved terms. In many cases it is useful if coding systems show also not approved terms, which are often used as synonyms for approved terms.
Among the most widespread international classification systems can be ranked: International Classification of Diseases and Related Health Problems (ICD), Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT), Medical Subject Headings (MeSH), Logical Observations Identifiers Named and Codes (LOINC) and others, more than 100 classifications systems.
ICD is one of the oldest medical classification systems. The foundation was laid in 1855. The World Health Organization took it over in 1948. At that time it was its 6th revision. Since 1994 the 10th revision of ICD is in use and it contains 22 chapters [1–3]. ICD has become an international standard for a classification of diseases and for many epidemiological and management needs in healthcare. These include a general situation of health in different population groups and monitoring of the incidence and prevalence of various diseases and other health problems in relation to other variables. It is used to classify diseases and other health problems that are recorded in many types of health records, including death certificates and hospital records. ICD is available in six official languages of WHO and in other 36 languages, including Czech.
SNOMED CT[4–12] is a comprehensive clinical terminology that provides clinical content and expressivity for clinical documentation and reporting. It can be used to code, retrieve, and analyze clinical data. SNOMED CT resulted from the merger of SNOMED Reference Terminology developed by the College of American Pathologists and Clinical Terms Version 3 developed by the National Health Service of the United Kingdom. The terminology is comprised of concepts, terms and relationships with the objective of precisely representing clinical information across the scope of health care. SNOMED CT provides a standardized clinical terminology that is essential for effective collection of clinical data, its retrieval, aggregation and re-use, as well as interoperability. SNOMED CT is considered to be the most comprehensive, multilingual clinical healthcare terminology in the world. Since April 2007 is owned, maintained, and distributed by the International Health Terminology Standards Development Organisation, a not-for-profit association in Denmark. Nowadays we can meet with American, British, Spanish, and German versions of SNOMED CT.
MeSH[13, 14] is a vocabulary controlled by the National Library of Medicine (NLM). It is composed of terms, which denominate keywords hierarchically and this hierarchy helps with searching on various levels of specificity. Keywords are arranged not only alphabetically but also hierarchically. NLM uses MeSH for indexing of papers from world best biomedical journals for the MEDLINE/PubMED database. MeSH is used also for a database cataloguing books, documents, and audiovisual materials. Each bibliographical reference is connected with a class of terms in the MeSH classification system. Searching inquiries use also the MeSH vocabulary to find papers with required topics. There exists also the Czech translation of MeSH.
LOINC[15, 16] is a clinical terminology, which is important for laboratory tests and laboratory results. In the year 1999 the HL7 organization accepted LOINC as a preferred coding system for names of laboratory tests and clinical observations.
The increasing number of classification systems and nomenclatures requires designing of various conversion tools for transfer among main classification systems and for recording of relations among terms in these systems. Extensive ontologies and semantic networks are modeled for information transfer among various databases. Metathesauri are designed to monitor and connect information from various heterogeneous sources. The most extensive is the Unified Medical Language System (UMLS) [17–19], which we used in our study.
The Czech language belongs to the western group of Slavic languages and it belongs to the free constituent order languages . The style of writing narrative medical reports in the Czech Republic, as in any other country, is not standardized and they are written mostly in the form of a free text . Similarly as in other languages, Czech narrative medical reports do not use standardized terminology and they often bring insufficient information for statistical processing and medical decision making. For one term we can often meet with more than ten synonyms. Synonyms in narrative reports lead to inaccuracy and to misunderstanding. This problem has intensified with an introduction of a computer technology to healthcare. Using computers means higher uniqueness of data feeding, of term definitions, their precise denomination, etc., thereby the significant drawback becomes more noticeable. Generally, in the scientific terminology it is more advantageous to use only one expression for one term. Computers are able to learn synonyms but it enlarges dictionary databases and the number of necessary operations grows. Moreover, standardized terminology is a basic assumption for semantic interoperability.
One of our objectives was to propose a new application for measuring diversity of medical reports written in any language. The method is based on categorized attributes recorded in medical reports and on international classification systems. The general concept of diversity is derived from f-diversity and its modifications (relative f-diversity, self f-diversity and marginal f-diversity). Here we use f-diversities of Gini-Simpson and Number of Categories types. The method can be applied to compare diversities of two samples of medical reports. We compared diversities of the samples of Czech narrative medical reports and Czech structured medical reports. Both samples were collected in two outpatient departments of preventive cardiology of the Municipal Hospital in Čáslav. The first outpatient department was located in Prague and the second one in Čáslav. The Municipal Hospital in Čáslav approved using these medical reports for our research. We used categorized attributes selected from the Minimal Data Model for Cardiology (MDMC). Medical reports were recorded by four physicians. We analyzed 1119 structured medical reports collected in Prague and 110 narrative medical reports collected in Čáslav. We included narrative medical reports from Čáslav collected by the same physicians that collected also data for structured medical reports in Prague.
Minimal data model for cardiology
Nowadays, there is a big boom in the development of electronic health records (EHRs). There is a general agreement that EHR has the potential to improve quality of medical care . The most important seems to be the requirement on EHRs for exchanging and management of structured health information. For our study the field of cardiology has been chosen because since 1994 the EuroMISE Centre  has been running two outpatient departments of preventive cardiology under the auspices of the Municipal hospital of Čáslav and therefore we have access mainly to the cardiological data and medical records focused on cardiology. In 2002 the Minimal Data Model for Cardiology (MDMC) was developed within this research center [27, 28]. MDMC is a set of approximately 150 attributes, their categorization, mutual relations, integrity restrictions, units, etc. Prominent professionals in the field of Czech cardiology agreed on these attributes as on the basic data necessary for an examination of a patient in cardiology.
MDMC consists of eight groups of attributes. The first one is the administrative part. Then, there is a family history part with information on parents and siblings. The next part is the social history and addiction focusing on the marital status, physical activities, mental stress, levels of smoking and alcohol consumption rates. One part of MDMC is devoted to allergies, mainly to drug allergies. The personal history part detects the presence of diabetes mellitus, there is observed whether a patient suffered from a stroke, whether he/she is treated with an ischemic disease of periphery arteries, there are attributes related to aortic aneurysm, other relevant diseases and menopause in women. In the part called Current difficulties of a possible cardiological origin physicians focus on shortness of breath, chest pain, palpitations, swellings, syncope, cough, hemoptysis, and claudication. Another part determines what kind of a treatment a patient undergoes, what type of a diet is prescribed and which medications he/she uses. In the part of the physical examination, patient's weight, height, body temperature, BMI, WHR, blood pressure, pulse and breathing rates, and pathological findings are determined. Laboratory testing is focused on blood glucose, uric acid, total cholesterol, HDL-cholesterol, LDL-cholesterol, and triaglycerols. The last part is focused on attributes related to ECG. The beat frequency, the average PQ and QRS intervals and results of ECG are fully described there.
Traditional measures of diversity
Traditional measures of diversity are based on categorized attributes. For a given attribute we determine categories A 1 ,..., A k-1 . Then, we summarize the rest findings in the "others" category and we denote this category as A k . The most known two measures of diversity are the following.
The Gini-Simpson index has its values in the interval [0, (k - 1)/k], where the lower boundary 0 is reached if and only if there is only one category of the studied attribute and the upper boundary (k - 1)/k for p = u = (1/k, 1/k,..., 1/k) for uniform probability distribution. Originally it was suggested as a measure of inequality in income by Gini  and later discussed by Simpson  as a measure of ecological diversity.
The Shannon information index has its values in the interval [0; log k], where the lower boundary 0 is reached if and only if there is only one category of the attribute and the upper boundary log k for uniform probability distribution p = u = (1/k,..., 1/k).
It is hard to give a universal preference to one of these two measures. Some researchers are more familiar with the Shannon entropy and it is easier for them to interpret particular numerical values of H S (p) than those of H GS (p). On the other hand, the Gini-Simpson index is a very well-known traditional measure of diversity.
f-diversity and relative f-diversity
where p(x; y) are the joint probabilities and p(x); p(y) marginal probabilities of categories of X and Y attributes.
This measure of diversity will be further called Shannon diversity.
f-entropy Hf (p) can be interpreted as an average unpredictability of the individual categories A i of the attribute X. In this sense f-entropy Hf (p) is a measure of diversity depending on the distribution p. Hf (p) will be called f-diversity if it moreover satisfies the following conditions:
• Hf (p) is non-negative,
• Hf (p) reaches its minimal value in case that there is one category with probability 1,
• Hf (p) reaches its maximal value in case that p = u is the uniform distribution,
• Hf (p) is a symmetric function of p,
• Hf (p) is a concave function on the system of all probability distributions p.
We can see that Hf (p) is a sum of two expressions where the second one is nothing but the well-known Gini-Simpson index HGS (p) multiplied by the constant f(0). Further we will call Gini-Simpson index the Gini-Simpson diversity. In the paper  it was proved that f-diversities can be found among f-entropies satisfying the condition g(t) = (f(t) - f(0))/t is a concave function. Then f-entropy Hf(p) of the attribute X will reach its maximal value for uniform distribution of categories p= u. We can see that Gini-Simpson diversity HGS (p) is f-diversity with f(t) = t - 1 for t > 1, otherwise f(t) = 0. Similarly, Shannon diversity is f-diversity with f(t) = t log t.
Measures of rarity, self and marginal f-diversity
Three widely used diversity indexes are:
We can see that for β = 0 we receive the Shannon diversity, for β = 1 the Gini-Simpson diversity and for β = -1 the Number of categories diversity. As it was shown above, all of these three diversity indexes belong to the family of f-diversities.
Therefore f-diversity H f (p) is the weighted average of self f-diversities R f,i (p).
Comparing diversities on the sample of 110 Czech narrative medical reports and 1119 Czech structured medical reports we used diversities of the Gini- Simpson type. The reason is that there was shown in  that an ideal estimator of the Shannon type diversity does not exist.
MDMC has become a basis for selection of attributes and their categorization.
The analysis of suitability and utilizability of individual terminological thesauruses has been started by mapping clinical contents of the Minimal Data Model for Cardiology to various terminological classification systems.
First of all, we have tried to map the attributes and terms of MDMC to the SNOMED CT system. The first prerequisite for this mapping was the translation of the MDMC attributes and terms to the English language as there is not a Czech version of SNOMED CT .
As ICD-10 is one of a few international medical classifications translated to the Czech language, as the second step, we have tried to map the attributes and terms of MDMC to this ICD-10. Results of these mappings were published in .
We found the following types of MDMC attributes and terms from the point of view of possibilities of their mapping to SNOMED CT and ICD-10 classification systems:
Trouble-free terms and attributes - i.e. terms and attributes, which can be mapped directly, so only one possibility of mapping exists; possibly there are only synonyms with exactly same meanings and therefore the same classification code (e.g. patient first name, current smoker, motility, height of a patient, etc.).
Partially problematic terms and attributes - i.e. terms and attributes, which can be mapped in a way that there are several possibilities of mapping to different synonyms, which differ slightly in their meanings and usually in their classification codes (e.g. ischemic cerebro-vascular stroke, angina pectoris, hypertension, congestive cardiac failure, etc.).
Terms and attributes with a too small granularity - i.e. terms and attributes describing certain characteristics on a too general level so that classification systems contain only terms of a narrower meaning (e.g. e-mail in MDMC versus e-mail to work/e-mail to home/e-mail of a physician and so on in classification systems).
Terms and attributes with a too big granularity - i.e. terms and attributes describing certain characteristics on such a narrow level so that classification systems contain only a term of a more general meaning (e.g. symmetrical pulse of carotids, etc.).
Terms and attributes, which cannot be found in classification systems, e.g. dyslipidemy, etc.
Linguistics and lexical analysis of narrative medical reports
In the following part we present our findings on linguistic and lexical differences in Czech narrative medical reports.
The Czech language uses the diacritical writing system. As an example of diacritical letters let us mention e.g. letters "ě, č, ř, ž". However, it is faster for physicians to write without these diacritical marks and use letters "e, c, r, z". Such a text is for Czech native speakers understandable but it is difficult for computational processing.
Typing errors represent a bigger problem and they are very frequent in any language. The text is then very hardly usable for computational processing.
A similar problem is spaces omitting between words, which results in merging of two words in one.
For computational processing it is difficult if a physician uses the figure 0 instead of the capital letter O.
Many discrepancies are connected with numerical values. One physician may round one attribute to integers, while another rounds the same attribute with the precision of one or two decimal numbers. Sometimes numerical values are presented as ranges, e.g. "70-80". Often only an approximate indication is entered, e.g. "diastolic pressure around 70". Some attributes are not expressed in numbers but in words, e.g. "blood pressure is within the normal range".
Arabic and Roman numerals
There is a divergence in usage of Arabic and Roman numerals. For example heart sounds may be found in both ways "heart sounds 2" and "heart sounds II".
The Czech language is very rich in synonyms and they are highly used also in medical reports.
Some physicians use a newer version of Czech spelling, the others the older one.
Recording of time is not standardized as well. In medical reports we can run into the name of the month, e.g. "February 2006" but also the month order, e.g. "2/2006".
There are various ways how to describe the time when a patient should administer a drug (e.g. 1-0-0 vs. 1 pill in the morning vs. 1 in the morning vs. 1× in the morning).
These are not problems only of writing medical reports but the same orthographic errors may be found e.g. in web pages .
Mapping MDMC attributes to international classification systems
After translating the MDMC attributes from Czech to English, we have found that more than 60% of MDMC attributes could be mapped to SNOMED CT .
As the second step, we mapped the MDMC attributes to ICD-10. As the very title of the International Classification of Diseases shows, this classification can be used to encode particular diseases, syndromes, pathological conditions, injuries, difficulties and other reasons for the contact with healthcare services, i.e. the type of information that is being registered by a physician. Unfortunately, using this classification we cannot map many attributes of the MDMC, such as marital status, education, mental stress, physical stress, physical activity, smoking, alcohol drinking, physical examination (weight, height, body temperature, BMI, WHR, etc.) or laboratory tests (total cholesterol, HDL-cholesterol). ICD-10 can be used only for the parts of MDMC related to personal history and current difficulties of a possible cardiological origin. Therefore only 25% of MDMC has been mapped to ICD-10 .
Similar results were achieved when analyzing standardization possibilities of attributes of the Data Standard of Ministry of Health of the Czech Republic (DASTA) , in which the majority of healthcare information systems in the Czech Republic communicate. DASTA is based on the national classification system called the National code-list of laboratory items (NCLP) . These standards are developed and administered by the developers of healthcare information systems that are specialized companies, universities or research institutions in the Czech Republic. The development of the standard is supported by the Czech Ministry of Health. DASTA is specialized mainly in transfer of requests and results of laboratory analyses. The current version of DASTA is XML based and provides also the functionality for sending statistical reports to the Institute of Health Information and Statistics of the Czech Republic  and limited functionality of free text clinical information exchange. Unfortunately, DASTA has almost no relation to international communication standards such as HL7  or European standards like EN13606 .
Diversities of selected attributes and their categories in narrative and structured medical reports
Number of categories
No. of narrative reports with recorded attribute
No. of narrative reports with missing attribute
Number of categories diversity (narrative reports)
No. of structured reports with recorded attribute
No. of structured reports with missing attribute
Number of categories diversity (structured reports)
Ischemic heart disease
Each narrative medical report was read and analysed individually, one by one, and all various ways of describing selected MDMC attributes were highlighted and recorded. As there were much more ways describing these attributes in narrative reports than in the structured reports categorized according to MDMC, we can see that the Number of categories diversity is much higher in narrative medical reports than in structured medical reports.
Gini-Simpson relative diversities
No. of categories
Number of narrative reports
Gini-Simpson relative diversity
Number of structured reports
Gini-Simpson relative diversity
Ischemic heart disease
Gini-Simpson self diversities and relative marginal diversities
No. of narrative reports
Gini-Simpson relative marginal diversity
No. of structured reports
Gini-Simpson relative marginal diversity
do not know
Ischemic heart disease
I do not know
Finally, we compare percentage how often selected attributes appear in narrative and in structured medical reports and we achieve these results:
smoking has been recorded in 64.5% of narrative and in 96.5% of structured medical reports;
allergy in 81.8% of narrative and in 95.2% of structured medical reports;
whether a patient suffers from an ischemic heart disease has been recorded in 60.9% of narrative and in 93.4% of structured medical reports;
presence or absence of dyspnoea in 71.8% of narrative and in 93.6% of structured medical reports;
whether a patient suffers from a chest pain has been found in 34.5% of narrative and in 93.7% of structured medical reports;
questions about palpitations have been recorded only in 15.5% of narrative but in 94.2% of structured medical reports;
swelling has been recorded in 86.4% of narrative and in 93.8% of structured medical reports and
the diabetes mellitus attribute has been recorded in 62.7% of narrative and 95.9% of structured medical reports.
Close cooperation with physicians is essential for solving mapping problems. We consulted four physicians examining patients in both outpatient departments of preventive cardiology. It was often needed to choose the right synonym substituting a certain technical term. It was necessary to do it very carefully not to lose information or not to misinterpret it. In case that mapping is not possible without any lost of information, the better way is to describe a non-coded term by means of a set of several coded terms, possibly with showing mutual semantic relations. If this is not also possible, we can polemize with specialists whether these "indescribable" terms (attributes) can be replaced by other more equivalent or more standard ones. In special cases it is possible to add a certain term to an upcoming new version of a certain coding system. In case it is not possible to use any of the above mentioned possibilities of solving mapping problems, it is necessary to cope with the fact that mapping will never be 100%. The insufficient mapping process limits the interoperability of heterogeneous systems used for various purposes in healthcare. Restricted interoperability is often inevitable from the very root of the problem, e.g. insufficient harmonization of clinical contents of heterogeneous systems of electronic health records.
We can also see that while recording results of examinations by means of narrative medical reports terms for categories of attributes are not standardized and the Number of categories diversity is higher than for the same attribute in structured medical reports. Moreover, a lot of attributes in narrative medical reports are left unrecorded. It may have several reasons. Physicians do not have a strictly given skeleton according which they should proceed, so they may forget to collect some attributes. Another reason may be that physicians from the previous attributes know that the next attribute cannot be present and therefore they do not ask about it and they do not record it. But from the narrative medical reports we do not know whether these missing attributes have been checked or whether physicians on the basis of previous knowledge have deduced them. In structured medical reports (e.g. based on structured EHR) all attributes should be recorded. The application should not let physicians to continue if findings are not recorded. However, in our application based on MDMC, this was not the case and we can see that our physicians sometimes have not recorded some findings that could not be derived from other data.
The new method for measuring diversity of medical reports can be applied to medical reports written in any language with categorized attributes. Moreover, it can quantitatively express possibilities for extraction of information from medical reports, more generally from any free text document. The analysis of narrative medical reports has shown that recording of attributes is very inaccurate and not standardized. The biggest problems for computational processing are typing errors, various length of shorten expressions and usage of synonyms. Another problem in the Czech healthcare is the lack of international classification systems translated to the Czech language. But even despite these problems in the usage of international classification systems in Czech healthcare, their use is a necessary first step to enable interoperability of heterogeneous systems of health records. Sufficient semantic interoperability of these systems is the basis for shared care, which leads to efficiency in healthcare, financial savings and reduction of the burden on patients. In this work we have tried to analyze how the international classification systems could be used best for the needs of Czech healthcare.
High diversity in narrative medical report leads to more difficult computer processing than in structured medical reports and some information may be lost during this process. Therefore it is very important to set standardized terminology that would be used in medical reports. Using international classification systems and nomenclatures we can compare diversities of medical reports written in the same or different languages among physicians or healthcare organizations.
The standardized terminology would bring many benefits to physicians. The standardized terminology would help to support development of electronic health records that can easily collect structured medical information. Structured medical reports have a smaller diversity and fewer numbers of missing observations. They can provide physicians, patients, administrators, software developers and payers with much more accurate and objective information. The standardized clinical terminology would help healthcare providers in a way that it could provide complete and easily accessible information that belongs to the process of healthcare (patient's medical record, diseases, treatments, laboratory results, etc.) and it would result in better care of patients.
- As physicians are often pressed of time they abbreviate words while writing medical reports. Unfortunately:
there exists not a single rule how particular attributes should be abbreviated. Therefore the same words can be abbreviated diversely.
The work was partially supported by the project 1M06014 of the Ministry of Education of the Czech Republic and by the AV0Z10300504 project of the Institute of Computer Science AS CR.
- World Health Organization: International Classification of Diseases (ICD). ©2011, homepage available at [http://www.who.int/classifications/icd/en/] (last accessed October 10, 2011)
- International Classification of Diseases and Related Health Problems. The Tenth Revisions. Instructing Manual. ÚZIS ČR. (In Czech)
- Stausberg J, Lehmann N, Kaczmarek D, Stein M: Reliability of diagnose coding with ICD-10. Int J Med Inform. 2008, 77: 50-57. 10.1016/j.ijmedinf.2006.11.005.View ArticlePubMedGoogle Scholar
- The International Health Terminology Standards Development Organisation: SNOMED Clinical Terms. ®, homepage available at [http://www.ihtsdo.org/snomed-ct/] (last accessed October 10, 2011)
- The International Health Terminology Standards Development Organisation: SNOMED Clinical Terms® User Guide. ©2002-2009, July 2009 International Release, 1-70
- Schulz S, Hanser S, Hahn U, Rodgers J: The semantics procedures and diseases in SNOMED® CT. Methods Inf Med. 2006, 45: 354-358.PubMedGoogle Scholar
- Cornet R: Definitions and qualifiers in SNOMED CT. Methods Inf Med. 2009, 48: 177-183.View ArticleGoogle Scholar
- Lee D, Cornet R, Lau F: Implications of SNOMED CT versioning. Int J Med Inform. 2011, 80: 442-453. 10.1016/j.ijmedinf.2011.02.006.View ArticlePubMedGoogle Scholar
- Ceusters W: SNOMED CT's FR2: Is the Future Bright?. User Centered Networked Health Care. Edited by: Moen A at al. 2011, IOS Press, 829-833.Google Scholar
- Conley E, Benson T: SNOMED CT: Who Needs to Know What?. European Journal for Biomedical Informatics. 2011, 7 (2): 40-47.Google Scholar
- Park HA, Lundberg C, Coenen A, Konicek D: Evaluation of the content coverage of SNOMED CT representing ICNP seven-axis version 1 concepts. Methods Inf Med. 2011, 50: 472-478. 10.3414/ME11-01-0004.View ArticlePubMedGoogle Scholar
- Cornet R, de Keizer N: Forty years of SNOMED: a literature review. BMC Med Inform Decis Mak. 2008, 8 (Suppl 1): S2-10.1186/1472-6947-8-S1-S2.View ArticlePubMedPubMed CentralGoogle Scholar
- U. S. National Library of Medicine, National Institutes of Health: Medical Subject Headings. Homepage available at [http://www.nlm.nih.gov/mesh/] (last accessed October 10, 2011)
- Gault LV, Schultz M: Variations in Medical Subject Headings (MeSH) mapping: from the natural language of patron terms to the controlled vocabulary of mapped lists. J Med Libr Assoc. 2002, 90 (2): 173-180.PubMedPubMed CentralGoogle Scholar
- Regenstrief Institute, Inc: Logical Observation Identifiers Names and Codes (LOINC®). ©1994-2011, homepage available at [http://www.regenstrief.org/medinformatics/loinc/] (last accessed October 10, 2011)
- Khan AN, Griffith SP, Moore C, Russell D, Rosario AC, Bertolli J: Standardizing laboratory data by mapping to LOINC. J Am Med Inform Assoc. 2006, 13 (3): 353-355. 10.1197/jamia.M1935.View ArticlePubMedPubMed CentralGoogle Scholar
- U.S. National Library of Medicine, National Institute of Health: Unified Medical Language System® (UMLS®). homepage available at [http://www.nlm.nih.gov/pubs/factsheets/umls.html] (last accessed October 10, 2011)
- Han S-B, Choi J: The comparative study on concept representation between the UMLS and the clinical terms in Korean Medical Records. Int J Med Inform. 2005, 74: 67-76. 10.1016/j.ijmedinf.2004.09.004.View ArticlePubMedGoogle Scholar
- Campbell JR, Olivek DE, Shortliffe: UMLS: towards a collaborative approcah for solving terminological problems. J Am Med Inform Assoc. 1998, 5: 12-16. 10.1136/jamia.1998.0050012.View ArticlePubMedPubMed CentralGoogle Scholar
- Massari P, Pereira S, Thirion B, Derdeville A, Darmoni SJ: Use of super-concepts to customize electronic medical records data display. Stud Health Technol Inform. 2008, 136: 845-850.PubMedGoogle Scholar
- Meystre SM, Savova K, Klipper-Schuler C, Hurdle JF: Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research. IMIA Yearbook of Medical Informatics. 2008, 128-144.Google Scholar
- Liu K, Chapman WW, Savova G, Chute CG, Sioutos N, Crewley RS: Effectiveness of Lexico-syntactic Pattern Matchng for Ontology Enrichment with Clinical Documents. Methods of Information in Medicine. 2001, 40 (5): 397-407.Google Scholar
- Eryiğit G, Nivre J, Oflazer K: Dependency parsing of Turkish. Computational Linguistics. 2008, 34 (3): 357-389. 10.1162/coli.2008.07-017-R1-06-83.View ArticleGoogle Scholar
- Zvára K, Kašpar V: Identification of units and other terms in Czech medical records. European Journal for Biomedical Informatics. 2010, 6 (1): 78-82.Google Scholar
- Bleich HL, Slack WV: Reflections on electronic medical record: when doctor will use them and when they will not. Int J Med Inform. 2010, 79: 1-4. 10.1016/j.ijmedinf.2009.10.002.View ArticlePubMedGoogle Scholar
- Zvárová J: Biomedical Informatics Research and Education at the EuroMISE Center. IMIA Yearbook of Medical Informatics, Schattauer GmbH. 2006, 166-173.Google Scholar
- Adášková J, Anger Z, Aschermann M, Bencko V, Berka P, Filipovský J, Goláň L, Grus T, Grünfeldová H, Haas T, Hanuš P, Hanzlíček P, Holcátová I, Hrach K, Jiroušek R, Kejřová E, Kocmanová D, Kolář J, Kotásek P, Králíková E, Krupařová M, Kyloušková M, Malý M, Mareš R, Matoulek M, Mazura I, Mrázek V, Novotný L, Novotný Z, Pecen L, Peleška J, Prázný M, Pudil P, Rameš J, Rauch J, Reissigová J, Rosolová H, Rousková B, Říha A, Sedlak P, Slámová A, Somol P, Svačina Š, Svátek V, Šabík D, Šimek S, Škvor J, Špidlen J, Štochl J, Tomečková M, Umnerová V, Zvára K, Zvárová J: A proposal of the Minimal Data Model for Cardiology and the ADAMEK software application (in Czech). Internal research report of the EuroMISE Centre - Cardio. 2002, Prague: Institute of Computer Science AS CRGoogle Scholar
- Mareš R, Tomečková M, Peleška J, Hanzlíček P, Zvárová J: Interface of patient database systems - an example of the application designed for data collection in the framework of minimal data model for cardiology (in Czech). Cor Vasa. 2002, 44 (4 Suppl): 76-Google Scholar
- Gini C: Variabilità e Mutabilità. Studi Economico-Giuridici della R. Univ. di Cagliari. 3, 1912; Part 2 80
- Simpson EH: Measurement of diversity. Nature. 1949, 163: 688-10.1038/163688a0.View ArticleGoogle Scholar
- Vajda I: Theory of statistical inference and information. 1989, Kluwer: BostonGoogle Scholar
- Zvarova J, Studeny M: Information theoretical approach to constitution and reduction of medical data. Int J Med Inf. 1997, 45: 65-74. 10.1016/S1386-5056(97)00036-1.View ArticleGoogle Scholar
- Peng H, Long F, Ding Ch: Feature selection based on mutual information: criteria of max-dependency, mas-relevance and min-redundancy. IEEE Trans Pattern Anal Mach Intell. 2005, 27 (8): 1226-1238.View ArticlePubMedGoogle Scholar
- Benish WA: Intuitive and axiomatic arguments for quantifying diagnostic test performance in units of information. Methods Inf Med. 2009, 48: 552-557. 10.3414/ME0627.View ArticlePubMedGoogle Scholar
- Blokh D, Zurgil N, Stambler I, Afrimzon E, Shafran Y, Korech E, Sandbank J, Deutsch M: An information-theoretical model for breast cancer detection. Methods Inf Med. 2008, 47: 322-557.PubMedGoogle Scholar
- Zvárová J: On measures of statistical dependence. Casopis pro pestovani matematiky. 1974, 99: 15-29.Google Scholar
- Zvárová J, Vajda I: On genetic information, diversity and distance. Methods Inf Med. 2006, 2: 173-179.Google Scholar
- Patil GP, Tailie C: Diversity as a concept and its measurement. J Am Stat Assoc. 1982, 77: 548-561. 10.2307/2287709.View ArticleGoogle Scholar
- Zvárová J, Zvára K: Stochastic modelling of biodiversity: f-diversity, self f-diversity and marginal f-diversity. Proceedings of the 6th Summer School on Computational Biology, Deterministic and Stochastic Modelling in Biology and Medicine. Edited by: Hrebicek J, Holcik J. 2010, Akademické nakladatelství CERM, Brno, 108-119.Google Scholar
- Bonachela JA, Hinrichsen H, Munoz MA: Entropy estimates of small data sets. Journal of Physics A: Mathematical and Theoretical. 2009, 41: 1(11)-Google Scholar
- Lee DH, Lau FY, Juan H: A method for encoding clinical datasets with SNOMED CT. BMC Med Inform Decis Mak. 2010, 10: 53-10.1186/1472-6947-10-53.View ArticlePubMedPubMed CentralGoogle Scholar
- Přečková P: Language of Czech medical reports and classification systems in medicine. European Journal for Biomedical Informatics. 2010, 6 (1): 58-65.Google Scholar
- Ringlestetter C, Schulz KU, Mihov S: Orthographic errors in web pages: toward cleaner web corpora. Computational Linguistics. 2006, 32 (3): 295-340. 10.1162/coli.2006.32.3.295.View ArticleGoogle Scholar
- Ministry of Health of the Czech Republic (homepage on the internet), Data Standard of MH CR - DASTA and NCLP. (last accessed October 10, 2011), [http://ciselniky.dasta.mzcr.cz]
- Institute of Health Information and Statistics of the Czech Republic (homepage on the internet). (last accessed October 10, 2011), [http://www.uzis.cz]
- Health Level Seven, Inc. (homepage on the internet) Health Level 7. (last accessed October 10, 2011), [http://www.hl7.org]
- European Committee for standardisation (CEN), Technical Committee CEN/TC251: European Standard EN 13606, "Health informatics - Electronic health record communication".
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/12/31/prepub