Using n-gram analysis to cluster heartbeat signals
- Yu-Chen Huang^^{1, 3},
- Hanjun Lin^^{1, 3},
- Yeh-Liang Hsu^{1, 3}Email author and
- Jun-Lin Lin^{2}
DOI: 10.1186/1472-6947-12-64
© Huang et al.; licensee BioMed Central Ltd. 2012
Received: 26 November 2011
Accepted: 21 June 2012
Published: 8 July 2012
Abstract
Background
Biological signals may carry specific characteristics that reflect basic dynamics of the body. In particular, heart beat signals carry specific signatures that are related to human physiologic mechanisms. In recent years, many researchers have shown that representations which used non-linear symbolic sequences can often reveal much hidden dynamic information. This kind of symbolization proved to be useful for predicting life-threatening cardiac diseases.
Methods
This paper presents an improved method called the “Adaptive Interbeat Interval Analysis (AIIA) method”. The AIIA method uses the Simple K-Means algorithm for symbolization, which offers a new way to represent subtle variations between two interbeat intervals without human intervention. After symbolization, it uses the n-gram algorithm to generate different kinds of symbolic sequences. Each symbolic sequence stands for a variation phase. Finally, the symbolic sequences are categorized by classic classifiers.
Results
In the experiments presented in this paper, AIIA method achieved 91% (3-gram, 26 clusters) accuracy in successfully classifying between the patients with Atrial Fibrillation (AF), Congestive Heart Failure (CHF) and healthy people. It also achieved 87% (3-gram, 26 clusters) accuracy in classifying the patients with apnea.
Conclusions
The two experiments presented in this paper demonstrate that AIIA method can categorize different heart diseases. Both experiments acquired the best category results when using the Bayesian Network. For future work, the concept of the AIIA method can be extended to the categorization of other physiological signals. More features can be added to improve the accuracy.
Background
Biological signals may carry specific characteristics that reflect basic dynamics of the body. In many studies, biological signals are mapped into symbolic sequences for further analysis. For example, the DNA-sequence, which is composed of adenine (A), cytosine (C), guanine (G) and thymine (T), is a well-known biological symbolic sequence. When mapping to symbolic sequences, the essential information of the original signals must be preserved.
The human heart beat time series is another well-studied example. Human cardiac autonomic activity is affected by two different interactions: sympathetic activity increases heart rate, and parasympathetic activity decreases heart rate. Since these opposite effects are stimulated by many different kinds of stimuli, human heart beat time series is highly variable and complex. [Cysarz et al. 1] demonstrated that even regular heartbeat dynamics may be associated with cardiac health. They found that in healthy subjects, continuous adaptation to different activities occurs during daytime, but there was erratic behavior in Congestive Heart Failure (CHF) patients.
Regular heart beat dynamics contains distinct alternation of acceleration and deceleration. Some early traditional linear methods could reliably describe partial actions in autonomic regulation, such as respiration [2, 3]. However, non-linear methods are needed to analyze highly variable data, such as heartbeat signals [2, 4, 5]. In recent years, many researchers have shown that representations which used non-linear symbolic sequences can often reveal much hidden dynamic information. This kind of symbolization proved to be useful for predicting life-threatening cardiac diseases [6–11].
At present, there are three different approaches for using non-linear symbolic sequences to represent heart beat time series. The first approach is based on the deviation of the heart rate time series from the local mean, and a symbol is assigned to each heartbeat. For example, if the momentary heart rate is close to the mean value, it is assigned a “1”; if the heart rate is lower than the mean value, it is assigned a “2”; others are assigned a “3”. [Voss et al. 7] found that there were some specific patterns in patients after suffering myocardial infarction using the symbolization based on deviation from the mean value. They later improved this method to identify patients with other high risk cardiac diseases [12].
The second approach is to symbolize the increase or decrease of the momentary heart rate by two different symbols. For example, Yang et al. [10] simplified the heartbeat dynamics via mapping the output to binary sequences, where the increases of the interbeat intervals were denoted by “1” and others were denoted by “0”. They presented a distance method based on rank order statistics to calculate the dissimilarity between two symbolic sequences. According to the results, this method can robustly recognize the difference between healthy people and patients with heart diseases. [Peng et al. 11] of the same research team, combined the distance method with a weighting function, resulting in less overlap between groups, and more clearly distinguished classes corresponding to the level of subjects in the CHF group. [Van et al. 13] also found that symbolization can be applied to quantify the fetal heart rate, demonstrating that development of the autonomic nervous system and emergence of behavioral states lead to increase in both irregular and regular heart rate patterns.
The third approach is to divide the range between minimum and maximum heart rate into a few equidistant intervals, or to map a time series onto a symbolic sequences of permutation rank [14–16]. Entropy and entropy rate were used to evaluate the complexity of heart variability. [Porta et al. 14] used the pattern classification method to auto identify different physiological conditions by the activation of different mechanisms responsible for cardiovascular regulation. Permutation entropy and modified permutation entropy analysis have also been studied, which maps a time series onto a symbolic sequence of permutation rank [15, 16].
The second approach described above for symbolization does not need any parameter settings (e.g., the mean heart rate is required in the first approach), and it is independent of any other features of heart rate variations. In contrast to the third approach described above, it does not need to adjust the range of intervals which might affect the results of classification. However, the second approach used only binary symbols (e.g., 0 and 1) to represent acceleration and deceleration of interbeat intervals, which might not be able to represent the degree of variations. For example, the difference between two interbeat intervals such as +250 and +100 may both be represented as acceleration and assigned “1”, but actually they are not the same in a detailed interpretation, and the degree information of acceleration is lost in this binary representation.
To address this problem, this paper presents an improved method called the “Adaptive Interbeat Interval Analysis (AIIA) method”. The AIIA method uses the Simple K-Means algorithm for symbolization, which offers a new way to represent subtle variations between two interbeat intervals without human intervention. After symbolization, it uses the n-gram algorithm to generate different kinds of symbolic sequences. Each symbolic sequence stands for a variation phase. Finally, the symbolic sequences are categorized by classic classifiers.
This paper is organized as follows. Section 2 describes the procedure of the AIIA method. Sections 3 and 4 present two experiments to validate this method in classifying different diseases. Finally, Section 5 concludes the paper.
Methods
Preliminary treatment – Calculating the RRI difference
Symbolization – Using Simple K-Means algorithm to cluster the RRI differences
Simple K-Means is one of the most popular clustering techniques, and it has been adapted to many problem domains because of its simplicity and efficiency. This algorithm was voted as one of the top 10 algorithms in the data mining research area for identifying hidden patterns and revealing underlying knowledge from large data collections [17].
After calculating the RRI differences for each time series, the Simple K-Means algorithm is used to cluster the RRI differences. In this algorithm, parameter k represents the number of clusters desired. The output of the clustering algorithm is k clusters, which should correspond to any known classes in terms of instance distribution.
Identifying styles and signatures – Using the n-gram algorithm
This research uses 26 clusters (k = 26) with 1-gram, 2-gram and 3-gram for analysis, which includes 18,278 (18,278 = 26^{1} + 26^{2} + 26^{3}) different kinds of string combinations. That is to say, 18,278 different kinds of variations in the sample are considered.
Classification – Using Classic classifiers
Prior to classification, a probability matrix according to the occurrences of each gram in the last step is generated. Then 6 classic classifiers, including Bayesian Network, Logistic, Naïve Bayesian, Neural Network, Support Vector Matrix (SVM) and Tree-J48, are used to classify the samples into different heart diseases.
In the next section, two examples are used to demonstrate how the AIIA method is applied to categorize different types of heart rate time series. The databases for the examples were provided by PhysioBank, which was created under the auspices of the National Center for Research Resources of the National Institutes of Health, USA. It is a large and growing archive of well-characterized digital recordings of physiological signals and related data for use by the biomedical research community. The biomedical signals from healthy subjects and from patients with a variety of diseases are included [18].
The 10-fold cross-validation is used to assess the result. In the 10-fold cross-validation, the original samples were randomly partitioned into 10 subsets. Of the 10 subsets, a single subset was retained as the validation data for testing the model and the remaining 9 subsets were used as training data. This step was then repeated 10 times. Each subset was used exactly once as the validation data. Finally, the 10 results from the 10 subsets were averaged to produce a single estimation. The advantage of this method was that all observations were used for both training and validation and each observation was used for validation exactly once.
Results
Example 1 – Using the AIIA method to classify heart diseases
The 5 groups of the example 1
No. | Group | Subjects | Description | Source |
---|---|---|---|---|
1. | Congestive Heart Failure (CHF) | 43 | 15 females and 28 males, average age 55.5 years. It takes 16 to 24 hours for each sample (around 75,000 RRI). | BIDMC Congestive Heart Failure Database [19] |
2. | Atrial Fibrillation (AF) | 9 | Takes only 2 hours for recording (around 12,000 RRI). | Albert C.-C. Yang |
3. | Healthy Young (HY) | 20 | 10 females and 10 males, average age 25.9 years. It takes 2 hours for each sample (around 7,100 RRI). | |
4. | Healthy Elderly (HE) | 20 | 10 females and 10 males, average age 74.5 years. It takes 2 hours for each sample (around 7,200 RRI). | |
5. | White Noise (WNU) | 50 | Uniform distribution. It takes 6 hours for each sample (around 15,000 RRI). | Artificially generated |
Total | 142 |
The top 4 classified results (2 gram, 26 clusters)
Classifier | Cluster Number (k) | Total number of instances | Correctly classified instances | Incorrectly classified instances | Accuracy | Best Performance |
---|---|---|---|---|---|---|
Bayesian Network | 20 | 142 | 126 | 16 | 88.7% | 20 clusters, 88.7% |
SVM | 20 | 142 | 124 | 18 | 87.3% | 24 clusters, 88.7% |
Tree-J48 | 20 | 142 | 121 | 21 | 85.2% | 24 clusters, 88.0% |
Naïve Bayse | 20 | 142 | 113 | 29 | 79.6% | 24 clusters, 81.7% |
Detailed classification results form using Bayesian Network
Group | AF | CHF | HE | HY | WNU |
---|---|---|---|---|---|
Total | 9 | 43 | 20 | 20 | 50 |
Correct | 7 | 38 | 17 | 13 | 50 |
Incorrect | 2 | 5 | 3 | 7 | 0 |
Accuracy | 77.8% | 88.4% | 85.0% | 65.0% | 100% |
Example 2 – Using the AIIA method to classify patients with apnea
The 4 groups of the example 2
No. | Group | Subject | Description | Source |
---|---|---|---|---|
1. | Apnea (APNEA) | 20 | This experiment uses the class ‘A’ set which includes 20 records for the target set of Apnea. These records meet all Apnea criteria. Recordings in class A contain at least one hour with an apnea index of 10 or more, and at least 100 minutes with apnea during the recording. It takes 8 hours for each sample (around 35,000 RRI). | Apnea-ECG database [23] |
2. | Health Young (HY) | 20 | 10 females and 10 males, average age 55.5 years. It takes 2 hours for each sample (around 7,100 RRI). | |
3. | Health Elderly (HE) | 20 | 10 females and 10 males, average age 74.5 years. It takes 2 hours for each sample (around 7,200 RRI). | |
4. | White Noise (WNU) | 20 | Uniform distribution. It takes 6 hours for each sample (around 15,000 RRI). | Artificially generated |
Total | 80 |
Top 4 classification results (2 gram, 26 clusters)
Classifier | Cluster Number (k) | Total number of instances | Correctly classified instances | Incorrectly classified instances | Accuracy | Best Performance |
---|---|---|---|---|---|---|
Bayesian Network | 11 | 80 | 68 | 12 | 85.0% | 11 clusters, 85.0% |
Tree-J48 | 11 | 80 | 63 | 17 | 78.6% | 17 clusters, 81.3% |
Logistic | 11 | 80 | 55 | 25 | 68.8% | 17 clusters, 83.8% |
SVM | 11 | 80 | 54 | 26 | 67.5% | 23 clusters, 83.8% |
Detailed classification results using Bayesian Network
Group | Apnea | HE | HY | WNU |
---|---|---|---|---|
Total | 20 | 20 | 20 | 20 |
Correct | 19 | 17 | 12 | 20 |
Incorrect | 1 | 3 | 8 | 0 |
Accuracy | 95.0% | 85.0% | 60.0% | 100% |
Discussion
As interest continues to grow in analyzing heart diseases, symbolic analysis will clearly remain an important research tool. It offers advantages such as computational efficiency, ease of visualization, as well as the ability to combine with other algorithms, information theories and language that may not be matched by any other approach. The most significant issue in the application of symbolic analysis is how to develop an algorithm to appropriately define symbols in the absence of generating partitions. Although some information is always lost during the symbolic transformation process and it involves some degree of imprecision, many associated applications have proved it to be viable and realistic.
The AIIA method presented here also cannot assure that no information is lost, but it tries to capture small variations when doing the symbolic transformation. First, the method uses up to 26 symbols (a to z) to represent variations between interbeat intervals to show the increase or decrease phases and the degree of variation. Second, the symbols are not generated by artificial experiences or functions, but by the Simple K-Means algorithm, which is one of the most popular clustering techniques that supplies clusters with minimal total variance [17]. The criterion of minimal total variance yields the most closed clusters. That is, if variations belong to the same cluster, they are similar. This step is totally different from previous studies. Finally, it uses the n-gram algorithm to generate symbolic sequences. Closely associated with the problem of symbol definition, there always needs to be an efficient algorithm for defining the appropriate length of symbolic sequences. The n-gram algorithm can automatically change the lengths of sequences according to the experimental performance. The complexity of calculating the occurrence of each “gram” is, where n is the number of clusters and m is the number of grams. In general, more clusters and grams may lead to better performance, but it requires a large amount of computation and takes a long CPU time. It also may lead to an overfitting problem.
Conclusions
Biological signals may carry specific characteristics that reflect basic dynamics of the body. Therefore, finding and analyzing the hidden signals of dynamical structures which raise a lot of clinical interests. The AIIA method presented here uses the Simple K-Means algorithm for symbolization, which offers a new way to represent subtle variations between two interbeat intervals without human intervention.
The two experiments presented in this paper demonstrate that AIIA method can categorize different heart diseases. Both experiments acquired the best category results when using the Bayesian Network. For future work, the concept of the AIIA method can be extended to the categorization of other physiological signals. Further study is required to show robustness of the AIAA method, and more features can be added to improve its accuracy.
Declarations
Acknowledgements
We thank for Doctor Albert C.-C. Yang. He offered the samples of the AF group and provided a detailed explanation.
Authors’ Affiliations
References
- Cysarz D, Lange S, Matthiessen PF, Leeuwen P: Regular heart-beat dynamics are associated with cardiac health. Am. J. Physiol. Regulatory Integrative Comp. Physiol. 2007, 292: 368-372.View ArticleGoogle Scholar
- Kaplan DT, Talajic M: Dynamics of heart rate. Chaos. 1991, 1: 251-256. 10.1063/1.165837.View ArticlePubMedGoogle Scholar
- Yana K, Saul JP, Berger RD, Perrott MH, Cohen RJ: A time domain approach for the fluctuation analysis of heart rate related to instantaneous lung volume. IEEE Trans Biomed Eng. 1993, 40: 74-81. 10.1109/10.204773.View ArticlePubMedGoogle Scholar
- Goldberger AL, Bhargava V, West BJ, Mandell AJ: On a mechanism of cardiac electrical stability the fractal hypothesis. Biophys J. 1985, 48: 525-528. 10.1016/S0006-3495(85)83808-X.View ArticlePubMedPubMed CentralGoogle Scholar
- Kobayashi M, Musha T: 1/f Fluctuation of heartbeat period. IEEE Trans Biomed Eng. 1982, 29: 456-457.View ArticlePubMedGoogle Scholar
- Kurths J, Voss A, Saparin P, Witt A, Kleiner HJ, Wessel N: Quantitative analysis of heart rate variability. Chaos. 1995, 5: 88-94. 10.1063/1.166090.View ArticlePubMedGoogle Scholar
- Voss A, Kurths J, Kleiner HJ, Witt A, Wessel N, Saparin P, Osterziel KJ, Schurath R, Dietz R: The application of methods of non-linear dynamics for the improved and predictive recognition of patients threatened by sudden cardiac death. Cardiovasc Res. 1996, 31: 419-433.View ArticlePubMedGoogle Scholar
- Cysarz D, Bettermann H, Leeuwen P: Entropies of short binary sequences in heart period dynamics. Am. J. Physiol. Regulatory Integrative Comp. Physiol. 2000, 278: 2163-2172.Google Scholar
- Wessel N, Ziehmann C, Kurths J, Meyerfeldt U, Schirdewan A, Voss A: Short-term forecasting of life-threatening cardiac arrhythmias based on symbolic dynamics and finite-time growth rates. Phys. Rev. 2000, 61: 733-739.View ArticleGoogle Scholar
- Yang AC, Hseu SS, Yien HW, Goldberger AL, Peng CK: Linguistic Analysis of the Human Heartbeat Using Frequency and Rank Order Statistics. Phys. Rev. Lett. 2003, 90: 103108-Google Scholar
- Peng CK, Yang AC, Goldberger AL: Statistical physics approach to categorize biologic signals: From heart rate dynamics to DNA sequences. Chaos. 2007, 17: 015115-10.1063/1.2716147.View ArticlePubMedGoogle Scholar
- Wessel N, Voss A, Malberg H, Ziehmann C, Voss HU, Schirdewan A, Meyerfeldt U, Kurths J: Nonlinear analysis of complex phenomena in cardiological data. Herzschritt. and Elektroph. 2000, 11 (3): 159-173. 10.1007/s003990070035.View ArticleGoogle Scholar
- Leeuwen P, Cysarz D, Lange S, Geue D, Groenemeyer D: Quantification of fetal heart rate regularity using symbolic dynamics. Chaos. 2007, 17: 015-119.Google Scholar
- Porta A, Guzzetti S, Montano N, Furlan R, Pagani M, Malliani A, Cerutti S: Entropy, entropy rate, and pattern classification as tools to typify complexity in short heart period variability series. IEEE Trans. Biomed. Eng. 2001, 48: 1282-1291. 10.1109/10.959324.View ArticlePubMedGoogle Scholar
- Bandt C, Pompe B: Permutation Entropy: A Natural Complexity Measure for Time Series. Phys. Rev. Lett. 2002, 88: 174102-View ArticlePubMedGoogle Scholar
- Bian C, Qin C, Ma QD, Shen Q: Modified permutation-entropy analysis of heartbeat dynamics. Phys. Rev. E. 2012, 85: 021906-View ArticleGoogle Scholar
- Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Yu PS, Zhou ZH, Steinbach M, Hand DJ, Steinberg D: Top 10 algorithms in data mining. Knowl Inform Syst. 2008, 14: 1-37. 10.1007/s10115-007-0114-2.View ArticleGoogle Scholar
- Goldberger AL, Amaral LA, Glass L, Hausdorff JM, Ivanov PC, Mark RG, Mietus JE, Moody GB, Peng CK, Stanley HE: PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation. 2000, 101 (23): 215-220. 10.1161/01.CIR.101.23.e215.View ArticleGoogle Scholar
- Physionet/The BIDMC Congestive Heart Failure Database. http://www.Physionet.org/physiobank/database/chfdb.
- Physionet/FantasiaDatabase. http://www.physionet.org/physiobank/database/fantasia.
- Iyengar N, Peng CK, Morin R, Goldberger AL, Lipsitz LA: Age-related alterations in the fractal scaling of cardiac interbeat interval dynamics. Am J Physiol. 1996, 271: 1078-1084.Google Scholar
- Schafer H, Koehler U, Ploch T, Peter JH: Sleep-Related Myocardial Ischemia and Sleep Structure in Patients With Obstructive Sleep Apnea and Coronary Heart Disease. Chest. 1997, 111: 387-393. 10.1378/chest.111.2.387.View ArticlePubMedGoogle Scholar
- Penzel T, Moody GB, Mark RG, Goldberger AL, Peter JH: The Apnea-ECG Database. Computers in Cardiology. 2000, 27: 255-25.Google Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6947/12/64/prepub
Pre-publication history
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.