BMC Medical Informatics and Decision Making BioMed Central BMC 2002, Medical Informatics and Decision Making

Background: Classification of the electrocardiogram using Neural Networks has become a widely used method in recent years. The efficiency of these classifiers depends upon a number of factors including network training. Unfortunately, there is a shortage of evidence available to enable specific design choices to be made and as a consequence, many designs are made on the basis of trial and error. In this study we develop prediction models to indicate the point at which training should stop for Neural Network based Electrocardiogram classifiers in order to ensure maximum generalisation.


Background
For more than 4 decades, computers have been used in the classification of the Electrocardiogram (ECG) resulting in a huge variety of techniques [1,2] all designed to enhance the classification accuracy to levels comparable to that of a 'gold standard' of expert cardiology opinion. Included in these techniques are Multivariate Statistics, Decision Trees, Fuzzy Logic, Expert Systems and Hybrid approaches. The recent interest in Neural Networks (NNs) coupled with their high levels of performance has resulted in many instances of their application in this field [2,3].
In designing an ECG classifier based on NNs, the normal procedure is to firstly train the network by presenting it with training data that is representative of the unknown data it is likely to experience during the classification process. A well-chosen training algorithm, results in a NN which is capable of generating a non-linear mapping function with the capability of representing relationships between given ECG features and cardiac disorders. A well designed NN will exhibit good generalisation when a correct input-output mapping is obtained even when the input is slightly different from the examples used to train the network [4]. In designing a NN, for example a multi-layered perceptron (MLP), the designer must make a number of choices with regard to the system architecture: what is the appropriate number of hidden layers to be included? How many nodes should each layer have? What activation function should be employed and in which configuration? Unfortunately, there is a shortage of evidence available to designers that would enable them to make specific design choices based on a clear scientific rationale and as a consequence, many designs are made on the basis of trial and error. There are other design issues associated with the level and extent of training required for such a network, in particular locating the point at which the network is considered to be sufficiently trained. Conventional methods of training MLPs involve a process whereby the network is trained to the point of minimum error based upon the training data. Subsequently, the network's internal parameters are fixed and it is tested with unseen data to evaluate its performance. This has been the most common approach in the development of NN ECG classifiers [3,[5][6][7]. A danger exists with this approach in that the NN, during training, may memorise the training data. If this becomes the case, then the NN may be biased towards the training data and hence not fully represent the underlying function that is to be modeled. In such instances, poor generalisation is attained when unseen data is presented as input to the network. Such a phenomenon is referred to as over-fitting. Hence it is possible to over-fit the NN if training is not stopped at the correct point.
By employing the 'early stopping method of training' [8] it is possible to test the NN at various stages of training on a validation data set to ensure that over-fitting is avoided. With such an approach it is usual to find that the learning performance of the NN will increase monotonically for an increasing number of epochs in the usual fashion. The validation performance increases monotonically to a maximum, then it begins to decrease gradually as the training continues. With this approach the suggested point of stopping the learning is at the maximum point on the valida-tion curve. Figure 1 shows an example of a NN trained in this fashion. As indicated in Figure 1 the validation performance increases monotonically to a maximum, occurring at just below 500 epochs. After this point, although the learn performance continues to increase, the validation performance begins to decrease gradually. Thus by employing the early stopping method of training, a network can be trained to a point of maximal generalisation based on a validation set and thus over-fitting is avoided. Although this increases the computational requirements of the NN learning process, benefit is obtained in terms of higher levels of generalisation.
The authors have previously developed a framework for classification of the 12-lead ECG based on a configuration of bi-group NNs (BGNNs) [9]. The framework has the capability to analyse a feature vector comprising approximately 300 features extracted from the 12-lead ECG and classify it into one of a possible 6 diagnostic categories: Inferior Myocardial Infarction, Anterior Myocardial Infarction, Combined Myocardial Infarction, Left Ventricular Hypertrophy, Combined Myocardial Infarction and Left Ventricular Hypertrophy and Normal. Each BGNN in the framework is represented by a single layer MLP with one output node. Each network has the ability to be trained to specifically detect the presence (or absence) of one of the aforementioned diagnostic categories and through a combination matrix an overall classification of the ECG currently under evaluation can be made.
The design procedures, as with similar studies, were largely based on trial and error. This increases the amount of effort required during the design process before the optimal solution can be identified. Additionally, to avoid over-fitting of the BGNNs, the early stopping method of training was employed, further increasing the computational requirements of the design process. Subsequently, a

Figure 1
Example of a NN trained with the early stopping method of training depicting the performance with learning and test data. means of predicting a final solution and hence avoiding the aforementioned design constraints can be seen as being largely beneficial.
To address the needs of such a prediction approach to assist in the design process, two methods have been presented; one based on NNs and the other on Genetic Programming (GP). GP can be defined as a search method based on natural selection rules [10,11]. The process begins by selecting an initial set of contending solutions for a particular problem. This set is created by randomly assembling programs from a gene pool of program parts consisting of for example, operators, variables and constants. In order to obtain a good individual or solution (i.e. the program that solves the problem) a number of steps must be taken. Firstly, the set of contending solutions are exposed to the environment or problem to which they are trying to address to determine their 'fitness' i.e. how well they fit the problem at hand. Secondly, the evolutionary process in GP emulates natural evolution in that, the unfit members are removed, fit members remain and new members are generated [10,11]. The entire process is repeated over successive generations until a specific criterion is met and the best solution obtained [12].
The aims of the current study have been to generate two models based on NNs and GP techniques to estimate, given a NN design for ECG classification, the point at which training should stop. The following section describes the methods and approaches of the study.

Methods
In order to develop the prediction models a number of variable attributes considered to affect the training and generalisation of the NNs were identified. These are variable conditions in the development of the NNs and hence are considered as having potential affects on the location of the point of maximum validation performance. The variables identified were: 1. Number of nodes in the hidden layer (n).
(During training various network configurations are evaluated to locate the optimal solution.) 2. Feature Selection method employed (fs).
(In an effort to maximise generalisation between classes, various feature selection methods are employed [13]. These aim to reduce the dimensionality of the input to the network, yet still maintain sufficient information to permit discrimination. (Depending on the training set size and feature selection method employed, the input feature vector size varies.) As a final variable, the point at which the network attained maximum performance during training, in the form of the number of epochs (m), was also included.
These 5 variables can be considered, in the current study, to potentially contribute to the location of the point of maximum validation performance and hence be used as inputs to the prediction model to give the optimum number of epochs. As mentioned in the introductory sec-

Figure 2
Comparison between actual and predicted values for the NN based prediction model for training data. Series4 Actual Predicted tion, these are variable factors which will change during the design process, in order to attain the optimal solution.
As a starting point for the current study, only the data collected from the development of the BGNN classifier for the Anterior Myocardial Infarction was analysed. This involved the analysis of the design of 44 different BGNN classifiers developed to specifically classify Anterior Myocardial Infarction. For each classifier a record was created detailing the above 5 design parameters. In addition, the point at which the network attained maximum performance following analysis of the results attained from the early stopping method of training was included. This can be considered to be the desired output of the prediction model and hence be used to compare with the actual output from the prediction model to evaluate how well the model fits the problem. The data was partitioned with two thirds allocated as training data (29 records) and one third as test data (15 records).
Two approaches were investigated to develop the necessary prediction model: a NN approach and a GP approach.

NN Approach to the Implementation of the Prediction Model
In an effort to develop a suitable prediction model, a NN based system has been employed. An MLP NN topology was adopted for the prediction model. During the development of the prediction model various architectures (in terms of the numbers of hidden layers and neurons in each layer) were developed and evaluated. The back propagation training algorithm was employed. The input layer of the NN had 5 neurons, one neuron for each of the aforementioned variable design parameters. A single neuron was used in the output layer with a sigmoidal activation function. The output from this neuron was linearly de-normalised to produce a value for the predicted number of epochs to stop training in the range 0-5000. Following assessment of the different neural classifiers developed, in terms of comparison between desired and actual outputs, the optimal NN generated for the required prediction model had a 5-4-1 architecture. Results generated are presented in the next section.

Genetic Programming Approach to the Implementation of the Prediction Model
To further the investigations in terms of the development of suitable prediction models a GP approach was also employed. For the GP prediction model, populations of 3000 individuals were initially evolved and arithmetic functions: add, minus, protected division and product were defined as the function set [10]. A set of random float type constants between 0.0 and 5.0, 0.0 and 50.0 and, 0.0 and 500.0 were defined as the terminal set [10], as well as the aforementioned 5 input parameters (n, fs, N, s, m). The fitness function (i.e. the measure of how well the model fits the problem) was based on absolute errors for the desired output parameter and the complexity of each individual. Following the evolution process, the individu-  al (solution program) was found with raw fitness of 340.5 and complexity of 127. Figures 2 and 3 indicate the prediction capabilities of both models following training. These graphs indicate the predicted epoch cycles from the NN and the GP models and actual points at which the BGNN attained maximum performance.

Results
Both models follow closely to the actual number of epochs at which maximum performance was attained. The range of values for which the given BGNN attained minimum error following evaluation with training data is in the range of 250-2500, in comparison with the range of 50-500 for the actual values of epochs based on the early stopping method of training and 12-275 for the NN prediction model or 60-500 for the GP prediction model. The data is not normally distributed, so a non-parametric test, the Wilcoxon's signed rank sum test for paired data was utilised. This tests the hypothesis that the two outputs have the same distribution without making any assumptions as to their shape. It was applied to the results for both the NN and GP models for comparison of the desired and predicted results for both training and test data sets; these are given in Table 1 in the standard Wilcoxon's output fashion. These results indicate that there are no significant (sig.) differences with respect to predicted and actual, for both NN and GP methods based on the training data. For the test results, the NN model had no significant differences with respect to predicted and actual while the GP results on the test data are just marginally significantly different (p = 0.047).
Another performance measure is that related to the mean absolute error (MAE) of the two train and the two test cases (NN and GP). These can be tested with each other by a t-test for paired samples for the means of the differences, given the degrees of freedom (d.f.). The results for this are given in Table 2. This indicates that the means of the errors of the training sets are significantly different (p = 0.008), indicating that the GP training performs significantly better. For the test data, the means of the errors are not significantly different (p = 0.955), hence the NN and the GP could be considered to perform equally well on the test sets.
Some discrepancy in test results may be accounted for by the difference in approaches. The Wilcoxon's test here indicates that although the GP performs as well as the NN as regards to the overall errors with respect to the test sets (shown here by the t-test), the GP consistently slightly overestimates the predictions. This is evidenced by the difference in mean ranks (4.17 -ve; 10.56 +ve) and, on closer examination of the graph ( Figure 5), this effect can just be discerned.

Discussion and Conclusions
As shown in the previous section, both the NN and GP model have the ability to a certain extent to predict the epoch number at which the BGNN should stop training.

Figure 4
Comparison between actual and predicted values, following exposure to the test data, for the NN based prediction model.

Figure 5
Comparison between actual and predicted values, following exposure to the test data, for the GP based prediction model.  Table 1, but also exhibited good generalisation, Figures 4 and 5. Although the GP displayed a significant level of performance in terms of training in comparison with the NN, both were comparable following evaluation with the test data, with no significant differences in this given study. The range of values for attained minimum error indicates that not only is there a gain in the reduction of the computational intensity of the learning process by being able to indicate when a given network should stop its training, but as previously stated, a gain is also achieved in generalisation.

GP Performance on Test Data
This study demonstrates that it is possible to generate prediction models to detect the point at which training should stop, in terms of epochs based on variable design parameters. This provides an indication of the point at which maximum performance and subsequently maximal generalisation can be attained and thus optimises these. The models presented in essence provide a means of reverse engineering to the heuristic problem of NN design, in that given the variable parameters of a network architecture and its associated optimal learning performance based on the training data, an indication can be given as to when the network should stop training in order to provide the maximum level of generalisation. This can be seen to be of benefit to developers of NNs, not only in the presented case of NN based ECG classifiers, but indeed any classification problem. Such an approach alleviates the intensity of a trial and error process to NN design and additionally ensures good generalisation.
Further work is planned to develop prediction models for the remaining BGNNs classifying different diagnostic classes and investigate further commonalities and differences in the results generated by both the NN and GP approaches. Initial results have indicated similar findings for the prediction models when applied to different diagnostic classes [14].