 Research
 Open access
 Published:
A comparative study of CNNcapsulenet, CNNtransformer encoder, and Traditional machine learning algorithms to classify epileptic seizure
BMC Medical Informatics and Decision Making volumeĀ 24, ArticleĀ number:Ā 60 (2024)
Abstract
Introduction
Epilepsy is a disease characterized by an excessive discharge in neurons generally provoked without any external stimulus, known as convulsions. About 2 million people are diagnosed each year in the world. This process is carried out by a neurological doctor using an electroencephalogram (EEG), which is lengthy.
Method
To optimize these processes and make them more efficient, we have resorted to innovative artificial intelligence methods essential in classifying EEG signals. For this, comparing traditional models, such as machine learning or deep learning, with cuttingedge models, in this case, using CapsuleNet architectures and Transformer Encoder, has a crucial role in finding the most accurate model and helping the doctor to have a faster diagnosis.
Result
In this paper, a comparison was made between different models for binary and multiclass classification of the epileptic seizure detection database, achieving a binary accuracy of 99.92% with the CapsuleNet model and a multiclass accuracy with the Transformer Encoder model of 87.30%.
Conclusion
Artificial intelligence is essential in diagnosing pathology. The comparison between models is helpful as it helps to discard those that are not efficient. Stateoftheart models overshadow conventional models, but data processing also plays an essential role in evaluating the higher accuracy of the models.
Introduction
Epilepsy is a neurological disorder characterized by generating discharges in the nervous system without an external stimulus cause which produces convulsions or unusual behavioral moments and sometimes loss of consciousness, affecting people of all ages and geographical locations. It is a common but stigmatized disease, making its diagnosis and treatment challenging, especially in lowresource countries, and increasing mortality rates compared to developed countries [1]. This condition encompasses four main classes: focal, generalized, focal generalized, and unknown. It should be highlighted that recent studies have shown that epilepsy is not just seizures; patients can also experience neuropsychiatric and neurobehavioral symptoms [2]. The symptoms of a seizure can vary widely. Some people with epilepsy only stare briefly during a seizure, while others constantly move their arms or legs [3].
Epilepsy diagnosis and treatment pose unique challenges, especially in lowresource countries where stigma and lack of access to specialized care increase mortality rates. The interpretation of the electroencephalogram (EEG), crucial in diagnosis, is an intensive and variable process dependent on the specialistās experience [4]. In this context, artificial intelligence (AI) and deep learning are promising solutions, particularly methods based on convolutional neural networks (CNN) that promise to analyze EEG data with greater precision and efficiency [5]. Despite advances in AI for the diagnosis of epilepsy, there is a significant gap in comparing different deep learning architectures with traditional machine learning techniques, which is crucial for identifying the most effective models. This study aims to fill this gap by comparing CNNbased methods and traditional machine learning techniques, seeking to improve the accuracy and efficiency of epilepsy diagnosis. The findings of this research could transform the diagnosis of epilepsy, offering faster and more precise methods and reducing the economic and social burden of this condition, especially in regions with limited access to neurology specialists.
In machine learning mechanisms, hyperparameters are adjusted [6]. A pipeline mechanism is used to modify these hyperparameters. It aims to chain together different steps in an organized manner to extract features and make adjustments to a model. Following this, a grid search is employed. This technique explores the best values and evaluates the modelās performance for each combination of values [7].
In recent years, new deeplearning mechanisms have improved the modelās capacity. Suat Toraman discusses Capsule Neural Networks (CapsuleNet) and their role in enhancing the performance of image prediction models. Additionally, Toraman proposes a CapsuleNet model for predicting epileptic seizures along with 1D convolutional networks [8]. More recently, Shuaicong Hu et al. proposed a hybrid transformer model for classifying epileptic seizures, which primarily consists of four blocks: Rhythm Embedding, Positional Encoding, SelfAttention, and Classifier [9].
CapsuleNet are a new type of machine learning (ML) architecture recently developed to overcome the disadvantages of CNNs. CapsuleNet is resistant to affine rotations and translations, which is useful when dealing with medical image datasets. In addition, Vision Transformer (ViT) based solutions have recently been proposed to solve the longterm dependency on CNNs. Implementation to deep learning models with CapsuleNet and Transformer Encoder offers improvements in performance and computational cost since CapsuleNet requires less training data compared to CNNs and Transformer Encoder models are more robust and have better performance. However, it has yet to be explored in the medical data field [10].
Yi Wei et al. [11] propose a Transformer model and a CapsuleNet to improve performance in emotion recognition, the Transformer model is used to extract information, and the CapsuleNet is used to refine the features, thus avoiding limitations that arise from CNNs, achieving excellent performance in this field of study. The combination of these models served as a motivation for our research.
Considering that the purpose of this paper is to compare models for the classification of electroencephalograms, the main contributions are:

The CapsuleNet model is modified and applied to signals or flat data, releasing the code to be replicated in other problems.

Stateoftheart architectures were combined for the creation of new optimized models to achieve the best possible classification in addition to this, it is compared, and a verdict is given as to which of the models is the most efficient for the database used; however, like the modified CapsuleNet model, the repository is published for experimentation on other types of pathologies or with different databases.

The optimization of the models improves code compilation times, helping to reduce computational costs and giving way to the ease of doing multiple experiments.
This paper is structured into key sections: āIntroductionā, which sets the stage by providing a comprehensive overview of the addressed ideas; āRelated workā, exploring works related to the current study; āMaterials and methodsā, offering insights into the methodologies employed; āResultsā, presenting the outcomes of the study; āDiscussionā, analyzing and interpreting the results; and āConclusionā, summarizing the key findings and implications. Each section contributes to a comprehensive understanding of the research endeavor.
Related work
Epilepsy is a chronic brain disease that affects people of all ages. It is estimated that around 50 million people suffer from this disease, making it one of the most common neurological diseases. The World Health Organization estimates that 70% of people with epilepsy can live seizurefree if properly diagnosed and treated, so in recent years, new research has emerged to identify epilepsy using deep learning such as the case study conducted by Gaowei Xu et al. [12]. Gaowei Xu et al. implemented a onedimensional convolutional neural network model of shortterm memory (1DCNNLSTM) to analyze epileptic seizures through EEG signals. They initially preprocessed and normalized the data, and then they created the CNN to extract the features from the data that pass to LSTM (Long ShortTerm Memory) layers to extract the temporal features to finally introduce these outputs into fully connected layers, thus achieving an accuracy of 99.39% in binary detection and 82% in multiclass detection, demonstrating the potential of deep learning models for epilepsy detection.
On the other hand, Mengnan Ma et al. [13] used a combination of recurrent neural network (indRNN) and 1D CNN to detect periods of interictal, preictal, and ictal epilepsy; thus, the 1DCNN was used to extract the features of the EEG signal while the indRNN was used to distinguish the categories based on the extraction of features so with this combination a deep learning model was created for the spatiotemporal detection of the disease. In this model, in small sample datasets from the University of Bonn, the proposed method achieved 100% classification accuracy and specificity in detecting all three classes.
Another research carried out by RubĆ©n SanSegundo et al. [14] used an EEG database from BernBarcelona [15] and the epileptic seizure recognition database [16]; However, the first contains data from two categories unlike the second, thus dividing into three classifications: healthy (Z), interictal (F) and ictal (S), several transformations of the EEG signal with Fourier, Wavelet and decomposition were evaluated empirically, obtaining various scenarios for the detection of seizures, which generated the best results when using the Fourier transform. Accuracy increased from 99.0% to 99.5% for classifying nonseizure vs. seizure records, from 91.7% to 96.5% when differentiating between healthy, nonfocal, and seizure records, and from 89.0% to 95.7% when considering adjustment, focal and seizure records.
Amirmasoud Ahmadi et al. [17] presented a new algorithm for seizure classification using the wavelet packet transform (WPT) to identify the essential characteristics of the signal better. They used a public database from the Epilepsy Centre at Bonn University, which contained EEG signals from five healthy and five patients with epilepsy. This dataset was divided into 17 subsegments that, in turn, were organized into WP trees. From these coefficients, they used statistical characteristics such as standard deviation (STD) and root mean square (RMS). Then, they used the Support vector machine (SVM) classifier for binary classification in seven cases. The best result was obtained by classifying class A (healthy person with open eyes) versus class E (epileptic seizure) with an accuracy of 99.64%, while for the binary classification of class E versus the remaining four classes, an accuracy of 97.85% was obtained.
Lina Wang et al. [18] initially performed a database treatment at the University Hospital Bonn, Germany, that contained data from 5 healthy patients and five patients with epilepsy, thus filtering the EEG signal with a method that eliminates noise using the wavelet threshold. They analyzed the signals in the time, frequency, and timefrequency domains and performed a nonlinear analysis using empirical modal decomposition (EMD). They implemented five algorithms, including Knearest neighbors (kNN) and SVM, the latter being the classifier with the highest accuracy since it obtained a value of 99.25% with the nonlinear multidomain analysis with 10fold crossvalidation and a standard deviation of 0.28.
Shen et al. [19] propose a realtime approach to detecting epileptic seizures using EEG. This approach combines a tunable Q wavelet transform and a CNN. The authors extract spectral and timedomain features from the EEG, such as statistical moments and spectral power, and convert them into imagelike data to feed the CNN. The proposed method was evaluated using the CHBMIT database, and promising results were obtained. The accuracy was 97.57%, with a sensitivity of 98.90% and a false positive rate of 2.13%. In addition, the feasibility of implementing this approach in realtime is highlighted, making it suitable for application in clinical settings for seizure detection.
Finally, a recent study by Chen et al. [20] proposes an automated method for detecting epileptic seizures in EEG signals using a CNNbased classifier and feature fusion and selection. The authors extract mixed features from EEG signals using discrete wavelet decomposition (DWT), including approximate entropy (ApEn), diffuse entropy (FuzzyEn), sample entropy (SampEn), and STD by using a random forest algorithm to select relevant features and applying CNNs to classify epileptic EEG signals. Experimental results from reference datasets, such as Bonn EEG and New Delhi, demonstrate the efficacy of the proposed method. For the Bonn datasetās interictal and ictal classification tasks, the model achieves an accuracy of 99.9%, a sensitivity of 100%, an accuracy of 99.81%, and a specificity of 99.8%. For the interictalictal case of the New Delhi dataset, the model achieves 100% classification accuracy, 100% sensitivity, 100% specificity, and 100% accuracy.
This research demonstrates the ability of the proposed approach to detect and classify EEG signals associated with epileptic seizures with high accuracy, which is of great relevance in the clinical detection of epilepsy. These studies are detailed in Table 1.
Materials and methods
Database
The epilepsy seizure recognition database [16] consists of 5 individuals and 4097 data points of 23.5 seconds each. This database mixes each data point into 23 fragments with 178 data points per second. It is divided into five classes (a, b, c, d, e):

Class (a) represents the recording of epileptic activity.

Class (b) represents the EEG recording from the area where a tumor was present.

Class (c) represents the healthy part of the brain after tumor localization.

Class (d) represents the recording of the patient with closed eyes.

Class (e) represents the EEG recording of the individual with eyes open.
It was decided to divide the database into two ways to compare the results obtained from the study and to have clear and precise information on how to classify the different brain activities corresponding to epilepsy disease. The original dataset has five folders with 100 records each from another patient, totaling 5 individuals/persons. Each file is a recording of brain activity for 23.6 seconds. The corresponding time series is sampled into 4097 data points, but the dataset used was modified, dividing and mixing each data point 4097 into 23 chunks, each containing 178 data points per 1 second. We are leaving; as a result, 11500 pieces of information.
With this in mind, it was decided to split the dataset into two classifications, a binomial and a multinomial classification, looking for the best performance of the model when classifying the different signals.
Binary partition
With this dataset division, the work was divided into two phases: The first phase worked with only two classes in search of the best binomial classification. How the classes were divided in this phase was the following: The first class represented the brain activity when the epileptic seizure occurred (Class 1) with a total of 2300 samples, and the second one represents no epileptic activity (Class 0) with a total of 9200 pieces, in this class are grouped the other four classes of the original dataset, being these four different classes where no epileptic seizure occurred. As shown in the Fig. 1.
Multiclass partition
The second phase was carried out in search of the best multinomial classification with the five original classes of the dataset. Each Class represents a different moment when the brain activity was recorded, Class (a) represents the recording of epileptic activity, Class (b) represents the EEG recording from the area where a tumor was present, Class (c) represents the healthy part of the brain after tumor localization, Class (d) represents the recording of the patient with closed eyes, Class (e) represents the EEG recording of the individual with eyes open. Each Class has a complete recording of 2300 brain activity samples from the five folders, each with 100 patients.As shown in the Fig. 2.
Database preparation
As mentioned earlier, the database was divided into two and five classes. For two classes, standard normalization was performed. For five classes, the following preprocessing steps were carried out:

Any:The database is without additional preprocessing.

Scaling: Database with standard normalization.

PCA: Principal Component Analysis (PCA) was performed on the database. It is a technique used to extract the most relevant features by finding the direction of the highest variability of the data, representing the data in a smaller dimension without losing too much information [21]. For this, a standardization process is carried out, followed by a covariance matrix calculation, calculation and selection of vector components, and finally, a data projection. For this database, PCA was performed, which reduced the channels from 178 to 40 in the multiclass models, so the models with and without PCA were compared with the standard scaler.

Scaling + PCA:The database underwent normalization followed by PCA, just as the features were reduced from 178 to 40 when PCA alone was performed.
Hyperparameters
Grid search
When we talk about Grid Search, we are talking about a very common or traditional method for hyperparameter optimization, where a complete search is performed on the subset of data of the space bounded by the same hyperparameter of the model. This is because the parameter used for the model can sometimes include areas with fundamental or unbounded values. One of the big problems with the grid search is the need to apply a specific limit since it suffers in huge dimensional spaces. Still, its great advantage is the ease with which the process can be stopped since the values of the hyperparameters used by the model are independent of each other [22].
Pipeline
The term pipeline is used for objects capable of combining estimators and various transformers to create a combined estimator [23]. It is also used to help optimize the data flow to the desired model, including several essential parameters for the proper functioning of the model, such as features, results, predictions, and raw data. The importance would be substantially improved performance and effectiveness, which is fundamental in developing many machine learning models.
This paper used a grid search and pipeline model to find the hyperparameters best fitting the machine learning models. These can be seen in the Table 2.
Batch normalization (BN)
For data normalization, the BN is used to normalize the features in each data map to have a mean of 0 and a variance of 1, allowing rescaling and retranslating of the distribution. This process in training allows for a higher learning speed [24] (see Eq. 1).
Where:
\(\gamma\) = The rescaling scalar.
\(\beta\) = retranslation scalar.
E[X] = expectation.
Var[X] = variance.
Scaled exponential linear unit (SELU)
It was proposed by Klambauer et al. in 2017 [25]; this is a nonlinear function that works linearly as long as the values are positive, but if otherwise, they are negative, it will behave exponentially. This allows the values to scale and propagate around the multiple layers of the neural network using its two constants \(\lambda\), which is a value around 1.0507, and \(\alpha\), which is the negative slope with an approximate value of 1.67326. Furthermore, this is considered a selfregulating function since, as the information flows through the network, the mean and variance remain stable, helping to improve the stability of the model [26] (see Eq. 2).
Dropout
The regularization method counteracts overfitting by temporarily deactivating randomly selected nodes and their connections. This prevents the neural network from excessively coadapting and relying too heavily on specific features, limiting its ability to recognize only the training data. Dropout not only addresses overfitting but also contributes to developing more resilient networks. By forcing the network to operate with various samples, Dropout promotes robustness. This approach facilitates the averaging of predictions and reduces test time [27], as highlighted in Fig. 3, to mitigate overfitting in the network.
Data balanced
Synthetic minority oversampling technique (SMOTE)
We have a very unbalanced data set in the binary classification, so data balancing is performed with SMOTE. This works in such a way that synthesized data can be generated using similar neighboring samples and linear combinations between them; this helps to increase the data of the minority class, allowing the model to learn the patterns of the unbalanced course better [28].
Adaptive synthetic sampling (ADASYN)
It is used to create synthesized data from the minority class to balance the data. It focuses on generating synthesized data in feature regions where the minority class examples are few, helping the model better capture the class data with fewer data, and avoiding overgeneralization of the model [29]. Synthetic samples are created by selecting a minority example and randomly choosing some of its neighbors, for which their density is calculated. This is based on interpolation and employs straightline or kmeans techniques. Thanks to the relative density, more examples of the minority class are generated [29].
Table 3 compiles the best hyperparameters of the machine learning models obtained in the GridSearch and Pipeline process for data balancing with SMOTE and ADASYN.
Models
Machine learning models:
Ā

The Extra Trees Classifier and the Random Forest Classifier (ETC and RFC): Are machine learning models that refer to decision trees and are used for classification and linear regression. However, they tend to overfit, which causes problems with new data [30]. The RFC randomly trains multiple decision trees using training data subsets to address this. Finally, a voting algorithm is applied to obtain the best results [31, 32]. The ETC adds randomness to the training process to increase diversity among the trees and improve the modelās performance [33].

The Support vector machine (SVM): This model can solve linear and nonlinear classification and regression problems. It is particularly wellsuited for small and moderately complex data sets. The fundamental concept behind SVM (Support Vector Machine) classification is to separate classes by maximizing the decision boundaries concerning the closest training patterns. Furthermore, it aims to maximize the distance from the nearest training pattern while introducing nonlinearity. SVMs [30] achieve linearly separated classes by utilizing kernel functions that modify or add features based on the training set.

Gradient Boosting: Combines multiple weak learning models into a single robust model [34]. The general idea is that the Gradient Boosting (GB) training process starts with a simple base model and fits it to the training data. Then, the residuals of this first base model are calculated. A new weak model is trained using the residuals as the target in each subsequent iteration. This new model is added to the existing ensemble of models and fitted to the updated residuals [30].

The Decision Tree classifier (DTC): Is based on a decision tree, which selects the most relevant features or attributes from the training set. In addition to this, additional criteria such as node stopping or pruning can be added to the decision tree [35].

KNeighbors Classifier (KNN): In the case of the KNN algorithm, the example data is represented in an ndimensional space, where n is the number of attributes of the data. Each point in this ndimensional space is labeled with its corresponding class value. The fact is placed in this ndimensional space to determine the classification of unclassified data, and the class labels of the k nearest k data points are observed. Typically, k is an odd number. The class that appears most frequently among the k nearest data points is taken as the class of the new data point. In other words, the decision is made by voting on the k neighboring points. One of the significant advantages of this generic KNearest Neighbor algorithm for classification discovery is that it lends itself to parallel operations [36].

Stochastic Gradient Descent(SGD): The SGD is a variant of the Gradient Descent algorithm. Still, unlike the latter, it does not use the entire training data set in each iteration but instead uses minibatches to calculate the gradient and adjust the model parameters. This decreases the computational burden. In addition to estimating its loss function, hyperparameters such as the learning rate and the number of minibatches must be adjusted [37].
Convolutional neural network
The convolutional neural network is based on the preamble that data have locally important patterns or features that can be extrapolated. There are multiple convolutional neural networks; however, they mostly all follow the same structure. These consist of 3 layers: Convolutional, which aims to learn the input feature representation; this is composed of several convolution kernels that map the different features; these are interconnected first to understand the input and then use the activation function. After this, we have the second layer fully connected, i.e., all neurons from the previous layer are directly related to the next one to generate general semantic information. Finally, we have an output layer for classification tasks that commonly has a softmax activation operator and an optimizer [38].
The following Eq. 3 denotes this:
Where: \(M^l\) represents each of the feature maps. \({M_i^{l  1}}\) is the prefeuture map layer, \({K_i^l}\) is the kernel, b is the bias, * refers to convolution.
The neural network we use in this paper is described in Figs. 3 and 4. The 1D convolutional layers (Conv1D) comprise a kernel of size 3, padding same, and a selu activation method. In the output of each set, we have a 1D maxpole and a dropout 0. 5, the sets are ordered in such a way that we have two layers of 32 neurons, 2 layers of 64 neurons, 3 layers of 128 neurons, 3 layers of 256 neurons, and 6 layers of 512 neurons divided into groups of 3. Connected to this last output, we have a Global Max pooling 1D that connects us with our fully connected dense layers. The neural network we use in this paper is described in Figs. 3 and 4. The 1D convolutional layers (Conv1D) are composed of a kernel of size 3, padding SAME, and a selu activation method, and in the output of each set, we have a 1D maxpole and a dropout 0. 5, the sets are ordered in such a way that These have a batch normalization method with a Selu activation method, so we have layers of 1024, 512, 256, 128, 128, 64, 32, and 16 neurons. Finally, we have a classification layer of 5 neurons, one for each class, a softmax activation method, and an Adam learning rate optimizer of 0.001.
Capsulenet
CNNs have limitations, such as the need for large amounts of training data, the inability to handle ambiguity and changes in object orientation, and the loss of information across layers. To overcome these shortcomings, Geoffrey E. Hinton proposed a new approach known as Capsular Neural Networks (CapsuleNet) described in Fig. 5. CapsuleNet implements groups of neurons called capsules, which encode spatial information and the probability of the existence of an object in an image.
Each capsule represents the instantiation parameters of a specific entity, such as an object or a part of an object. The length of a capsuleās vector indicates the probability that the entity exists, while its orientation represents the instantiation parameters. In CapsuleNet, the model learns to represent an image inversely by examining it and attempting to predict the corresponding instantiation parameters. This is achieved by trying to reproduce the object the model thinks it has detected and comparing it to labeled examples in the training data, thus improving the ability to predict the instantiation parameters.
Active capsules at one level predict the instantiation parameters of higherlevel capsules using transformation matrices. When several predictions match, a higherlevel capsule is activated. Unlike the maxpooling used in CNNs, CapsuleNet does not lose information about the exact position of the entity within a region, allowing higherlevel capsules to cover larger regions of the image [39]. As one moves up the hierarchy, the lowerlevel capsules encode more basic information, such as simple geometric shapes and their spatial position. In contrast, the more complex capsules represent more structured geometries.
A key feature of CapsuleNet is its ability to handle spatial and hierarchical relationships between entities in an image. Unlike CNNs, where features are combined using convolution and clustering layers, CapsuleNet allows lowerlevel capsules to interact and predict the properties of higherlevel capsules using transformation matrices. This architecture more effectively captures objectsā hierarchical relationships and geometry in an image, resulting in a more robust and complete representation of visual features. It first extracts learned features that are then fed into a fully connected neural network that produces a classification. The network can learn features by chaining convolutional blocks whose layers learn simple features, but as the blocks are usually routed with pooling, they significantly improve the classification by discarding unimportant activations, which makes the classifier robust to small transformations in the input data [40].
The algorithm for the CapsuleNet is the next:

1.
procedure ROUTING \((\hat{U}_{ji}, r, l)\)

2.
for all capsule i in layer l and capsule j in layer \((l+1): b_{ij} \leftarrow 0.\)

3.
for r iterations do:

4.
for all capsule i in layer \(l: C_{i} \leftarrow \texttt {softmax}\left( b_{i} \right)\); where softmax is: \(c_{ij} = \frac{\exp (b_{ij})}{\sum _{k}\exp (b_{ik})}\)

5.
for all capsule j in layer \((l+1): s_{j} \leftarrow \sum _{i} c_{ij} \hat{U}_{ji}\).

6.
for all capsule j in layer \((l+1): v_{j} \leftarrow \texttt {squash}(s_{j})\); where squash is: \(v_{j} = \frac{s_{j}^{2}}{1+s_{j}^{2}}\frac{s_{j}}{s_{j}}\)

7.
for all capsule i in layer 1 and capsule j in layer \((l+1): b_{ij} \leftarrow b_{ij} + \hat{U}_{ji} \cdot v_{j}\) return \(v_{j}\) where \(s_{j} = \sum _{i} c_{ij} \hat{U}_{ji}\), \(\hat{U}_{ji} = W_{ij} U_{i}\)
The vector output \(v_{j}\) of capsule j represents its resulting output, while \(s_{j}\) represents the total input received by that capsule. In layers beyond the initial layer, the total input \(s_{j}\) of a capsule is calculated as a weighted sum of the āprediction vectorsā \(\hat{U}_{ji}\) from the capsules in the layer below. This is achieved by multiplying the output \(u_{i}\) of a capsule in the lower layer by a weight matrix \(W_{ij}\). The coupling coefficients \(c_{ij}\) play a crucial role in determining the weights and are obtained through an iterative dynamic routing process [39].
The CapsuleNet was incorporated into the model as a subsequent layer to the convolutional and pooling layers, with dimensions adapted to facilitate efficient processing of the capsule vectors. In the first step, an activation function is applied that normalizes and compresses the output values to ensure they are in an appropriate range. This compressed output is passed to the CapsuleNet, where linear transformations generate a tensor in response.
Transformer encoder
Transformers consist of an encoderdecoder architecture, where the encoder processes the input sequence and generates a representation, while the decoder generates the output sequence based on that representation. Each encoder and decoder layer of a transformer consists of multiple selfattenuating heads and feedforward neural networks [41].
The key component of transformers is the attention mechanism, which allows the model to focus on different parts of the input sequence when making predictions. This attention mechanism allows the transformers to capture contextual information from both preceding and following words in a sentence, leading to better understanding and representation of the input data [41].
Adequate transformer performance is due to the use of Attention, which allows the model to focus on the relationship to other words directly related to the text sequence in the input. Transformers are helpful in most NLP tasks, such as linguistic modeling and text classification. There are different structures for different types of problems. The basic coding layer is a standard building block for these architectures, with various specific āheadsā to apply depending on the problem being solved.
In the transformer, the Attention module repeats the computation several times in parallel. Each of these is referred to as an attention head. The Attention module splits its N query, key, and value parameters and passes each split independently through a separate header. These similar attention calculations are combined to produce a final attention score. This draws attention from multiple heads and allows the transformer to encode multiple conditions and nuances for each word [42].
Given the same set of queries, keys, and values, they were entering the practical application, opting for a model that combines knowledge of different behaviors of the exact attention mechanism to capture dependencies of various ranks within a sequence. The attention mechanism must jointly use different representation subspaces of queries, keys, and values; the latter are transformed with independently learned linear projections. In the end, the results of the attention grouping are concatenated and transformed with another learned linear forecast to produce the final result, where each of the outputs of the attention clustering is a head, resulting in the design known as multiheaded attention [42]. The model used in this article is shown in the Fig. 6. Equation 4 that describes it is represented taking into account that āQā is the vector that represents the current token and is used to calculate the following tokens, āKā is the key, and āvā is the value of the vector that contains relevant information. ādkā is an attention normalization constant.
Metrics
TabaresSoto Et al. explain the relevance of metrics in the evaluation of a model, highlighting the distinction between false positives (FP), false negatives (FN), true positives (TP), and true negatives (TN) [43,44,45]. The most important metrics are the following:
Accuracy
Accuracy is the fraction ranging from 0 to 1, representing the correct prediction percentage. To achieve this metric, the total correct predictions are divided by the total predictions made [44,45,46,47,48] (see Eq. 5).
Precision
This metric aims to identify the correct proportion of positive cases, including both false positives and true positives. It is calculated by dividing the number of true positives by the sum of true positives and false positives [44, 45, 48, 49] (see Eq. 6).
Recall
Also known as sensitivity, it shows the ability of the classifier to display correct predictions [44, 45, 48] (see Eq. 7).
F1
F1 is a metric used to assess the modelās ability to accurately identify positive and negative cases. This metric is sensitive to imbalance. It is calculated as the harmonic mean between precision and recall [44, 45, 48, 50] (see Eq. 8).
Support
This metric indicates the number of data in each test class.
Confusion matrix
The confusion matrix is the combination of the actual and predicted classes. The rows represent the envisioned classes, and the columns represent the real class [44, 48].
Crossvalidation (CV)
Crossvalidation is used to evaluate the performance of a model. It divides it into several subsets known as āfoldsā (k) of similar sizes, generating a process of interactions in the model in which data are obtained at the end of which an average is obtained [44, 48]. In this case, we used āten foldsā. Equation 9 represents it:
Model configuration
Convolutional neural networkfully connected (CNNsFully)
This is the combination of feature extraction Fig. 3 with a densely connected neural network Fig. 4. The input tensor of the convolutional neural network has a shape of (None, 178, 1), and the input tensor to the Fully connected is (None, 2, 512). This changes when we apply PCA, as the input tensor for the convolutional network becomes (None, 40, 1), and the input to the Fully connected is (None, 1, 512).
Convolutional neural networkcaps_net (CNNscapsulenet)
Here we can see the main feature extraction base Fig. 3 connected to the modified capsule for signal reading Fig. 5. The input tensor of the convolutional neural network has a shape of (None, 178, 1), and the input tensor to the CapsuleNet is (None, 2, 512). This changes when we apply PCA, as the input tensor for the convolutional network becomes (None, 40, 1), and the input to the CapsuleNet is (None, 1, 512).
Convolutional neural networktransformer encoder (CNNsTf)
Characteristic extraction base Fig. 3 followed by the transformer encoder attention model Fig. 6. The input tensor of the convolutional neural network has a shape of (None, 178, 1), and the input tensor to the transformer encoder is (None, 2, 512). This changes when we apply PCA, as the input tensor for the convolutional network becomes (None, 40, 1), and the input to the transformer encoder is (None, 1, 512).
Convolutional neural networktransformer encoderfully connected (CNNsTffully)
Characteristic extraction base Fig. 3 followed by the transformer care model Fig. 6 and the densely connected network model Fig. 4. The input tensor for the convolutional neural network has a shape of (None, 178, 1), the input tensor for the transformer is (None, 2, 512), and the input tensor for the Fully Connected is (None, 2, 512). This changes when applying PCA, as the input tensor for the convolutional network becomes (None, 40, 1), the input to the Transformer Encoder is (None, 1, 512), and for the Fully Connected, it is (None, 1, 512).
Convolutional neural networktransformer encoder and capsulenet (CNNsTfcapsulenet)
Mainly the feature extraction layer Fig. 3, followed by the model attention transformer encoder Fig. 6, and finally, the modified capsule Fig. 5. The input tensor for the convolutional neural network has a shape of (None, 178, 1), the input tensor for the transformer is (None, 2, 512), and the input tensor for the CapsuleNet is (None, 2, 512). This changes when applying PCA, as the input tensor for the convolutional network becomes (None, 40, 1), the input to the Transformer Encoder is (None, 1, 512), and for the CapsuleNet, it is (None, 1, 512).
Hardware and resources
The experiments used Google Colab, where specific computations were performed on the NVIDIA GP100GL [T4 PCIe 15GB] platform, equipped with 250W power, CUDA Version 10.1, and 12 GB of RAM.
Results
Iteration hyperparameters of the transformer model
To achieve the best possible results, the hyperparameters of the transformer encoder, such as the attention heads and the layers, were iterated to find the most optimal ones for each variation of the models. In Table 4, we can see the results of each interaction with their respective results, where the best model was the encoder transformer model without any aggregate with 16 attention heads. This was possible because our database does not have a significant computational cost.
Binary classification
In the case of binary classification, whether the patient had epilepsy or not, we have the Table 5, which shows the data results without applying a balancing model. In Table 6, we have the results of the balanced data using SMOTE, and Table 7 have the result of the balanced data using ADASYN. These three tables are divided where the first column is the used model, then we have its accuracy, then its crossvalidation, and finally, we have the modelās sensitivity for each class.
With the unbalanced data, we have very high results in most cases. However, it is essential to highlight that the worst model was machine learning, specifically the SGD, with an accuracy of 0.83 and equal crossvalidation. However, the model correctly classifies patients with epilepsy; however, in the other classes, it presents a failure. And as a classification model with higher accuracy, we have the transformer encoder model with an accuracy of 0.9974, a crossvalidation of 0.998, and an accuracy in both classes of 0.998. It is also important to highlight that the best machine learning model was SVM with an accuracy of 0.9830 and a precision, although close, higher in the case of the data set of patients with the pathology.
On the other hand, in the balanced database, as in the unbalanced one, the worst model is the machine learning SDG; however, in this case, we have an accuracy of 0.669 and a crossvalidation of 0.54, showing that the model is not efficient for this classification problem. The best model was the CapsuleNet model, with an accuracy of 0.992. The best machine learning model was KNN, with a remarkable crossvalidation of 1.
Multiclase
Table 8 shows the results obtained from the models evaluated in the five classes of the database. All the models were evaluated in 4 different ways: the first is āAnyā, which means the database without any process before entering the network; the second, āScalingā, refers to the use of Standard Scaler; the third is with the use of PCA and finally the combination of standard scaler and PCA. We can see the hyperparameters used and its result.
On the other hand, Fig. 7 we have the comparison between the models where the best machine learning model, the extra tree classifier, presents the lowest result; however, its accuracy in the class of patients with an epileptic seizure stands out, but the best model is the CapsuleNet, as in the binary model, where we have the highest accuracy in all classes and a more leveled confusion matrix.
Time of compilation
As the results of the best models are so close, it is essential to analyze other variables, such as compilation time and computational resource expenditure. Figure 8 shows a bar chart comparing the compilation times of the best models, where it is evident that the model that takes less time to compile for 500 epochs is the CapsuleNet model, the fully connected and Transformer Encoder+CapsuleNet models have the same time and the most delayed is the transformer encoder model. The compilation time is calculated during training by multiplying the duration of each epoch by the total number of epochs and finally dividing by 60 to convert from seconds to minutes.
Comparison with the state of the art
The results are compared with the stateoftheart using the same database and division. This can be seen in Table 9.
Gradientweighted class activation mapping (GradCam)
GradCam, this method interprets convolutional neural network models by visually presenting the input regions the model deems most crucial for making predictions. It relies on calculating the gradient of the predicted class score concerning the feature maps of the final convolutional layer. These maps are then globally averaged to derive weights multiplied by their respective inputs, resulting in a map highlighting the importance of the input variable [51]. The GradCAM for this issue can be observed in Fig. 9, which pertains to patients with epileptic seizures, and Fig. 10 illustrates the remaining patients.
Feature importance
Understanding the significance of features is a fundamental technique in interpreting ML models. It enhances our comprehension of the modelās functioning and assists in recognizing biases and crucial features. This approach is essential as artificial intelligence models have grown progressively complex and challenging to interpret, mainly owing to scientific advancements [52] (see Fig. 11).
Discussion
Epilepsy is a severe disease that, due to lack of knowledge, has been cataloged as a taboo and considered less important than it is, worsening the patientās quality of life and even causing death. However, an accurate and quick treatment can help the patient have a relatively everyday life, so it is essential to use new technologies to achieve a more efficient process.
It is common to use an electroencephalogram for diagnosing pathology since its cost can be meager compared to other medical imaging methods. This opens the possibility of using artificial intelligence methods with electroencephalogram databases. However, there are very few public databases, so it is necessary to explore all options. In this case, different ways of processing the database were used to find the best result by evaluating them in other traditional and stateoftheart models.
For traditional machine learning models, their efficiency is directly related to the tuning of their hyperparameters. Here is where the pipeline and grid search algorithms that we observe in Table 2 are essential to efficiently search for the best combination for the evaluation of each of the database partitions observed in Figs. 1 and 2.
As shown in Fig. 1, the model of two classes, i.e., of patients with epilepsy against the rest of the categories, is unbalanced. To correct this problem, we used synthetic data using algorithms such as SMOTE and ADASYN in the tuning of hyperparameters in the ML models we can see in Table 3 that the hyperparameters for each of the types of data balancing do not have a natural significant variation, so it can be decided to use one or the other.
Directly in the stateoftheart algorithms, we can observe in the āModelsā section and in the āModel configurationāĀ section, it is evident that they all have feature extraction using CNN as their origin. Experiments obtained the best layers and activation methods. In the case of convolutional layers, we can observe in Fig. 3 the most efficient activation method for this database is selu. Modifying the capsule initially designed for reading images generates an essential contribution to the state of the art. It is necessary to emphasize that according to the author of the original capsule, the max pool can generate problems and worsen the accuracy. In the case of these signals, we have an improvement when using the global max pooling, possibly because we are working with signals and not with images.
Continuing with stateoftheart, as briefly mentioned in the āIntroductionā section, the combination of models has been the latest trend for classification problems; however, the stateoftheart does not report combinations associated with this pathology, nor to this specific database, experimentation with the variety of models to use the best features of each one is an important contribution presented in this paper.
Now focusing directly on the results obtained in this article, Table 4 shows the interaction of the hyperparameters of the encoder transformer model with its different variations. It is essential to highlight that by using flat data, the expenditure in computational resources is not so high, which makes it possible to increase the number of heads of attention and experimentation without having problems due to the lack of powerful graphics cards.
In binary classification based on whether the patient has epilepsy or not, we can observe that, although stateoftheart models show higher efficiency, there is no significant difference among all models based on their standard deviation or cross validation of unbalanced data, as seen in Table 5. This difference is much smaller when the data is balanced, as shown in Tables 6 and 7. However, the metrics improve when we have balanced classes. This indicates that synthesized data effectively increases the systemās signal recognition capacity. Among the balancing methods, SMOTE and ADASYN, SMOTE proves to be more efficient and achieves better metrics, achieving an accuracy of 99.92% in the capsule net model and 99.59% in the KNN model. However, since they are synthetic data, validating again with data from another database is necessary.
Entering directly into the five database classes, we can observe in Fig. 2. As mentioned at the beginning of the discussion, database processing plays a fundamental role in finding the best model. As seen in Table 8, processing the database is necessary since the unprocessed database yields meager results, ranging from 20% to 69% in ML models and from 76% to 88% in DL models. Analyzing the variability of PCA components does not show an improvement in the results obtained. This may be because each point in the EEG contributes information and variability to the model. When analyzing the results in ML models using PCA, we obtain results ranging from 19% to 75%. Despite SVM presenting the worst results, applying PCA in DL models improves the performance of ML models and demonstrates superior metrics.
Now, standard scaling is the one that best fits the majority of the database, where we see that both ML and DL models show a significant improvement in their metrics, except for some cases like ETC, where the best option is to apply PCA and standard scaling. However, machine learning models could be more efficient for classifying the five classes in this problem, with the best results observed in decision tree models like ETC with 73.48% accuracy. Nevertheless, they need to catch up compared to DL models such as CNN+TF, which shows 88.34% accuracy, or the CNN+CapsuleNet model, which shows 87.13% accuracy, with lower standard deviation and compilation time. In this case, individual models perform better than combining them, as in the case of CNN+TF+CapsuleNet, which proves to be inferior to the personal evaluation of each one, achieving only 85.09% accuracy.
In Fig. 7, where we can compare the best ML and stateoftheart models, we can see that the stateoftheart models are much more efficient. However, a constant is demonstrated in all models, and EEGs related to brain tumors, classes b and c, have a problem with classification.
In addition to the above, we can use GradCam from the convolutional layers for better interpretability, as shown in Figs. 9 and 10. These visually depict the waveform and its behavior in the final or convolutional layers. Additionally, we can observe the feature extraction graph in Fig. 11, where graphically, it is evident that most of the features or points in the EEG are crucial for classification. This may be a reason why PCA does not yield good results.
Finally, looking for the best model, the difference between the two best models, CNNs+CapsuleNet and CNNs+Transformer Encoder, is only one percentage point. Still, the CapsuleNet has a lower standard deviation, achieving a more stable result. Besides that, Fig. 8 shows that the compilation time and, consequently the computational resource expenditure when using CNNs+CapsuleNet is three times less than when using the CNNs+Transformer Encoder model concluding that the best classification model is the proposed CNNs+CapsuleNet model modified for signals achieving an accuracy of 87.30% and a standard deviation of Ā± 1%. However, the stateoftheart is surpassed with both models, Furthermore, the compilation time is significantly shorter, even with more parameters. This is due to the efficiency of the models and their parallel processing, unlike LSTMs that operate sequentially as shown in Table 9.
The creation of diagnostic tools with the help of artificial intelligence is an innovative field in medical technology. Tools such as classification models applied to medical services can make treatments more assertive and faster, as doctors would have an additional tool to confirm or reject a diagnosis. This can reduce timeconsuming and costly processes in developing countries, impacting the directly affected users. Artificial intelligence and diagnostic tools have the potential to save lives.
Conclusion
The use of artificial intelligence models for diagnosing pathologies has experienced significant growth in the last decade, making it essential to find the model that best suits each studied disease. In the case of epilepsy using electroencephalographic signals, the difference between models is minimal, so it is crucial to analyze other aspects, such as computational resource expenditure and compilation time. Regarding machine learning models, compilation times are minimal due to their high optimization. However, the results are much lower when classifying multiple types of electroencephalograms, possibly due to the reduced data. The metrics are significantly lower. The ML models, such as ETC or RFC, generally show the best results for this issue. Specifically, concerning stateoftheart models, it can be concluded from this analysis that the best model for classifying electroencephalograms is the CNNs+CapsuleNet model. It achieves accuracy only one percentage point below the transformers model but with a lower standard deviation and, most importantly, half the compilation time.
Suggestions for future research
For future work, evaluating the proposed models with new data and different signal acquisition methods is advisable to verify their suitability for deployment in a hospital setting. This is why a repository with the codes is included so that experiments can be replicated and eventually improved.
Availability of data and materials
The datasets and code used during the current study are available at: https://github.com/BioAITeam/AComparativeStudyofCNNCapsuleNetCNNTransformerEncoderandtraditionalMachineLearningAl.
Abbreviations
 EEG:

Electroencephalogram
 CNNs:

Convolutional neural networks
 CapsuleNet:

Capsule neural networks
 ML:

Machine learning
 ViT:

Vision transformer
 LSTM:

Long shortterm memory network
 DL:

Deep learning
 ETC:

Extra Trees Classifier
 RFC:

Random Forest Classifier
 SVM:

Super Vector Machine
 GB:

Gradient Boosting
 DTC:

Decision Tree Classifier
 KNN:

Kneighboors Classifier
 SGD:

Stochastic Gradient Descent
 Fully:

Fully Conected
 Tf:

Transformer Encoder
References
Thijs RD, Surges R, OāBrien TJ, Sander JW. Epilepsy in adults. Lancet. 2019;393(10172):689ā701.
Perucca P, Bahlo M, Berkovic SF. The genetics of epilepsy. Annu Rev Genomics Hum Genet. 2020;21:205ā30.
for MedicalĀ Education MF, Research. epilepsy symptomscauses. Mayo Clin Proc. 2023. https://www.mayoclinic.org/eses/diseasesconditions/epilepsy/symptomscauses/syc20350093. Accessed 14 Oct 2023.
Soufineyestani M, Dowling D, Khan A. Electroencephalography (EEG) technology applications and available devices. Appl Sci. 2020;10(21):7453.
Shoeibi A, Khodatars M, Ghassemi N, Jafari M, Moridian P, Alizadehsani R, et al. Epileptic seizures detection using deep learning techniques: a review. Int J Environ Res Public Health. 2021;18(11):5780.
Zhou ZH. Machine learning. Springer Nature; 2021.
Lowery BR, Langou J. A Greedy Algorithm for Optimally Pipelining a Reduction. 2013. arXiv preprint arXiv:1310.4645.
Toraman S. Automatic recognition of preictal and interictal EEG signals using 1Dcapsule networks. Comput Electr Eng. 2021;91:107033.
Hu S, Liu J, Yang R, Wang Y, Wang A, Li K, et al. Exploring the Applicability of Transfer Learning and Feature Engineering in Epilepsy Prediction Using Hybrid Transformer Model. IEEE Trans Neural Syst Rehabil Eng. 2023;31:1321ā32.
Akinyelu AA, Zaccagna F, Grist JT, Castelli M, Rundo L. Brain Tumor Diagnosis Using Machine Learning, Convolutional Neural Networks, Capsule Neural Networks and Vision Transformers, Applied to MRI: A Survey. J Imaging. 2022;8(8). https://doi.org/10.3390/jimaging8080205. https://www.mdpi.com/2313433X/8/8/205.
Wei Y, Liu Y, Li C, Cheng J, Song R, Chen X. TCNet: A Transformer Capsule Network for EEGbased emotion recognition. Comput Biol Med. 2023;152:106463.
Xu G, Ren T, Chen Y, Che W. A onedimensional cnnlstm model for epileptic seizure recognition using eeg signal analysis. Front Neurosci. 2020;14:578126.
Ma M, Cheng Y, Wei X, Chen Z, Zhou Y. Research on epileptic EEG recognition based on improved residual networks of 1D CNN and indRNN. BMC Med Inf Decis Mak. 2021;21:1ā13.
SanSegundo R, GilMartĆn M, DāHaroEnrĆquez LF, Pardo JM. Classification of epileptic EEG recordings using signal transforms and convolutional neural networks. Comput Biol Med. 2019;109:148ā58.
Andrzejak RG, Schindler K, Rummel C. Nonrandomness, nonlinear dependence, and nonstationarity of electroencephalographic recordings from epilepsy patients. Phys Rev E. 2012;86(4):046206.
Andrzejak RG, Lehnertz K, Mormann F, Rieke C, David P, Elger CE. Indications of nonlinear deterministic and finitedimensional structures in time series of brain electrical activity: Dependence on recording region and brain state. Phys Rev E. 2001;64(6):061907.
Ahmadi A, Shalchyan V, Daliri MRA, new method for epileptic seizure classification in EEG using adapted wavelet packets. In: 2017 Electric Electronics, Computer Science, Biomedical Engineeringsā Meeting (EBBT). IEEE; 2017. p. 1ā4.
Wang L, Xue W, Li Y, Luo M, Huang J, Cui W, et al. Automatic epileptic seizure detection in EEG signals using multidomain feature extraction and nonlinear analysis. Entropy. 2017;19(6):222.
Shen M, Wen P, Song B, Li Y. Realtime epilepsy seizure detection based on EEG using tunableQ wavelet transform and convolutional neural network. Biomed Sig Process Control. 2023;82:104566. https://doi.org/10.1016/j.bspc.2022.104566. https://www.sciencedirect.com/science/article/pii/S1746809422010205
Chen W, Wang Y, Ren Y, Jiang H, Du G, Zhang J, etĀ al. An automated detection of epileptic seizures EEG using CNN classifier based on feature fusion with high accuracy. BMC Med Inform Decis Mak. 2023;23(1). Cited by: 0; All Open Access, Gold Open Access. https://doi.org/10.1186/s1291102302180w. https://www.scopus.com/inward/record.uri?eid=2s2.085159827390 &doi=10.1186%2fs1291102302180w &partnerID=40 &md5=2b062961790c3fff122e44184c9b95d2.
Liu J, Cai W, Shao X. Cancer classification based on microarray gene expression data using a principal component accumulation method. Sci China Chem. 2011;54:802ā11.
Liashchynskyi P, Liashchynskyi P. Grid search, random search, genetic algorithm: a big comparison for NAS. 2019. arXiv preprint arXiv:1912.06059.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P,Weiss R, Dubourg V, et al. Scikitlearn: Machine learning in python. J Mach Learn Res 2011;12:2825ā30.
Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In: Proceedings of the 32Nd International Conference on International Conference on Machine Learning  Volume 37. ICMLā15. JMLR.org; 2015. p. 448ā456. https://doi.org/10.1016/j.molstruc.2016.12.061. http://arxiv.org/abs/1502.03167, http://dl.acm.org/citation.cfm?id=3045118.3045167.
Klambauer G, Unterthiner T, Mayr A, Hochreiter S. SelfNormalizing Neural Networks. 2017.
Rasamoelina AD, Adjailia F, SinÄĆ”k PA, review of activation function for artificial neural network. In: 2020 IEEE 18th World Symposium on Applied Machine Intelligence and Informatics (SAMI). IEEE; 2020. p. 281ā6.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929ā58.
Camacho L, Douzas G, Bacao F. Geometric SMOTE for regression. Expert Syst Appl. 2022:116387.
He H, Bai Y, Garcia EA, Li S, ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE; 2008. p. 1322ā8.
GĆ©ron A. Handson machine learning with ScikitLearn, Keras, and TensorFlow. OāReilly Media, Inc.; 2022. p. 153ā203.
Garavand A, Salehnasab C, Behmanesh A, Aslani N, Zadeh AH, Ghaderzadeh M, etĀ al. Efficient model for coronary artery disease diagnosis: a comparative study of several machine learning algorithms. J Healthc Eng. 2022;2022.
Sadoughi F, Ghaderzadeh M. A hybrid particle swarm and neural network approach for detection of prostate cancer from benign hyperplasia of prostate. In: eHealthāFor Continuity of Care. IOS Press; 2014. p. 481ā485.
Breiman L. Random forests. Mach Learn. 2001;45:5ā32.
Friedman JH. Stochastic gradient boosting. Comput Stat Data Anal. 2002;38(4):367ā78.
Safavian SR, Landgrebe D. A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern. 1991;21(3):660ā74.
Pandya VJ. Comparing handwritten character recognition by AdaBoostClassifier and KNeighborsClassifier. In: 2016 8th International Conference on Computational Intelligence and Communication Networks, (CICN). Tehri: IEEE; 2016. p. 271ā4.
Bottou L. Stochastic gradient descent tricks. Neural Networks: Tricks of the Trade. 2nd ed. 2012. p. 421ā436.
Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, et al. Recent advances in convolutional neural networks. Pattern Recognit. 2018;77:354ā77.
Sabour S, Frosst N, Hinton GE. Dynamic Routing Between Capsules. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, etĀ al., editors. Advances in Neural Information Processing Systems, vol.Ā 30. Curran Associates, Inc.; 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/2cad8fa47bbef282badbb8de5374b894Paper.pdf.
Dombetzki LA. An overview over capsule networks. Netw Archit Serv. 2018; 2ā4.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, etĀ al. Attention Is All You Need. 2023.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, etĀ al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.
TabaresSoto R, ArteagaArteaga HB, MoraRubio A, BravoOrtĆz MA, AriasGarzĆ³n D, AlzateGrisales JA, et al. Sensitivity of deep learning applied to spatial image steganalysis. PeerJ Comput Sci. 2021;7:616.
Hosseini A, Eshraghi MA, Taami T, Sadeghsalehi H, Hoseinzadeh Z, Ghaderzadeh M, et al. A mobile application based on efficient lightweight CNN model for classification of BALL cancer from noncancerous cells: a design and implementation study. Inform Med Unlocked. 2023;39:101244.
ArteagaArteaga HB, MoraRubio A, Florez F, MurciaOrjuela N, DiazOrtega CE, OrozcoArias S, et al. Machine learning applications to predict twophase flow patterns. PeerJ Comput Sci. 2021;7:798. https://doi.org/10.7717/peerjcs.798.
Fernando GP, Brayan AAH, Florina AM, Liliana CB, HĆ©ctorGabriel AM, Reinel TS. Enhancing Intrusion Detection in IoT Communications Through ML Model Generalization With a New Dataset (IDSAI). IEEE Access. 2023;11:70542ā59.
Ghaderzadeh M, Aria M, Hosseini A, Asadi F, Bashash D, Abolghasemi H. A fast and efficient CNN model for BALL diagnosis and its subtypes classification using peripheral blood smear images. Int J Intell Syst. 2022;37(8):5113ā33.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikitlearn: machine learning in Python. J Mach Learn Res. 2011;12:2825ā30.
Hossin M, Sulaiman MN. A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process. 2015;5(2):1.
Powers DM. Evaluation: from precision, recall and Fmeasure to ROC, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061. 2020.
Selvaraju RR, Das A, Vedantam R, Cogswell M, Parikh D, Batra D. GradCAM: Why did you say that? arXiv preprint arXiv:1611.07450. 2016.
Adler AI, Painsky A. Feature importance in gradient boosting trees with crossvalidation feature selection. Entropy. 2022;24(5):687.
Acknowledgements
Mario Alejandro BravoOrtiz and Harold Brayan ArteagaArteaga are supported by a Ph.D. grant āConvocatoria 22 OCAD de Ciencia, TecnologĆa e InnovaciĆ³n del Sistema General de RegalĆas y al Ministerio de Ciencia, TecnologĆa e InnovaciĆ³nā. We would like to thank Universidad AutĆ³noma de Manizales for making this paper as part of the āClasificaciĆ³n de los estadios del Alzheimer utilizando ImĆ”genes de Resonancia MagnĆ©tica Nuclear y datos clĆnicos a partir de tĆ©cnicas de Deep Learningā with code 873139 and āAplicaciĆ³n de Vision Transformer para clasificar estadios del Alzheimer utilizando imĆ”genes de resonancia magnĆ©tica nuclear y datos clĆnicosā project with code 8472023 TD also to the projects āCHT1246 : Oportunidades de Mercado para las Empresas de TecnologĆa  Compras PĆŗblicas de Algoritmos Responsables, Ćticos y Transparentesā, ANID PIA/BASAL FB0002 and ANID/PIA/ANILLO ACT210096.
Funding
This work was funded by Universidad Autonoma de Manizales as part of the project āClasificaciĆ³n de los estadios del Alzheimer utilizando ImĆ”genes de Resonancia MagnĆ©tica Nuclear y datos clĆnicos a partir de tĆ©cnicas de Deep Learningā with code 873139, and also by the projects āCHT1246: Oportunidades de Mercado para las Empresas de TecnologĆaCompras PĆŗblicas de Algoritmos Responsables, Ćticos y Transparentesā, ANID PIA/BASAL FB0002, and ANID/PIA/ANILLOS ACT210096.
Author information
Authors and Affiliations
Contributions
SAHG, EGN, AEDC,MAPC,HBAA, GAR, RTS, and MABO designed the study. MABO provided the data. SAHG, EGN, AEDC,MAPC,HBAA, and MABO preprocessed the data. SAHG, EGN, AEDC,MAPC,HBAA, and MABO developed the tools, performed the analyses and produced the results. SAHG, EGN, AEDC,MAPC,HBAA, GAR, RTS, and MABO analysed the results and wrote the manuscript. GAR and RTS acquired the funding and provided the resources. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisherās Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
HolguinGarcia, S.A., GuevaraNavarro, E., DazaChica, A.E. et al. A comparative study of CNNcapsulenet, CNNtransformer encoder, and Traditional machine learning algorithms to classify epileptic seizure. BMC Med Inform Decis Mak 24, 60 (2024). https://doi.org/10.1186/s1291102402460z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1291102402460z