Skip to main content

Volume 20 Supplement 12

Slow Onset Detection in Epilepsy

Learning to detect the onset of slow activity after a generalized tonic–clonic seizure



Sudden death in epilepsy (SUDEP) is a rare disease in US, however, they account for 8–17% of deaths in people with epilepsy. This disease involves complicated physiological patterns and it is still not clear what are the physio-/bio-makers that can be used as an indicator to predict SUDEP so that care providers can intervene and treat patients in a timely manner. For this sake, UTHealth School of Biomedical Informatics (SBMI) organized a machine learning Hackathon to call for advanced solutions


In recent years, deep learning has become state of the art for many domains with large amounts data. Although healthcare has accumulated a lot of data, they are often not abundant enough for subpopulation studies where deep learning could be beneficial. Taking these limitations into account, we present a framework to apply deep learning to the detection of the onset of slow activity after a generalized tonic–clonic seizure, as well as other EEG signal detection problems exhibiting data paucity.


We conducted ten training runs for our full method and seven model variants, statistically demonstrating the impact of each technique used in our framework with a high degree of confidence.


Our findings point toward deep learning being a viable method for detection of the onset of slow activity provided approperiate regularization is performed.


Recent advancements deep learning have significantly improved performance for classification and detection tasks [1, 2]. However, generalization ability is still limited due to the lack of sufficient high-quality training data for many domains. This holds true for many problems in the biomedical domain where data is often limited (especially for sub-population studies), which constraints the capacity of highly powerful supervised deep learning frameworks [3]. Since deep learning is known for requiring a considerable amount of data [4], applying it to a problem such as detection of markers (onset of slow activity) to predict critical patterns in a rare disease like SUDEP is not straightforward.


Our method attempts to build a framework to apply recent advancements in deep learning [2, 5,6,7] to detection problems such as detection of the onset of slow activity after a generalized tonic–clonic seizure, where availability of of training data is limited. We combine a variety of preprocessing (Resampling), regularization (Anti-aliased temporal downsampling [6], Global temporal downsampling [8], Global batch-wise z-scoring, Kernel regularization, [9]), and optimization (Batch size [10], Loss discount factor) techniques to work around the data paucity issue. We also develop a system for real-time visualization of our models predictions to emphasize which parts of the signal contributed most to the decision


From a high level, we feed an EEG sequence x into our binary classification model \( y = f(x) \), which estimates the probability \( y \approx P(y|x) \) that the sequence contains the onset of slow activity (i.e., label). The chosen model architecture is a residual neural network [11] utilizing stacked convolution layers [12], skip connections [11], batch normalization [7], downsampling [6], and non-linear activation functions. We train our model using mini-batch stochastic gradient descent (SGD). We implemented our model using Python 3.7 and tensorflow.keras [13]. Our full source code is available on Github:


The original source of the training data D contains variable length sequences composed of recordings from ten pairwise offsets of two adjacent EEG electrodes [14]:

$$\begin{aligned} \{fp_1-f_7, f_7-t_7, t_7-p_7, p_7-o_1, fp_2-f_8, f_8-t_8, t_8-p_8, p_8-o_2, fz-cz, cz-pz\} \in F \end{aligned}$$

The sequences were recorded from 134 different patients, each with their own variable length sequence [14]. It follows that \( |D| = 134 \). The EEG sampling rate \(F_s\) is 200 Hz, and each timestep \( t_n \) is labeled \( y \in \{0, 1\} \) for the presence of slow activity [14]. We create a training set T derived from this set in Sequence generation. The validation dataset V contains \( |V| = 12345 \) ten second sequences sampled from 34 patients with the same EEG channels and sampling rate [14]. Each sequence is labeled \( y \in \{0, 1\} \). The validation set V has a class imbalance for label y, with \( |V_{pos}| = 3,219 \) and \( |V_{neg}| = 9,126 \).

Inputs / output format

Inputs Detection of the onset of slow activity requires detection within the a short time-span in order to be clinically useful. A sequence length of 10 seconds was chosen based on this requirement. It follows that the input sequence to the model contains \({\mathbf {len}}~ {\textit{seq}}_{input} = 10r = 2000 \) timesteps. Each training example contains ten sequences of pairwise offsets. Considering both the sequence length and number of channels, the input to our model has the shape \( ({\mathbf {len}}~ {\textit{seq}}_{input}, |F|) = (2000, 10) \).

Outputs Our model estimates P(y|x) , which is a scalar value ranging between 0 and 1. Hence, the output of our model has the shape (1, )


Sequence generation In order to make the maximum utilization of the original training data, we first create a set \( S_{pos} \) of as many positive sequences with length \( {\mathbf {len}}~ {\textit{seq}}_{input} = 2000 \) as possible for an individual patient, starting with \( t_f = t_{onset} \), and stopping after \( t_i = t_{onset} \). For memory efficiency, a stride of 5 was used during the creation of each sequence in \( S_{pos} \). We then create a disjoint set \( S_{neg} \) by randomly sampling at most \( |S_{pos}| \) negative sequences with replacement from a uniform distribution containing every possible negative training example (sequences with \( t_f < t_{onset} \)) from the same patient. This process is repeated for each patient, and the final training set T contains the union of each \( S_{pos} \) and \( S_{neg} \) set.

Resampling Before training, 50% of generated sequences were randomly cropped relative to the first timestep, resulting in a new sequence \( {\textit{seq}}_{input}' \) with the relationship \( {\mathbf {len}}~ {\textit{seq}}_{input}' = u {\mathbf {len}}~ {\textit{seq}}_{input} \), where \( u \in [0.9, 1.1] \) is sampled from a uniform distribution. \( {\textit{seq}}_{input}' \) was then resampled to the original length \( {\mathbf {len}}~ {\textit{seq}}_{input} = 2000\). While this is a commonly used image augmentation technique for object detection [15, 16], it should also be beneficial here since we are interested in augmenting the temporal relationship between frequency and phase rather than the frequency and phase itself.

Network architecture

Other researchers have demonstrated success with residual neural network variants for detecting complicated patterns in signals [2]. Thus, we use a similar variation of ResNet as a starting point with pre-activation style blocks [5] as shown in Fig. 4. Through trial and error, the first few convolution layers use a \( D = 32 \) dimensional kernel, before increasing to 2D and ending with 4D. Increasing \( D = 32 \) by factors of 2 resulted in overfitting. Likewise, reducing \( D = 32 \) by factors of 2 resulted in underfitting. With \( D = 32 \), our model has \(p = \) 165,664 trainable parameters.

Anti-aliased temporal downsampling We explored several different methods of temporal downsampling in our network architecture, as well as investigating recent advancements in reducing aliasing [6]. After deciding on other hyper parameters, we trained our network with an anti-aliased version of strided downsampling. We use a three point Gaussian low pass kernel with \( \sigma \approx 0.79577 \) during downsampling. We use the same \( \sigma \) for each of the three downsampling operations to encourage the network to learn a feature representation increasingly focused on lower frequencies. Each downsampling operation divides the temporal axis of the sequence by two.

Global temporal downsampling Recent papers in deep learning have increasingly relied on global pooling layers to reduce the number of trainable parameters and improve generalization for a variety of problems [8, 11, 17]. We considered several different global downsampling strategies including global max pooling (GMP), global average pooling (GAP) [8], and flattening. GAP was excluded because it may not be able to effectively handle sequences where only a small percentage contains the onset. Flattening significantly increases the number of trainable parameters, and may bias towards certain parts of the sequence in the training set. GMP provides the largest activation value from each channel regardless of where it occurred. With these considerations in mind, GMP was selected for global temporal downsampling on the top of the network.

Online augmentation

During training, online augmentations were employed to help the network to learn how to handle differences in variance and bias from patient to patient. We employ global batch-wise z-scoring, when combined with a small stride size during sequence generation, smaller batch sizes, and sample-wise shuffling results in the network being forced to generalize to a considerable number of different scales and biases.

Global batch-wise z-scoring z-scoring was done along batch, temporal, and channel axes, normalizing the entire batch using a single mean and standard deviation. Let \( B_n \) be a mini batch of shape \( (|B|, \mathbf{len }~{\textit{seq}}_{input}, |F|) = (16, 2000, 10) \) for a batch size of 16. Each mini batch \( B_n \) is randomly sampled without replacement from a uniform distribution during the start of every training epoch. We calculate the mean \( \mu _{batch} \) and standard deviation \( \sigma _{batch} \) by reducing all three axes to a single scalar value. We then apply standard z-scoring as follows \( B_{train}' = \frac{B_{train} - \mu _{batch}}{\sigma _{batch}} \). \( B_{train}' \) is then used to calculate the loss during training. When validating our models performance, we instead z-score the validation set using the training set population mean and standard deviation.


Since our neural network is a binary classifier, we used a binary cross-entropy based cost function to train the network.

Kernel regularization In order to encourage the model to not overemphasize a small subset of learned features which may be biased towards the training set, we used \(L_2\) kernel regularization. \( \lambda = 0.01 \) was chosen for the \(L_2\) penalty for all convolution kernels using through trial and error [9].

Loss discount factor While GMP may help with cases where only a small part of the onset is present, some positive sequences generated using our methodology only contain a small number of positive time steps which may negatively impact convergence. If more data was available, we could simply omit ambiguous regions during training. Due to data paucity however, another solution is needed. We define a cost discounting function \( \alpha (p) \) where p is defined as the number of positive time steps in a sequence divided by the total length of the sequence:

$$\begin{aligned} \alpha \left( p = \frac{n_{pos}}{\mathbf{len }~seq_{input}}\right) = {\left\{ \begin{array}{ll} 0.95 &{} p = 0 \\ 10p &{} 0< p \le 0.1 \\ 1 &{} 0.1 < p \\ \end{array}\right. } \end{aligned}$$

This effectively discounts loss during the first second after the onset, starting from complete discount at \( t_{onset} = t_{final} \) and ending with no discount at \( t_{onset} = t_{final} - r \), with our discount linearly decreasing as \( t_{onset} \rightarrow t_{final} - r \). Since our classes are balanced, we chose to discount a proportional amount from all negative examples in order to avoid bias. Finally, we define our cost function as:

$$\begin{aligned} loss(y_{true}, y_{pred}) = \alpha \cdot bce(y_{true}, y_{pred}) + \lambda \sum _{i=1}^p \beta ^2 \end{aligned}$$


We optimized our network during training using mini-batch stochastic gradient descent (SGD).

Batch size We used a mini-batch size of 16 during each training step. While a much higher batch size could easily fit into memory during training, smaller batch sizes result in a wider range of scale and bias when utilizing batch-wise z-scoring. Smaller batch sizes have also been observed to have a regularizing effect on the model when training with SGD [10].

Training parameters We selected an initial learning rate of \( \eta _{i} = 0.0001 \), decaying by a factor of 2 every 15 epochs for a total of 75 epochs. Momentum was set to \( \beta = 0.9 \).

Experimental setup While developing our method, we observed a high variability of outcome with different random seeds. In order to test the reliability of our methods, we conducted ten runs using different random seeds with our method during training.

Method variants In addition to our full method, we applied the same experiment setup to different variants omitting batch-wise z-scoring, \(L_2\) kernel regularization, and anti-aliased downsampling. For the z-scoring variant, we normalize each sequence with its own mean and standard deviation during training and validation. The \(L_2\) variant simply omits the \(L_2\) penalty. The method without anti-aliased down-sampling performs a strided down-sampling before the residual connection, and adds a max pooling layer on the residual in order to match the sequence lengths. Two additional variants use batch sizes of 32 and 64. Finally, we created a baseline variant without batch z-scoring, \(L_2\) regularization, anti-aliased downsampling, and the discount factor. For this variant we selected to use a batch size of 64. All variants share the same ten random seeds used in the full method for comparison.

Metrics Due to class imbalance in the validation set, we use receiver operator characteristic area under curve (ROC-AUC) to evaluate the accuracy of our model. Despite the imbalance, are also interested in the trade off between sensitivity and specificity for each of our variants. To compute sensitivity and specificity, values of \( y_{pred} > 0.5 \) are considered true, and values of \( y_{pred} \le 0.5 \) are considered false. The same threshold also applies for accuracy.


Accuracy over ten training runs is shown in Table 1. Table 2 shows the best single validation ROC-AUC of each variant. Finally, Table 3 shows the result of 20 additional training runs for our full method.

Table 1 Comparing our full method to methods which omit one technique: ten runs \( \mu \pm \sigma \)
Table 2 Comparing our full method to methods which omit one technique: ten runs best validation
Table 3 Full method additional training runs: maximum ROC–AUC


Average accuracy

Our full model had the highest average ROC–AUC and highest and most consistent accuracy out of each of our variants. In our variant which omitted batch-wise z-scoring, we observe a significant increase in metric variance as well as the lowest average sensitivity and ROC–AUC. We hypothesize there is not enough variance in scale and bias in the training set without this augmentation. The variant without \(L_2\) regularization struggled with ROC–AUC and specificity, while having slightly higher average sensitivity than our full method. Even considering the fact that our model only has \(\approx \) 165 K trainable parameters, without \(L_2\) kernel regularization there is clear evidence that a small number of features are overemphasized. Our variant without anti aliasing has a higher sensitivity than our full method. However, this comes at a significant cost in specificity. We hypothesize that this is due to the model associating aliasing with the presence of the onset, and that anti-aliasing and/or removal of high frequency information is important for reducing the frequency of false positives. The variant without loss discounting was the closest to our best results, trading off more specificity than was gained in sensitivity. In both cases, increasing the batch size from 16 has a significant negative impact on ROC–AUC during validation. Our baseline model predictably had the worst results overall.

Maximum accuracy

We observe our full method has highest single epoch ROC–AUC of each variant. All of our variants appear to be heavily dependent on weight initialization and mini-batch batch selection during training, with many separate training runs needed to achieve highest generalization. We hypothesize that this is due to both the paucity of the data set and unstable gradients caused by lower batch sizes.

In addition to the ten runs for our full method, we conducted approximately twenty additional runs for our full method with new random seeds. In Table 3, We show the best overall model in terms of ROC-AUC. The model has much higher sensitivity without sacrificing a significant amount of specificity. We use this model for all following discussion and visualization of model behavior.

Explaining model predictions

Salience In order to help explain our models predictions, we computed the gradient of y with respect to input sequences from the test set and summed the absolute value of the gradient for each feature channel together:

$$\begin{aligned} {\textit{salience}}(t) = \sum _{f=0}^{9} \left| \frac{\partial {\textit{y}}}{\partial {\textit{seq}}_{t,f}}\right| \end{aligned}$$

For visualization purposes, we normalize salience with the timestep containing the maximum value: \( {\textit{salience}}_{vis}(t) = \frac{salience(t)}{salience(t_{max})} \). In each visualization we see only strong, sparse activation contributing to the models decision due to the GMP layer at the top of the network.

Example: true positive Arguably the strongest activation overall appears to happen when almost every channel simultaneously increases, which can happen several times around the onset. We visualize this in Fig. 1, where observe strong activation on the rising edge of a global increase.

Fig. 1

Salience: \( y_{true} = 1, \lfloor y_{pred} \rceil = 1 \) (true positive)

Example: false negative Only some instances of the onset of slow activity exhibit strong cross channel correlation, as demonstrated in Fig. 2. While most channels appear to move simultaneously, there is less positive correlation as well as some negative correlation between channels. In this particular example, there appears to be a wide spread of channel bias and low dynamic range. We hypothesis that z-scoring using the population mean and standard deviation may not be optimal for all examples, and that an adaptive strategy could improve validation performance.

Fig. 2

Salience: \( y_{true} = 1, \lfloor y_{pred} \rceil = 0 \) (false negative)

Example: false positive Fig. 3 demonstrates that not all instances of cross channel correlation are useful for predicting the onset by themselves. We hypothesize that a model may need to take into account the temporal nature of the problem in order to avoid these types of false positives.

Fig. 3

Salience: \( y_{true} = 0, \lfloor y_{pred} \rceil = 1 \) (false positive)

Fig. 4

Our network has 12 convolutional layers, each of which is followed by batch normalization and a rectified linear unit. Residual connections are used to improve gradient propagation throughout the network


While our naive baseline model had relatively poor accuracy, we demonstrated the impact of many different regularization techniques. It follows that deep learning can be an effective tool for signal detection problems with a small amount of available training data. By conducting our experiment over many different training runs, we show the statistical significance of our results. Finally, we demonstrated that while our model may be a black box, we can make the results easier to interpret with salience and effective visualization.

Future work

We recognize that the loss discount factor could be made into a continuous function across the entire sequence. Currently, examples with a negative label could contain the start of the onset due to the the labeling task being particularly challenging, but are weighted as heavily as non ambiguous examples. In addition, we observed examples of false positives which would be relatively easy for a human to classify correctly due to drastic changes in overall behavior patterns. An improved model would be able to recognize these changes over time in addition to identifying channel cross correlation.

Availability of data and materials

The data include protected health information, thus are not publicly available.





Global average pooling


Global max pooling


Receiver operator characteristic area under curve


Stochastic gradient descent


Sudden death in epilepsy


  1. 1.

    Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems—volume 1. NIPS’12. Red Hook: Curran Associates Inc.; 2012. p. 1097–105.

  2. 2.

    Rajpurkar P, Hannun AY, Haghpanahi M, Bourn C, Ng AY. Cardiologist-level arrhythmia detection with convolutional neural networks. CoRR abs/1707.01836 (2017). arxiv:1707.01836.

  3. 3.

    Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow P-M, Zietz M, Hoffman MM, Xie W, Rosen GL, Lengerich BJ, Israeli J, Lanchantin J, Woloszynek S, Carpenter AE, Shrikumar A, Xu J, Cofer EM, Lavender CA, Turaga SC, Alexandari AM, Lu Z, Harris DJ, DeCaprio D, Qi Y, Kundaje A, Peng Y, Wiley LK, Segler MHS, Boca SM, Swamidass SJ, Huang A, Gitter A, Greene CS. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15(141):20170387.

    Article  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Marcus G. Deep learning: a critical appraisal. CoRR abs/1801.00631 2018. arxiv:1801.00631.

  5. 5.

    He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. CoRR abs/1603.05027 2016. arxiv:1603.05027.

  6. 6.

    Zhang R. Making convolutional networks shift-invariant again. CoRR abs/1904.11486 2019. arxiv:1904.11486.

  7. 7.

    Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167 2015. arxiv:1502.03167.

  8. 8.

    Lin M, Chen Q, Yan S. Network in network. CoRR abs/1312.4400 2013.

  9. 9.

    Schmidhuber J. Deep learning in neural networks: an overview. CoRR abs/1404.7828 2014. arxiv:1404.7828.

  10. 10.

    Breuel TM. The effects of hyperparameters on SGD training of neural networks. CoRR abs/1508.02788 2015. arxiv:1508.02788

  11. 11.

    He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. CoRR abs/1512.03385 2015. arxiv:1512.03385

  12. 12.

    LeCun Y, Haffner P, Bottou L, Bengio Y. Object recognition with gradient-based learning. In: Forsyth DA, et al., editors. Shape, contour and grouping in computer vision. Heidelberg: Springer; 1999. p. 319.

    Chapter  Google Scholar 

  13. 13.

    Chollet F, et al. Keras. Accessed on 2020-09-20 2015.

  14. 14.

    Jiang X, Kim Y. SBMI Healthcare Machine Learning Hackathon. School of Biomedical Informatics. Accessed on 2020-09-20 2019.

  15. 15.

    Zhao Z, Zheng P, Xu S, Wu X. Object detection with deep learning: a review. CoRR abs/1807.05511 2018. arxiv:1807.05511.

  16. 16.

    Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):60.

    Article  Google Scholar 

  17. 17.

    Lathuilière S, Mesejo P, Alameda-Pineda X, Horaud R. A comprehensive analysis of deep regression. CoRR abs/1803.08450 2018. arxiv:1803.08450.

Download references


We would like to thank Marijane de Tranaltes, Judy Young, David Ha, Luyao Chen, Queen Chambliss, Marcos Hernandez, Angela Wilkes, and everyone else involved in organizing the the SMBI Healthcare Machine Learning Hackathon.

About this supplement

This article has been published as part of BMC Medical Informatics and Decision Making Volume 20 Supplement 12, 2020: Slow Onset Detection in Epilepsy. The full contents of the supplement are available online at


This challenge is supported by the startup grant from UTHealth for the Center for Secure Artificial Intelligence For hEalthcare (SAFE) and Elimu Inc. Data for this challenge is provided with support from the Center for SUDEP Research (NINDS U01NS090408 and U01NS090405). Publication costs are funded by XJ’s discretionary funding from UTHealth. The funding bodies had no roles in the design of the study, analysis, and interpretation of data and in writing the manuscript.

Author information




CV developed methodology; SL, GZ, ST, LC, and XL, provided data; YK and XJ provided guidance and feedback. All authors have read and approve the final manuscript.

Corresponding author

Correspondence to Carroll Vance.

Ethics declarations

Ethics approval and consent to participate

This study was approved by the Institutional Review Board of University of Texas Health Science Center at Houston (HSC-MS-19-0045).

Consent for publication

Not applicable

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Vance, C., Kim, Y., Zhang, G. et al. Learning to detect the onset of slow activity after a generalized tonic–clonic seizure. BMC Med Inform Decis Mak 20, 330 (2020).

Download citation


  • Electroencephalogram
  • Sudden death in epilepsy
  • Generalized tonic–clonic seizure
  • Onset of slow activity
  • Signal detection
  • Machine learning
  • Deep learning
  • Neural network
  • Convolutional neural network
  • Data paucity