Continual learning framework for a multicenter study with an application to electrocardiogram

Deep learning has been increasingly utilized in the medical field and achieved many goals. Since the size of data dominates the performance of deep learning, several medical institutions are conducting joint research to obtain as much data as possible. However, sharing data is usually prohibited owing to the risk of privacy invasion. Federated learning is a reasonable idea to train distributed multicenter data without direct access; however, a central server to merge and distribute models is needed, which is expensive and hardly approved due to various legal regulations. This paper proposes a continual learning framework for a multicenter study, which does not require a central server and can prevent catastrophic forgetting of previously trained knowledge. The proposed framework contains the continual learning method selection process, assuming that a single method is not omnipotent for all involved datasets in a real-world setting and that there could be a proper method to be selected for specific data. We utilized the fake data based on a generative adversarial network to evaluate methods prospectively, not ex post facto. We used four independent electrocardiogram datasets for a multicenter study and trained the arrhythmia detection model. Our proposed framework was evaluated against supervised and federated learning methods, as well as finetuning approaches that do not include any regulation to preserve previous knowledge. Even without a central server and access to the past data, our framework achieved stable performance (AUROC 0.897) across all involved datasets, achieving comparable performance to federated learning (AUROC 0.901). Supplementary Information The online version contains supplementary material available at 10.1186/s12911-024-02464-9.


Introduction
With the importance of the size of available data in the deep learning process, multicenter study is one of the most common approaches in studies using medical data.However, using personal information in medical institutions is mainly prohibited, and obtaining permission to access data requires several arduous procedures, such as approval of the institutional review board evaluating the potential risk of using data.Even though access to data is approved, sharing or taking out data is mostly precluded, and additional approval is required according to the bylaws of each institution.Thus, merging and training all involved data at once is practically difficult.
Federated learning is one of the alternatives for decentralized data [1][2][3][4].In federated learning, models are distributed to be trained, aggregated periodically, and distributed again [5].Federated learning only shares the weights of the trained model without direct access to data, alleviating the potential risk of privacy invasion caused by data sharing.Nevertheless, federated learning requires a central server which is challenging to construct since there are various legal regulations regarding the server containing several institutions [6][7][8][9].Assuming the environment without a central server, executing federated learning is infeasible because automatically merging and distributing trained models until they converge are not available.
As an alternative to federated learning, we propose a continual learning framework for a multicenter study.The goal of continual learning is gradually extending acquired knowledge without catastrophic forgetting [10,11].Although continual learning intrinsically focuses on the problems with the sequential stream of data, we reduced the given distributed environment of a multicenter study to the sequential tasks; the data from each institution is provided one by one, and the knowledge from the preceding trained model is retained by suppressing catastrophic forgetting.In this way, the model is trained to fit all involved datasets.Continual learning can be executed without a central server and requires less communication compared to federated learning which needs a central server and communication until the model reaches a convergence state which is unclear when to reach.
Several magnificent continual learning methods have been developed; however, a major challenge in applying these methods to a multicenter study is selecting the most proper one for the specific and inaccessible preceding dataset.To our best knowledge, most studies proposing their continual learning methods evaluated the performance retrospectively.In other words, those studies compared their methods to the baselines using all datasets after all experiments were finished.However, access to only the current institution's dataset is possible in a real-world setting, and it may be necessary to use a particular method that is suitable for the specific data, rather than relying on the state-of-the-art method.
Inspired by this issue, we focus on selecting a proper continual learning method for each institution in a multicenter study.All involved institutions are assumed to prohibit data sharing strictly and only allow sharing parameters of the trained model.Under this circumstance, we propose an algorithm to choose the best among the concerned methods while training, not after.The main idea here is that the synthesized data from the generative adversarial network (GAN) is introduced to equivalently evaluate the performance of the model trained by each continual learning method.To alleviate the potential risk of privacy invasion caused by the fake data, we randomly paired patient demographic data to the generated ECG and increased the difficulty of patient identification.In experiments, we used four different openly accessible electrocardiogram (ECG) datasets: Shaoxing and Ningbo Hospital ECG Database [12], PTB-XL [13], Georgia 12-Lead ECG Challenge Database [14], and China Physiological Signal Challenge in 2018 (CPSC 2018) [15].Our contributions are as follows: • We propose an algorithm to select the most suitable continual learning method in a multicenter study under a segregated environment without access to preceding datasets.To our best knowledge, this is the first approach to compare continual learning methods in advance, not ex post facto.• Under the real-world setting involving institutions with different data distribution and data collection equipment, we validated our proposed method using four independent real-world ECG datasets.• We utilized the fake data based on GAN to equivalently evaluate the model's performance trained by each continual learning method.We mitigated the potential privacy risk of the fake data by randomly pairing demographic data to the generated ECG.

Federated learning
Federated Learning assumes that several mobile devices have privacy-sensitive data, and merging all data into single storage for training is not allowed.Each device sends the trained model to the central server, and those models are merged and distributed to each device repeatedly [16].Federated learning can be categorized into horizontal and vertical federated learning; horizontal federated learning is introduced when datasets share the same feature space but are different in samples, and vertical federated learning is the opposite [17].In this study, a horizontal federated learning setting is applied, assuming that the size of input data from each institution is matched.

Continual learning
The goal of continual learning is to learn a new task while preserving pre-trained knowledge [10,11].However, as new tasks are added, it is inevitable to degenerate the performance of previously learned tasks.This phenomenon is called the stability-plasticity dilemma.Stability implies preserving previous knowledge, and plasticity implies integrating new knowledge [18,19].Continual learning methods are distinguished into three categories which are (1) replay method, (2) regularization-based method, and (3) parameter isolation method.The replay method uses samples from previous data or synthesizes fake data to train the model with current task data.The primary considerations of the replay method are how many samples to store, which representative samples to choose, and how to synthesize data to retain the previous distribution [20][21][22].Meanwhile, storing sampled data may cause privacy invasion.Compared with the replay method, the regularization-based method does not require previously sampled data.Instead, it uses the additional term in the loss function to maintain the weights of essential parameters from the previous model without sampling past data [23][24][25].Similarly, the parameter isolation method does not use sampled data.However, it differs from the regularization-based method because it fixes parameters allocated to each task.Thus, the number of all tasks should be defined in advance [26].

Generative adversarial network
To equivalently evaluate the stability of continual learning, this study introduces synthesized data of generative adversarial network (GAN).GAN was first introduced by Goodfellow et al. in 2014 [27].The GAN has two main components: generator and discriminator.The generator maps a random noise to the input space, and the discriminator classifies whether the received data is real or synthesized.Despite the brilliant idea of the GAN, it is well-known to be hard to train because no converging point exists.Wasserstein GAN (WGAN) proposed by Arjovsky et al., used the Wasserstein-1 distance defined as the distance between two different distributions [28].
With the WGAN setting, the discriminator does not classify samples as real or fake but is used to calculate Wasserstein-1 distance.

Continual learning method candidates
As the basic idea of this study is to select an appropriate continual learning method for the specific dataset, we considered three regularization-based continual learning methods as the candidates: Learning without Forgetting (LwF) [23], Elastic Weight Consolidation (EWC) [24], and Memory Aware Synapses (MAS) [25].LwF preserves preceding knowledge by adding the knowledge distillation loss proposed by Hinton et al. to the loss function [37].The loss function for LwF is as follows: where l is the number of labels, is a hyperparameter setting the importance of the old task, and y o , y o ′ (i) are the knowledge distillation applied currently recorded probabilities y with a hyperparameter T: EWC is an algorithm that retains important parameters close to their old values.To discover important parameters that contain preceding information, EWC introduces the Fisher information matrix F which is approximated from the Gaussian distribution of parameters.Accordingly, the loss function for EWC is as follows: (1) where L B (θ new ) is the loss for the new task only, and θ is the weights of the model's parameters.Fisher information matrix is equivalent to the second derivative of the loss near a minimum and can be computed from firstorder Taylor expansion of the loss, so that easy to calculate even for large models [38].MAS also retains important parameters close to their old values, but instead of calculating gradients of the loss function, it uses the gradients of the squared 2 norm of output from the trained model.Thus, the importance weight i for parameter θ i is: where M is trained model, x k is k-th data point, and N is the number of data points.The loss function for MAS is equivalent to Eq. ( 3) by changing F to i as follows:

Arrhythmia detection model
In this study, each institution trains a model to detect arrhythmia based on ECG data, age, and sex.The model has three layers: an ECG waveform processing layer based on residual one-dimensional convolutional neural networks, a patient information processing layer that uses a multi-layer perceptron (MLP), and an arrhythmia detection layer that uses another MLP to return the probability of arrhythmia based on the concatenated outputs of the previous two layers.The model's architecture is shown in Fig. 1, and the detailed configuration of the model is described in Table 1.Note that this architecture was used in several studies using physiological signals [35,39].We empirically modified the architecture for arrhythmia detection.

Methods
In this section, we present a continual learning framework for a multicenter study, as shown in Fig. 2. First, we present our continual learning method selection algorithm in a segregated environment without access to any previous data.Then, we describe the process of constructing fake data using a GAN-based ECG synthesizer.

Framework notations
We represent each continual learning method as M l for l ∈ {1, • • • , N} .D k is the data of k-th institution.The model parameters trained by M l and D k are defined as

Continual learning method selection
Some studies on continual learning have succeeded and advanced the field, but they measured their performance using all the data at once after finishing all their experiments.This approach makes sense for determining the best method, but it's not practical in real-world multicenter studies where accessing past data is not permitted.Additionally, it's hard to guarantee that a single method will work well for all the datasets involved.
Accordingly, we assume that there is a suitable continual learning method depending on the dataset.To find the most suitable one, the stability of candidate methods should be compared under the equivalent condition without preceding data.In this study, the equivalent condition is fulfilled by the fake data, which is equally utilized to evaluate given methods, and the stability is defined as the performance of a method calculated by the fake data.
The optimal hyperparameters of continual learning methods such as in Eqs. ( 1), (3), and (5) should also be determined without preceding data.We referred to the continual hyperparameter selection introduced by De Lange et al. [10].Our proposed continual learning Fig. 2 Flowchart of continual learning framework for a multicenter study method selection process in each institution is illustrated in Algorithm 1.

Algorithm 1 Continual learning method selection
The proposed algorithm consists of three steps: First, the previous model θ * k−1 is finetuned by D k without any regulation to preserve previous knowledge, and the baseline performance p * and the corresponding model hyper- parameters h * k are returned.Second, the trained model parameters θ l k for the corresponding continual learning method M l are determined.In this step, the previous model θ * k−1 is trained by M l .The continual hyperparam- eters H l are initially set to maximize stability and then alleviated by the decaying factor α until the performance of the current data meets the reference value, which is the dropped baseline performance (1 − δ)p * , where δ is the performance drop margin.In the last step, each θ l k is evaluated by the previously accumulated fake data D S k−1 .
Then the best-performing model's parameters θ * k are selected and returned.

Fake data construction
This study mainly focused on comparing various continual learning methods in the process of multicenter study.In this context, to reflect reality, we assumed a precluded environment where access to other data is not available when only one data can be accessed for training models at one step, and we used the synthesized data as surrogate of real data for equivalent evaluation of the stability of the model trained by each continual learning method.As an ECG waveform synthesizer, Pulse2Pulse proposed by Thambawita el al., was used [31].The original Pulse2Pulse generates an 8-channel ECG composed of lead I, II, and V1 to V6, and constructs the rest of the leads (III, aFR, aVR, aVL) by linear calculation of the eight leads.Since it was not verified whether the excluded leads (III, aFR, aVR, aVL) of all used datasets were calculated or directly measured, we modified the size of the layers of Pulse2Pulse's generator and discriminator to synthesize the full 12-lead ECG.Keeping the original setting of Pulse2Pulse, only the sizes of the discriminator's input layer and the generator's input and output layers were changed from 8 to 12.The training data were separated into groups with and without arrhythmia, and the generator was trained using the data from each group.To preserve the approximate age and sex distributions of the original data, we randomly sampled data of age and sex from each group and randomly paired them to the synthesized waveforms of the corresponding group.Note that this process maintains the joint distribution of age and sex, not the complete distribution including waveforms.By this approach, we increased the difficulty of identifying individuals, ensuring that the synthesized ECG and corresponding demographic information differed significantly from the original data.Meanwhile, even though some prior studies have explored human identification using ECG [40,41], there is currently no standardized technique for identifying individuals by ECG, and as pointed out by Thambawita et al., generating realistic synthetic data can be an alternative solution to privacy issues [31].In this way, our fake data construction process alleviated the potential risk of privacy invasion.

Data and code availability
All datasets used for the development and validation of the proposed framework in this study are publicly available [12][13][14][15].The code for ECG and demographic data preprocessing, model development, and all experiments including arrhythmia detection and fake data construction is available in our source code repository at https:// anony mous.4open.scien ce/r/ CLMS-FB72.

Datasets
We conducted experiments using four publicly available ECG datasets (Shaoxing and Ningbo Hospital ECG Database, PTB-XL, Georgia 12-Lead ECG Challenge Database, and CPSC 2018) including arrhythmia labels [12][13][14][15].Each 12-lead ECG was sampled with a frequency of 500 Hz for 10 s.All datasets contain age and sex information.In this study, we used ECGs with ages between 18 and 100.The baseline characteristics of all datasets are shown in Table 2.The ECGs were filtered by 0.5 to 40 Hz using a fifth-order bandpass Butterworth filter and scaled to range from − 1 to 1.All datasets were randomly split into training, validation, and test sets according to an 8:1:1 ratio.

Training on a single domain
We first considered the effect of our continual learning framework on a single domain, as a "weak" multicenter study.Training on a single domain assumes the plain condition that the datasets are collected from each site independently having different cohort distributions, but the recording device and regional factors are shared.PTB-XL was used as a single domain and split into four non-IID (not independent and identically distributed) data because the arrhythmia of the dataset was most evenly distributed.

Non-IID data generation
The splitting procedure is as follows: First, divide the dataset into four groups based on age 60 and sex [42].Second, randomly subdivide each group into ten subgroups.Then for each group, randomly select three subgroups among ten subgroups, and distribute them to the other three groups except itself.In this way, four non-IID data corresponding to each site are generated, and the summary of baseline characteristics is shown in Table 3.

Training on multiple domains
Non-IID data from a single domain may reflect the segregated environment to some extent.However, a real-world multicenter study consists of more different datasets in the aspect of the cohort distribution, the data collecting devices, the structure of databases, and other hardly explainable regional factors such as overall income level, ethnicity, and climate.Accordingly, we set the experiment on multiple domains as a "strong" multicenter study, using four independently different datasets, PTB-XL ECG dataset (PTB-XL), Shaoxing and Ningbo Hospital ECG Database (Shaoxing), Georgia 12-Lead ECG Challenge Database (Georgia), and CPSC2018 dataset (CPSC).As shown in Table 1, PTB-XL has a ratio of arrhythmia much different from the rest datasets.Regarding data devices, the PTB-XL dataset was recorded by devices from the Schiller AG, while the Shaoxing dataset was recorded by the GE MUSE ECG system.As to region, Georgia was collected in the USA, while Shaoxing and CPSC were collected in China.

Supervised learning
For the baseline of supervised learning, we trained the arrhythmia detection model using individual data from each site and all merged data.Cross entropy loss was to be minimized considering the class imbalance of the training set as follows: where N is the number of training data, l i is the loss of ith data, and y i is the label of i-th data.The training epoch was set to 100, and the model with the best validation area under the receiver operating characteristic curve (AUROC) was selected.The model was optimized by Adam [43], with a learning rate of 0.0001, and the batch size was set to 256.

Federated learning
FedAvg by McMahan et al. was adopted as a baseline of federated learning [16].The parameter averaging process is as follows:  where w t denotes the merged data, w k t the updated parameter of k-th institution, K the number of involved institutions, n the total number of data, n k the number of data of k-th institution.We additionally conducted feder- ated learning experiments with FedProx by Li et al. [44], which adds an L2 regularization term to the loss function L in the local training of FedAvg as follows [45]: where µ is a hyperparameter that controls the regularization.
For every round, each site trained the model according to the manner of supervised learning with 20 epochs, and the training was early stopped if there was no increase of the AUROC for more than five epochs.This process was repeated for 30 rounds, and the model with the best weighted average AUROC was selected.The formula for weighted average AUROC in this study is as follows: where N i denotes the number of data in the i-th institu- tion, N the total number of data across all institutions.

Finetuning and continual learning
For finetuning and continual learning, the order of the sites should be considered.However, since there is no standard ordering technique [10], we arbitrarily determined training order by sorting from small to large and vice versa based on each site's dataset size (N in Tables 1  and 2).For each site, the training epoch was set to 100, and the training process was early stopped if there was no increase in the AUROC for more than ten epochs.No regularization was applied to finetuning.Continual learning followed Algorithm 1, and LwF, EWC, and MAS were used as method candidates.For the hyperparameter of each method, in Eqs. ( 1), ( 3) and ( 5) was used and initially set to 1, and T in Eq. ( 2) was set to 10, empirically.
The decaying factor α and performance drop margin δ were also empirically set to 0.9 and 0.95, respectively.

Generative model training
To evaluate continual learning methods, we used synthesized data from a generative model.As a generative model, we adopted the modified Pulse2Pulse to synthesize 12-lead ECGs.The training epoch was set to 2000, and the early stopping was activated if there was no decrease of the negative critic loss of WGAN-GP for more than 100 epochs [29].The model was optimized by Adam [43], with a learning rate of 0.0001, and the batch size was set to 64.For the ECG synthesizer, the generator was trained once while the discriminator was trained for five epochs.The parameters of the first trained synthesizer were transferred to the following site, and the parameters were used to initialize the new synthesizer to reduce the training time.Note that we only transferred the trained synthesizer with no previously sampled data.
A sample of synthesized normal ECG is shown in Fig. 3.The comparison of ECG features extracted from the Fig. 3 A sample of synthesized normal 12-lead ECG from PTB-XL original and synthesized ECG is shown in Supplementary Table 1, confirming that our generator successfully addressed mode collapse [46].

Computational information
The

Performance on a single domain
The test performances of all methods on a single domain are presented in Table 4.Note that the overall AUROC was calculated by the weighted average AUROC described in Federated Learning section.We trained the model using all merged data assuming full accessibility, the performance of supervised learning for merged data showed consistently better performance than single supervised learning.The overall performance was best in our proposed framework with large-to-small order, resulting in an AUROC = 0.914, but since the data from all sites are derived from the identical dataset, PTB-XL, the performance difference between the methods was hypothesized not to be very large.Figure 4 shows the fluctuation of validation AUROC of all methods as the training process goes on.For continual learning method selection, the selection rates of LwF, EWC, and MAS throughout all experiments were 10.0%, 40.0%, and 50.0%, respectively.

Performance on multiple domains
In Table 5, the test performances of all methods on multiple domains are presented.In supervised learning, the performance were improved by training with the merged data including each other, but the performance on PTB-XL, which has a significantly different arrhythmia ratio compared to the other datasets, became worse (AUROC 0.930 → 0.842).FedAvg and FedProx also showed weak performance on PTB-XL with an AUROC of 0.751 and 0.735, while the overall performance was best with an AUROC of 0.901 and 0.900, respectively.Finetuning with small-to-large order showed the best performance for the Shaoxing dataset; however, this is because the model was trained by the Shaoxing dataset at last in this order.The result of the large-to-small order showed the best performance in CPSC2018 because of the same reason.Among all methods, only continual learning with large-to-small order achieved AUROC over 0.87 on all datasets, with an overall AUROC = 0.897, which is only 0.004 lower than the best score (FedAvg).The results of the small-to-large order showed relatively weak performance than the large-to-small order like finetuning, but the decline in performance during the training was much smaller than finetuning.For all orders, continual learning maintained the performance of each site as training progressed by suppressing catastrophic forgetting, as shown in Fig. 5.For continual learning method selection, the selection rates of LwF, EWC, and MAS

Discussion
In this study, we proposed a continual learning framework for a multicenter study with a segregated environment where data sharing is strictly prohibited.
We focused on evaluating various continual learning methods without preceding data during the training, not after.The fake data was synthesized from a proper generative model to evaluate each continual learning method equivalently, and in this process, no raw data was shared.To mitigate the potential risk of privacy invasion, we randomly paired the sampled demographic  data to the synthesized waveform, disturbing the original data distribution.Our proposed framework with proper order (large-to-small) showed competitive performance (AUROC 0.897) compared to federated learning (AUROC 0.901) and successfully suppressed catastrophic forgetting regardless of the dataset.Beyond the performance, our framework has higher utility than federated learning since continual learning does not require a central server which is an essential component of federated learning.The selection rates of method candidates (LwF, EWC, MAS) were not one-sided, and this verifies our assumption that a single method is not omnipotent for all involved datasets in a real-world setting and there could be a proper method to be selected for specific data.
There are several future works to improve the study.We proposed the continual learning method selection process based on the fake data by a generative model.However, discovering and training a proper generative model for specific kinds of data requires a lot of effort.Towards the goal of equivalent evaluation of continual learning methods, considering more efficient and mathematically reasonable evaluation processes would be a future work.Continual learning methods usually have the additional term in loss function to control catastrophic forgetting.Thus, directly comparing these terms in a coordinated scale would reduce the evaluation time without a generative model.On the other hand, there could be persistent privacy risks despite utilizing fake data, and thus, we plan to evaluate privacy invasion through reasonable metrics for privacypreserved fake data.Also, we performed experiments with orders from small to large and vice versa to set the order according to the criteria (despite weak significance) instead of random ordering, because there is no standard ordering technique yet [10].However, this approach might not be feasible if there are many sites or the sample size in each domain is similar or dynamic, and the other ordering techniques are to be further explored.Meanwhile, this study was performed with standard 12-lead ECGs with a frequency of 500 Hz for 10 s, so the methods of the study should be validated by another format of ECG, such as Holter monitor records, overcoming diverse measurement environments simultaneously.Lastly, expanding continual learning method candidates and analyzing the impact of our framework on the other models, such as lightweight deep learning models, remains future work.
2 θ l k .The best-performing model's parameters are denoted as θ * k .Note that the initial state of the training model at the k-th institution is θ * k−1 .S k is the synthesizer trained by D k , and D S k is the fake data by S k .For continual learn- ing method selection, each θ l k is evaluated by the accumulated fake data D S 1 , • • • , D S k−1 .Every k-th institution transfers the selected model θ * k , and the accumulated fake data D S 1 , • • • , D S k to the following institution.

Fig. 1
Fig. 1 The architecture of the arrhythmia detection model.The representations of ECG waveforms and patient information including age and sex are concatenated and pass through the arrhythmia detection layer to return the probability of arrhythmia

Fig. 4
Fig. 4 Validation AUROC of all methods on a single domain.(upper left) supervised learning, (upper right) federated learning, (lower left) continual learning (small to large), (lower right) continual learning (large to small).Results are averaged across five random seeds and the shaded part indicates one standard deviation

Fig. 5
Fig. 5 Validation AUROC of all methods on multiple domains.(upper left) supervised learning, (upper right) federated learning, (lower left) continual learning (small to large), (lower right) continual learning (large to small).Results are averaged across five random seeds and the shaded part indicates one standard deviation Kwon et al. and Lin et al. showed significant performance in detecting imbalance of electrolytes, including potassium, sodium, and calcium, using ECG and deep learning methods [33, 34].Also, Raghunath et al. used a deep neural network (DNN) to predict mortality from 12-lead ECG [35].Kiyasseh et al. proposed CLOCS, the novel contrastive learning method [36]the effective representation of ECG[36].CLOCS showed state-of-the-art performance on the downstream task, arrhythmia (abnormality in a heartbeat) detection.

Table 1
Configuration of arrhythmia detection model.n indicates the batch size

Table 2
Baseline Characteristics of all datasets.The mean and standard deviation of age, the percentages of the male sex, and arrhythmia are presented

Table 3
Baseline Characteristics of non-IID groups of PTB-XL.The mean and standard deviation of age, the percentages of the male sex, and arrhythmia are presented

Table 4
Test performances of all methods on a single domain (PTB-XL) are presented.The mean and standard deviation across five random seeds are shown.Bold reflects the method with the best performance.The overall performance is the weighted average AUROC by the number of data in each site.Bold is the best and underlined is the second best

Table 5 Test
performances of all methods on multiple domains are presented.The mean and standard deviation across five random seeds are shown.The overall performance is the weighted average AUROC by the number of data in each site.Bold is the best and underlined is the second best