Stacked LSTM based deep recurrent neural network with kalman smoothing for blood glucose prediction

Background Blood glucose (BG) management is crucial for type-1 diabetes patients resulting in the necessity of reliable artificial pancreas or insulin infusion systems. In recent years, deep learning techniques have been utilized for a more accurate BG level prediction system. However, continuous glucose monitoring (CGM) readings are susceptible to sensor errors. As a result, inaccurate CGM readings would affect BG prediction and make it unreliable, even if the most optimal machine learning model is used. Methods In this work, we propose a novel approach to predicting blood glucose level with a stacked Long short-term memory (LSTM) based deep recurrent neural network (RNN) model considering sensor fault. We use the Kalman smoothing technique for the correction of the inaccurate CGM readings due to sensor error. Results For the OhioT1DM (2018) dataset, containing eight weeks’ data from six different patients, we achieve an average RMSE of 6.45 and 17.24 mg/dl for 30 min and 60 min of prediction horizon (PH), respectively. Conclusions To the best of our knowledge, this is the leading average prediction accuracy for the ohioT1DM dataset. Different physiological information, e.g., Kalman smoothed CGM data, carbohydrates from the meal, bolus insulin, and cumulative step counts in a fixed time interval, are crafted to represent meaningful features used as input to the model. The goal of our approach is to lower the difference between the predicted CGM values and the fingerstick blood glucose readings—the ground truth. Our results indicate that the proposed approach is feasible for more reliable BG forecasting that might improve the performance of the artificial pancreas and insulin infusion system for T1D diabetes management.


Background
Diabetes Mellitus is a chronic disorder associated with abnormally high levels of blood glucose because the body is unable to produce enough insulin to meet its needs. α -cell and β-cell in the pancreas are responsible for maintaining the glucose level in blood by secreting insulin and glucagon hormones [1]. Diabetes can be classified primarily into two categories. Type-1 diabetes is due to β-cell destruction and would cause absolute insulin deficiency. Type-2 diabetes is due to a progressive insulin secretory defect on the background of insulin resistance [2]. In type-1 diabetes, hypoglycemia sets in when blood sugar levels are too low (blood glucose concentration < 70 mg/dl ) [3] and hyperglycemia occurs when blood sugar levels are too high for a prolonged period (blood glucose concentration > 180 mg/dl ) [4][5][6]. In the long term, hyperglycemia causes severe complications Rabby et al. BMC Med Inform Decis Mak (2021) 21:101 of heart, blood vessel, eyes, kidneys, and other organs [7]. Therefore, proper diabetes management is vital for human health.
External insulin treatments are indispensable for T1D diabetes patients to maintain the blood glucose level in a healthy range [8]. With the naive approach for diabetes management, a patient needs to measure BG concentration several times throughout the day and night using the finger-stick test. Currently, it is the most common selfmonitoring approach. Improved techniques such as a combination of an insulin pump for Continuous Subcutaneous Insulin Infusion (CSII) and a device for continuous glucose monitoring (CGM), are required for an effective blood glucose management system, known as sensoraugmented-pump (SAP) therapy [9]. CGM device takes glucose measurements with an interval of a particular time-frame. For example, most of the CGM devices take 288 measurements per day with 5 min interval. SAP therapy has been further improved by utilizing control algorithms for dynamic insulin delivery. In terminology, that is known as "Artificial Pancreas" (AP), closed-loop control of blood glucose in diabetes [10]. Statistical and machine learning techniques with the availability of previous continuous BG records make BG prediction more convenient. This prediction mechanism allows a patient or control algorithm to take the initiative to lower the adverse effect of unintended glycemic events.
Various statistical and machine learning methodologies have been proposed for blood glucose forecasting. The autoregressive integrated moving average (ARIMA) model-based algorithm is an example of a classical statistical method for blood glucose prediction [11]. The primary limitation of naive machine learning approaches is that it fundamentally depends on the representation of the input data (e.g., support vector regression [12] or k-nearest neighbor classifier [13]). Daskalaki et al. [14] proposed a generic physiological model of blood glucose dynamics to generate essential features of a Support Vector Regression (SVR) model that took daily events such as insulin boluses and meals into consideration. The meal model, insulin model, and exercise model were used with the SVR model in work [15].
Several works have been done for blood glucose prediction using artificial neural networks (ANNs) [16][17][18][19]. Classically, an ANN has a single hidden layer. However, deep learning models have many hidden layers. These deep models with higher complexity outperform traditional shallow neural networks. These deep learning models with higher complexity can learn the pattern automatically from data [20,21]. Especially for sequential data, recurrent neural networks outperform feed-forward ANNs. However, the drawbacks of using classical RNNs are limitations in its ability to discover, that arises from the vanishing/ exploding gradient problem. This issue has been addressed by Long short-term memory (LSTM) networks [22] with the addition of memory cell and forget gate to classical RNN. In very recent works, [23][24][25][26], convolutional RNN and LSTM are investigated for BG prediction more accurately. Fox et al. [27] used RNN with Gated Recurrent Unit (GRU) cell and Sun et al. [28] utilized Bi-LSTM based RNN for BG forecasting. However, the accuracy achieved with state-of-the-art models for real patient data is not high enough that these approaches might not be applicable in the health domain. For example, the best average RMSE value achieved for the OhioT1DM dataset is 19.04mg/dl so far in the most recent works. Moreover, in these recently proposed methodologies with the RNN model, sensor fault is not taken into consideration. Comparatively, the less accuracy in the BG prediction schemes motivates us for this work to improve the reliability and accuracy of the BG forecasting mechanism.
Early detection and prevention of potential glycemic events such as hypoglycemia and hyperglycemia are one of the primary purposes of the artificial pancreas system. Comparatively, more extended works have been done associated with hypoglycemia detection than the detection of hyperglycemic events. Several statistical methods were studied, such as linear prediction models [29], recursive auto-regressive partial least squares models [30] and multi-variable model [31], for modeling the CGM sensor data for a reliable hypoglycemia early alarm system. In work [32], an RNN model and two different autoregressive models were proposed to design a hypoglycemia/hyperglycemia early warning system (EWS). The author developed a time-sensitive ANN-based hypoglycemia prediction system that can predict future hypoglycemic events within a prediction horizon (PH) of 30 min [33]. Additional physiological models, along with ANN, are studied in work [34] to predict nocturnal hypoglycemia. However, we see very limited works explicitly for the detection of hyperglycemia [35]. The authors have shown in this work that the electrocardiographic (ECG) signals can be employed with the ANN model to detect hyperglycemic events practically and non-invasively.
CGM sensor reading is a crucial factor in BG prediction as a slight error in CGM sensor reading might result in the wrong prediction. However, sensor fault is very common in the CGM system. That is why most of the clinical BG data sets are prone to have errors. Several factors are responsible for that kind of fault, such as the decay of sensor sensitivity, pressure-induced sensor attenuation (PISA) [36], and interruption in signal transmission, etc. Furthermore, bias and latency might be present in a CGM reading [37]. Another issue is that the reading of sensors from the same manufacturer might be different due to manufacturing variability. As a consequence, the predicted BG value and the estimation for insulin might be erroneous due to such faulty BG reading. Consequently, it might decrease the efficacy of the diabetes management system. So, to propose more reliable BG forecasting methodologies, sensor fault should be taken into consideration. However, synthesized data sets are exceptions of this case as this is out of the scope of the sensor fault.
Moreover, there might be another issue responsible for inaccurate BG prediction. The CGM sensor glucose is measured from interstitial fluid samples instead of the blood sample [38]. However, there is a discrepancy in time and magnitude between BG and interstitial glucose (IG) [39]. In the BG forecasting system, CGM sensor readings are used as one of the primary inputs to predict the future BG level. As IG and BG values are different, the BG prediction with IG as input is challenging. We find in our experiment that the BG prediction made with the Kalman smoothed CGM reading is closer to actual BG reading measured with fingerstick than the prediction made with unprocessed CGM reading. In this study, we assume the fingerstick BG reading as the ground truth.
In this work, we propose a deep learning approach for blood glucose prediction using a multi-layered LSTM based recurrent neural network model. We use OhioT1DM (2018 version) dataset [40], containing eight weeks' data for each of six people with type-1 diabetes, for training and evaluation of our proposed model. We utilize several features instead of only CGM measurement. This is because the dynamics of the glucoregulatory system depend on several factors such as carbohydrate intake from meals, the amount of bolus dose or infused insulin, exercise, and physical activeness. Using only the CGM measurement as input might not be enough for learning those complex dynamics. Therefore, we investigate several combinations of different physiological information from the OhioT1DM dataset and choose the most optimal combination of those for our proposed model. We find that step count information from a fitness band can be a useful feature. It turns out that the combination of CGM reading, step count, carbohydrate intake from the meal, and bolus dose, is the most optimal feature set as the input of the model. Note that we use preprocessed CGM data with Kalman smoothing (KS) [41] instead of using raw CGM data to mitigate the effect of sensor fault on BG prediction. The overall architecture of our proposed BG prediction system has been illustrated in Fig. 1.
Our main contribution can be summarized as follows: • We propose a novel approach using a stacked LSTM based deep recurrent neural network architecture for BG prediction. • We introduce a more reliable BG prediction system by utilizing Kalman smoothing for mitigating the sensor fault in CGM reading. We demonstrated that the prediction made with this approach is closer to the actual BG level (fingerstick BG reading) than the conventional prediction method. • Our study reveals that a person's step count information from the fitness band can be effectively utilized for improving BG prediction accuracy.
Overall, our proposed methodologies provide more accurate and reliable forecasting accuracy in terms of RMSE.

Dataset
We use the OhioT1DM (2018 version) [40] dataset to train and test our model. This dataset includes eight weeks of data for each of six type-1 diabetes patients. The number of male and female patients was two and four, respectively. For data collection, Medtronic 530G insulin pumps and Medtronic Enlite CGM sensors were used throughout the data collection period. Each of the patients reported daily events data via a smartphone app and a fitness band. The dataset includes CGM blood glucose level every 5 min-288 samples per day, blood glucose levels with fingerstick (self-monitoring), insulin doses, in the form of bolus and basal, self-reported meal time with estimated carbohydrate intake, exercise time, sleep, work, stress, and illness. The data set also includes 5-min aggregations of step count, heart rate, galvanic skin response (GSR), skin, and air temperature. In this study, we experiment with each of these attributes to find out the optimal attribute set for the BG prediction model. we calculate the accuracy of this model. We assume this accuracy as base accuracy. In the next phase of the study, we use incremental and brute force strategy for adding new attributes to the input of the model. After including a new attribute to the input of the model, we determine the model's performance/accuracy. If the accuracy gets better than the previous accuracy where the model is trained without that particular attribute, only then we consider that attribute as a potential attribute for further experiment. For example, we notice that the accuracy of the model trained with CGM values and bolus information is better than the model trained with only CGM values (base accuracy). So we consider bolus information as a potential attribute. whereas, adding heart-rate information with CGM values, does not provide better accuracy than base accuracy. So we do not include this attribute in the input of our model. From our experiment, it turns out that CGM values, carbohydrate intake from the meal, insulin dose as a bolus, and 5-min aggregation of step count from the fitness band, have a positive effect on the accuracy of the model. It is worth mentioning that some attributes such as insulin from basal, sleep, heart rate, GSR, and skin temperature have no effect or sometimes adverse effect on the accuracy improvement of our proposed model. Table 2 shows the prediction accuracy for different feature sets as input.

Feature extraction from physiological information
We select CGM values, meal info, insulin dose, and step count info, as the final feature set. These four features constitute the 4-channel inputs for the proposed model.

Carbohydrate information
In the case of meal information, we consider the fact that blood sugar starts to rise after 15 min of having the meal and reaches its peak after 1 h [42]. We name this phase as the Increasing Phase. Then the carbohydrate level in blood starts to decrease. That is the Decreasing Phase. At any time-index t s , we calculate the amount of carbohydrate that is effective at that moment in the blood after having a meal. The calculated amount of carbohydrates is treated as the input for the time-index t s to the model. Every time the subject encounters a meal, we need to keep track and update the time-index t meal and the amount of carbohydrate C meal from the meal. We ignore the first 15 min (3 samples of data-point with 5 min of an interval) right after having a meal since it takes 15 min to have the effect of the meal on the blood glucose level. Thus, the equation for the effective carbohydrates C eff in blood for any time-index t s within the first 60 min of time interval after having the meal, is as follows: Here β inc is carbs increasing factor. We use β inc = 0.111 , which implies that the carbohydrate level in the blood reaches its maximum (100% of the amount of carbohydrate taken from the meal) by increasing with the rate of 11.1% for every increment of time-index (5 min interval). Note that, t meal >= t s . At the 60th minute after having the meal, C eff attains the maximum value by increasing with a rate of 11.1% per time-index. At that moment, the value of (t s − t meal )β inc has been approximately less than or equal to 1.00. Note that we do not consider the first 15 min (3 time-indexes) right after having the meal in the above equation; instead, we consider nine time-indexes out of twelve time-indexes withing 60-min interval so that the value of C eff doesn't exceed 100% of C meal . When C eff reaches its maximum, then the decreasing phase begins. We calculate the amount of carbohydrate effective or appeared in the blood as glucose at any particular time-index t s within the decreasing phase is as follows.
Here, β dec is the carbs decreasing factor. we set the value of β dec as 0.028 according to the assumption that the duration of the decreasing phase is around 3 h. This means that after 3 h (the number of time-index is 36), C eff would be approximately near to 0.00. We update C eff in every steps where, C eff = max(0, C eff ) . Thus, we ignore any negative values for C eff .

Insulin information
We use crafted insulin information as one of the four inputs to the proposed RNN model. We consider only bolus information for further crafting. In this study, we find that basal insulin has no considerable effect on blood glucose prediction accuracy. There are two pathways of insulin absorption [43], slow and fast. Approximately 67% of delivered insulin passed through the slow channel with an average absorption rate of 0.011 min −1 , whereas 33% of insulin passed through the fast one with an absorption rate of 0.021 min −1 . Therefore, in this experiment, we use the weighted average of these two absorption rates, which is 0.014 min −1 . For every 5 min interval, we denote it with R insulin = 5 * 0.014 . We calculate the effective insulin on the body I eff at any particular time-index t s with the equation as follows: Here, I bolus is the amount of insulin delivered in the form of bolus to the patient, and t bolus is the time-index when the most recent insulin delivered to the patient. We update I eff in every steps, where I eff = max(0, I eff ) . This calculated effective insulin in the body I eff is used as one of the inputs for the RNN model. In the above equation, we do not consider insulin absorption delay.

Step count information
We compute a weighted average of the number of steps taken by the patient at time t s . To calculate S avg , we consider the previous 50 min (10 readings with 5 min of the interval) where the steps count for most recent timeindex has a more significant weight. Note that the weight decreases gradually with time-index.
In the above equation, n = 10 as we only consider the previous 50 min or 10 data points. The computed S avg is treated as one of the inputs. Theoretically, RNN can learn from the previous sequential data by its nature. However, we have observed from our extensive experiments that using additional weight for calculating the step information improves the prediction accuracy to a small extent but consistently for each of the six patients. This explicitly implied weight in the recent 50 min out of the 120 min step count history (24 data points), emphasize the effect of the recent walking pattern. Eventually, this emphasis facilitates the RNN model to predict BG level with marginally better accuracy.

Glucose level information from CGM readings
Either unprocessed or Kalman smoothed CGM readings are considered as an input channel out of the four inputs.

Kalman smoothing for preprocessing
In this section, we discuss Kalman filtering and Kalman smoothing briefly. The KS method outputs an interpolated time series of glucose estimates with mean and variance. It can automatically correct errors in the CGM readings where the estimated variance can be utilized for determining at which times the data are reliable. In our study, KS has been used as a pre-processing technique for sensor fault correction in the CGM reading. We use a modified implementation of KS for the OhioT1DM dataset, from the work [41]. The Kalman filter is a technique of estimating the current state of a dynamical system from the previous observations. In Kalman filtering, records of data are used for the calculation of the estimates. Thus, the Kalman filter is appropriate for real-time data processing. It is a forward algorithm where each step is computed analytically. The model and observation can be written as: Here, x, u, and y are the system internal state, input to the system, and measured output respectively. Whereas v is the process noise, and w is the measurement noise. These noise processes are assumed to be zero-mean Gaussian. φ is the transition matrix, and H is the measurement matrix.
In the phase of time update, the Kalman filter computes the priori estimates, a state estimate x and state covariance matrix P [44].
Then the phase measure update is performed, where the posteriori estimate x and P are calculated. The equations are as follows [44]: Kalman smoothing can be applied to get better estimates than Kalman filtering. However, it is required to have the whole dataset available at the time of performing Kalman smoothing. In our experiment, that is true. The Rauch-Tung-Striebel (RTS) algorithm [44] utilizes previous as well as the following data at the time k to generate the estimate. In RTS, there is one forward pass through the available data applying the Kalman filter to generate the priori, posteriori, and covariance matrices. These generated estimates and covariance are then treated as input to a subsequent backward pass. In this phase, RTS calculates the smoothed estimate x s k and P s k .
We apply Kalman smoothing on CGM readings from the OhioT1DM dataset to get smoothed CGM values. In the OhioT1DM dataset, there are two separate individual files for each patient (Total 12 data files in the dataset). These two files contain training and test data, respectively. These training and test data sets for each patient

Modeling
Our model uses a neural network to learn the prediction. A recurrent neural network is a feed-forward neural network that can model sequential data. It utilizes weight sharing between each element in the sequence over time.
There are diverse variants of RNN. In the basic RNN variant termed as Vanilla RNN, the transition function is a linear transformation of the hidden state vector h and the input vector x, followed by an activation function for non-linearity.
Where W is weight matrix, b is a bias vector, and tanh is the activation function. Classical RNN has the form of a chain of repeating modules of neural networks with straightforward architecture. Theoretically, RNN can learn long term dependency. However, practically it suffers from vanishing gradient problem and exploding gradient problem as a result of long term dependency. This dependency makes RNN less useful and more challenging to train [45]. Hochreiter and Schmidhube [22] proposed Long short-term memory (LSTM) for addressing this issue. The LSTM network mitigates this long-term dependency problem to some extent by utilizing the (15) 3 Two layered stacked LSTM network concept of memory, the gate structure, and constant error carousel. In the case of our study, LSTM is more suitable to model blood glucose levels as there are dependencies upon immediate previous entries in sequential diabetes patient data. Consequently, we prefer the LSTM based RNN model over the other model architectures to make the BG prediction more rigorous.
In our work, we build an LSTM network containing 128 element hidden vector.
We add a dropout layer after the first LSTM layer. Dropout is intended to reduce overfitting and improve the generalization of the model [46]. The last layer of the LSTM outputs a vector h i , fed as the input of a fully connected multi-layer network. This network consists of three layers, including two dense layers and one output layer. These dense layers contain 512 neurons, 128 neurons, and the output layer contains a single neuron, respectively. There is an activation function for each layer. We choose the rectified linear unit (ReLU) activation function for the first two dense layers and the exponential activation function for the output layer.
The input of the model is a multi-dimensional sequence of preprocessed BG level from CGM reading, carbohydrates amount from the meal, carbohydrates amount from the bolus, and step count related data. The output of the model is a prediction regarding BG level withing a prediction horizon. We experiment with the prediction horizon of 30 and 60 min.
Our proposed model adopts the negative log-likelihood (NLL) loss function. We use Adaptive Moment Estimation (Adam) as the optimizer.
We study the correlation between BG prediction accuracy and depth of the model architecture. We see in previous work that the deep recurrent neural networks provide empirical superiority over shallow networks [47]. The shallow network cannot precisely model the information with a temporal hierarchy [48]. However, the concept of depth in an RNN is not the same as it is in feedforward neural networks [49]. To make the RNN model deeper, we employ the stacking technique. Our proposed model consists of two LSTM layers. The first LSTM layer provides a sequence output that is fed as one input to the LSTM layer above. Both LSTM layers have the same internal architecture described earlier. Figure 3 illustrates the architecture of two Layered Stacked LSTM.
We also experimented with the GRU cell instead of the LSTM cell. However, in this application, the network with the LSTM cell outperforms the network with GRU cells. A comparison between the performance of these two types of RNN is shown in "Results" section. Thus, we choose the LSTM cell for our final RNN model.
We also consider the practicality of the self-attention based model for BG prediction. In recent years, models with attention mechanisms such as transformer [50] outperform naive RNN models, especially in machine translation and natural language processing tasks [51]. However, Mirshekarian et al. [52] show that a neural attention module can improve the BG prediction performance improvement for only synthetic data, whereas the authors do not find any prediction performance improvement for real data. As we focus on improving BG prediction accuracy for only real data, we do not consider using the self-attention mechanism in this study further.

Experimental setup
We conduct extensive experiments on data and our proposed model to tune hyperparameters and determine the optimal setup. We measure the efficacy of our proposed method to predict BG values for two different setups. In the first setup, we use raw CGM readings, whereas, in the second setup, we use Kalman smoothed CGM values for sensor error correction.

Hyperparameters
We experiment with two different RNN cell types-LSTM and GRU cells, respectively. The dynamics of the glucoregulatory system are considered as nonlinear. Hence, learning blood glucose dynamics is a complicated task where consideration of previous data is very crucial for effective prediction mechanisms. As a consequence, we skip experimenting with Vanilla RNN cells because of its long term dependency problem [53]. Our experimental result demonstrates that the LSTM network achieves better prediction accuracy than a network with GRU cells. We train our model with the six patients' training data from the OhioT1DM dataset. We experiment with 30, 60, 120, and 240 min of data history as the input for the proposed model. Those contain 6, 12, 24, and 48 training examples respectively as CGM readings are taken with every 5 min of interval. We also investigate the effect of the number of LSTM state hidden units and fully connected layers in the network. Different combinations of LSTM state size-of 64, 128, and 248 with two fully connected layers are tested. We find that the network model with 128 LSTM states with two fully connected layers outperforms other configurations. A learning rate of 10 −3 was used for training with a maximum number of 6000 epochs. However, an early-stopping patience threshold of 128 epochs is employed for better convergence.

Training, testing, and validation
We use the T1DM dataset for training, validation, and testing purposes. This dataset contains 12 files for six data contributors. For each person, there are two files for training and testing data, respectively. We split the data in the training file into the training and validation dataset. The entire dataset in the testing file is used for testing purposes for each subject. We partition the training data file such that 80% of the data is used for training, and the rest of the data is used for validation. The purpose of the validation data is to provide an evaluation of the model after hyperparameter tuning.
In the training phase, the prediction from the model is used to determine the subsequent prediction curve at each epoch. It is possible to evaluate the model at a certain point in time if there are at least 24 prior data points are available (prior data points for 2 h or 120 min). This threshold is to assure that there are enough data points available for the model to make a feasible prediction. For the training process, we set the batch size to 128. However, we try with larger batch sizes than 128 and found that it makes the prediction accuracy worse in terms of RMSE. Finally, we train the two instances of the model separately with raw CGM data, and Kalman smoothed CGM data, respectively, along with other features including carbohydrates from the meal, bolus insulin, and cumulative step counts in a fixed time interval. In this study, we consider the previous 50 min of history for calculating the cumulative step numbers.

Development environment
We have implemented our proposed model with Python 2.7, whereas Tensorflow-gpu v1. 8.0 has been used as a machine learning framework. For utilizing the GPU's computing capability, we use CUDA 9.0 and cuDNN v7.3.0. The other libraries we include in this project are Keras (built-in with TensorFlow), SciPy, Pandas, NumPy, Matplotlib, and Seaborn. Moreover, We implement the preprocessing module (Kalman smoothing) in Matlab.

Evaluation criteria and results
In this section, we discuss the BG prediction result of our proposed deep RNN model for the T1DM-testing dataset. The performance of the model is compared based on the accuracy over the 30 and 60 min prediction horizon. The hyperparameters of the model are tuned extensively for optimal results.

Evaluation criteria
Initially, we investigate the usefulness of preprocessing the CGM values. We use Kalman smoothing for processing the CGM readings. We compare the deviation of raw CGM readings and preprocessed CGM values from the ground truth reference value. In our analysis, we assume the BG level measured by fingerstick as ground-truth. The glucose readings from the sensor sampling in interstitial fluid are substantially different from blood glucose values measured at the same time [39]. As a result, CGM manufacturers suggested that patients should use capillary blood glucose measurements before any treatment decisions. Moreover, the self-measurement of blood glucose (SMBG), has been used as a reference for different CGM systems accuracy comparison [54]. Fingerstick testing is one of the most convenient SMBG methods. These are the principal reasons behind our assumption of choosing the fingerstick reading as the ground truth for comparison between raw CGM value and preprocessed CGM values. We used root-mean-square error (RMSE) to calculate the difference between CGM values (raw CGM, preprocessed CGM) and fingerstick reference BG values at a particular time.
Here, y denotes the CGM values (either raw CGM values or smoothed CGM values), and x denotes the fingerstick reference BG values.
We find that preprocessed CGM values are much closer to fingerstick BG reading than raw CGM values. Table 3 illustrates the accuracy comparison, in terms of RMSE, between raw CGM and Kalman smoothed CGM values with respect to fingerstick BG readings. Table 3 shows that both of the CGM readings (measured from interstitial fluid) and pre-processed CGM values differ from finger stick reading (measure from blood glucose). However, most importantly, processed CGM values have a lower error than raw CGM values. Thus, we conclude that the model trained with Kalman smoothed CGM values along with other features, is more effective in forecasting the BG level.
However, the performance of a model can be evaluated with different criteria. Among those, the root-meansquare error (RMSE) between the reference CGM values and predicted BG level, is one of the most widely adopted methods to assess the BG prediction accuracy. Thus, we evaluate the performance of our model in terms of RMSE in this paper.
where ŷ(i|i − PH) denotes the model's prediction results provided the previous data and y denotes the reference CGM reading, N is the number of data points.

Results for the dataset with unprocessed/raw CGM readings
In this section, we discuss the BG forecasting accuracy for the proposed stacked RNN model for the OhioT1DM dataset, where CGM readings are raw/unprocessed.
Initially, we investigate the effect of the depth of LSTM layers of the network on the prediction accuracy of the model. In our experiment, we consider the prediction horizon of 30 and 60 min. The experimental results for models with single LSTM layers and stacked LSTM layers are summarized in Table 4. Results from the table demonstrate that the RNN model with stacked LSTM layered architecture performs better than RNN with a single LSTM layer for all of the cases. The proposed model's (Stacked LSTM) predictive results for patient #570 and #575 over PH=30 min are illustrated in Fig. 4. The proposed model provides the lowest and the highest RMSE result for patient #570 and patient #575 respectively.
We also experiment with single GRU cell-based RNN for BG level prediction. However, both of the models with single and stacked LSTM cells provide better RMSE than the model with the GRU cell. The mean RMSE for the six patients' BG prediction are 20.07 and 31.12 for the PH=30 and 60 min respectively.

Results for the dataset with Kalman smoothed CGM readings
In this section, the proposed model's predictive accuracy with stacked LSTM layers is evaluated for the preprocessed testing dataset with Kalman smoothing technique described in "Kalman smoothing for preprocessing" section. Note that, In our study, only CGM values are preprocessed, and the rest of the features remain the same. Then we estimate the RMSE of the model's prediction individually with preprocessed CGM values and raw CGM values, respectively, from the testing dataset as our goal is to lower the difference between the predicted CGM values and the real fingerstick blood glucose readings. The preprocessing techniques are described in "Feature Extraction from Physiological Information" section. Table 5 presents the forecasting accuracy (RMSE) of the model trained with the preprocessed (CGM values) dataset using the Kalman smoothing technique and original (raw CGM values) dataset. Figure 5 demonstrates that preprocessing the CGM reading with Kalman smoothing, improves the prediction accuracy to a substantial extent.

Analysis and result comparison
Initially, we compare our work with different machine learning models for the OhiT1DM dataset. However, Xie et al. provide the BG prediction performance comparison of several traditional machine learning approaches for the OhiT1DM dataset in their work [55]. We take their experiment result for comparing our work. The results in Table 6 show that LSTM model outperforms the conventional machine learning model. Our proposed approach is more generalized as the prediction RMSE for all six patients is uniformly improved. As a consequence, we do not further experiment with traditional machine learning approaches.
From Table 4, it is evident that, with a wider prediction horizon, the forecasting model becomes more complicated. However, we observe that deep approaches with stacked LSTM layers provide advantages over the shallow model with a single LSTM layer in forecasting BG level, particularly for a higher prediction horizon. As a result, we choose the deeper model for the final experiment.
From Table 5, we can observe that prediction RMSE for each patient ranges from 15.94 to 20.94 and 4.73 to 8.54 for the model trained with raw CGM readings and processed CGM readings, respectively. The predictions for patients #575 and #591 are comparatively more inaccurate than other patients, whereas we achieved better RMSE for patient #570 and #588. There might be several reasons for the RMSE difference. The first reason is that #575 and #591 have a larger number of missing      Fig. 4, 12-h period BG level of #570 contains less fluctuation than #575. It is noteworthy that the prediction error is highest around the spikes and turning regions of the CGM trajectory. Furthermore, there is a marginal time delay in the prediction curve. This delay is responsible for the prediction error. Such abrupt fluctuations in training and testing dataset make it difficult for the model to learn and predict the BG level accurately. One of the possible reasons behind such fluctuation is the complicated dynamic of the glucoregulatory system. Another possible reason is the sensor fault of the CGM system. We mitigate such undesirable errors by applying the Kalman smoothing technique on the dataset that makes CGM values less likely to have abrupt fluctuations. Subsequently, it boosts the learning capability of the model resulting in significantly better prediction accuracy. Figure 5 illustrates how error correction of CGM reading, enhances the precision of the prediction curve for a particular 12-h time window for patient #563. Here we use patient #563 as the dataset for this patient has average fluctuation. Note that, the SD (light blue region) is remarkably less for the prediction made with the model trained with processed data. Hence, the model trained with the error corrected dataset (Fig. 5-Right), is capable of forecasting BG with more confidence than the model trained with the unprocessed dataset ( Fig. 5-Left). Moreover, we notice an improvement in the time delay in the proposed model.
We provide a performance comparison of our work with related works for the OhioT1DM dataset in Table 7. For comparison, we consider previous works [24,25,56] those uses the OhioT1DM dataset for result evaluation. Table 7 demonstrates that our proposed model with KS provides the best accuracy for every patient among all other related works. Even without utilizing KS, we achieved the topmost accuracy for the four patients out of six. For patient #563 and #570, the work [24] has slightly better accuracy than ours. [25] and [24] use LSTM and dilated RNN for BG prediction respectively whereas [56] uses a convolutional neural network (CNN) trained with the OhioT1DM dataset for the BG level prediction. To the best of our knowledge, we achieved the leading average prediction accuracy for the OhioT1DM dataset.

Discussion
Our proposed method provides a notably precise prediction mechanism for the typical blood glucose range ( < 70 mg/dl and > 180 mg/dl ). However, our elaborated experimental results suggest that prediction accuracy decreases slightly when blood glucose level values (Ground Truth) lie outside the ideal range. The primary reason for this issue might be the smoothing process. With the KS preprocessing technique, the spikes and extreme CGM readings get smoothed and become less sharp. The smoothed values are imperceptibly biased to the average BG value. Those smoothed values tend to lie in the normal range. After preprocessing, these smoothed values are used in training steps. It reduces the learning capability of the model marginally, particularly for extremely high or low BG values. Further study might be done to improve the prediction accuracy for extremely high or low blood glucose range.
Another point is that there are some missing data in the OhioT1Dm dataset. For the patient, 575 and 591 have 1309 and 782 missing data points, each. Continuous missing data has been observed between Dec 26 to Dec 28 and Dec 26 to Jan 5, respectively. Whereas patient 570 has only 649 missing data. However, these missing data might have an adverse effect on prediction. Moreover, the device type used for acquiring the patient's data is fixed; that is the Medtronic Enlite CGM sensor. The slow sampling rate also limits the size of the dataset. In our future work, we would also consider a faster sampling rate for patient data acquisition. Also, the patient number and racial diversity are limited in the dataset. We have the plan to collect more patient-data with several devices from the broad and racially diverse patientbase and use those data to explore more generalized approaches for BG forecasting. Additionally, we plan to work on glycemic event detection, especially on hypoglycemic event detection. Besides, there might be scope for further experiments in the future to assess the potentiality of using the self-attention mechanism for BG prediction. Furthermore, the precise glucose level prediction might help to address different types of attacks on wireless insulin pumps [57][58][59][60].

Conclusion
This work investigated methods for more accurate blood glucose prediction. We demonstrated that preprocessing the CGM readings with Kalman smoothing for sensor error correction could be useful for improving the robustness of the BG prediction. We utilized different physiological information such as meal, insulin, aggregations of step count, and preprocessed CGM data in our method. We proposed a novel approach leveraging the stacked LSTM based deep RNN model to improve the BG prediction accuracy in this paper. It is evident from our study that preprocessing the CGM values with Kaman smoothing makes the BG prediction curve less uncertain and less fluctuating. Our proposed approach provides more reliable predictions than traditional methods while we assumed fingerstick BG readings as the ground truth in our experiment. We want to lower the difference between the predicted CGM values and the real fingerstick blood glucose readings. The BG prediction with Kalman smoothed CGM data is closer to the actual BG level (The Fingerstick BG reading) than without the Kalman filter. This more accurate prediction can aid diabetes patients to avoid adverse glycemic events. Our proposed methodologies could be employed to get insight into T1D patient's future BG level trends that might result in a more dependable diabetes management system.