Prefix span
Prefix Span [22] is a kind of effective mining algorithm for sequential patterns based on an incremental sequential pattern, without producing candidate sequences. The basic idea is that all possible frequent subsequences are not considered when the database is projected. Only checking the prefix subsequence is needed, and then their corresponding suffix subsequences are projected into the projection database. In each projection database, sequential patterns are grown by exploring local frequent patterns, and the setting of the support threshold and the risk threshold is the key to the algorithm [23]. The Prefix Span, which overcomes the disadvantage of FreeSpan in the construction of the projection database, mines the sequential pattern seen as creating a tree. In terms of the mining time, Prefix Span has a significant advantage compared with other algorithms such as FreeSpan [24], Apriori [25], and GSP [26]. Therefore, the Prefix Span algorithm is selected as the sequential pattern mining algorithm in this paper, and its relevant definitions are listed as follows.
Definition 1
Subsequence. For sequence \({\text{A}} = < a_{1} ,a_{2} , \ldots ,a_{n} >\) and sequence \({\text{B}} = < b_{1} ,b_{2} , \ldots ,b_{n} >\left( {{\text{n}} \le {\text{m}}} \right)\), if there is a number sequence \({1} \le j_{1} \le j_{2} \ldots \le j_{n} \le {\text{m}}\) meeting \(a_{1} \epsilon b_{j1} ,a_{2} \varepsilon b_{j2} , \ldots ,a_{n} \epsilon b_{jn}\), then A is regarded as a subsequence of B.
Definition 2
Support threshold. If there is a sequence database D, the support threshold is expressed as the ratio between the number of sequences containing sequence S in the sequence database D and the total number of sequences D, namely, sup.
Definition 3
Frequent sequences. Sequence A frequently occurs in the sequence set and the support threshold of A in the sequence set is more than or equal to the support threshold.
Definition 4
Prefix. Given sequence \({\text{A}} = < a_{1} ,a_{2} , \ldots ,a_{n} >\) and sequence \({\text{B}} = < b_{1} ,b_{2} , \ldots ,b_{n} > \left( {{\text{n}} \le {\text{m}}} \right)\), if \(b_{i} = a_{i} , \left( {i \le m - 1} \right),b_{m} \in a_{m}\), and all items in \(\left( {a_{m} - b_{m} } \right)\) follow behind \(b_{m}\), it is said that B is the prefix of A.
Definition 5
Projection. For the given sequence A and B, if B is the subsequence of sequence A, the projection A′ of A corresponding to the prefix B needs to satisfy the requirement of making B as the prefix of A′, so A′ is the largest subsequence of A that meets the above conditions.
Definition 6
Suffix. Given sequences \({\text{A}} = < a_{1} ,a_{2} , \ldots ,a_{n} >\) and \({\text{B}} = < b_{1} ,b_{2} , \ldots ,b_{{m^{{\prime }} }} > \left( {{\text{n}} \le {\text{m}}} \right)\), sequence B is the subsequence of A. If \(A^{{\prime }} = < b_{1} ,b_{2} , \ldots ,b_{p} > (m < p \le n)\) is the projection from sequence A onto subsequence B, the suffix of sequence A' onto subsequence B is \(< b_{{m^{{\prime \prime }} }} ,b_{m + 1} , \ldots ,b_{p} >\) and \(b_{{m^{{\prime \prime }} }} = (b_{m} - b_{{m^{{\prime }} }} )\).
Definition 7
Projection database. Suppose A is a sequence schema in sequence database S, and sequence B is prefixed by A, so the projection database of A is the suffix of all sequences prefixed by A in S relative to A, denoted as S|A.
Longest common subsequence
The longest common subsequence algorithm [27] aims to obtain the longest sequence of two or more known sequences. Given two sequences X and Y, it means that sequence Z is a subsequence of X and Y when another sequence Z is both a subsequence of X and Y. For the given sequence X, Y, and Z, where sequence Z is the common subsequences of sequence X and Y, and the length of other common subsequences in sequence X and Y are all shorter than that of sequence Z. Then, sequence Z is defined as the longest common subsequence of sequence X and Y. The subsequence obtained at this time means that each element in the sequence belongs to the original sequence and inherits the order of the original sequence but is not necessarily continuous. For example, for the sequence X = <ffghj> and Y = <fghfgh> , the common sequence contains the sequential and continuous sequence <fgh> as well as the discontinuous sequence <ffgh>, so the longest common sequence of these two is <ffgh>. At present, the brute force method and dynamic programming algorithm are applied to solve the longest common subsequence problems [28, 29].
In this paper, dynamic programming is applied to solve the longest common subsequence problem according to the next four steps: state division, state identification, state transition, and boundary determination. By setting the two-dimensional array C[i, j], record the length of the longest common sequence that the sequence \({\text{X}} = < x_{1} ,x_{2} , \ldots ,x_{i} >\) is m and \({\text{Y}} = < y_{1} ,y_{2} , \ldots ,y_{j} >\) is n. Considering the recursive relation from the subproblem solution, the recursive equation can be obtained as follows:
$$C\left[ {i,j} \right] = \left\{ {\begin{array}{*{20}l} 0 \hfill & {i = 0\;or\;j = 0} \hfill \\ {C\left[ {i - 1,j - 1} \right] + 1} \hfill & {i,j > 0\;and\;x_{i} = y_{j} } \hfill \\ {\max \left( {C\left[ {i,j - 1} \right],C\left[ {i - 1,j} \right]} \right)} \hfill & {i,j > 0\;and\;x_{i} \ne y_{j} } \hfill \\ \end{array} } \right.$$
(1)
The solution of the longest common subsequence is reached by using the recursion method from the lower right corner of the constructed matrix. The method is as follows: Firstly, initialize the first row and column of the two-dimensional array to 0. If the two elements corresponding to the position of (i, j) in the array are equal, then the corresponding value of C[i, j] is C[i − 1, j − 1] + 1; If the two elements corresponding to the position (i, j) are different in the sequence, then the corresponding value of C[i, j] is the largest one in C[i, j − 1] and C[i − 1, j]. The values of all corresponding positions in the two-dimensional array are obtained recursively in turn. When the solution is completed for each element in the entire two-dimensional array, the length of the longest common subsequence is the last value in the lower right corner of the two-dimensional array. Finally, use the completed two-dimensional array matrix to backtrack all the longest common subsequence. The lower right corner of the matrix is the starting point, and all the routes are searched out by the exhaustive method.
Hypoglycemia early alarm
Definition of hypoglycemia early alarm problem
Based on the Prefix Span, this paper reaches the mining of hypoglycemia early alarm sequence pattern and creates the hypoglycemia early alarm sequence library. The basic strategy for constructing the hypoglycemia model library can be listed as follows:
-
(1)
Obtain the existing CGM records.
-
(2)
The preprocessed CGM time series data is judged by the blood glucose in a fixed continuous period, and the hypoglycemia pattern library is built according to the result classification.
-
(3)
Remove frequent and redundant sequences from the database.
In this paper, the early alarm of hypoglycemia based on sequential pattern mining is divided into the construction of the hypoglycemia frequent sequence library and the pattern matching of the target blood glucose sequence. The related concepts related to pattern mining of the hypoglycemia early alarm sequence are defined in this paper [30].
Definition 1
An alarm event. According to the alarm rule (blood glucose ≤ 3.9 mmol/L [31]), the starting index \(S_{i}\) and the ending index \(S_{i} + L_{w}\) of the alarm sequence is determined, as shown in Fig. 1.
Definition 2
Early alarm sequence. Given that the fixed window length is \(L_{e}\), the early alarm sequence is composed of the glucose data that the number of it is \(L_{e}\) before the alarm sequence. That is, the corresponding value in the sequence index (\(S_{i} - L_{e}\)) composes the early alarm sequence that the length is \(L_{e}\).
Definition 3
Non-alarm sequence. Non-alarm sequence refers to the glucose sequence in the middle part of the alarm sequence and the next alarm sequence. That is, the sequence consisting of the corresponding values of the sequence index in (Si − Lp) ~ (Si − Le). If the length of the non-alarm sequence is greater than that of the alarm sequence, the sequence needs to be processed in advance, which aims to ensure that the alarm and the non-alarm sequence are mined in the same sequence length.
Multi-level hypoglycemia early alarm
The pattern matching of the hypoglycemia sequence is based on the similarity of the sequence to predict the future glucose trend, which has a certain advantage over the prediction accuracy for early alarm. Between the construction of frequent sequential pattern libraries of hypoglycemia and the selection of hypoglycemia, a threshold exists a causal relationship. The framework process with multiple thresholds is reached by setting different hypoglycemia thresholds. Different thresholds can be given according to clinical needs or the actual needs of the patients. In this paper, hypoglycemia thresholds [32] (such as 3.0 mmol/L, 3.9 mmol/L and 4.4 mmol/L) are correspondence with the level-I, level-II and level-III libraries then refer to different early alerts respectively. The specific information were shown as follows:
-
Level-I (Red Alert with threshold 3.0 mmol/L): represents the highest level of alert, which needs immediate measures to prevent adverse symptoms such as coma;
-
Level-II (Orange Alert threshold 3.9 mmol/L): means to approach the diagnostic limit of hypoglycemia;
-
Level-III (Yellow Early Alert threshold 4.4 mmol/L): means the occurrence of hypoglycemia, which suggests that attention should be paid to the blood glucose dynamics.
By matching the real-time glucose sequence with the frequent sequence libraries of the three early alarm levels, different matching results could be obtained to issue an early alert of different levels, as shown in Fig. 2. Different levels of early alarms enable diabetic patients to take appropriate treatment strategies according to their levels, which effectively avoids the early alarm time being too short to deal with. Therefore, a reasonable early alarm of hypoglycemia is an important means to effectively prevent and control adverse events.
The flow chart of multi-level hypoglycemia early alarm
In this paper, the thought of sub-sequence pattern matching is used in hypoglycemia early alarm that the longest common subsequence is solved between the target sequence in the test set and the frequent subsequence in the early alarm pattern library. And the overall process is illustrated in Fig. 3. The main steps of the multi-level hypoglycemia early alarm are listed as follows:
-
Step 1: In the original time series data, the alarm event data is screened out according to different levels of hypoglycemia early alarm rules, and the blood glucose data of patients is divided into an early alarm and non-alarm sequence set.
-
Step 2: Symbolic mapping rules are used to symbolize the CGM records. On the one hand, it can ignore the high dimensional characteristic of time series; on the other hand, it can solve the problem that projection for each prefix may produce a large number of single items, which can improve the storage efficiency and speed up the processing.
-
Step 3: The Prefix Span algorithm figures out the frequent glycemic sequence pattern mining of the early alarm and non-alarm sequence, which constructs the preliminary multi-level glucose sequence set.
-
Step 4: The longest common subsequence algorithm is used to delete the frequent subsequences lower than the shortest length threshold in the alarm and non-alarm sequence library. The reason for this step is that the glucose sequence set formed in Step 3 is a frequent sequence pattern set of all different levels meeting the support threshold, including frequent subsequences with the too long or too short sequence. The length of the shortest frequent subsequence \(L_{min}\) is defined by the length of the early alarm sequence \(L_{e}\) to reduce the probability of false positives.
-
Step 5: Remove the common frequent sequence in the frequent and non-frequent alarm sequence in the frequent alarm pattern library by the longest common subsequence. And the longest common subsequence is applied to remove the redundant pattern in the current alarm sequence library and eventually complete the final establishment.
-
Step 6: Multiple frequent sequence alarm pattern libraries with different levels are obtained in this step. Meanwhile, the real-time CGM records was slid with the same fixed window size as the alarm sequence to obtain the matching real-time sequence.
-
Step 7: Match the divided real-time hypoglycemia sequence with multiple hypoglycemia early alarm pattern libraries of different levels by the longest common subsequence.
-
Step 8: If the match is unsuccessful, the next sequence would be matched; If the match is successful, a corresponding level of hypoglycemia would be issued with three levels corresponding to yellow, orange, and red alert respectively.