AliClu - Temporal sequence alignment for clustering longitudinal clinical data

Background Patient stratification is a critical task in clinical decision making since it can allow physicians to choose treatments in a personalized way. Given the increasing availability of electronic medical records (EMRs) with longitudinal data, one crucial problem is how to efficiently cluster the patients based on the temporal information from medical appointments. In this work, we propose applying the Temporal Needleman-Wunsch (TNW) algorithm to align discrete sequences with the transition time information between symbols. These symbols may correspond to a patient’s current therapy, their overall health status, or any other discrete state. The transition time information represents the duration of each of those states. The obtained TNW pairwise scores are then used to perform hierarchical clustering. To find the best number of clusters and assess their stability, a resampling technique is applied. Results We propose the AliClu, a novel tool for clustering temporal clinical data based on the TNW algorithm coupled with clustering validity assessments through bootstrapping. The AliClu was applied for the analysis of the rheumatoid arthritis EMRs obtained from the Portuguese database of rheumatologic patient visits (Reuma.pt). In particular, the AliClu was used for the analysis of therapy switches, which were coded as letters corresponding to biologic drugs and included their durations before each change occurred. The obtained optimized clusters allow one to stratify the patients based on their temporal therapy profiles and to support the identification of common features for those groups. Conclusions The AliClu is a promising computational strategy to analyse longitudinal patient data by providing validated clusters and by unravelling the patterns that exist in clinical outcomes. Patient stratification is performed in an automatic or semi-automatic way, allowing one to tune the alignment, clustering, and validation parameters. The AliClu is freely available at https://github.com/sysbiomed/AliClu.


Introduction Introduction
In this Additional File we present all the details required for pre-processing data, from panel to prefix-encoded sequences, needed by the TNW algorithm. Afterwards, we present the temporal Needleman-Wunsch algorithm and results of the AliClu algorithm on synthetic data.

Pre-Processing
Longitudinal data is usually available in a panel format. Therein, each row or line corresponds to a medical appointment of a certain patient; the columns contain the several features measured during the medical appointments.
In this work, we consider that each patient experiences a sequence of events spaced in time. Let A and B be events of interest for a given patient, with time-distance t between them; a prefix-encoded (PE) sequence for that patient is defined as 0.A, t.B. Three variables are required: id patient, event and time. The id patient refers to the unique identifier of the patient, event is the observed feature and time is the variable that provides the temporal information. For each patient, a temporal sequences is built as follows. The observed feature event is always categorical. If a numerical numerical one is desired a discretization step must be done before proceeding. In addition, the variable time can appear formatted as a date or just as a number in any time unit (seconds, minutes, days, etc.). Depending on the type of the time variable we can have two different pre-processing steps, as described in the next sections.

Pre-Processing I: time is a number
In this case the time variable is just a number. An example of raw data that needs this type of pre-processing is presented in Table 1. Each row shows an event experienced by a id patient event time  3. Create the prefix-encoded temporal sequences.
For illustration purposes, the temporal sequences obtained with the data presented in  We note that event Z is artificially introduced to mark the end of a sequence, hence, the last observed event of a patient is given by the penultimate event in the sequence.

Pre-Processing II: time is a date
Herein, time is formatted as a date. An example of raw data that needs this type of preprocessing is presented in Table 3. In this example the observed variable event is again categorical, but this time already represented as a letter. In this case, pre-processing raw data requires one additional step (when compared with Pre-processing I) to compute the time interval between two consecutive events.
For each patient with id id patient, the main pre-processing steps are: 1. Eliminate rows with NA values.
2. If event is a categorical variable represented with a number, convert this number into a letter in alphabetic order; e.g. 1 → A, 2 → B, etc. If event is already a letter, skip this step.
3. Compute the time elapsed between two consecutive events and associate it with the corresponding row.
The temporal sequences obtained with data in Table 3 are presented in Table 4. Therein, for instance, patient with id number 12 has a transition between event A and B of 12 days.
Note, however, that it is not possible to build a temporal sequence for patient 15 from   Table 4: Prefix-encoded temporal sequences for patients 12 and 20 built from data in Table   3.
After the pre-processing step, all patients are fully characterized by the temporal sequences. These sequences provide the clinical history of the patient regarding a specific observed variable event. Data in this format is the input for the Temporal Needleman-Wunsch (TNW) algorithm.

Pairwise alignment
Herein, we present the Needleman-Wunsch (NW) algorithm and then its temporal extension.

Needleman-Wunsch algorithm
The NW algorithm was developed by Saul B. Needleman and Christian D. Wunsch and published in 1970. It is proven to find the optimal alignment between two sequences. It is based on dynamic programming where the basic idea is to solve the problem by dividing it into smaller ones; solve the smaller problems optimally and then use the sub-solutions to construct an optimal solution for the original problem.
For a pair of sequences, X = x 1 , ..., x m and Y = y 1 , ..., y n , where x i and y j , for i ∈ [1, m] and j ∈ [1, n], are the symbols of sequences X and Y with size m and n, respectively. The algorithm uses two matrices -score matrix H and traceback matrix T -that considers all possible pairs of symbols in the two sequences to build the alignment. The NW algorithm consists of three steps: 1. Initialisation of the score matrix H: 2. Calculation of the scores and filling the traceback matrix T . The remaining entries of the score matrix H are defined as: 3. Deducing the alignment from the traceback matrix T .
In Equations (1) and (2), S(x i , y j ) is a user-defined scoring schema that measures the similarity between symbols x i and y j in the sequences. Moreover, g is a user-defined constant, called the gap penalty, that allows to penalize the alignment score whenever a gap is inserted.
Example To illustrate the steps of the algorithm consider two sequences given by: We start by defining a scoring schema S(x, y), presented in Figure 1. If two symbols x i and y j of our sequences are equal then the matching score is 1 (S(x i , y j ) = 1) otherwise we have a mismatch and the score is The gap penalty is assumed to be 2 (g = 2). Now we follow the three steps described above: S -1 -1 -1 -1 1 Figure 1: User-defined scoring schema S.

Initialisation step:
The score matrix H is initialized with m + 1 rows and n + 1 columns where m and n are the size of sequences X and Y , respectively. The extra row and column is given for alignments with gaps. With Equation (1) we fill the first row and first column (c.f. Figure 2). The traceback matrix T is initialised according to Figure 3.   The same procedure is followed to compute all the other cells. The final score H and traceback T matrices are presented in Figure 6. The final score of the alignment is given by the last filled entry, corresponding in this example to the entry H 3,4 = −1. In the first move we look at the value in the bottom right cell that is "diag", which means that the pair "DD" of letters corresponding to the two sequences are aligned.
Then we move diagonally from the position (3,4) to the position (2,3). The latter cell also stores the value "diag", hence, the same procedure applies here and the the pair "NN" is aligned. In the third step the cell (1,2) has the value "left". This means that a gap is introduced in the left sequence, i.e., the letter "E" from sequence "SEND" is aligned with a gap and the letter "A" from sequence "AND" will look for other Figure 7: Traceback procedure using traceback matrix.
alignments while we traceback. Finnaly, in step 4 we encounter again the value "diag" at cell (1,1) and letter "A" is aligned with "S" concluding the alignment procedure.
In summary, the values stored in the traceback matrix indicate the alignment: "diag", where the letters from two sequences are aligned; "left", where a gap is introduced in the left sequence; "up", where a gap is introduce in the top sequence.

Temporal Needleman-Wunsch algorithm
The temporal Needleman-Wunsch (TNW) algorithm, proposed by Haider Syed and Amar K. Das, is a modified version of the NW method that incorporates the transition times between elements of a sequence. Therefore, it can be used for pattern discovery and matching sequences where time between events is relevant. The authors used the proposed method for autonomous chemotherapy-protocol recognition where patients treatment histories are aligned with standard recommended protocols to understand which protocol a patient is following.
The TNW algorithm can compute the temporal penalties with the total transition time between consecutive match pairs, by using PE sequences and two auxiliary matrices T R and T C that accumulate these transition times for events that align with gaps associated with the first X and second sequence Y , given by . . , t xm−1 .x m and Y = 0.y 1 , t y1 .y 2 , . . . , t yn−1 , . . . , y n .
The score matrix H is calculated using: for all i ∈ [1, m] and j ∈ [1, n], where T R and T C are given by and The only difference between Equations (2) and (3) is the introduction of a term f (t xi , t yj ) that is a user defined temporal penalty function. The aim of this function is to reduce the similarity S(x i , y j ) by an amount that depends on t xi and t yj . In the original paper the following temporal penalty function was used: where T p is some user-defined constant, called temporal penalty, that will impose the maximum penalty on S(x i , x j ). This penalty function computes a percentage discrepancy between times t i and t j of two events x i and x j that are compared. If the events being compared have the same transition times associated to both of them that means that they are similar and the penalty function will be zero. But if the transition times are different then a penalty that depends on the times will be imposed. Indeed, instead of using the transition times of the match-pair alone (−f (t xi , t yj )) this penalty function considers the total transition times between matched event pairs (−f (t xi + T R i−1,j−1 , t yj + T C i−1,j−1 ) in Equation (3)) associated with the two sequence being alignment.
As an example, the result obtained by applying the TNW algorithm in the sequences X = 0.A, t1.B, t2.C, t3.D and Y = 0.A, t4.D is shown in Figure 8 and Figure 9, where the resulting score H, traceback T , and the additional T R and T C matrices are presented. In this example the pre-defined scoring schema S consisted on scoring 1 matching events and −1.1 mismatching ones. The gap penalty was defined as 0.5. These scores were chosen in order reflect preference for aligning events with gaps instead of aligning mismatch events.
As already explained T R and T C matrices store values that represent the accumulated transition times for events that are aligned with gaps in the first sequence X and second sequence Y , respectively. This is clearly visible in Figure 9 where in the left side (T R matrix) it is possible to see an accumulation of transition times (t 1 , t 2 , t 3 ) of sequence X as the row i increases; the same happens in the right side (T C matrix) but now column wise with only t 4 being accumulated. For example, the cell position T 2,0 indicates 'up' which means that there is no alignment; in this case, T R 2,0 = T R 1,0 + t x2 = 0 + t 1 = t1 and T C 2,0 = T C 1,0 = 0. Because event B aligned to a gap we stored this transition time to use in a future calculation when some events align. In cases where we found a match, we add the diagonal T R and T C values to the transition times t xi and t yj of the events being compared. In the matrices positions (4, 2) a match is discovered, hence, the temporal penalty is computed using T R 3,1 + t x4 = t 1 + t 2 + t 3 and T C 3,1 + t y1 = 0 + t 4 = t 4 which gives the function −f (t 1 +t 2 +t 3 , t 4 ) that is zero. Since we found a match pair the values for the corresponding cell positions of T R and T C are reset to zero. In this manner subsequent events that are aligned will use time information from the last pair of events that were matched.

Synthetic results
In this section, we present results of the AliClu algorithm on synthetic data.

Temporal sequence generation
Synthetic data consist of temporal sequences generated by continuous-time Markov chains (CTMC). These type of models are adequate given the fact that they simulate very well both the transitions between discrete states and the corresponding times.
A CTMC is defined by two components: first, a discrete-time Markov chain, called the jump chain, whose set of states are denoted by E; second, a set of holding time parameters λ = {λ i } i∈E , where λ i defines the average amount of time spent in the i-th state. A CTMC can be represented via a Q-matrix.
The matrix Q = (q ij ) i,j∈E , also referred to as the generator matrix, provides an alternative way of specifying a CTMC. This matrix has the following properties: An adequate way to present the information contained in a Q-matrix is through a transition rate diagram. In this diagram, the values q ij are shown on the edges, whereas the values of q ii are not usually shown because they can be derived from the others.
For a given Q-matrix, the associated CTMC has a jump chain Π defined as follows: where q i = i =j q ij . Moreover, the holding parameters λ are such that λ i = q i , for all i ∈ E.
To generate temporal sequences, as described in Algorithm 1, an initial probability distribution vector α for the states is needed. The algorithm starts by initializing the CTMC with an initial state i being drawn from α. Then, in Step 2, the initial state i is converted into a letter, say I and the temporal sequence is initialized as tseq ='0.I'. Since, in our experiments, we never used more than five states, a simple conversion of the states into strings in alphabetical order is made, i.e a state i = 1 is converted to I = A, state i = 2 to I = B, etc. In Step 3, the time t of the jump of the CTMC to another state is simulated and then the new state j is drawn (Step 4). Again, after simulating the new state j a conversion to a latter J is made and the temporal sequence is updated as tseq = tseq+ ',t.J'. Finally, we consider the existence of an absorbing state, in order to mark the end of the sequences, hence, the simulation continues (Step 5) until an absorbing state is found.
Algorithm 1 Temporal sequence generation 1: Initialize the CTMC at t = 0 with initial state i drawn from the initial distribution α.
2: Convert the initial state i to a letter, say I, and initialize the temporal sequence as tseq ='0.I'.
3: Let the time between two consecutive events t be an observation of random variable with exponential distribution with mean value q i . -If q i = 0, simulate a discrete random variable with probability distribution given by the i-th row of the Π-matrix, i.e, qij qi with j = i; -Convert j to a letter, say J, and let tseq = tseq+ ',t.J'.
We used this data generating process to design two different experiments, detailed in the next sections, to test AliClu. Two remarks are of notice. First, we skip the first step of AliClu, concerning temporal sequences generation from raw data, for obvious reasons.
Second, as true cluster labels are known a priori, an additional analysis is made where clustering indices are computed between the true labels, i.e. the original parameter sets used to generate the data, and the clusters found by AliClu. In these experiments AliClu is run automatically.

Two clusters from the same two-state transition diagram
Two different clusters were generated from CTMCs with transition diagram given by Figure 10, cluster 1 with λ 1 = 1000 and cluster 2 with λ 2 = 1. Each cluster is composed by a number of sequences given by N . The goal is to separate correctly the clusters in the generated data. Results are presented in Figure 11. Averages are quite high with low standard deviations. In addition, all measures agree, i.e.
they present values close to each other and, therefore, it is possible to conclude that elicited clusters are equal to the original ones in the majority of the experiments.

Four clusters from different transition diagrams
Herein, data composed of four clusters generated from different CTMCs in Figure 12, were generated to assess AliClu. As before, the number of sequences N per cluster varies, taking values 5, 15, 25 and 50.
The percentages of correct decisions for different gap (g from −1.0 to 1.0 with incremental steps of 0.1) and temporal (T p values of 0.25, 1, 1.5 and 2) penalties of the TNW algorithm are plot in Figures 13-16. Ward's method was used in the agglomerative algorithm of AliClu in all experiments. In Figure 13 it is possible to observe that the percentage of correct decisions is zero for the majority of gap penalties and N . However, by increasing the parameter T p , the percentages of correct decisions increase, mainly for positive gap values, which is observed in Figure 14 for T p = 1. Overall, a good performance is achieved  Table 5: Average, standard deviation and median of five clustering indices when using Ward's method (c.f. Figure 11).
when g = 0.3 and g = 0.4; the latter presents the highest percentages for all N . In Figure 15, where T p = 1.5, the method performed better from g = 0.6 to g = 1. The best result is achieved for g = 0.8 with 100% of correct decisions on the number of clusters for all N . A deterioration of the performance is observed again in Figure 16 where T p = 2. As before, the percentages of correct decisions are mainly zero for the majority of gap values, as for the case with T p = 0.25. Finally, we note that the clustering indices between the groups found and the original ones, for cases where AliClu performed best (Figures 14 with g ∈ [0.3 0.5] and 15 with g ∈ [0.7 0.9]), validate that the correct clusters are obtained.
Indeed, the medians and averages of the clustering indices are always one or almost one, whereas the standard deviations are zero or almost zero.
To conclude, it was shown that by increasing the parameter T p , good performances can be obtained. The analysis of these synthetic data suggests that gap values used in AliClu should range over by positive gap values, rather than negative ones.   Figure 14: Results for T p = 1. Figure 15: Results for T p = 1.5. Figure 16: Results for T p = 2.