Skip to main content

Medication adherence prediction through temporal modelling in cardiovascular disease management



Chronic conditions place a considerable burden on modern healthcare systems. Within New Zealand and worldwide cardiovascular disease (CVD) affects a significant proportion of the population and it is the leading cause of death. Like other chronic diseases, the course of cardiovascular disease is usually prolonged and its management necessarily long-term. Despite being highly effective in reducing CVD risk, non-adherence to long-term medication continues to be a longstanding challenge in healthcare delivery. The study investigates the benefits of integrating patient history and assesses the contribution of explicitly temporal models to medication adherence prediction in the context of lipid-lowering therapy.


Data from a CVD risk assessment tool is linked to routinely collected national and regional data sets including pharmaceutical dispensing, hospitalisation, lab test results and deaths. The study extracts a sub-cohort from 564,180 patients who had primary CVD risk assessment for analysis. Based on community pharmaceutical dispensing record, proportion of days covered (PDC) \(\ge\) 80 is used as the threshold for adherence. Two years (8 quarters) of patient history before their CVD risk assessment is used as the observation window to predict patient adherence in the subsequent 5 years (20 quarters). The predictive performance of temporal deep learning models long short-term memory (LSTM) and simple recurrent neural networks (Simple RNN) are compared against non-temporal models multilayer perceptron (MLP), ridge classifier (RC) and logistic regression (LR). Further, the study investigates the effect of lengthening the observation window on the task of adherence prediction.


Temporal models that use sequential data outperform non-temporal models, with LSTM producing the best predictive performance achieving a ROC AUC of 0.805. A performance gap is observed between models that can discover non-linear interactions between predictor variables and their linear counter parts, with neural network (NN) based models significantly outperforming linear models. Additionally, the predictive advantage of temporal models become more pronounced when the length of the observation window is increased.


The findings of the study provide evidence that using deep temporal models to integrate patient history in adherence prediction is advantageous. In particular, the RNN architecture LSTM significantly outperforms all other model comparators.

Peer Review reports


The management of CVD risk is necessarily longterm. Such management typically involves disease modifying life style adjustments such as smoking cessation, weight management, diet and physical activity as well as long-term drug therapy for patients assessed to be above the threshold for pharmacological intervention. Studies in the United States have found that among adult populations only a small fraction (2% and 3%) maintain a healthy lifestyle as indicated by nonsmoking, ideal BMI, daily consumption of fruits and vegetables and regular physical exercise [1, 2]. This signals that from a population health perspective, it is insufficient to rely on lifestyle intervention alone in CVD risk management. For patients assessed to be above the threshold for pharmacological treatment, effective drugs exist [3]. Lipid-lowering treatment forms a key component of CVD management and statins are the preferred lipid-lowering drug [4, 5] with strong evidence of effectively lowering CVD risk in primary and secondary prevention [3, 6]. However, maintaining adherence to long-term medication presents a significant challenge to patients benefiting from treatment, with non-adherence associated with risk of major adverse cardiovascular events and mortality [7, 8].

Medication adherence is defined by the International Society for Pharmacoeconomics and Outcomes Research (ISPOR) as “the extent to which a patient acts in accordance with the prescribed interval and dose of a dosing regimen.” [9]. Commonly used CVD medications such as antiplatelets, statins, beta blockers, ACE inhibitors and angiotensin II antagonists are generally well tolerated with severe side effects occurring very rarely [10]. Despite the estimated 60–80% reduction to CVD risk provided by these preventive medicines [11] non-adherence to long-term medication continues to be a longstanding challenge in healthcare delivery. Evidence from a number of studies have found non-adherence to common CVD medication to be up to 80% for antihypertensives, as high as 75% for statins and up to 29 % for antiplatelets [12,13,14,15,16,17,18].

Both international and New Zealand studies have found long-term adherence to statin (a class of lipid-lowering drugs)—e.g. simvastatin, atorvastatin—to be low [12, 13]. In New Zealand, adherence to statin over the 3 years after an Acute Coronary Syndrome (ACS) was found to be 66%, although it was 82% for those on a statin prior to ACS admission [19]. In primary prevention in New Zealand, statin adherence in the first year after initiation of treatment was found to be 63% [20]. A US study found non-adherence to statin to be as high as 56.0% for secondary prevention patients and 56.4% for primary prevention patients [21]. Similarly, a UK based study found patterns of discontinuation of treatment for 41% of patients who are using statin as secondary prevention and 47% of patients who are using statin as primary prevention, although many of these patients restarted their treatment following discontinuation (75% and 72% respectively) [22].

The lack of adherence has dramatic clinical and economic implications. Poor adherence has been associated with approximately twice the risk of death in CVD patients [23, 24]. In the United States, approximately 125,000 deaths, at least 10% of all hospitalisation and significant increase in morbidity are attributed to lack of adherence, annually costing the healthcare system an estimated $100 billion to $289 billion [25]. As such, non-adherence represents a large flaw in current healthcare delivery for chronic condition management. The ability to accurately identify patients at risk of non-adherence could provide a valuable component for clinical decision support to target adherence-promoting interventions.

Patients’ adherence to therapy can be measured in a number of ways; through direct means such as monitoring a drug or its metabolite concentration in blood or urine, or through indirect means including patient self-reporting, the use electronic monitoring devices that record the frequency and time pill bottles have been used, and pill counts [26,27,28]. A method for measuring adherence that is used with increasing prevalence is by leveraging pharmacy dispense data due to it being non-invasive, low cost and its ability to cover a large population. Although this method makes the assumption that patterns of dispensing are consistent with patterns of actual ingestion/consumption, studies have validated the approach and have shown that this assumption is an acceptable estimate [29, 30]. Adherence measures based on pharmacy dispensing data are numerous; two widely used measures are medication possession ratio (MPR) and proportion of days covered (PDC) [31, 32]. Formally, MPR is defined as

$$\begin{aligned} MPR = \left( \frac{\text {Sum of days' supply in period}}{\text {Number of days in period}} \right) \times 100 \end{aligned}$$

and PDC is defined as

$$\begin{aligned} PDC = \left( \frac{\text {Number of days in period ``covered''}}{\text {Number of days in period}} \right) \times 100. \end{aligned}$$

The notion of “covered” stands for days in the period where the patient can reliably be in possession of their medication. For example, the interval between the dispense date and the number of days supplied after it are considered to be “covered”. Alternatively, if there exists a gap between when the supply runs out and the subsequent dispense this gap will be considered not “covered”. Both measures calculate a percentage value for adherence. Often, the implementation of MPR does not account for gaps, allowing subsequent overlapping dispenses to close earlier gaps leading to overestimation of adherence and potentially a nonsensical value of larger than 100%. PDC is a more conservative measure that accounts for gaps and in addition captures the notion of “stockpile”. Stockpile can happen when there is an overlapping of days supplied between dispenses or when there is leftover medication from the previous periods assessed. For PDC, overlapping subsequent dispenses are shifted allowing succeeding gaps to be covered by supply in the stockpile. PDC avoids the type of overestimation that MPR is prone to and is always \(\le\) 100%. Figure 1 illustrates their differences. Here, two identical dispensing patterns are shown where in the case of MPR, the presence of significant overlaps between supplies close off an earlier gap producing an MPR of >100%. Whereas in the case of PDC the earlier gap is maintained through shifting subsequent dispense to the end of the stockpile when overlaps of supplies occur, producing a PDC of 97%.

Fig. 1
figure 1

MPR and PDC of two identical dispense patterns. Each dispense represented by the red bar is of 90 days supply, with the left ending of the bar representing the dispense date and the right ending of the bar representing when the supply of the dispense will run out. Top: MPR sums days supply indiscriminately with respect to overlaps and gaps resulting in a value of > 100%. Bottom: PDC accounts for gaps and addresses overlaps by shifting subsequent dispense date to when supply runs out

Logistic regression is a statistical method for modelling the relationship between one or more predictor variables and a dichotomous response variable of the values 1 or 0. It is a function of the odds ratio, and it models the proportion of new incidents developed within a given period of time. Mathematically, the logistic regression model logit(p) is defined by

$$\begin{aligned} \begin{aligned} \frac{p}{1 - p} = \text {exp}(\beta _0 + \beta _1x_1 +\cdots + \beta _{m-1}x_{m-1})\\ \end{aligned} \end{aligned}$$

where \(\beta _0\) is the constant and \(\beta _1,\ldots ,\beta _{m-1}\) are the coefficients of the predictor variables \(x_1,\ldots ,x_{m-1}\), p is the probability of the event and \(\frac{p}{1-p}\) is the odds for an event [33]. The above equation can also be written as

$$\begin{aligned} \text {ln}\left( \frac{p}{1 - p}\right) = \beta _0 + \beta _1x_1 + \cdots + \beta _{m-1}x_{m-1} \end{aligned}$$

Equation 4 is referred to as the logit transformation of the probability of an event, showing that the natural log of the odds ratio is a linear function of predictor variables x [33, 34]. The probability of the event p can be calculated by [33]

$$\begin{aligned} \begin{aligned} p&= \frac{\text {exp}(\beta _0 + \beta _1x_1 +\cdots + \beta _{m-1}x_{m-1})}{1 + \text {exp}(\beta _0 + \beta _1x_1 +\cdots + \beta _{m-1}x_{m-1})}\\&= \frac{\text {exp}({\textbf {x}} \varvec{\beta })}{1 + \text {exp}({\textbf {x}} \varvec{\beta })}\\&= \frac{1}{1 + \text {exp}(-{\textbf {x}} \varvec{\beta })} \end{aligned} \end{aligned}$$

Regression based analysis is widely used in epidemiological studies to uncover the relationship between risk factors and an outcome of interest [35, 36]. In the context of CVD, logistic regression is commonly incorporated in research as a representative traditional statistical method to be compared against more recent methods such as decision tree, gradient boosting as well as less interpretable methods such as SVM, NN, and random forest [37,38,39,40]. Researchers have shown the model is competitive and is a valuable tool for risk prediction under certain conditions: moderate sample size (\(\sim\) 10,000 patients), small incident rates, and a limited number of predictors [39].

Ridge regression and its classification variant ridge classifier are linear models that address the problem of multicollinearity in the predictor variables [41]. The models are part of a family of penalised regression models including Lasso [42] and Elastic Net [43] that adds a penalty to the loss. This penalty constrains and shrinks the size of the model coefficients, which has a regularisation effect and prevents overfitting. Formally, the ridge regression minimizes

$$\begin{aligned} \sum _{i=1}^{n} \left( Y_i - \sum _{j=1}^{p} X_{ij} \beta _{j}\right) ^{2} + L2 \sum _{j=1}^{p} \beta _{j}^{2}. \end{aligned}$$

Here, n is the number of samples, p is the number of model coefficients, Y is the response variable, X is the predictor variable and \(L2 \ge 0\) is the regularisation parameter. The left hand side of Eq. 6 is the sum of the squared estimate of error and the right hand side of the equation is the penalty term. Ridge regression places a quadratic constraint on \(\beta\)s, where the regularisation parameter L2 controls the amount of shrinkage [44]. L2 is a hyperparameter, the value of which needs to be searched, typically through a cross-validation procedure. For classification problems, ridge classifier first modifies the binary response to − 1 and 1 and then treats the task as a regression task, minimising the loss in Eq. 6. The sign of the regressor’s prediction then represents the predicted class.

Ridge regression/classification has shown to be a promising modelling technique in the domain of epidemiology, particularly in high dimensional settings where the number of features is large, such as in genomic data analysis [45, 46]. As a comparatively more interpretable model, it has shown to be competitive against models with the capacity to model non-linear relationships such as Support Vector Machines (SVM) and neural networks (NN) [47].

A multilayer preceptron (MLP) is a densely connected feedforward neural network that in its most simple form consists of 3 layers: input, hidden and output. In contrast to a recurrent neural network, feedforward means information flows from one end of the network to the other without any feedback connections. A feedforward network defines a mapping of \(\varvec{y} = f(\varvec{x};\varvec{\theta })\) where \(\varvec{x}\) is the input, \(\varvec{y}\) the output and \(\varvec{\theta }\) the learnt parameters that best approximate the function. Commonly represented as a composite of functions \(f(x) = f^{(3)}(f^{(2)}(f^{(1)}(x)))\) where \(f^{(i)}\) represents the ith layer [48]. For a MLP, \(\varvec{\theta }\) consists of the weight matrix \(\varvec{W}\) and the bias vector \(\varvec{b}\). Mathematically, a 3-layer MLP is defined as

$$\begin{aligned}&{\varvec{h}}^{\mathbf{(1)}} = \phi ^{(1)}({\varvec{W}}^{\mathbf{(1)}}\varvec{x} + {\varvec{b}}^{\mathbf{(1)}}) \\&{\varvec{h}}^{\mathbf{(2)}} = \phi ^{(2)}({\varvec{W}}^{\mathbf{(2)}} {\varvec{h}}^{\mathbf{(1)}} + {\varvec{b}}^{\mathbf{(2)}}) \\&\varvec{y} = \phi ^{(3)}({\varvec{W}}^{\mathbf{(3)}}{\varvec{h}}^{\mathbf{(2)}} + {\varvec{b}}^{\mathbf{(3)}}). \end{aligned}$$

Here, \(\varvec{W^{(i)}}\), \(\varvec{b^{(i)}}\) and \(\phi ^{(i)}\) are the weights, bias and activation for the ith layer. \(\varvec{h^{(i)}}\) is the output of the layer i. The network can be trained end-to-end using backpropagation. Conventionally, with the exception of the output layer, \(\phi\) is a non-linear activation, common among which are sigmoid, hyperbolic tangent also known as tanh or the more recently developed rectified linear unit (ReLU). It is the non-linear activation that provides the expressive power of MLP. Even with only a single hidden layer, an MLP can be universal (represent arbitrary functions) under certain technical conditions [49]. Increasing the depth of the network allows the network to represent complex functions more compactly. The hidden layer(s) of MLP can be thought of as learning nonlinear feature mapping, transforming a nonlinearly separable representation of the features to one that is linearly separable [48, 49]. This enables MLP to represent nonlinear relationships, overcoming the limitations of linear models. MLP is often used as an NN comparator among a number of other classification approaches for prediction tasks in the biomedical domain [37,38,39, 50]. While some studies have shown the strength of MLP over other classical machine learning techniques such as LR, classification and regression tree, gradient boosting, and random forest [37, 38], other studies have shown that MLP is not always superior [39, 50]. Primarily, MLP is known as a model that can capture complex non-linear relationships and interactions among the predictor variables, and is the simplest manifestation of a class of machine learning algorithms (NN). Within the context of this study, the limitation of MLP is that the model is intrinsically non-temporal, and is unable to model temporal relationships and interactions. While MLP is not explicitly a temporal model, it is flexible and will serve as a comparison for the performance of other models.

Recurrent Neural Networks (RNNs) are connectionist models that include edges connecting across adjacent time steps. At time t the input to node \(h^{(t)}\) is comprised of both the current input data \(x^{(t)}\) and the hidden node values from the previous time step \(h^{(t-1)}\). At each t the hidden node’s value \(h^{(t)}\) is then used to calculate the network’s output \(\hat{y}^{(t)}\). The forward pass at each time step of a RNN can be fully specified by two equations.

$$\begin{aligned} \varvec{h}^{(t)} = \sigma (W^{hx}\varvec{x}^{(t)} + W^{hh}\varvec{h}^{(t-1)} + \varvec{b}^{h} ) \end{aligned}$$
$$\begin{aligned} \varvec{\hat{y}}^{(t)} = softmax(W^{yh}\varvec{h}^{(t)} + \varvec{b}^{y}) \end{aligned}$$

where \(W^{hx}\) is the weight matrix between the input and the hidden node, \(W^{hh}\) is the recurrent weight matrix connecting the hidden layer between adjacent time steps and \(W^{yh}\) is the weight matrix between the hidden node and the output. \(\varvec{b}^{h}\) and \(\varvec{b}^{y}\) are the bias vectors for the hidden node and output respectively. \(\sigma\) is a sigmoid activation function and softmax is a softmax activation function [51]. Other activation functions may be used, commonly used activations include, ReLU and tanh [52].

RNNs suffer from a problem known as ‘vanishing’ or ‘exploding’ gradient. By unfolding the recurrent edges of the hidden layer, RNNs can be interpreted as deep NNs with one layer per time step where the weights are shared across time steps. The widely used algorithm backpropagation through time (BPTT) for training RNNs is the application of backpropagation through the (unfolded) network across time steps. However, as outlined by [53] training RNNs using gradient descend is difficult. The problems arise when the error signal propagated backwards in time vanishes or explodes as the evolution of the backpropagated error exponentially depends on the weights of the recurrent edge. Earlier experiments showed that backpropagation was unable to discover contingencies that span long temporal intervals, settling in suboptimal solutions that learnt short-range dependencies but failed to learn dependencies that are long-range [53].

To address the difficulties in training RNNs, Hochreiter and Schmidhuber introduced Long Short-Term Memory (LSTM) [54]. LSTM features a constant error carrousel (CEC) to allow constant error to flow through the self-connected units, a multiplicative input gate unit and a multiplicative output gate unit to protect the network’s memory from perturbation by irrelevant inputs as well as irrelevant memory perturbing other units. The extended unit is called a memory cell. The proposed LSTM solved numerous complex tasks requiring the learning of long-range dependencies that were unable to be solved by previous RNN algorithms. Gers et al. further added a forget gate unit to LSTM so that it may overcome the weakness of the internal cells’ values growing without bounds when the network is learning from continual input streams that are not previously segmented into training sequences with clearly demarcated beginnings and ends [55]. The forget gates learn to reset the contents of the memory cells once they are no longer needed. Since its introduction, forget gates have proven effective and are now standard in LSTM implementations [51]. The full algorithm of LSTM with forget gate is given by the following equations:

$$\begin{aligned}&\varvec{g}^{(t)} = \phi (W^{gx}\varvec{x}^{(t)} + W^{gh}\varvec{h}^{(t-1)} + \varvec{b}^{g}) \end{aligned}$$
$$\begin{aligned}&\varvec{i}^{(i)} = \sigma (W^{ix}\varvec{x}^{(t)} + W^{ih}\varvec{h}^{(t-1)} + \varvec{b}^{i}) \end{aligned}$$
$$\begin{aligned}&\varvec{f}^{(t)} = \sigma (W^{fx}\varvec{x}^{(t)} + W^{fh}\varvec{h}^{(t-1)} + \varvec{b}^{f}) \end{aligned}$$
$$\begin{aligned}&\varvec{o}^{(t)} = \sigma (W^{ox}\varvec{x}^{(t)} + W^{oh}\varvec{h}^{(t-1)} + \varvec{b}^{o}) \end{aligned}$$
$$\begin{aligned}&\varvec{s}^{(t)} = \varvec{g}^{(t)} \odot \varvec{i}^{(t)} + \varvec{s}^{(t-1)} \odot \varvec{f}^{(t)} \end{aligned}$$
$$\begin{aligned}&\varvec{h}^{(t)} = \phi (\varvec{s}^{(t)}) \odot \varvec{o}^{(t)} \end{aligned}$$

Here \(\varvec{g}^{(t)}\) is the input node that takes activation from the input layer \(\varvec{x}^{(t)}\) and the hidden layer \(\varvec{h}^{(t-1)}\). The superscripts t and \(t-1\) indicate time steps, where t is the current time step and \(t-1\) the previous time step. \(W^{gx}\) and \(W^{gh}\) are weights for the input layer to the input node and the hidden layer to the input node respectively. \(\phi\) is a tanh activation function for the summed weighted input and bias vector \(\varvec{b}^{g}\). \(\varvec{i}^{(t)}\), \(\varvec{f}^{(t)}\) and \(\varvec{o}^{(t)}\) are the input, forget and output gate units. Each gate also takes activation from the summed weighted \(\varvec{x}^{(t)}\), \(\varvec{h}^{(t-1)}\) and their respective bias vectors. Here, \(\sigma\) is a sigmoid function; if the value of the gate is one, all flow is passed through, if the value is zero, the flow is entirely blocked. \(\varvec{s}^{(t)}\) is the internal state of the memory cell. \(\odot\) denotes pointwise multiplication. The value of the internal state is updated by the sum of the input node \(\varvec{g}^{(t)}\) multiplied by the input gate unit \(\varvec{i}^{(t)}\) and the internal state from the previous time step \(\varvec{s}^{(t-1)}\) multiplied by the forget gate unit \(\varvec{f}^{(t)}\). The value of the hidden layer \(\varvec{h}^{(t)}\) is then derived first by passing the internal state \(\varvec{s}^{(t)}\) through a tanh function and then by multiplying the value of the output gate unit \(\varvec{o}^{(t)}\).

Recently, a number of researchers have applied RNN, in particular, LSTM in the biomedical domain. The task of making predictions in the healthcare domain is influenced by a recency effect akin to human memory as well as simultaneously requiring learning the dependencies of distant past events, for these reasons the architecture of LSTM is well suited. These studies include [56] where LSTM is used to model multivariate pediatric intensive care time series to predict diagnoses and [57] where LSTM is used to jointly analyse episodic clinical events and continuous monitoring data in the ICU settings to predict deterioration of patient conditions and their length of stay.

Pham et al. [58] proposed an extension to the LSTM model called DeepCare. The model regulates the input gate of LSTM with information on the patient’s diagnoses and admission method (planned or unplanned) and regulates the forget and output gates with information about the patient’s diagnoses and procedures or medications received. DeepCare not only was able to learn long term dependencies in a patient’s disease trajectory, it can learn the confounding interaction between disease progression and treatment. The authors demonstrated the superiority of DeepCare at the task of diagnosis and intervention prediction against a Markov model and a plain RNN for two distinct diseases; diabetes and mental health. In the same study, DeepCare was shown to be superior to Support Vector Machine (SVM) and Random Forest in predicting future risk of readmissions.

In one particular recent study LSTM was applied to the task of medication adherence prediction. In this study, patients who were on self-administered injection therapy were monitored through a internet of things (IoT)-connected smart sharps bin. Data collected from the smart bin was then used to develop ensemble and deep learning machine learning models to predict patient adherence at the next scheduled injection [59]. Against model comparators including: extreme gradient boosting, extremely randomized trees, random forest, gradient tree boosting, MLP, and RNN; LSTM was found to be the best performing model on the held-out test set, achieving AUC of 0.8902 [59].

Similar to the above mentioned study, other recent studies in medication adherence have also underscored the temporal patterns present in patient adherence behaviour and the relationships between clinical events in patients history to adherence [60, 61]. To the best of our knowledge, the prediction of long-term medication adherence using temporal deep learning models on a large routinely collected population data set has not been investigated. The current study integrates patient history into the analytics task and aims to identify individuals within a population who might be at risk of medication non-adherence. The experiment focuses on lipid-lowering pharmacological treatment. All patients included in the cohort of the study have had at least one dispense of lipid-lowering medication in the observation window, with a two week look ahead past the index date. Informed by clinical practice, the study makes the assumption that once the patient has been prescribed a lipid-lowering medication, the patient should continue to be prescribed lipid-lowering medication; thus the lack of dispensed medication supply indicates a flaw in the CVD risk management process, such as failure to get repeat prescriptions or failure to have them dispensed.

Given the prolonged nature of pharmacological therapy for CVD, the hypothesis for this study is that computational methods that allow the integration of patient history including the history of pharmaceutical dispensing, hospitalisation, and lab test results (indicating patients’ physiological changes), through temporal modelling will aid predictive performance. The study investigates explicitly temporal models for sequential modelling including Long short-term memory (LSTM) recurrent neural network (RNN) architecture and simple recurrent neural networks (Simple RNN) as well as non-temporal models multilayer perceptron (MLP), ridge classifier (RC) and logistic regression (LR). LSTM contains internal mechanisms that facilitate the learning of important events and discarding of unimportant events in the distant past, overcoming the problem known as vanishing and exploding gradient suffered by Simple RNN [51, 54]. For this prediction task, the hypothesis that by lengthening the observation window temporal models, specifically LSTM, might gain a performance advantage is also investigated.


Adherence measure and medication switching

The current study uses PDC as the measure of adherence, but the commonly-observed phenomenon of medication switching in long-term therapy must be addressed. A frequent example in CVD management in New Zealand is the switching from simvastatin to atorvastatin. These two drugs constitute the majority of lipid-lowering medication prescribed in the country. Atorvastatin is a more potent drug at blocking target enzyme HMGCoA [62] and prior to September 2010 approval for atorvastatin required first-line use of simvastatin [63]. During medication switching, the total days “covered” of a class of drugs could exceed 100% even if one uses PDC as the measure. To apply the method of shifting the dispense date based on overlap across the two drugs is unrealistic as once the new drug is dispensed, the likely scenario is the patient will discontinue taking the older medication. In this study this issue is addressed by summing the PDC across any lipid-lowering drugs dispensed in the same quarter and set an upper-bound for the value to 100%. Figure 2 illustrates the method using simvastatin to atorvastatin switching as an example.

Fig. 2
figure 2

Dispensing pattern of simvastatin to atorvastatin switching and their respective PDC measures. A Lipid-lowering PDC is calculated by summing their respective PDC and setting an upper bound of 100

Data sources and cohort

PREDICT is a web-based CVD risk assessment and management decision support system developed for primary care in New Zealand. The system is integrated with the general practice electronic health record (EHR) and since its deployment in 2002 has produced a constantly growing cohort of CVD risk profiles. Through the use of encrypted National Health Identifier number (NHI), the de-identified cohort is annually linked to other routinely collected databases to produce a research cohort. The PREDICT cohort and its use in improving CVD risk assessment has been described in detail previously [64, 65]. Vascular Informatics Using Epidemiology and the Web (VIEW) is a research programme with the goal of reducing inequities in vascular disease outcomes. The VIEW team has made a de-identified extract from the PREDICT cohort and linked data available for the present study [66].

The current study links the PREDICT cohort to TestSafe (Auckland regional laboratory test results [67]) and national collections by the Ministry of Health - the Pharmaceutical collection, the National Minimum Dataset (hospital events) and the Mortality Collection [68]. TestSafe is used to obtain lab test results of clinically relevant measures: high-density lipoproteins (HDL), low-density lipoproteins (LDL), triglyceride (TRI), total cholesterol (TCL), total cholesterol to high density lipoprotein ratio (TC/HDL), glycated hemoglobin (HbA1c) and estimated Glomerular Filtration Rate (eGFR). The Pharmaceutical collection is used to obtain dispensing history of medication relevant to the management of CVD including lipid-lowering, blood pressure lowering, antiplatelets, and anticoagulants as well as dispensings of drugs used in the management of important comorbidities e.g. insulin. The National Minimum Dataset (NMDS) is used to identify hospitalisation with their dates of admission and discharge and diagnosis. The Mortality collection enables the identification of patients who died during the study period and their cause of death. From these sources, history of CVD, treatment trajectories, important comorbidities as well as CVD events can be derived.

Table 1 VIEW CVD categories: CVD history, CVD mortality and CVD outcome, feature names under the categories and feature descriptions

A lookup table constructed by the VIEW research team is used to identify relevant chemical names from the Pharmaceutical collection. Identified chemical names using this lookup table are grouped into 3 broad categories Lipid-lowering, CVD and Other. Similarly, a lookup table constructed by the VIEW research team is used to identify ICD-10 codes in the hospitalisation collection that are related to CVD conditions: more specifically, International Statistical Classification of Diseases and Related Health Problems, Tenth Revision, Australian Modification, ICD-10-AM, which was used in New Zealand from 1999 to 2019 [69]. The conditions are broadly in two categories: history and outcome, with the addition of mortality. For the list of the CVD conditions and their respective categories see Table 1. For the definitions of listed conditions see

The study cohort was selected through a number of exclusion criteria. First, patients having their PREDICT assessment prior to 01/01/2007 and after 30/12/2013 are excluded as their pharmaceutical records are censored in the observation or prediction windows. Second, informed by our interest in integrating the temporal pattern of disease states, patients without all components of cholesterol test results (HDL, LDL, TRI, TCL and TC/HDL) reported in either the observation or prediction windows are excluded. Third, informed by our interest in integrating the temporal pattern of disease management process, patients without lipid-lowering medication dispensed in the observation window with a 2-week look ahead post PREDICT assessment (to account for patients prescribed lipid-lowering medication in response to the PREDICT assessment) are excluded. Patients with infeasible data values and patients under the age of 18 are excluded. Lastly, Ethnicity MELAA (Middle Eastern, Latin American and African; which comprises only 1.5% of the New Zealand population [70]) and Other are excluded due to small sample size. See Fig. 3 for the study cohort selection flowchart.

Fig. 3
figure 3

Flow chart of study cohort selection

Of the 100,096 patients in the selected cohort, 25,419 patients have prior history of CVD, defined as having a hospital admission prior to their PREDICT assessment date with an ICD-10-AM code matching the ‘broad CVD history’ category (HX_BROAD_CVD) defined by VIEW.

Study design

A study design is formulated using each patient’s PREDICT assessment as the index date, and the \(\sim\) 2 years (8 \(\times\) 90 day quarters) prior to the index date and the \(\sim\) 5 years (20 \(\times\) 90 day quarters) after the index date as the observation window and the prediction window respectively. (see Fig. 4). A time-step of a quarter (90 days) is used for the constructed time-series. This decision is informed by 90 days being the most common value in the pharmaceutical record for DAYS_SUPPLY of Lipid-lowering medication, the CVD preventive treatment of principal interest. A \(\sim\) 5 years interval for the prediction window is chosen because it aligns with MoH guildlines for CVD risk assessment and is underpinned by the fact that patients’ CVD risk and risk management can change considerably over a longer period (i.e. 10 years), most randomised controlled trials of CVD medications are based on a period of 5 years or less and that practitioners are accustomed to this approach [3]. A \(\sim\) 2 years interval for the observation window is chosen in the interest of retaining enough samples in the data set, as dispense data extracted from the pharmaceutical collection begins from 2005, the longer the observation window grows the larger the number of samples that will need to be excluded.

Fig. 4
figure 4

Study design showing date range from index date for the observation window (shaded in green) and the prediction window (shaded in red)

Descriptive statistics

Based on the study design outlined in the section Study design and the result of the cohort selection outlined in the section Data sources and cohort, quarterly time-series based on 90-day quarters are constructed for each patient in the cohort using the linked data outlined in the section Data sources and cohort. The features of the data fall into 8 categories: Demographic, Lipids, Lipid-lowering drugs, CVD drugs, Other drugs, Hospitalisation, HbA1c and eGFR, and PREDICT. See Tables 2, 3, 4, 5 and 6 for the features’ descriptive statistics, and Table 7 for the variable descriptions of the PREDICT variables.

Table 2 Descriptive statistics: demographic variables
Table 3 Descriptive statistics: cholesterols
Table 4 Descriptive statistics: hospitalisation
Table 5 Descriptive statistics: HbA1c and eGFR
Table 6 Descriptive statistics: PREDICT
Table 7 PREDICT variables and their descriptions

Prediction outcome

The adherence prediction problem is formulated as a binary classification task: predicting adherent or non-adherent. The feature LL_PDC is a time-series that sums over all lipid-lowering PDCs: simvastatin, bezafibrate, atorvastatin, ezetimibe, nicotinic acid, acipimox, cholestyramine, cholestipol hydrochloride, pravastatin, ezetimibe with simvastatin and gemfibrozil, with an upper bound of 100 as described in the section Adherence measure and medication switching. Thus, any sums of lipid-lowering PDCs exceeding the value of 100 are set to 100. This feature is used to assess patient adherence using a historical and widely applied threshold of \(\ge 80\) indicating adherence [71, 72]. Here, patients’ mean LL_PDC in the prediction window determines their class (\(\ge 80\) equals class label 1 for adherence, 0 for non-adherence otherwise). Of the 100,096 patients included in the study cohort, 42,428 patients are classed as non-adherent and 57,668 patients are classed as adherent.

Prediction models and evaluation

The models LSTM, Simple RNN, MLP, RC and LR are compared. The input data for LSTM and Simple RNN are explicitly sequential, and the input data for MLP, RC as well as LR are flattened across the time-step dimension and concatenated. To examine the effect of multicollinearity as well as the effect of using history on RC and LR, two other input data sets are constructed. First, instead of concatenating the features across multiple time-steps, an input data set is constructed that uses the values of the last time-step in the observation window (quarter 8) for features that are invariable across time (i.e. SEX, ETHNICITY, NZDEP) and the mean value of features that are variable across time (i.e. TC/HDL, LL_SIMVASTATIN, HX_BROAD_CVD). Here, an exception is AGE where the value at the 8th quarter is used. This data set is from here on referred to as aggregated. Second, an input data set is constructed using only the values of the last quarter in the observation window. This data set is from here on referred to as last quarter.

All NN models are connected to a 2-unit densely connected output layer. This layer uses the softmax activation from which the probability distribution of the two classes is derived. The RNN models (LSTM and Simple RNN) require an architecture that takes multiple inputs across the observation window and only outputs once at the last time-step. The unrolled view across the time-step dimension of the RNN models is shown in Fig. 5.

Fig. 5
figure 5

An unrolled view of RNN across the time-step dimension. Here, RNN can be a layer of Simple RNN or LSTM. NN is a layer of densely connected NN with softmax activation. \(x_{n}\) are the inputs across n time-steps. \(\hat{y}\) is the output

Software setup

Experiments are carried out using Python 3.6.8 [73], with neural network models using library Keras 2.2.4 [52] with Tensorflow 1.13.1 [74] backend and linear models RC and LR using library Scikit-learn 0.21.2 [75]. Experiments also used R version 3.6.0, package pROC 1.16.2 [76] for conducting DeLong’s test and SciPy library 1.5.4 for conducting Kolmogorov–Smirnov test [77].

Procedures for hyperparameter search

This section outlines the procedures carried out to search for the optimal set of hyperparameters for the LSTM, Simple RNN and MLP models. The samples in the data set are first randomly shuffled, and then from the entire data set, 10,096 samples are set aside as the test set and removed from the search process. The remaining data (90,000 samples) are used in the search process. For each combination of hyperparameters, a fivefold cross validation is carried out where while the proportion of data used for the train and validation sets are consistent, with 90% train (81,000 samples) and 10% validation (9000 samples), different splits of train and validation sets are used in the experiments. See Fig. 6 for a visual illustration of how the data is split into train and validation sets across the 5 folds. In these experiments, the validation error is monitored and the lowest mean validation error is used to determine the best set of hyperparameters.

Fig. 6
figure 6

Illustration of the procedure used in splitting data into test, train and validation sets across different folds

For all experiments, the optimizer ADAM [78] is used due to its capacity to adaptively adjust the learning rate during the training process and because its default hyperparameters have been shown to work on a range of problems. The ADAM optimizer is used with the default hyperparameter values outlined in [78]. These hyperparameter values are, learning rate \(\alpha = 0.001\), the exponential decay rate for the 1st moment estimate \(\beta _1 = 0.9\), the exponential decay rate for the 2nd moment estimate \(\beta _2 = 0.999\), and the small constant for numeric stability \(\hat{\epsilon } = 1e-7\) [52]. See Table 8 for the found optimal hyperparameters of the NN models.

Table 8 NN model hyperparameters for the adherence prediction experiment

Scikit-learn’s RidgeClassifierCV class [75] provides an implementation of ridge classifier that uses a built-in generalised cross-validation to search for the optimal L2 value from an array of values. For this experiment, the values \(1e^{-6}\), \(1e^{-5}\), \(1e^{-4}\), \(1e^{-3}\), \(1e^{-2}\), 0.1, 1 and 10.0 were searched. See Table 9 for the found optimal L2 values and their respective accuracy on the validation set.

Table 9 Optimal L2 values found for RidgeClassifier for adherence prediction and their respective accuracy on the validation set

Assessment of model performance

Once the optimal hyperparameters for each NN model have been found, the models are trained using the found hyperparameters with the data split shown in Fold 1 in Fig. 6. The test set that is held aside is then used to assess model performance. The linear models RC, LR and their respective aggregated and last quarter variants are trained using the same training samples in Fold 1 and use the same test samples to measure model performance. For RC, the value of L2 is estimated using the training samples, accuracy reported are calculated using the validation set and model performance is assessed based on prediction made using the test set.

The performance of the models on the test set are compared by using the metric of receiver operating characteristic (ROC) area under the curve (AUC) score [75, 79]. To assess the statistical significance of the difference in performance between the models, DeLong’s test is used to conduct pairwise comparisons. DeLong’s test is a nonparametric test that can be used to obtain a p-value when two ROC AUC are compared [80]. The Bonferroni adjustment is used to address the increased likelihood of making a type I error when making multiple comparisons [81]. Sensitivity analysis is conducted by randomly resampling the test set with replacement. Here, 1000 (repeats) of 10,096 (resampled test sets) are used to assess model performance, with each model producing 1000 ROC AUC scores. The models’ scores are then compared using two-sample Kolmogorov-Smirnov test [77, 82].

A potentially important consideration for adherence analysis using dispensing data is the number of days the patient is hospitalised in the prediction window. During this time the patient is given their medication but is not recorded as part of the dispensing data (as inpatient medicine supply is not tracked within the community-pharmacy based data source). However, in the extracted de-identified quarterly time-series it was difficult to accurately track the number of days the patient is an inpatient as, for instance, consecutive days of outpatient visits and a single inpatient stay could appear identical. It was determined that only 0.54% of the training set changed from non-adherent to adherent with a maximal estimate of in-patient hospitalisation days, and therefore hospitalisation was ignored in computing PDC in the observation period.

Post-PREDICT experiment

Based on the hypothesis that the integration of patient history through temporal modelling will aid predictive performance, and that LSTM has the capacity to learn to retain relevant and ignore unimportant patient history, a further question is raised: Could lengthening the observation window demonstrate a more distinct advantage of using LSTM? To answer this question, a further set of experiments is conducted. Using data after the PREDICT assessment/index date (20 quarters), a task is formulated to predict the adherence behaviour in the last year (quarter 17 to quarter 20) using 1, 2, 3 and 4 years (4, 8, 12 and 16 quarters) of patient history before and up to quarter 16 (the new index date). See Fig. 7 for an illustration of the study design of this set of experiments. For this task, patients who died before quarter 16 (4897 individuals) are removed from the data. These experiments are from here on referred to as the Post-PREDICT adherence prediction experiments. For these experiments, models LSTM, Simple RNN, MLP and RC are compared.

Fig. 7
figure 7

Study design of the Post-PREDICT adherence prediction experiments showing the initial index date, new index date, the observation windows (shaded in green) and the prediction window (shaded in red)


Table 10 shows the ROC AUC scores of the models on the task of adherence prediction. Figure 8 shows the ROC curves of the models. It can be observed LSTM, Simple RNN, MLP and in some regions RC compare favourably to other model comparators. Over the ROC space, the NN based models dominate over the regression based models with the exception that for a small region RC performed similarly well to MLP. Here, LSTM’s predictive superiority is shown, dominating all other models for the most part of the ROC space. The performance strength of NN models are confirmed in Table 10 where there appears to be a performance gap between NN models and the regression based models, suggesting for this task the capacity to model complex nonlinear relationships is advantageous. Here, the best performing regression models are RC and LR using full sets of features across the observation window. Interestingly, in Fig. 8, the aggregated variant of RC and LR dominates over their last quarter variants in the region where the false positive rate (FPR) is < 0.35, however, where the FPR is > 0.35 the last quarter variants dominate over the aggregated variant.

Fig. 8
figure 8

ROC curves of adherence prediction

Table 10 Model performance on adherence prediction

The results of the pairwise comparison of ROC curves using DeLong’s test are shown in Table 11. The model comparisons of the adherence prediction experiments include 36 hypotheses (pairwise comparison of 9 individual models). The significance level of 0.05 is chosen for determining statistical significance. In Table 11, all p-values below the Bonferroni adjusted significance threshold of 0.00139 are in bold, the Bonferroni adjusted threshold is derived from 0.05/36 as there are 36 pairwise comparisons. Alongside the results shown in Table 10, these results supports our stated hypothesis that an integration of patient history through temporal modelling can aid predictive performance. Here, LSTM and Simple RNN are the top performing models with LSTM shown to perform significantly better than all model comparators. Simple RNN ranking second performed significantly better than all other models except for MLP. The results of the sensitivity analysis are shown in Table 12. The pairwise comparison of ROC AUC scores using Kolmogorov-Smirnov test shows that the models are robust against randomly resampled test sets.

Table 11 pvalues of pairwise comparison of models performance on adherence prediction using DeLong’s test
Table 12 p values of pairwise comparison of models performance on 1000 (repeats) randomly resampled with replacement test sets (10,096 samples) using Kolmogorov–Smirnov test
Fig. 9
figure 9

Model performance on the Post-PREDICT adherence prediction with varying observation window lengths

The results of the experiments using data after the PREDICT assessment date with varying lengths of observation window is shown in Fig. 9. The corresponding results of DeLong’s tests are shown in Table 13. Here, it can be observed that as the observation window grew in length, the performance of temporal models LSTM and Simple RNN continued to improve up to 12 quarters. The non-temporal comparator MLP and RC were unable to leverage additional context beyond 8 quarters to improve its prediction. Interestingly, RC (aggregated)’s performance deteriorated with longer observation window. These results confirm that a more distinct advantage of using LSTM can be demonstrated by lengthening the observation window.


The results of the adherence prediction experiments show that for predicting adherence to lipid-lowering medication over a 5-year period based on a 2-year (8 quarters) observation window it is beneficial to use the full sets of features across the observation window. With RC and LR, both models perform significantly better than their aggregated and last quarter counterparts despite the strong presence of multicollinearity. There is also evidence of nonlinear relationships among the predictor variables as shown by the superior performance of MLP in contrast to the linear models. Additionally, the observed ROC AUC of Simple RNN is higher than that of MLP, but the result is not statistically significant with the adjusted significance level (it is worthwhile to mention that the Bonferroni adjustment is conservative and can lead to a high rate of false negatives [83]). Overall, LSTM is the superior model, outperforming all comparators. These results demonstrate that for the problem of adherence prediction, explicitly modelling history sequentially and learning to “remember” and “forget” influential and unimportant events in the past is valuable for predictive performance. Our findings of the strengths of deep learning models when compared with regression based models are consistent with observations in other recent studies in the biomedical domain, including for the task of adherence prediction among hypertensive patients [84], and specifically the advantage of LSTM for the tasks of in-hospital mortality, decompensation, length of stay and phenotyping prediction for ICU patients using clinical time−series data [85].

In our experiments using data after the PREDICT assessment with varying lengths of observation window, it is observed that only certain models’ performance improved with longer observation windows, namely LSTM, Simple RNN and RC (See Fig. 9). Additionally, it can be observed that the method of aggregating data from the past, with equal weight, is detrimental to the prediction of regression based models. From Table 13, results of DeLong’s tests show LSTM and Simple RNN are able to leverage additional contexts of patient history to produce predictions that are statistically significantly better than the performance of MLP (LSTM from 8 quarters and Simple RNN from 12 quarters). For this specific task, 12 quarters of observation window seems to be the most effective context for LSTM, allowing it to perform significantly better than RC. Our findings aligns with other studies that showed the capacity of explicitly sequential models to improve with longer context of observation [86, 87]. However, in contrast to [87] MLP did not benefit from longer observation windows in our experiments, and aggregating data across the temporal dimension negatively impacted the performance of regression based models as shown in RC (aggregated).

Table 13 Post-PREDICT adherence prediction

The results demonstrate the advantage of lengthening the observation window for LSTM and Simple RNN models that take explicitly sequential data as input for the task of adherence prediction. Interestingly, the ability to model nonlinear relationships does not give MLP an advantage over RC. This could be due to the length of the prediction window. To predict adherence behaviour over 1 year is a much easier task than to predict adherence behaviour over 5 years. The results suggest there are complex nonlinear relationships in adherence prediction that become important when the prediction window is longer.

There are a number of limitations of the current study. It is not uncommon in the course of a chronic condition for patients with the condition to also suffer from one or more comorbidities. Some of these comorbidities can have consequential effects on the risk and outcome of the patient. In the case of CVD, diabetes and renal disease are known independent risk factors that contribute to an increased risk of a CVD event [3, 88]. While covariates that indicate the presence of comorbidities are included in the current study (e.g. eGFR and HbA1c) the effects of the combination of multiple drugs or polytherapy and by extension the complexity of treatment regimen on adherence has not been addressed. Additionally, the effects of medication titration and known side-effects of statins such as myopathy, elevated creatine kinase levels, and diabetes [3] have not been explored in the context of adherence prediction. Further, researchers of CVD therapy have pointed to the knowledge gap that exists between the evidence from randomised clinical trails, typically only lasting a few years, and the effect of long-term medication treatment (where it is common for therapy to continue for decades) in secondary prevention [89]. The study design used in this thesis was unable to capture the long-term (defined in the scale of decades) effect of disease progression and treatment trajectory. While preserving a useful number of cases, the data construction used in this thesis was only able to achieve a 7 year window to divide between observation and prediction. In the future, however, this will change as routinely collected electronic health records lengthen year on year. LSTM, like other NN models, is a class of black box models where the influence of and interactions between predictor variables cannot be readily explained. Considerable research has been carried out investigating methods to interpret and explain neural models [90, 91], and some specifically for RNNs such as through the use of an attention mechanism [92] or deriving feature attribution from Learned Binary Masks and KernelSHAP [93]. These methods are clearly worthy directions of future work as they hold the potential for aiding risk communication. Finally, the study was limited to patients in the PREDICT cohort; while PREDICT is widely-used in New Zealand general practice, the cohort is not identical to all users of lipid-lowering therapy in the general population. Without having external validation conducted, the generalisability of the findings are limited.

The current study confirms the hypothesis that integrating patient history through temporal modelling is beneficial to the prediction of medication adherence. This has implications for the monitoring of long-term statin therapy as well as other similar diseases where the management and treatment of the disease is long-term. In this study we have refrained from exploring in depth how the model would be operationalized in clinical decision support (e.g. in terms of specific operating thresholds or forms of alerts). However, the observed ROC AUC of 0.805 on 5-year adherence indicates that 80% of pairings of adherent and non-adherent patients are correctly ordered in terms of predicted risk [94]. Practical medication adherence promotion strategies include reminder packaging, reviewing a patient’s total set of medications and simply practicing more effective communication [95]. The performance of the LSTM model presented herein would provide a reasonably-accurate adherence risk stratification to help prioritise such efforts.


The current study provides evidence that routinely collected health data can be leveraged for the task of adherence prediction. Our findings show integrating patient history using temporal deep learning models is beneficial to predictive performance. LSTM’s performance compares favourably against other model comparators and is a promising model for identifying individuals within a population who might be at risk of medication non-adherence.

Availability of data and materials

The data that support the findings of this study are not available from the authors due to the terms on which we accessed them from the VIEW research programme. Researchers interested in using the VIEW datasets in collaboration with the VIEW programme should contact Professor Rod Jackson at The VIEW research programme data access procedures incorporate the“Five Safes”principles, an internationally recognised risk assessment framework encompassing safe projects, safe people, safe settings, safe data and safe output [96]. To ensure safe projects are conducted using the VIEW data, a template will be provided to allow completion of a data access proposal (DAP) that outlines the proposed research and the specific data required. The DAP will then be considered by the VIEW Leadership team, which is comprised of senior academics, including Professor Rod Jackson and Associate Professor Matire Harwood who are co-directors of the research programme. The research credentials of the applicant will also be reviewed. If approved, researchers are required to adhere to the VIEW team Code of Practice and to sign a Data Release Agreement form that outlines the conditions for data storage and use that must be adhered to (i.e.“safe people”). In terms of safe settings, source data and research-ready VIEW data are not publicly accessible. Researchers will only be able to access data through a VIEW virtual machine that is separate from the wider University network, is disconnected from the internet and is not accessible by USB. Approved researchers are expected to be physically based at the Section of Epidemiology and Biostatistics at the University of Auckland for a few days at the outset of data access. Subsequently, remote access to data can be arranged. Access to source data is highly restricted and requires authorisation from the VIEW team data manager. With consideration of the“safe data”principle, all data made available to researchers are anonymised; the VIEW team does not have access to identifiable health data and therefore the potential for re-identification of research-ready data is minimised. Finally, including a VIEW team member as a co-investigator for all approved research projects involving VIEW data contributes to safe outputs, as all results are reviewed by at least one VIEW team member for any identifying results such as small counts.





Acute Coronary Syndrome


Area under the curve


Body mass index


Cardiovascular disease


Electronic health record


Estimated glomerular filtration rate


Glycated haemoglobin


High-density lipoprotein


3-Hydroxy-3-methylglutaryl coenzyme A


International Statistical Classification of Diseases and Related Health Problems, 10th Revision


International Statistical Classification of Diseases and Related Health Problems, 10th Revision, Australian Modification


Intensive care unit


International Society for Pharmacoeconomics and Outcomes Research


Low-density lipoprotein


Logistic regression


Long short-term memory


Middle Eastern, Latin American and African


Multilayer perceptron


Medication possession ratio


National Health Index


National Minimum Dataset


Neural network


New Zealand Index of Deprivation


Proportion of days covered


Ridge classifier


Receiver operating characteristic


Recurrent neural network


Total cholesterol/high-density lipoprotein ratio


Total cholesterol




Vascular Informatics using Epidemiology and the Web


  1. Bambs C, Kip KE, Dinga A, Mulukutla SR, Aiyer AN, Reis SE. Low prevalence of ideal cardiovascular health in a community-based population: the heart strategies concentrating on risk evaluation (Heart SCORE) study. Circulation. 2011;123(8):850–7.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Reeves MJ, Rafferty AP. Healthy lifestyle characteristics among adults in the United States, 2000. Arch Intern Med. 2005;165(8):854–7.

    Article  PubMed  Google Scholar 

  3. Ministry of Health: Cardiovascular Disease Risk Assessment and Management for Primary Care. 2018. Accessed 3 July 2020.

  4. The Heart Foundation of New Zealand: Cardiovascular Disease Risk Assessment and Management. 2022. Accessed 26 Sep 2022.

  5. World Health Organisation: Cardiovascular disease (CVDs). 2021. Accessed 26 Sep 2022.

  6. Cholesterol Treatment Trailists’ (CTT) Collaborators. Efficacy and safety of cholesterol-lowering treatment prospective meta-analysis of data from 90056 participants in 14 randomised trails of statins. The Lancet. 2005;366(9493):1267–78.

  7. World Health Organization: Adherence to Long-Term Therapies-Evidence for Action. 2003. Accessed 5 July 2018.

  8. Simon ST, Kini V, Levy AE, Ho PM. Medication adherence in cardiovascular medicine. BMJ. 2021;374.

  9. Cramer JA, Roy A, Burrell A, Fairchild CJ, Fuldeore MJ, Ollendorf DA, Wong PK. Medication compliance and persistence: terminology and definitions. Value Health. 2008;11(1):44–7.

    Article  PubMed  Google Scholar 

  10. Institute for Quality and Efficiency in Health Care (IQWiG: Medication for the long-term treatment of coronary artery disease. 2006. Accessed 14 Jan 2019.

  11. Kerr A, Stewart RAH. Non-adherence to medication and cardiovascular risk. N Z Med J. 2011;124(1343):6–10.

    PubMed  Google Scholar 

  12. Brown MT, Bussell JK. Medication adherence: WHO cares? Mayo Clin Proc. 2011;86(4):304–14.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Mabotuwana T, Warren J, Harrison J, Kenealy T. What can primary care prescribing data tell us about individual adherence to long-term medication? Comparison to pharmacy dispensing data. Pharmacoepidemiol Drug Saf. 2009;18:956–64.

    Article  PubMed  Google Scholar 

  14. Kulkarni SP, Alexander KP, Lytle B, Heiss G, Peterson ED. Long-term adherence with cardiovascular drug regimens. Am Heart J. 2006;151:185–91.

    Article  PubMed  Google Scholar 

  15. Costa FV. Compliance with antihypertensive treatment. Clin Exp Hypertens. 1996;18:463–72.

    Article  CAS  PubMed  Google Scholar 

  16. Cramer JA, Benedict A, Muszbek N, Keskinaslan A, Khan ZM. The significance of compliance and persistence in the treatment of diabetes, hypertension and dyslipidaemia: a review. Int J Clin Pract. 2008;62:76–87.

    Article  CAS  PubMed  Google Scholar 

  17. Chodick G, Shalev V, Gerber Y, Heymann AD, Silber H, Simah V, Kokia E. Long-term persistence with statin treatment in a not-for-profit health maintenance organization: a population-based retrospective cohort study in Israel. Clin Ther. 2008;30(11):2167–79.

    Article  PubMed  Google Scholar 

  18. Newby LK, LaPointe NMA, Chen AY, Kramer JM, Hammill BG, DeLong ER, Muhlbaier LH, Califf RM. Long-term adherence to evidence-based secondary prevention therapies in coronary artery disease. Circulation. 2006;113(2):203–12.

    Article  CAS  PubMed  Google Scholar 

  19. Grey C, Jackson R, Wells S, Thornley S, Marshall R, Crengle S, Harrison J, Riddell T, Kerr A. Maintenance of statin use over 3 years following acute coronary syndromes: a national data linkage study (ANZACS-QI-2). Heart. 2014;100(10):770–4.

    Article  CAS  PubMed  Google Scholar 

  20. Sigglekow F, Horsburgh S, Parkin L. Statin adherence is lower in primary than secondary prevention: a national follow-up study of new users. PLoS ONE. 2020;15:1–17.

    Article  Google Scholar 

  21. Ellis JJ, Erickson SR, Stevenson JG, Bemstein SJ, Stiles RA, Fendrick MA. Suboptimal statin adherence and discontinuation in primary and secondary prevention populations. J Gen Intern Med. 2004;19(6):638–45.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Vinogradova Y, Coupland C, Brindle P, Hippisley-Cox J. Discontinuation and restarting in patients on statin treatment: prospective open cohort study using a primary care database. The BMJ. 2016;353:1–16.

    Article  Google Scholar 

  23. Ho PM, Spertus JA, Masoudi FA, Reid KJ, Peterson ED, Magid DJ, Krumholz HM, Rumsfeld JS. Impact of medication therapy discontinuation on mortality after myocardial infarction. Arch Intern Med. 2006;166(17):1842–7.

    Article  PubMed  Google Scholar 

  24. Rasmussen JN, Chong A, Alter DA. Relationship between adherence to evidence-based pharmacotherapy and long-term mortality after acute myocardial infarction. J Am Med Assoc. 2007;297(2):177–86.

    Article  CAS  Google Scholar 

  25. Viswanathan M, Golin CE, Jones CD, Ashok M, Blalock SJ, Wines RCM, Coker-Schwimmer EJL, Rosen DL, Sista P, Lohr KN. Interventions to improve adherence to self-administered medications for chronic diseases in the United States. Ann Intern Med. 2014;157:1–16.

    Google Scholar 

  26. Lehmann A, Aslani P, Ahmed R, Celio J, Gauchet A, Bedouch P, Bugnon O, Allenet B, Schneider MP. Assessing medication adherence: options to consider. Int J Clin Pharm. 2014;36(1):55–69.

    Article  PubMed  Google Scholar 

  27. Lam WY, Fresco P. Medication adhernce measures: an overview. Biomed Res Int. 2015;2015(217047):1–12.

    Google Scholar 

  28. Andrade SE, Kahler KH, Frech F, Chan KA. Methods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiol Drug Saf. 2006;15(8):565–74.

    Article  PubMed  Google Scholar 

  29. Fairman K, Matheral B. Evaluating medication adherence. J Manag Care Pharm. 2000;6(6):499–504.

    Article  Google Scholar 

  30. Grymonpre R, Cheang M, Fraser M, Metge C, Sitar DS. Validity of a prescription claims database to estimate medication adherence in older persons. Med Care. 2006;44(5):471–7.

    Article  PubMed  Google Scholar 

  31. Krueger K, Griese-Mammen N, Schubert I, Kieble M, Botermann L, Laufs U, Kloft C, Schulz M. In search of a standard when analyzing medication adherence in patients with heart failure using claims data: a systematic review. Heart Fail Rev. 2018;23(1):63–71.

    Article  PubMed  Google Scholar 

  32. Pharmacy Times: Do you know the difference between these adherence measures. 2015. Accessed 20 June 2021.

  33. Department of Statistic Online Programs: Logistic Regression. Published by Penn State Eberly College of Science. 2018. Accessed 26 Aug 2018.

  34. MedCalc: Logistic Regression. Published by MedCalc easy-to-use statistical software. 2018. Accessed 26 Aug 2018.

  35. Haghighi M, Johnson SB, Qian X, Lynch KF, Vehik K, Huang S. The TEDDY Study Group: a comparison of rule-based analysis with regression methods in understanding the risk factors for study withdrawal in a pediatric study. Sci Rep. 2016;6(April):1–12.

    Article  CAS  Google Scholar 

  36. Goldstein BA, Navar AM, Carter RE. Moving beyond regression techniques in cardiovascular risk prediction: applying machine learning to address analytic challenges. Eur Heart J. 2017;38(23):1805–14.

    Article  PubMed  Google Scholar 

  37. Kurt I, Ture M, Kurum AT. Comparing performances of logistic regression, classification and regression tree, and neural networks for predicting coronary artery disease. Expert Syst Appl. 2008;34(1):366–74.

    Article  Google Scholar 

  38. Weng SF, Reps J, Kai J, Garibaldi JM, Qureshi N. Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLoS ONE. 2017;12(4):1–14.

    Article  CAS  Google Scholar 

  39. Nusinovici S, Tham YC, Chak Yan MY, Wei Ting DS, Li J, Sabanayagam C, Wong TY, Cheng CY. Logistic regression was as good as machine learning for predicting major chronic diseases. J Clin Epidemiol. 2020;122:56–69.

    Article  PubMed  Google Scholar 

  40. Mohd Faizal AS, Thevarajah TM, Khor SM, Chang SW. A review of risk prediction models in cardiovascular disease: conventional approach vs artificial intelligent approach. Comput Methods Prog Biomed. 2021;207:1–11.

    Article  Google Scholar 

  41. Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55–67.

    Article  Google Scholar 

  42. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B. 1996;58(1):267–88.

    Google Scholar 

  43. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B. 1996;67:301–20.

    Article  Google Scholar 

  44. Taylor J. Introduction to regression and analysis of variance: penalized models. Stanford University Statistics, vol. 203. 2005. jtaylo/courses/stats203/notes/penalized.pdf.

  45. Cule E, De lorio M. Automatic choice of the ridge parameter. Ridge regression in prediction problems. Genetic Epidemiol. 2013;37:704–14.

    Article  Google Scholar 

  46. Vlaming DR, Groenen PJF. The current and future use of ridge regression for prediction in quantitative genetics. BioMed Res Int. 2015.

  47. Niemann U, Boecking B, Brueggenmann P, Mebus W, Mazurek B, Spiliopoulou M. Tinnitus-related distress after multimodel treatment can be characterized using a key subset of baseline variables. PLos ONE. 2020;15:e0228037.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Goodfellow I, Bengio Y, Courville A. Deep learning; 2016. MIT Press.

  49. Grosse R. Lecture 5: multilayer perceptrons. Intro to neural networks and machine learning. 2018. Accessed 8 Nov 2020.

  50. Churpek MM, Yuen TC, Winslow C, MeltZer DO, Kattan MW, Edelson DP. Multicenter comparison of machine learning methods and conventional regression for predicting clinical deterioration on the wards. Crit Care Med. 2016;44(2):368–74.

    Article  PubMed  PubMed Central  Google Scholar 

  51. Lipton, ZC. A critical review of recurrent neural networks for sequence learning. 2015.arXiv:1506.00019.

  52. Chollet F, et al. Keras. 2015. Accessed 10 Nov 2020.

  53. Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw. 1994;5(2):157–66.

    Article  CAS  PubMed  Google Scholar 

  54. Hochreiter S, Urgen Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.

    Article  CAS  PubMed  Google Scholar 

  55. Gers FA, Schmidhuber J, Cummins F. Learning to forget: continual prediction with LSTM. Neural Comput. 2000;12(10):2451–71.

    Article  CAS  PubMed  Google Scholar 

  56. Lipton ZC, Kale DC, Elkan C, Wetzel R. Learning to diagnose with LSTM recurrent neural networks. 2015.

  57. Xu Y, Biswal S, Deshpande SR, Maher KO, Sun J. RAIM: recurrent attentive and intensive model of multimodal patient monitoring data. Kdd. 2018;18:2565–73.

    Article  Google Scholar 

  58. Pham T, Tran T, Phung D, Venkatesh S. DeepCare: a deep dynamic memory model for predictive medicine. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 9652 LNAI(i); 2016. p. 30–41.

  59. Gu Y, Zalkikar A, Liu M, Kelly L, Hall A, Daly K, Ward T. Predicting medication adherence using ensemble learning and deep learning models with large scale healthcare data. Sci Rep. 2021;11:18961.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Hu F, Warren J, Exeter DJ. Interrupted time series analysis on first cardiovascular disease hospitalization for adherence to lipid-lowering therapy. Pharmacoepidemiol Drug Saf. 2019.

    Article  PubMed  Google Scholar 

  61. Pietrzykowski L, Michalski P, Kosobucka A, Kasprzak M, Fabiszak T, Stolarek W, Siller-Matula JM, Kubica A. Medication adherence and its determinants in patients after myocardial infarction. Nat Sci Rep. 2020;10(12028):1–11.

    Google Scholar 

  62. Moon CJ. Switching statins. BMJ. 2006;332(7554):1344–5.

    Article  PubMed  PubMed Central  Google Scholar 

  63. Mehta S, Wells S, Jackson R, Harrison J, Kerr A. The effect of removing funding restrictions for atorvastatin differed across sociodemographic groups among New Zealanders hospitalised with cardiovascular disease: A national data linkage study. N Z Med J. 2016;129(1443):18–29.

    PubMed  Google Scholar 

  64. Pylypchuk R, Wells S, Kerr A, Poppe K, Riddell T, Harwood M, Exeter D, Mehta S, Grey C, Wu BP, Metcalf P, Warren J, Harrison J, Marshall R, Jackson R. Cardiovascular disease risk prediction equations in 400 000 primary care patients in New Zealand: a derivation and validation study. The Lancet. 2018;391(10133):1897–907.

    Article  Google Scholar 

  65. Wells S, Riddell T, Kerr A, Pylypchuk R, Chelimo C, Marshall R, Exeter DJ, Mehta S, Harrison J, Kyle C, Grey C, Metcalf P, Warren J, Kenealy T, Drury PL, Harwood M, Bramley D, Gala G, Jackson R. Cohort profile: the PREDICT cardiovascular disease cohort in New Zealand primary care (PREDICT-CVD 19). Int J Epidemiol. 2017;46(1):22.

    Article  PubMed  Google Scholar 

  66. Jackson R. Using big data to tackle inequalities in vascular disease. 2018. Accessed 28 Sep 2021.

  67. CareConnet: Welcome to TestSafe. 2020. Accessed 27 Jan 2019.

  68. Ministry of Health, Manatū Hauora: Collections. 2019. Accessed 30 May 2021.

  69. Ministry of Health. Manatū Hauora: ICD-10-AM/ACHI/ACS Development. 2021. Accessed 29 Aug 2021.

  70. Stats NZ. Ethnic group summaries reveal New Zealand’s multicultural make-up. 2020. Accessed 22 May 2021.

  71. Hayes RB, Taylor DW, Sackett DL, Gibson ES, Bernholz CD, Mukherjee J. Can simple clinical measurements detect patient noncompliance? Hypertension. 1980;2(6):757–64.

    Article  Google Scholar 

  72. Baumgartner PC, Hayes RB, Hersberger KE, Arnet I. A systematic review of medication adherence thresholds dependent of clinical outcomes. Front Pharmacol. 2018;9:1–10.

    Article  Google Scholar 

  73. Python Software Foundation: Python Language Reference. 2017. Accessed 10 Nov 2020.

  74. Abadi M, Agarwal A, Barham P, et al. TensorFlow: large-scale machine learning on heterogeneous systems. Software. 2015.

  75. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

    Google Scholar 

  76. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, Müller M. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 2011;12:77.

    Article  Google Scholar 

  77. Virtanen P, et al. SciPy 1.0 contributors: SciPy 1.0: fundamental algorithms for scientific computing in python. Nat MethodDs. 2020;17:261–72.

  78. Kingma DP, Ba J. Adam: a method for stochastic optimization. In: 3rd International conference for learning representations; 2014. p. 1–15.

  79. Swets AJ. Measuring the accuracy of diagnostic systems. Sci New Ser. 1988;240(4857):1285–93.

    Article  CAS  Google Scholar 

  80. DeLong ER, DeLong MD, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–45.

    Article  CAS  PubMed  Google Scholar 

  81. Fan Z. STATS 200: introduction to statistical inference lecture 11: testing multiple hypotheses. 2016. Accessed 24 Nov 2020.

  82. Massachusetts Institute of Technology 6.S085 Statistics for Research Projects: IAP2015: Nonparametric statistics and model selection. 2015.

  83. Goldman M. Why is multiple testing a problem?. 2008.

  84. Li X, Xu H, Li M, Zhao D. Using machine learning models to study medication adherence in hypertensive patients based on national stroke screening data. In: 2021 IEEE 9th international conference on bioinformatics and computational biology (ICBCB); 2021. p. 135–9.

  85. Harutyunyan H, Khachatrian H, Kale DC, Steeg GV, Galstyan A. Multitask learning and benchmarking with clinical time series data. Sci Data. 2019;6(96):1–18.

    Article  Google Scholar 

  86. Mirshekarian S, Bunescu R, Marling C, Schwartz F. Using lstms to learn physiological models of blood glucose behavior. In: 2017 39th Annual international conference of the IEEE engineering in medicine and biology society (EMBC); 2017. p. 2887–91.

  87. Choi E, Schuetz A, Stewart WF, Sun J. Using recurrent neural network models for early detection of heart failure onset. J Am Med Inform Assoc. 2017;24(2):361–70.

    Article  PubMed  Google Scholar 

  88. Church E, Poppe K, Harwood M, Mehta S, Grey C, Selak V, Marshall MR, Wells S. Relationship between estimated glomerular filtration rate and incident cardiovascular disease in an ethnically diverse primary care cohort. N Z Med J. 2019;132(1491):11–26.

    PubMed  Google Scholar 

  89. Rossello X, Pocock SJ, Julian DG. Long-term use of cardiovascular drugs challenges for research and for patient care. J Am Coll Cardiol. 2015;66(11):1273–85.

    Article  CAS  PubMed  Google Scholar 

  90. Holzinger A, Biemann C, Pattichis CS, Kell DB. What do we need to build explainable AI systems for the medical domain? 2017. arXiv:1712.09923.

  91. Goebel R, Chander A, Holzinger K, Lecue F, Akata Z, Stumpf S, Kieseberg P, Holzinger A. Explainable AI: the New 42? Mach Learn Knowl Extract. 2018;11015:295–303.

    Article  Google Scholar 

  92. Choi E, Bahadori MT, Kulas JA, Schuetz A, Stewart WF, Sun J. RETAIN: an interpretable predictive model for healthcare using reverse time attention mechanism (Nips); 2016. p. 1–13.

  93. Ho V, Aczon L, Ledbetter M, Wetzel DR. Interpreting a recurrent neural network’s predictions of ICU mortality risk. J Biomed Inform. 2021;114:1–18.

    Article  Google Scholar 

  94. Statology: How to interpret the C-Statistic of a logistic regression model. 2019. Accessed 18 Sep 2021.

  95. Kocurek B. Promoting medication adherence in older adults and the rest of us. Diabetes Spectr. 2009;22(2):80–4.

    Article  Google Scholar 

  96. Desai T, Ritchie F, Welpton R. Five safes: designing data access for research. University of the West of England, Bristol. 2016. Accessed 15 May 2022.

Download references


This study is supported by the University of Auckland Doctoral Scholarship and in part by New Zealand Health Research Council programme grant HRC16-609. We thank Kylie Chen for code checking the time-series construction code and Mike Merry for facilitating network connection to the GPU machine during COVID lockdowns. Thanks to the members of the VIEW research team for their feedback on earlier drafts of the manuscript.


James R. Warren received funding from the Health Research Council of New Zealand.

Author information

Authors and Affiliations



All authors contributed to the manuscript’s conception, design, and revision. WH conducted the experiments and wrote the first draft. All authors read and approved the final manuscript.

Corresponding author

Correspondence to William Hsu.

Ethics declarations

Ethics approval and consent to participate

The VIEW research programme granted permission to access the data of the PREDICT cohort and its related linked datasets.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Hsu, W., Warren, J.R. & Riddle, P.J. Medication adherence prediction through temporal modelling in cardiovascular disease management. BMC Med Inform Decis Mak 22, 313 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Pharmacoepidemiology
  • Medication adherence
  • Cardiovascular disease
  • Deep learning
  • Recurrent neural networks
  • Machine learning