How to measure temporal changes in care pathways for chronic diseases using health care registry data

Background Disease trajectories for chronic diseases can span over several decades, with several time-dependent factors affecting treatment decisions. Thus, there is a need for long-term predictions of disease trajectories to inform patients and healthcare professionals on the long-term outcomes and provide information on the need of future health care. Here, we propose a state transition model to describe and predict disease trajectories up to 25 years after diagnosis in men with prostate cancer (PCa), as a proof of principle. Methods States, state transitions, and transition probabilities were identified and estimated in Prostate Cancer data Base of Sweden (PCBaSeTraject), using nationwide population-based data from 118,743 men diagnosed with PCa. A state transition model in discrete time steps (i.e., 4 weeks) was developed and applied to capture all possible transitions (PCBaSeSim). Transition probabilities were estimated for changes in both treatment and comorbidity. These models combined yielded parameter estimates to run an individual-level simulation based on the state-transition model to obtain prediction estimates. Predicted estimates were then compared to real world data in PCBaSeTraject. Results PCBaSeSim estimates for the cumulative incidence of first and second transitions, death from PCa and death from other causes were compared to observed transitions in PCBaSeTraject. A good agreement was found between simulated and observed estimates. Conclusions We developed a reliable and accurate simulation tool, PCBaSeSim that provides information on disease trajectories for subjects with a chronic disease on an individual and population-based level. Electronic supplementary material The online version of this article (10.1186/s12911-019-0823-y) contains supplementary material, which is available to authorized users.


Background
Many previously fatal diseases have become chronic diseases due to improvements in diagnosis and treatment [1]. These disease pathways (referred to as disease trajectories below) can span over several decades and incorporate a variety of treatments and outcomes. Moreover, ageing and increasing comorbidity will affect treatment decisions. Thus, there is a need to predict disease trajectories for several decades to inform patients and healthcare professionals on the long-term outcomes of disease and provide information on the need of future health care [2].
There is a wide range in the clinical course of prostate cancer (PCa) [3]. Due to improvements in detection, diagnostics, and treatment, many men currently diagnosed with PCa have a low risk of PCa death during the first 15 years after diagnosis and hence for many men PCa has become a chronic disease, albeit with a small risk of progression many years after date of diagnosis [4].
This study aimed to create a model to describe and predict disease trajectories over a period of 25 years using PCa as a proof of principle for a chronic disease. More specifically, we used data on PCa from nationwide, population-based health care registers in Sweden to predict disease trajectories in men with different disease severity, comorbidities, and treatments using state transition models.

Study population
This study was based on 118,743 men diagnosed with PCa from 1992 until 2014 who were registered in The National Prostate Cancer Register (NPCR) Sweden [5]. Men with unknown disease severity at time of diagnosis (3%) were excluded [6]. By use of the unique Swedish personal identity number, NPCR has been linked to other population-based healthcare registers and demographic databases to form the Prostate Cancer data Base Sweden (PCBaSe Traject ). This provided additional information on cause of death, comorbidities, drug use, and socio-economic status, and treatment changes [7,8]. Comorbidities were specifically measured with the Charlson comorbidity index (CCI), which was based on discharge diagnoses retrieved from the National Patient Register and the National Cancer Register [9]. The linkage of PCBaSe was approved by the Research Ethics Board at Umeå University.
The men were divided into five groups based on the primary management strategy: active surveillance (AS), radical prostatectomy (RP), radiotherapy (RT) performed as either external beam radio therapy or brachy therapy, watchful waiting (WW), and androgen deprivation therapy (ADT). Different risk categories were defined for each management strategy. For AS risk categories, we applied a modification of the National Comprehensive Cancer Network guidelines [10]. The two lowest NCCN risk categories (low and intermediate-risk PCa) were divided into two subgroups each. The least favourable intermediate-risk category was not considered as suitable for AS, resulting in three AS risk categories (Additional file 2: Table S1). Curative treatment consisted of RP or RT, which could be either primary treatment or initiated after a period of AS, i.e. deferred RP or RT. Primary and deferred RPs were divided into six risk categories based on pathological tumour stage (pT stage) and pathological Gleason Grade Group (pGGG) as recorded in NPCR (Additional file 2: Table S2). Primary and deferred RT was divided into eight risk categories (Additional file 2: Table S3). WW as primary management  strategy was classified into six risk categories, based on T stage, N stage, M stage, serum levels of prostate-specific antigen (PSA), and Gleason Grade Group (GGG) (Additional file 2: Table S4). Men receiving ADT were divided into eight risk categories (Additional file 2: Table S5). ADT consisted of either antiandrogen monotherapy (AA) or gonadotropinreleasing hormone (GnRH) agonists (Additional file 2: Figure S1). For all treatment-specific risk categories, the lowest category indicated men with the most favourable risk.

Missing data
There is no specific follow-up data in NPCR. For example, men primarily managed with AS who eventually experience disease progression can be treated with curative intention with RP or RT. At this point, the updated TNM stage, GGG and PSA have not routinely been reported to NPCR. This yields a missing data problem. We solved this problem by applying multiple imputation by chained equations [11] using five imputation datasets. Similar problems with lack of detailed disease progression data also appeared for other non-primary treatments, where change was triggered by disease progression and these situations were also solved by multiple imputation. Further details regarding imputation models are presented in the Additional file 1.

Analysis
We applied a method based on a state transition model (Fig. 1). Men diagnosed with PCa entered their primary state according to their primary treatment (AS, RP, RT, WW, ADT) and their treatment specific risk category (AS 1 -AS 3 , RP 1 -RP 6 , RT 1 -RT 8 , WW 1 -WW 6 , and ADT 1 -ADT 8 ). The probability of a transition to another state was based on age, comorbidity, history of previous treatments, and treatment-specific risk category. All transitions were considered irreversible and state transitions were allowed until an absorbing state, i.e. PCa-death or death from other causes, was reached. Transitions between treatment-specific risk categories were not considered, i.e. each man stayed in his designated treatment risk category until a new treatment was introduced. Our model building consisted of several steps: First, we randomly split our set of data in two equally large subsets, the first set was a training set that was used to estimate transition probabilities and the second set was used to assess internal validity of the simulation process.
Second, we simplified follow-up time by use of fourweek time steps. At the end of each time step a man either remained in his current state or transited to a new Risk categories ADT 1 -ADT 8 f Screening, prostate cancer detected due to work-up after PSA testing in asymptomatic men state. The discretised follow-up data was arranged using long format, i.e., each man was represented by several rows of data, one for each time step in which he was alive. Age and CCI were updated in each time step.
Next, we estimated state transition probabilities, which were determined from both registered and nonregistered data regarding date of treatment change. Specifically, for all treatment state transitions, except the transition from AS ➔ WW, the date of treatment change was retrieved from PCBaSe Traject using the training set of data. Left truncation on January 1, 2006, was used as previously described [8]. The non-registered transition date for the transition of AS ➔ WW was managed as previously described [12] (Additional file 2: Figures S2 and S3). We considered a change in CCI as a state transition. For such a CCI transition, we applied a CCI state transition model as previously described [9]. The probability of CCI changes was state-specific, i.e. different CCI model parameters were estimated for each state.
Finally, our prediction was carried out as a microsimulation, an individual-level simulation based on the state-transition model [13]. Simulations were performed for each specific combination of treatment, age and CCI at treatment start, and treatment-specific risk categories. The simulation process consisted of three different steps, which are described in brief below. Further

1) A man's vital status at the end of each time step
was determined as alive, dead from PCa or dead from other causes. The probability of PCa death was modelled in a logistic regression, whereas the probability of death from other causes was modelled conditioned on no previous PCa death in a second logistic regression (see Additional file 1 for further details). 2) For a man who was alive at the end of a time step, we determined if a treatment change had occurred. Such change could be either direct, when the treatment risk category in the new state was known and remained unchanged, e.g. RP RT adj/salv , or preceded by an estimation of new treatmentspecific risk category (Additional file 2: Table S6). In  3) If a man was alive at the end of a time step, CCI was updated, according to a separate CCI transition model based on a combination of Poisson regression and logistic regression models, as described previously [9]. The CCI transition model was specific for each individual treatment.
We refer to this simulation method in the following text as PCBaSe Sim . Further specifications of possible transitions and used models are provided in the Additional file 1. In order to assess internal validity of PCBaSe Sim , we graphically compared cumulative incidence of subsequent transitions obtained from the simulation with those observed in validation set.

Results
The study population consisted of 118,743 men diagnosed with PCa and included in NPCR. Most of these men (n = 32,537, 27%) were primarily managed with RP (Table 1). Detailed characteristics of men included in PCBaSe Traject were previously described [7,8].
Our proposed model starts at time of PCa diagnosis and ends in either of the following absorbing states: death from PCa or death from other causes. Figure 1 illustrates states and state transitions, with all the possible treatment changes included in the model.
To assess the internal validity of the simulation, the cumulative incidence of the first transitions observed in PCBaSe Traject was compared with the simulated first transitions for the same men up to 25 years after diagnosis. To decrease random error, we ran the simulation 100 times for each man. Observed and simulated first transitions almost perfectly overlapped for all groups (Fig. 2). At ten years after RP, 21% of the observed PCBaSe Traject cohort had transitioned to adjuvant/salvage radiotherapy vs. 20% as predicted by PCBaSe Sim . Estimates at 20-years were still consistent for all the analysed primary management strategies (e.g., 5% of men treated with primary RP received GnRH agonists in both PCBaSe Traject and PCBaSe Sim ).
Similarly, we analysed second transitions. Men first treated with RP or RT, which are the most common treatment strategies overall for PCa, were followed until their second transition. Also in this model, the observed and simulated cumulative incidences were very similar, e.g. 10year proportion of men treated with primary RT and later receiving GnRH agonists was 5% in both PCBaSe Traject and PCBaSe Sim . At 15 years these proportions were 7% in PCBaSe Traject and 9% in PCBaSe Sim , respectively (Fig. 3). Eventually, the agreement between predicted vs. observed cumulative incidence of death was acceptable for both PCa and other causes of death (Fig. 4).

Discussion
We developed, applied, and tested a novel state transition model, based on longitudinal data in health care registers including men with PCa, to predict long term disease trajectories. More specifically, we tested the accuracy of first and second transitions (i.e. treatment changes), as well as the transitions to final absorbing states, by comparing observed data vs. data obtained by our newly developed tool, PCBaSe Sim .
Treatment decisions are a challenging process in real life, as well as predictions of health-related outcomes. Due to the long life expectancy, e.g. > 20 years for healthy men with low-risk PCa [4], and the availability of many therapeutic options [3], PCa disease trajectories can thus be difficult to predict at time of diagnosis. These pathways include different treatment options and may be protracted over time. Although international guidelines recommend therapeutic options according to disease status, there are several combinations of disease states and life expectancy for which there is little evidence to support treatment decisions [3]. Moreover, recently introduced management strategies, e.g. active surveillance, lack long-termfollow-up, so there are no data on which to base outcome predictions. Previous attempts have been made to implement state-transition models in urology [14,15] and other medical areas [16], mainly from the health-economy standpoint. However, these attempts were not based on a detailed and comprehensive, population-based source of data, such as PCBaSe Traject . Therefore, previous models have not been validated against real world data. In contrast, the state transition model we propose in this paper was compared with real world data, and provides as a user-friendly data output that makes both interpretation of results and replicability easy.
More specifically, these results provide a proof-ofprinciple for state transition models that can be applied to many other chronic diseases for which there is also a need for long term predictions of outcomes. Using complete nationwide, population-based data of~120,000 Swedish men with PCa [8], we were able to develop a simulation programme, PCBaSe Sim , which reliably models treatment changes up until 25 years after diagnosis. To assess the reliability of our state transition model, we compared realworldfollow-up data in PCBaSe Traject with simulated data from PCBaSe Sim . We found that observed and simulated transition profiles were overlapping and consistent, suggesting that PCBaSe Sim estimates were accurate and reliable. Moreover, in PCBaSe Sim , it is possible to set baseline parameter (e.g. number of subjects, initial state, subjectand disease-specific features, temporal duration of the simulation) prior to starting the simulation process. During the simulation, individual features are constantly updated (e.g. age increase, comorbidity changes [9]) and used for estimation of transition probabilities. Once a simulation is complete (i.e. end of virtual follow-up time), it is possible to retrieve every change in both treatment and individual characteristics (i.e. age, CCI, and risk group).
Our proposed state transition modelling approach consistently differs from conventional survival analyses and prediction tools, which focus on a single outcome over time, with little information regarding the actual path that leads to a specific outcome and the changes in the disease trajectory [17]. Conventional survival analyses inevitably result in a consistent loss of information and accuracy, especially when dealing with extended follow-up times. We have previously shown that it is possible to model temporal changes in comorbidity for cancer patients [9]. In PCBaSe Sim , thanks to the continuous update of baseline characteristics at the end of every time step, not only are the predictions likely to be accurate, but also is it possible to keep track of individual and disease-specific features and trajectories. Although the difference in model fit between time-updated CCI and CCI only measured at baseline was small (data not shown), we used the time-updated CCI since this approach was crucial to solve the problem of nonregistered AS ➔ WW transition [12]. Therefore, in order to be congruent with our previous publication [12] as well as with future studies on health economy, we support the use of time-updated CCI since increases in comorbidity are related to increased costs and therefore affect treatment decisions.
Moreover, PCBaSe Sim makes it possible to model disease trajectories over a long time-horizon and can therefore make long-term predictions even in the absence of "observed" (i.e. registered) data. Eventually, the flexibility of the model allows to include novel modelling parameters and states.
The main limitation of our proposed model is the absence of an external validation. However, PCBaSe Sim is ready to be tested in an external setting, and the authors welcome collaborators to validate our simulation program. It is worth noticing that in PCBaSe Traject , we had to use left truncation on January 1, 2006 since the Swedish Prescribed Drug Registry was initiated on July 1, 2005 [8]. The relatively low number of diagnosis during the early PCBaSe Rapid period is caused by this left truncation. This could theoretically limit the predictive power of PCBaSe Sim . On the other hand, we argue that the most recent data are the most valuable for prediction for current patients and that therefore it is advantageous that the impact of historically older diagnosis and treatments on predictions was limited by the use of this approach. Further limitations include the use of administrative data for the definition of comorbidities, though it has previously been shown that the accuracy of ICD codes for discharge diagnoses in the Patient Registry is high in the range of 85-95% [18]. Another limitation is represented by missing data related on some transitions/risk categories. However, we have previously [12] and in this current study provided reliable solutions to these issues.