Colorectal cancer (CRC) is the second most common cancer in the United States for men and women combined [1]. In 2017, there were roughly 135,000 new CRC cases, with 45% of men and 39% of women younger than 65 at the age of diagnosis [2]. Colorectal neoplasia development can take many years, remaining asymptomatic for much or all of this time. The development starts with small pre-cancerous polyps growing in the internal lining of the colon and rectum. These polyps may gradually increase in size or develop advanced histological features. Finally, advanced, precancerous polyps may evolve into invasive adenocarcinoma, eventually spreading locally or systemically through lymph and blood vessels. The five-year survival rate of CRC is 90% when the cancer is confined to the colon and rectum, whereas the five-year survival rate declines to 12% when it has spread to distant locations [3].
For CRC prevention, along with improving the accuracy and convenience of screening tests, there is a need to improve the prediction of tumor incidence and symptom onset via computational and predictive modeling of colorectal neoplasia development. Once this need is addressed, diagnostic screening and surveillance can be better targeted on those at high risk of rapid disease progression. In recent clinical practice, patients are often further classified by detection of advanced precancerous polyps, which include adenomas and sessile serrated polyps ≥ 10 mm, and adenomas with villous histology or high-grade dysplasia [4, 5]. Individuals with advanced precancerous lesions are more likely to develop other advanced lesions and asymptomatic CRC [6]. With improved prediction of precancerous lesion advancement, population surveillance can become more effective and cost-effective.
Additional evidence from more recent clinical studies may emerge, which requires updates on our understanding of colorectal neoplasia development. Moreover, many of these studies are conducted by exploring risk factors, e.g., comparing men and women, on the development. As a result, we often need to retrain existing computational models (i.e., updating the estimates of model parameters) to quantify the differences based on new data from the same population and/or data from a new population with distinct features from previously studied ones. This can help provide predictive intelligence on timely adjustment of CRC screening and surveillance strategies in terms of cost-effectiveness [7,8,9,10,11,12,13]. However, the computational models tend to become much more expensive with the incorporation of additional risk factors. Therefore, there is a need to develop an efficient algorithmic procedure for re-calibrating expensive computational models.
This paper used the predictive modeling of sex-specific colorectal neoplasia development in a proof-of-the-concept study. We adapted two independently developed and well-established CRC disease models, both of which are individual-based state-transition models [15, 22]. Due to CRC-related behavior changes and clinical interventions (e.g., polypectomy) available, real-world patient medical records cannot provide sufficient age- and sex-specific incidence information about colorectal neoplasia development under natural circumstances. In response, we resorted to model calibration against sex-specific prevalence data on key CRC preclinical stages. Note that there are multi-year population surveillance studies that collect colonoscopy images and derive sex-specific prevalence on the preclinical natural history, e.g., Brenner et al. [14].
Only a handful of papers in the CRC computational modeling literature have reported their model calibration work in detail. Roberts et al. [15] developed the V/NCS model, a discrete-event simulation model used in the current study, on a self-created object-oriented simulation platform, with a focus on the modeling of CRC events and the event relationships. The authors reported in their prior manuscripts (e.g., [16,17,18]) a series of model calibration activities through heuristics against epidemiological adenoma prevalence and CRC incidence data. Erenay et al. [19] developed an individual-based event-driven state transition simulation that mimics the natural history of metachronous colorectal cancer (MCRC) for a 5-year period following the treatment of primary CRC. The model comprises five states, namely polyp-free, polyp, MCRC, metastatic-MCRC, and MCRC-related death. The authors estimated six unknown parameters of the natural history of MCRC through calibrating the simulation mentioned above against two calibration targets, namely 5-year MCRC incidence and mortality rate, with the principle of least sum-of-squared error. For the calibration, the authors simply ran the simulation model exhaustively with every possible combination of the unknown parameters and selected those with simulated outputs matching the benchmark statistics of a well-defined patient cohort, derived from the SEER database. Rose et al. [20] proposed an individually-based state transition model consisting of two interacting submodels, namely a continuous-time disease-progression submodel and a discrete-time Markov submodel for surveillance and retreatment. The key components for modeling the disease progression are recurring transitions to unresectability and symptom onset, either of which is determined by a transition timing and modeled with an exponential distribution. The author estimated seven unknown parameters of disease progression through calibrating the simulation mentioned above against seven observable outcomes, reported in Pietra et al. [21]. The authors developed a calibration procedure that consists of several rounds of calibration with increasingly narrowed candidate parameter sets and against a series of calibration targets. Prakash et al. [22] developed the CMOST model, an open-source framework for the microsimulation of CRC screening strategies also used in our study, facilitating automated parameter calibration against epidemiological adenoma prevalence and CRC incidence data. The authors used a heuristic greedy algorithm followed by Nelder-Mead optimization [23] to minimize the squared error between the benchmark values and the corresponding model predictions.
Sai et al. [24] investigated the efficiency of a Gaussian Processes-based surrogate modeling approach to approximate the CMOST model to alleviate the computational burden in calibrating the CMOST model. Compared to above papers in the literature, we studied a different version of the calibration problem, for which we have the option of using a baseline parameter design from the literature and/or previous studies to start the model parameter adjustments. In addition, we conducted comparative studies on the effect of global search as the predecessor of the Nelder-Mead optimization and compared different settings of Nelder-Mead to further improve the calibration efficiency.
It is evident that sex plays an essential role in CRC incidence and progression, in addition to a wide range of risk factors, including family history [25] and lifestyle-related ones such as smoking [26], red-meat diet [27], among other factors [28,29,30]. More men than women are diagnosed with CRC. While men and women have similar genetic predispositions, there are substantial differences in CRC incidence between the two sexes [31, 32]. In addition, several studies suggest that females diagnosed with CRC have significantly longer survival than males [33, 34]. Further, men have a higher prevalence of adenomas than women. For example, Ferlitsch et al. [35] reported that adenomas prevalence was higher among men than women by an absolute difference of 10%, studying more than 44 thousand participants in a national screening colonoscopy program in Austria. In a study of more than 50,000 Polish participants, Regula et al. [36] reported that advanced precancerous polyp was found with a significantly higher percentage in men than women. Brenner et al. [14] reported that adenoma prevalence (both advanced and non-advanced) was substantially higher in men than in women for different age groups, from an observational study of more than 3.6 million German participants. Different from the above observational studies, we applied computational and predictive modeling to differentiate colorectal neoplasia development between the two sexes and over age groups.
From the above literature review, we concluded that existing studies have not addressed several challenges in modeling and model calibration of CRC natural history and beyond. The main contribution of the current study is the development of an efficient re-calibration procedure for expensive stochastic simulations of disease natural history. We believe our method works well on all kinds of individual-based state-transition disease models with a high-dimensional model parameter space, unbounded value range on each parameter, some prior knowledge on the association among different parameters, and expensive computational simulation run. Further, through our proof-of-the-concept study, we quantified the age-dependent sex differences in colorectal neoplasia development.