A multiscale and multiparametric approach for modeling the progression of oral cancer
© Exarchos et al.; licensee BioMed Central Ltd. 2012
Received: 25 April 2012
Accepted: 1 November 2012
Published: 22 November 2012
Skip to main content
© Exarchos et al.; licensee BioMed Central Ltd. 2012
Received: 25 April 2012
Accepted: 1 November 2012
Published: 22 November 2012
In this work, we propose a multilevel and multiparametric approach in order to model the growth and progression of oral squamous cell carcinoma (OSCC) after remission. OSCC constitutes the major neoplasm of the head and neck region, exhibiting a quite aggressive nature, often leading to unfavorable prognosis.
We formulate a Decision Support System assembling a multitude of heterogeneous data sources (clinical, imaging tissue and blood genomic), aiming to capture all manifestations of the disease. Our primary aim is to identify the factors that dictate OSCC progression and subsequently predict potential relapses of the disease. The discrimination potential of each source of data is initially explored separately, and afterwards the individual predictions are combined to yield a consensus decision achieving complete discrimination between patients with and without a disease relapse. Moreover, we collect and analyze gene expression data from circulating blood cells throughout the follow-up period in consecutive time-slices, in order to model the temporal dimension of the disease. For this purpose a Dynamic Bayesian Network (DBN) is employed which is able to capture in a transparent manner the underlying mechanism dictating the disease evolvement, and employ it for monitoring the status and prognosis of the patients after remission.
By feeding as input to the DBN data from the baseline visit we achieve accuracy of 86%, which is further improved to complete discrimination when data from the first follow-up visit are also employed.
Knowing in advance the progression of the disease, i.e. identifying groups of patients with higher/lower risk of reoccurrence, we are able to determine the subsequent treatment protocol in a more personalized manner.
Oral cancer refers to the cancer that arises in the head and neck region, i.e. in any part of the oral cavity or oropharynx. OSCC constitutes the 8th most frequent neoplasm in humans according to the worldwide cancer incidence ranking, and has been primarily associated with smoking and alcohol consumption . In terms of sex, men face twice the risk of being diagnosed with oral cancer than women . Moreover, sun exposure constitutes a significant risk factor, particularly for the cancer of the lip. There has also been suggested in the literature, that infection with the Human Pappilomavirus (HPV) is associated with oral cancer, especially with occurrences in the back of the mouth (oropharynx, base of tongue, tonsillar pillars and crypt, as well as the tonsils themselves) . Although current advances in treatment protocols  have led to high rates of successful eradication of the disease (i.e. a state called remission), a significant percentage, in the range of 25-48% , of remittent patients suffer from locoregional relapses, owed to the deeply infiltrative nature of these tumors, as well as the significant potential for occult neck metastasis . The accurate modeling of the disease progression and consequently the timely identification of a potential reoccurrence can provide patient-specific treatment.
In the literature, several studies have identified factors affecting the oral cancer invasion, progression and metastasis, both from a clinical and molecular perspective; yet they still remain limited in number and efficacy, leading to unsatisfactory results . Specifically, [7, 8] derive a gene expression profile in order to diagnose lymph node metastasis originating from a primary head and neck carcinoma; similarly, in , future metastases of head and neck carcinoma are predicted. In [10–12], the progression of tongue carcinoma is studied, and a subset of genes is identified, able to predict potential metastasis of the primary tumor is in the lymph nodes. Recently Reis et al.  have performed a meta-analysis based on five publicly available microarray datasets, and identified a four-gene signature that is of prognostic value for oral cancer reoccurrence. However, it should be noted that all the aforementioned approaches do not take into account the temporal dimension of the disease and its actual evolution over time. Other approaches identified in the broader field of biomedical engineering deal with pairs of consecutive time-slices rather than representing the follow-up as a whole .
The proposed approach encompasses in a complementary manner a multitude of heterogeneous data, varying in scale and dimension, therefore, "framing" all possible manifestations of the disease, from a clinical, imaging and genomic point of view. Among the aims of this work is to identify a limited subset of factors that are highly correlated with oral cancer progression, thus, formulating the disease profile. Based on this profile, we are able to calculate the risk of relapse for each patient and subsequently discriminate the patients in high and low risk groups. Moreover, information derived from time-varying parameters (i.e. gene expression from circulating blood cells) is employed in order to model the evolvement of the disease over time, and capture the temporal dimension of the disease as well. The outcome of this analysis is the representation of the disease mechanism which is subsequently used in order to conjecture if, when and why a patient is prone to suffer a relapse of the disease. Having a timely and accurate estimation of the relapse probability can be proven quite helpful towards determining the most proper treatment for a specific patient; i.e. patients in high risk can be monitored more intensely, whereas, patients in low risk are subject to less aggressive treatment.
In the sections that follow, we lay out our approach, which is organized in two types of analyses, namely: i) Baseline Data Analysis, which involves the analysis of data collected during the baseline visit of each patient, and are used to stratify the patient either at high or low risk of developing a relapse; ii) Disease Evolution Monitoring, which employs data varying over the follow-up period, i.e. gene expression from circulating blood cells, in order to assess the relapse probability coupled with the approximate timing that this relapse is more likely to occur.
Clinical features examined in this work
Body mass index (BMI)
Grade of differentiation
Anneroths Mod Score
Immunosuppressor Treatments Presence
Type Of Infection
Family History Of Malignance
Type Of Physical Agent
Tumor Maximum Diameter
Fe Haematic Concentration
Depth Of Invasion
Quantity Per Day
Hb Haematic Concentration
Lympho Plasmacytic Rection
B12 Vitamins Haematic Concentration
Lympho Plasmacytic Invasion
A Vitamins Haematic Concentration
E Vitamins Haematic Concentration
Degree Of Cells Keratinisation
Number of Mitoses per 10HPF
Imaging features employed in this work
Contrast take-up rate
Side of lymph nodes
Minor Axis Bigger than 10mm
Side Relative to Tumor
Extra Nodal Spreading
Number of Lymph Nodes
Number of Lymph Nodes Bigger than 3
Site of lymph nodes
As shown in Figure 2, the flow of operations applied upon each source of data includes a series of preprocessing steps; next, the resulting input vector is subject to feature selection in order to maintain the most discriminatory features, which are subsequently fed into a classification algorithm that assigns each patient either as high risk or low risk in terms of the calculated relapse probability. Eventually, a multi-type classification scheme combines all four individual single-classifier decisions, in order to yield an overall outcome. The preprocessing steps, the feature selection schemes as well as the classification algorithms have been invoked through the Weka Machine Learning software .
During this step, we utilize a series of preprocessing steps in order to enhance the quality of the input. Specifically, features with high percentage (>90%) of missing values are omitted from our analysis, whereas the values of the features with less percentage of missing values are imputed with the modes and means, in the case of nominal and numeric features, respectively. Another important issue present in our dataset is that the enrolled patients are unevenly distributed in the classes of relapsers and non-relapsers, resulting in considerable class imbalance. For this purpose, we employ the Synthetic Minority Oversampling Technique (SMOTE) , which utilizes a k-NN approach in order to oversample the minority class. It should be noted that SMOTE neither discards potentially useful samples nor merely replicates existing samples, therefore, posing an advantageous solution compared to sampling-based approaches. On the contrary, undersampling of the majority class may lead to loss of significant information, whereas, oversampling of the minority class does not add any new information and may duplicate possibly noisy or even erroneous instances.
Especially for the case of genomic data (both tissue and blood ones), the initial feature vector consisting of 45,015 gene expression values, is subject to certain steps in order to enhance the quality of the raw input data, that is subsequently employed for data analysis; to this end, redundant and control genes are removed, as well as genes with low data quality or high percentage of missing values. The outcome is a set of 33,491 high quality genes that are further analyzed in order to procure a limited subset of genes that are mostly differentially expressed between relapsers and non-relapsers. For this purpose we employ the Significance Analysis of Microarrays (SAM) algorithm , which performs multiple gene specific t-tests on permutations of the initial dataset, in order to account for the initially enormous number of input genes.
As soon as the feature vectors from each source of data are assembled, we either feed them directly as input to the classification algorithms, or employ a feature selection algorithm in order to discard redundant or correlated features and maintain a more informative subset, thus, facilitating the classification task as well. For this purpose two feature selection algorithms have been employed, namely the Correlation-based Feature Subset selection (CFS)  and the wrapper algorithm  (with the default settings of the Weka software). CFS maintains features that exhibit low correlation among them and high correlation with the class attribute; on the other hand, the wrapper algorithm evaluates all possible feature combinations and retains the best performing subset, tailored to the target classification algorithm.
Next, we examine the performance of six popular classification algorithms  towards the discrimination between patients with and without disease relapse; specifically, we employ Bayesian Networks (BN), Naive Bayes (NB), Artificial Neural Networks (ANN), Support Vector Machines (SVM), Decision Trees (DT) and Random Forests (RF).
For evaluation purposes, we employ two techniques, namely 10-fold cross validation and leave-one-patient-out. During 10-fold cross validation the dataset is split into 10 stratified sets, whereby 9/10 are used for training and the remaining 1/10 is used for testing; after a full rotation, the results over the 10 testing sets are averaged in order to procure the overall performance of the algorithm. The leave-one-patient-out technique is quite similar to 10-fold cross validation, where the number of folds is equal to the number of patients in the dataset. Specifically all patients but one are used for training and the remaining one is used for testing in a round-robin manner. The evaluation metrics used to compare the employed classification schemes are: sensitivity, specificity, accuracy, the Kappa statistic, as well as the Area Under Curve (AUC), which is an evaluation index obtained from the Receiver Operating Curve (ROC) analysis.
Initially, the gene expression data obtained from 45,015 genes are filtered in order to omit duplicate and control genes as well as the ones that are of low quality; it should be noted that all microarray experiments have been conducted using the same platform, the same array design and the same feature extraction software version in order to exclude unwanted data perturbations other than biological variability. Next, we identify the genes that are mostly differentially expressed between relapsers and non-relapsers; in addition, we extract a personalized gene subset (i.e. Personalized Genetic Signature) aiming to capture patient-specific perturbations of the disease progression within its molecular basis. Both the aforementioned inputs are fed as input to a Dynamic Bayesian Network (DBN)  in order to model the disease evolution over the follow-up and predict potential relapses.
After applying some basic filtering steps upon the gene expression data (i.e. omission of duplicate, control and low-quality genes), we are left with a set of 33,491 genes, that are fed to the next steps of our analysis. The SAM algorithm  is subsequently employed, which analyzes differentially unpaired time-course gene expression data between two groups; specifically we perform the Wilcoxon statistical test which identifies those genes that are mostly differentially expressed between the two groups of patients in all time-slices of the follow-up.
In addition, we extract a patient-specific genetic indicator denoting the progression of the disease for a specific patient; for each patient we compare the gene expression values before treatment (cancerous profile) and during the first stages of remission (cancer-free profile). The outcome is a limited set of differentially expressed genes representative for each patient, allowing for personalized modeling of the disease evolvement. The expression of these genes from all succeeding follow-up visits is compared in turn with the cancerous and the cancer-free profile, calculating the correlation and the Euclidean distance; these metrics provide, respectively, a qualitative and quantitative measure of the patient's prognosis. In the case of the Euclidean distance a weighted variant is employed which takes into account the significance of each gene in the personalized genetic signature. This weighting factor is proportional to the differential expression of each gene between the cancerous and the cancer-free profile.
In order to estimate the best performing DBN architecture, i.e. the dependencies among the employed variables within the same time-slice (intra-slice dependencies), as well as across successive time-slices (inter-slice dependencies), we employ two search algorithms, namely the Greedy and the Simulated Annealing. Based on the trained model, we are able to conjecture the values of all variables from future time-slices, including of course the probability of reoccurrence, by using the junction tree algorithm for inferencing from the DBN. In order to formulate and fine-tune the DBN network, as well as for inferencing from the trained model, the miniTUBA system has been used . Due to the relatively limited number of patients available for evaluating the Disease Evolution Monitoring, we employ the leave-one-patient-out technique which is rather suited for making the most out of refined datasets.
In the sections that follow, we present the results obtained in our effort to stratify the patients in high and low risk groups based on the reoccurrence probability. Initially, each source of data is treated independently and subsequently all single-classifier decisions are combined using a multi-type classifier. In the first step, each source of data is subject to preprocessing, and then the resulting feature vector is fed to the target classifier either unaltered or after applying certain feature selection algorithms, i.e. CFS and wrapper. As for the actual classification task, we have examined the performance of the following classifiers: Bayesian Networks, Naive Bayes, Artificial Neural Networks, Support Vector Machines, Decision Trees and Random Forests.
Results obtained using the clinical data and all classification schemes
No feature selection
The employment of the CFS algorithm for feature selection maintains the following features as most discriminatory: smoker, tumor thickness, lymphoplasmacytic rection, perineural invasion, num mitoses HPF, surgical margins, p53 stain and N staging. The employment of the wrapper algorithm has pinpointed the following features for each classification algorithm: BN: ecog status, cholesterol, grade differentiation and N staging; NB: allergies, cholesterol, depth invasion, lympphoplasmacytic rection and N staging; ANN: ecog status, cholesterol, depth of invasion and N staging; SVM: ecog status, smoking duration and N staging; DT: depth of invasion, p16ink4a stain and N staging; RF: quantity of cigarettes, galvanic current, eating habits, BMI, depth invasion and N staging.
Results obtained using the imaging data and all classification schemes
No feature selection
The employment of the CFS algorithm for feature selection resulted in a refined feature vector containing the following features: extra-tumor spreading, extra-nodal spreading, texture (lymph node), site (lymph node), side (lymph node) and number of lymph nodes. Afterwards, the employment of the wrapper feature selection algorithm has resulted in the following lists of features for each classification algorithm: BN: extra-tumor spreading and site (lymph node); NB: extra-tumor spreading and site (lymph node); ANN: extra-tumor spreading and site (lymph node); SVM: extra-tumor spreading, floor of the mouth invasion and site (lymph node); DT: extra-tumor spreading, perineural infiltration (tumor), bone infiltration (tumor) and site (lymph node); RF: extra-tumor spreading and site (lymph node).
List of the most significant genes as pinpointed by the SAM algorithm
Results obtained using the tissue genomic data and all classification schemes
No feature selection
The utilization of the CFS algorithm for feature selection, subsequently retains the following features: TCAM, SOD2, AMDHD1, AY358224, PHACTR1, AK026836 and RPRM. Afterwards, the employment of the wrapper algorithm coupled with a different target classification algorithm each time has retained the following features: BN: TCAM1, AMDHD1, AY358224, PHACTR1 and RPRM; NB: TCAM1, SOD2, AMDHD1, PHACTR1 and RPRM; ANN: TCAM1, SOD2, AMDHD1, PHACTR1, SLC5A12, AK026836 and RPRM; SVM: TCAM1, SOD2, FCAR, AMDHD1, AY358224 and PHACTR1; DT: AMDHD1, AY358224, PHACTR1, AK026836 and RPRM; RF: SOD2, FCAR, AMDHD1, AY358224, PHACTR1 and RPRM.
Comparison among the most prominent classification schemes
Source of gene set
# of genes
Union of literature and current work
It is noteworthy that even though both gene sets perform quite satisfactory, the union of the two gene sets significantly ameliorates the obtained results, enhancing the generalization capability of the classification procedure.
List of most significant genes as pinpointed by the SAM algorithm
Results obtained using the blood genomic data and all classification schemes
No feature selection
The employment of the CFS algorithm maintains the following genes as most significant: A_24_P221960, THC2399272, BM683433, OXCT2, A_24_P230388, A_32_P57247 and AL566369. Next, the employment of the wrapper algorithm maintains the genes that are specifically tuned to achieve the best performance using a specific classification algorithm. Those genes are for BN: A_24_P221960, THC2399272, BM683433 and OXCT2; for NB: THC2410448, A_24_P221960, BM683433, OXCT2, A_32_P57247 and A_24_P942151; for ANN: A_24_P221960, THC2399272 and CN391963; for SVM: THC2410448, OXCT2 and A_24_P942151; for DT: A_24_P221960; for RF: A_24_P221960 and THC2399272.
Best performing classifications schemes based on each source of data
Source of data
Union of genes from literature and current work
No feature selection
List of mostly differentially expressed genes over the follow-up period
Overall performance of the DBN model
Baseline & follow-up #1
In this work we have collected and analyzed a broad set of heterogeneous data from various sources, i.e. clinical, imaging tissue genomic and blood genomic, in order to capture the progression of the disease during remission and predict potential disease relapses. For this purpose a twofold analysis has been performed: i) Baseline Data Analysis which employs solely data obtained at the baseline visit aiming to identify a potential disease relapse and ii) Disease Evolution Monitoring, that explores gene expression from genes obtained at predefined intervals over the follow-up coupled with a personalized genetic signature in order to capture the temporal progression of the disease and hence predict the approximate timing of a potential relapse.
We observe that the currently proposed methodology exhibits superior results compared to the other methods presented in the literature, nevertheless, direct comparison cannot be performed since different datasets have been employed in each case. It should be noted that the currently employed dataset contains a considerable number of patients, i.e. 86, compared to the other ones in the literature. A significant advantage of the current work is the multitude of data that has been employed (clinical, imaging, tissue genomic and blood genomic) whereby a separate classifier has been trained from each source of data, as well as an overall one that combines the individual classifiers. On the other hand, the other methodologies in the literature exploit mostly genomic data coupled with information from the clinical profile of the patients.
In addition, the modular architecture of the current work allows for inspecting the "verdict" based on each source of data separately and hence conjecturing about the contribution and validity of each type of data. Of course an overall decision is calculated which weighs the individual predictions yielding a more accurate consensus outcome. Alternatively, the combination of all sources of data could be achieved by pooling all data in a single dataset (i.e. the bag of features approach); however, with this approach the intersection of patients across all sources of data would have to be used, and given the uneven distribution of patients this would conclude in a rather limited dataset. In terms of validation, the current work has been evaluated using 10-fold cross validation and the leave-one-patient-out technique, in order to assess thoroughly the achieved performance. The results obtained with the leave-one-patient-out technique are in complete accordance with the ones obtained using 10-fold cross validation, therefore, the former ones are not included in the manuscript.
The best performing classification schemes based on each source of data are summarized in Table 10. The features maintained in each case constitute a minimal subset of features that bears enhanced discriminative potential. Therefore, in terms of prediction, we have come down to a rather refined set of features that can be employed in order to estimate the relapse probability for a specific patient.
The features maintained from each source of data can be inspected independently, however, it should be noted that it is their combination that yields the results presented previously. Based on the clinical data, the best performing classification scheme involves a Decision Tree where the initial input has been stripped off from redundant features using the wrapper feature selection algorithm, maintaining the following features as most significant: depth of invasion, p16ink4a stain and N staging. Especially the depth of invasion and the N staging constitute major factors affecting the disease prognosis. For the case of imaging data the Naive Bayes classifier coupled with the wrapper algorithm for feature selection employs the combination of extra-tumor spreading and (lymph node) site, thus, achieving the highest performance. Next, we move on to the genomic data (from tissue and blood) where the best performing gene subsets are shown in Table 5 and Table 8, respectively. Among those genes, there are certain genes that have been correlated in the literature with the evolvement of oral cancer and its manifestations. Specifically, TCAM1 has been associated with the HPV status of patients with head and neck squamous cell carcinomas ; another example is SOD2 that has been implicated with the progression and metastasis of oral cancer [26, 27]. For the remaining genes more distant and sporadic associations have been mined in the literature for relevant diseases or other types of cancer, e.g. AMDHD1 and PHACTR1 have been linked to tobacco use disorders  or RPRM with colorectal cancers .
Both for the clinical and the imaging input vector, the employment of the wrapper algorithm yields the feature subset with the highest discriminating ability; in accordance with our findings, the wrapper algorithm has been reported in the literature to often outperform other feature selection algorithms due to the fact that it is tuned to the target classification algorithm . However, in the case of tissue and blood genomic data, the output from the SAM algorithm constitutes the best performing gene subset, and further feature selection does not ameliorate the results; this is more or less expected since the SAM algorithm aims at finding those genes that differ the most between the two target classes, and therefore bear enhanced discriminating potential.
An important aspect of the current work that should be highlighted, is the fact that for the tissue genomic data, we have merged the gene subset identified using the current dataset with a set of genes pinpointed in the literature as highly correlated and descriptive of oral cancer reoccurrence. As shown in the Results section, the union of the two gene subsets achieves superior performance compared to the individual sets. Besides the performance amelioration, it is very important that in this manner we consolidate information from other data resources, thus, achieving enhanced generalization capability.
As for the Disease Evolution Monitoring, a substantial advancement is the incorporation of the time dimension, thus, capturing the temporal nature of the disease. This is particularly interesting from a medical point of view, since we are able to conjecture the approximate timing that a potential relapse is more likely to occur. Moreover, the ability of DBNs to capture time-varying parameters, resembles better the actual disease progression and facilitates the modeling of the evolvement over the follow-up period. The employment of a DBN which features a transparent architecture allows for inspecting the interplay among the involved parameters and therefore, reasoning is provided for each decision. DBNs have been elsewhere used in other domains in order to capture time-varying events, yielding quite satisfactory and accurate results.
As shown in the DBN architecture depicted in Figure 5, nine genes have been maintained as most discriminatory and their combination, as represented in the DBN dependencies, can be used to conjecture about the relapse probability of a specific patient. The genes that have been maintained are: HMCN1, RGMA, TSC1, AK023526, NOTCH2, STX6, THC2447689, THC2344152, LEPRE1. It should be noted that the features extracted from the personalized genetic signature that has been described previously, were not maintained during the training of the DBN and therefore they have not been included in the employed architecture. The majority of the aforementioned genes have not been elsewhere associated in the literature with oral cancer progression; however, gene TSC1 is a notable exception since it has been correlated with esophageal cancer as well as the reoccurrence of head and neck cancer [30, 31]. Furthermore, gene AK023526 has been found to constitute a marker for cancer stem cells , and gene NOTCH2 has been associated with tobacco use disorders and certain types of cancer .
The relatively small association of the extracted genes with literature findings related to cancer and more specifically oral cancer, was expected since blood gene expression has been scarcely studied, and therefore few and sporadic references exist in the literature. However, given the fact that even a subset of the currently extracted genes have been identified in cancer related pathways, is quite encouraging and further supports the proof-of-concept that is intended with the approach of Disease Evolution Monitoring in the current work. In any case, the relatively small set of patients, compared to the initially enormous number of genes was a hindrance in the first place, yet the preliminary results obtained using the leave-one-patient-out technique are quite satisfactory.
In this work we have presented a multiscale and multiparametric approach for modeling the progression of oral cancer during remission. Specifically, our approach consist of two main analyses, i) Baseline Data Analysis where clinical, imaging, tissue genomic and blood genomic data were employed in order to predict a potential disease relapse and ii) Disease Evolution Monitoring aiming to capture the progression of the disease based on gene expression data acquired from circulating blood cells, and subsequently exploit this information in order to predict if and roughly when a relapse is more likely to appear. This information can be used to stratify the patients into high and low risk groups according to the relapse probability; hence, the treatment protocol can be either intensified or relaxed accordingly. Moreover, it is very important to unify our study with findings from similar studies alongside with further verification of the results in order to achieve enhanced generalization capability and conjecture meaningful and solidified corollaries.
The proposed approach encompasses heterogeneous sources of data that are expected to assess the disease status from several aspects and therefore, can be proven very helpful towards studying complex diseases, such as cancer.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.