Distinctive microscopic patterns for stage ta versus T1 bladder tumors
At least three morphological features have been identified to distinguish between stages Ta (Fig. 1a-c) and T1 (Fig. 1d-f) bladder tumors. The first pattern is desmoplastic reaction, which is characterized with dense fibrosis around the nests of T1 tumor cells (Fig. 1d). This pattern is most definite for invasion, but T1 lesions often lack it. The second pattern is retraction artifact, which is the result of tissue shrinkage after dehydration during tissue processing, seen around the nests of T1 tumor cells (Fig. 1e). The third pattern is more abundant, pinker cytoplasm in T1 tumor cells, presumably due to higher uptake of eosin, compared with that of Ta tumor cells (Fig. 1f). Although pathologists usually make a diagnosis of tumor invasion based on these patterns under microscope, a quantitative representation will allow automatic extraction and analysis of the patterns in H&E-stained slides.
H&E-stained slide digitalization, image processing and feature extraction
We obtained 1177 H&E-stained histopathology images of Ta or T1 bladder tumors from the archive in the Department of Pathology and Laboratory Medicine at URMC. To digitize these slides, each image was captured at × 100 magnification with 2048 × 2048 pixels. Although the overall images were very clear, a dark spot was often found at the lower right-hand corner. We therefore cropped and tiled the central part of the raw images to get smaller ones with 700 × 700 pixels.
To extract objective morphological information from these images, we used ImageJ and CellProfiler to extract image patterns into numerical numbers. We therefore built nine fully automated image pattern extraction pipelines to capture the above three microscopic patterns. Due to the complexity of pathological images, each pattern consisted of various features. The general procedure of feature extraction is described below. We first masked unwanted areas using methods like color thresholding and matrix subtractions before extracting the features. Since all of the raw images were consistent in staining quality, the parameters for extracting each feature were kept the same across all images. The image features included nuclear size distribution, crack edge, sample ratio, distribution of pixel intensity in the connective tissue and cytoplasm, as well as the shape of connective tissue and nuclei of tumor cells. The numerical representation of the features was outputted in spreadsheets and placed in columns.
For example, to extract the retraction artifact pattern, we developed a pipeline to differentiate two types of non-tissue regions in a H&E-stained image, one was around cells (i.e., small space around cells) and the other was between tissue parts (i.e., large space between tissue edges) (Fig. 2a). To only catch the small space surrounding cells (named “cracks” for simplicity), we first converted an original color image to a monochrome image with black or white color on each pixel and all non-tissue regions were in white (Fig. 2b). Then we converted the original color image to an 8-bit grayscale image. The regions with more than 40 pixels in diameters were considered to be the regions between tissue parts, which were shown in white (Fig. 2c). This 8-bit image was then converted to a 1-bit image with black and white colors inverted; now the inter-tissue space was shown in black (Fig. 2d). Then we combined images shown in Fig. 2b and d to get the final image in which the inter-tissue spacer was masked and cracks were shown in white (Fig. 2e). The number of pixels in white regions represented the size of cracks around cells.
We also developed pipelines (Supplementary Figures S2 and S3) to extract features in the pinker cytoplasm pattern and the desmoplastic reaction pattern. Note that each of the three microscopic patterns was extracted separately, and the numeric representation of each pattern was later combined into a large spreadsheet in the CSV format, in which each row represented an image and each column represented a feature. For every image (out of 1177), 740 quantitative features were extracted to represent the three microscopic patterns.
Unsupervised clustering of cancer images
To understand whether extracted features were sufficient to differentiate the histopathological images of Ta and T1 tumors, we set out to conduct a cluster analysis of the features. We first reduced data dimension through principal component analysis (PCA) because, through PCA, we were able to rank top components by their eigenvalues. However, as shown in Fig. 3a-b, plotting the top components with the highest eigenvalue failed to find recognizable clusters. In addition, by performing k-means analysis on PCA components, we found no apparent clusters between k = 2 and k = 9 (Fig. 3c-d). Combining the PCA and k-means analyses, we found that the non-invasive and invasive tumor images were highly overlapped. Splitting the clusters resulted in less than 0.006 in information gain. These data suggested that non-invasive and invasive bladder cancer images were not separable with simple linear transformation. Therefore, supervised learning methods were considered.
Feature reduction and supervised classification of cancer images
To select meaningful ones from the 740 features extracted by ImageJ and CellProfilers, we first manually trimmed questionable features that were related to time (i.e., time when images are processed or taken), index (i.e., labels of images), descriptive string (i.e., initials of processing methods or channels of image processing), or those containing missing values ‘N/A’ as the results of ImageJ and CellProfilers processing. In addition, the features containing no numeric values were also removed. As a result, 696 features were selected for further analysis.
Given that the training set contained 930 images, 696 features might raise the concern of overfitting. To address this concern, we reduced the number of features by employing decision tree (DT) with k-fold cross-validation to rank the relative importance of the features. Specifically, we first used all 696 features as the input to build 20 forests, each with 40 trees. Similar to RF, each tree was constructed by random samples, but the number of features was fixed to 696. The DT method was used to evaluate the importance of each feature by averaging the importance values of the feature in all trees of a forest. We therefore ranked the relative importance of all 696 features based on their average importance values. This rank determined the order of the features added to ML models. That is, after measuring the impact of the first feature, we iteratively added the next feature in the rank to the models. As shown in Fig. 4a, as the features were added in the ranking order, the performance of 6 ML classifiers including PNN, increased and reached a plateau between 70th and 100th features (Fig. 4b). After adding 200 features, the performance started to drop (Fig. 4a). To examine whether the ranking order of features was critical for the observed tendency, we randomized the feature order and found that a plateau was still reached between 70th and 100th features (Supplementary Figure S4A-B). This result suggests that the DT method successfully selects the most important 100 features from the original 696 features.
To compare further the two feature sets (100 vs. 696) in predicting Ta and T1 bladder cancers, we used 6 ML classifiers, including PNN [23,24,25], RF [26, 27], SVM [28], bagging (Adaboost) [29], LR [30], and MLP. Three metrics were used to evaluate the performance of the classifiers, including accuracy, ROC curve, and the AUC. We found that the average accuracy was over 90% for all classifiers (Fig. 5). Moreover, the 100-feature set outperformed the 694-feature set in five out of six classifiers, except LR (Fig. 5). The same trend is observed in ROC and AUC (Fig. 6a-b and Supplementary Figure S5A-B). Of note, PNN outperformed other classifiers with the AUC of 0.99 (Fig. 6b and Supplementary Figure S5B) and the accuracy of 96.7% (Fig. 5). Overall, our work clearly showed that the top 100 features generally had a higher predictive power than the 696 features.
To examine the performance of deep learning models on our data, we used both pre-trained VGG16 and VGG19 networks to extract features. Specifically, we took the convolutional base of the networks, ran the Ta and T1 cancer images through it, and trained a new classifier on top of the output. We found that the accuracies of VGG16 and VGG19 reached 84 and 81%, respectively (Fig. 7a and Supplementary Figure S6A), whereas their AUC values were 0.926 and 0.912, respectively (Fig. 7b and Supplementary Figure S6B). Our results showed that the general ML classifiers outperformed deep learning models, suggesting that, for cancer histopathological images, feature extraction based on domain knowledge performed better than computer-based feature extraction.
Relative importance of three microscopic patterns
To assess the relative importance of the three microscopic patterns in predicting non-invasive versus invasive bladder cancer images, we separated the 696 features into three groups and assessed the performance of the 6 ML classifiers. We found that features extracted from the desmoplastic reaction pattern had the highest overall accuracy of 90.5% with the average AUC values of 0.98 (Fig. 8a and Supplementary Figure S7A). By contrast, pinker cytoplasm had 74.5% overall accuracy with the average AUC of 0.825 (Fig. 8b and Supplementary Figure S7B), whereas retraction artifact had 73.4% overall accuracy with the average AUC values of 0.802 (Fig. 8c and Supplementary Figure S7C). It was noteworthy that desmoplastic reaction had 675 features, whereas pinker cytoplasm and retraction artifact had 13 and 15 features, respectively. These observations suggest that the models with the desmoplastic reaction features may be overfitting. Reducing from 675 features to 70 features in the desmoplastic reaction pattern still outperformed the pinker cytoplasm and retraction artifact patterns with an accuracy of over 90% (data not shown). To some extent, all three patterns could distinguish Ta and T1 tumor images with a reasonable accuracy (> 70%), suggesting that some features extracted from these patterns might be correlated (see Discussion).
To understand which features in the desmoplastic reaction pattern are more important, we ranked all features based on 40 DTs. We found that features, such as the number of nuclei and distributions of nuclei sizes, came out at the very top of our ranking (Supplementary Figure S8). These findings were consistent with the main microscopic characteristics of the desmoplastic reaction pattern, in which a large number of inflammatory cells surround the nests of tumor cells. Our result suggests that the desmoplastic reaction pattern contains the most informative features in distinguishing Ta versus T1 bladder tumors.