Using autoencoders as a weight initialization method on deep neural networks for disease detection

Background As of today, cancer is still one of the most prevalent and high-mortality diseases, summing more than 9 million deaths in 2018. This has motivated researchers to study the application of machine learning-based solutions for cancer detection to accelerate its diagnosis and help its prevention. Among several approaches, one is to automatically classify tumor samples through their gene expression analysis. Methods In this work, we aim to distinguish five different types of cancer through RNA-Seq datasets: thyroid, skin, stomach, breast, and lung. To do so, we have adopted a previously described methodology, with which we compare the performance of 3 different autoencoders (AEs) used as a deep neural network weight initialization technique. Our experiments consist in assessing two different approaches when training the classification model — fixing the weights after pre-training the AEs, or allowing fine-tuning of the entire network — and two different strategies for embedding the AEs into the classification network, namely by only importing the encoding layers, or by inserting the complete AE. We then study how varying the number of layers in the first strategy, the AEs latent vector dimension, and the imputation technique in the data preprocessing step impacts the network’s overall classification performance. Finally, with the goal of assessing how well does this pipeline generalize, we apply the same methodology to two additional datasets that include features extracted from images of malaria thin blood smears, and breast masses cell nuclei. We also discard the possibility of overfitting by using held-out test sets in the images datasets. Results The methodology attained good overall results for both RNA-Seq and image extracted data. We outperformed the established baseline for all the considered datasets, achieving an average F1 score of 99.03, 89.95, and 98.84 and an MCC of 0.99, 0.84, and 0.98, for the RNA-Seq (when detecting thyroid cancer), the Malaria, and the Wisconsin Breast Cancer data, respectively. Conclusions We observed that the approach of fine-tuning the weights of the top layers imported from the AE reached higher results, for all the presented experiences, and all the considered datasets. We outperformed all the previous reported results when comparing to the established baselines.

(Continued from previous page) and 98.84 and an MCC of 0.99, 0.84, and 0.98, for the RNA-Seq (when detecting thyroid cancer), the Malaria, and the Wisconsin Breast Cancer data, respectively. Conclusions: We observed that the approach of fine-tuning the weights of the top layers imported from the AE reached higher results, for all the presented experiences, and all the considered datasets. We outperformed all the previous reported results when comparing to the established baselines.

Background
Cancer is a label for a group of diseases that is characterized by abnormal and continuous cell growth, with the potential to spread through its surrounding tissues and other body parts [1]. During 2018, cancer was the second leading cause of death globally, accountable for 9.6 million deaths, where around 70% were in developing countries [2]. Throughout the years, and given the evolution of techniques, technology, and treatments in medicine, cancer survival rates have been improving [3]. However, there are still some types that have survival rates of under 20%, such as pancreatic, esophagus, and liver cancers. Its prevalence makes it more crucial to correctly and accurately classify such diseases. For tackling this need, many research groups have been trying to help on accelerating cancer diagnosis, by experimenting and studying the application of machine learning algorithms to this problem [4].
When automatically classifying tumor samples, one approach is to analyze the samples derived molecular information, which is its gene expression signatures. Gene expression is the phenotypic manifestation of a gene or genes by the processes of genetic transcription and translation [5]. By studying it, this gene map can help to better understand cancer's molecular basis, which can have a direct influence on this disease's life cycle: prognosis, diagnosis, and treatment. There are two main cancer genomics projects -The Cancer Genome Atlas (TCGA) [6] and The International Cancer Genome Consortium (ICGC) [7] -that aim to translate gene expression, systematizing thousands of samples across different types of cancers. With this elevated number of features, each representing a particular gene, one may find genome-wide gene expression assays datasets in these projects. However, this type of data presents some challenges, because of (1) a low number of samples, (2) an unbalanced class distribution, with few examples of healthy samples, and (3) a high potential of underlying noise and errors, due to eventual technical and biological covariates [8]. This difficulty in gathering data accurately is underlying for every dataset creation. The equipment used to collect the data has intrinsic errors associated (mechanical, of acquisition, and others), hence, the dataset will reflect these errors.
Several authors have chosen the previously mentioned approach of analyzing the gene expression of tumor samples. Many of the developed methodologies in this scope use straightforward supervised training, especially when using deep neural networks (DNNs), relying on their depth to produce the best results. Gao et al. [9] proposed DeepCC, a supervised deep cancer subtype classification framework based on deep learning of functional spectra quantifying activities of biological pathways, robust to missing data. The authors conducted two studies, each with a different cancer detection (colorectal and breast cancer data). The authors claimed that the described method achieved overall higher sensitivity, specificity, and accuracy compared with other classical machine learning methods widely used for this kind of task, namely random forests, support vector machine (SVM), gradient boosting machine, and multinomial logistic regression algorithms, with an accuracy higher than 90%.
Sun et al. [10] proposed Genome Deep Learning (GDL), a methodology aiming to study the relationship between genomic variations and traits based on DNNs. This study analyzed over six thousand samples of Whole Exon Sequencing (WES) mutations files from 12 different cancer types from TCGA, and nearly two thousand healthy WES samples from the one thousand genomes projects. The main goal of GDL was to distinguish cancerous from healthy samples. The authors built: 12 models to identify each type of cancer separately, a total-specific model able to detect healthy and cancerous samples, and a mixed model to distinguish between all 12 types of cancer-based on GDL. All the experiments were evaluated through: (a) three performance metrics -accuracy, sensitivity, and specificity -and (b) Receiver Operating Characteristic curves, with the respective Area Under the Curve (ROC-AUC). This methodology achieved a mean accuracy of 97.47% on the specific models, 70.08% on mixture models, and 94.70% on total specific models, for cancer identification.
In [11], Kim et al. compared the performances of: (1) a neural network, (2) a linear SVM, (3) a radial basis function-kernel SVM, (4) a k-nearest neighbors, and (5) a random forest when identifying 21 types of cancers and healthy tissues. The classifiers were trained with RNAseq and scRNA-seq data from TCGA, where they selected up to the 300 most significant genes expressed for each of the cancer variations. To determine the optimal number of genes for each classifier's binary classification task, the methods mentioned above were trained with 12 different sizes of gene expression datasets (from 5 to 300 genes). When learning with 300 genes, the neural network, the linear SVM, and the radial basis function-kernel SVM models achieved their best performance, with a with a Matthews Correlation Coefficient (MCC) of 0.92, 0.80, and 0.83, respectively. The k-nearest neighbors and random forest models achieved an MCC of 0.8 and 0.83, accordingly, when using 200 genes. Furthermore, the authors identified 10 classes with an accuracy of over 90%, and achieved a mean MCC of 0.88 and a mean accuracy of 0.88, with the neural network classifier.
However, many DNNs, besides the known open challenges regarding their training setting [12], have a higher tendency to overfit, which one can detect when applying the same architecture to unseen data (or to a held-out test). Thus, our motivation focuses on exploring unsupervised pre-training methods based on a lower-dimensional latent representation with the usage of an autoencoder (AE). This approach is grounded in the hypothesis that (a) there is unessential information in high dimensionality datasets, and (b) the acquisition and processing errors potentially present in the dataset are discarded, contributing to a lower probability of overfitting [13]. Furthermore, pre-training AEs and using the learned weights as priors of the supervised classification task not just improves the model initialization, but also often leads to better generalization and performance [13]. This may be one of the reasons why AEs are found to be the most predominant strategy when analyzing RNA-Seq data [14].
To support our motivation and choices, we present some works that include unsupervised training in their methodologies. In [15], the authors designed a solution by combining a Multilayer Perceptron and Stacked Denoising Autoencoder (MLP-SAE), aiming to predict how good genetic variants can be a factor in gene expression changes. This model is composed of 4 layers (input, two hidden layers from the AEs, and output, and trained it to minimize the chosen loss function, the Mean Squared Error (MSE). The authors started by training the AEs with a stochastic gradient descent algorithm to later use them on the multilayer perceptron training phase as weight initialization; cross-validation was used to select the best model. The performance of the chosen model was compared with the Lasso and Random Forest methods and evaluated on predicting gene expression values for a different dataset. The authors concluded that their approach (1) outperformed both the Lasso and Random Forest algorithms (with an MSE of 0.2890 versus 0.2912 and 0.2967, respectively), and (2) was able to capture the change in gene expression quantification.
The authors in [16] described a study of four different methods of unsupervised feature learning -Principal Component Analysis (PCA), Kernel Principal Component Analysis (KPCA), Denoising AE (DAE), and Stacked Denoising AE -combined with distinct sampling methods when tackling a classification task. The authors focused on assessing how influential the input nodes are on the reconstructed data of the AE's output, when feeding these combinations to a shallow artificial network trained to distinguish papillary thyroid carcinoma from healthy samples. The authors highlighted two different results, in their 5-fold cross validation experiment: the combination of a SMOTE [17] with Tomek links and a KPCA, was the one with the best overall performance, with a mean F 1 score of 98.12, while the usage of a DAE achieved a mean F 1 score of 94.83.
In [18] presented a stacked sparse autoencoder (SSAE) semi-supervised deep learning pipeline, applied to cancer detection using RNA-Seq data. By employing layerwise pre-training and a sparsity penalty, this approach helps to capture more significant information from the known high dimensionality of RNA-Seq datasets, using the filtered information to the sequent classification task. The SSAE model was tested on three different TCGA RNA-Seq datasets -corresponding to lung, stomach, and breast cancers) -with healthy and cancerous samples, and compared it to four others classification methods: an SVM, a Random Forest, a neural network (supervised learning only), and a vanilla AE. The authors performed 5-fold cross validation and evaluated the model's performance through four metrics: accuracy, precision, recall, and F 1 score. The results show that the semi-supervised deep learning approach achieved superior performance over the other considered methods, with an average F 1 score of 98.97% across the three used datasets.
The authors in [19] developed a methodology for detecting papillary thyroid carcinoma. They analyzed how the usage of AEs as a weight initialization method affected the performance of a DNN. Six types of AEs were considered: Basic AE, Denoising AE, Sparse AE, Denoising Sparse AE, Deep AE, and Deep Sparse Denoising AE. Before being integrated into the classifier architecture, all AEs were trained to minimize the reconstruction error. Subsequently, they were used to initialize the weights of the first layers of classification neural network (meaning that the AE layers become the top layers of the whole classification architecture), using two different strategies when importing the weights: (1) just the encoding layers, and (2) all the pre-trained AE. Moreover, in the training phase, the authors studied two different approaches when building the classifier: (a) fixing the weights of the AE and (b) allowing subsequent fine-tuning of all the network's weights. The authors used stratified 5-fold cross-validation and evaluated the model through 6 distinct metrics: Loss, Accuracy, Precision, Recall, and F 1 score. The authors reported that the overall best result was achieved through a combination of Denoising AE, followed by its complete import into the classification network, and by allowing subsequent fine-tuning through supervised training, yielding an F 1 score of 99.61.
In [20], the authors present a transfer learning methodology, in which the main goal is to explore whether leveraging the information extracted from a large RNA-Seq data repository, with multiple cancer types, leads to extract important latent features that can help complex and specific prediction tasks, such as identifying breast cancer neoplasia. The authors used the TCGA PanCancer dataset, which is composed of approximately 11,000 RNA-Seq gene expression examples of 33 distinct tumor types. This data was split into two sets: breast cancer and non-breast cancer data. The non-breast data is firstly used to train the three selected architectures for this study: a sparse AE, a deep sparse AE, and a deep sparse denoising AE models. Then, the breast data is used to fine-tune the resulting AEs. After pre-training these models, the authors aim to predict the breast tumor intrinsic-subtypes, which is given by the PAM50 subtype information included in the clinical data included in the PanCancer data. The extracted features from the AEbased architectures are then fed as input to three different machine learning classifiers, namely Logistic Regression, Support Vector Machine, and a shallow Neural Network. To assess the deep AEs performance as feature extraction methods, the authors compared them to other classical feature extraction methods, combining them with the classification algorithms previously mentioned: ANOVA, Mutual Information, Chi-Squared, and PCA. A 10-fold cross validation was performed, and all the combinations were compared through the accuracy metric. The results showed the deep sparse denoising AE performs best when using the AE extracted features, where the combination with a shallow neural network leads to the best overall of 90.26% (±2.85).
In [21], Ferreira et al. used the same methodology described in [19] to discriminate different types of cancer, instead of distinguishing cancerous samples from healthy ones. In this case, they aimed to identify thyroid, skin, and stomach cancer correctly. Given that a Denoising AE was the AE that lead to the best results in previous studies, the authors chose to single it out, instead of the original 6. The rest of the experiments remained the same: 2 strategies for importing the pre-trained AE into the top layers of the classifier, two approaches when training the classifier to detect different types of cancer, same evaluation of the obtained results. Although in a different domain, the best outcome was reached with a combination of the same strategy and the same approach in the previous work [19], with an F 1 score of 98.04, when identifying thyroid cancer.

Methods
We extend the previously described work in [21] by assembling three different types of experiments, divided into two main parts, where we use three different AEs and five types of cancer samples. In the first one, we analyze the performance of a deep neural network (DNN), using the same pipeline to identify different types of cancer. In the second part, we choose one of the used AEs to assess how: (1) the variance of its latent vector dimension impacts the essential information capture and therefore possibly influencing the classifier's performance, and (2) different data imputation strategies can influence the overall performance in the classification task. Moreover, we study if the network architecture is correlated with its overall performance, and how the model reacts when training with a different data type dataset. We built this pipeline in Python, using: the Numpy [22] and Pandas [23] packages for the data preprocessing step; the Keras deep learning library [24] running on top of TensorFlow and the Scikit-Learn [25] package to train and evaluate the models; and the Matplotlib [26] library for visualization. Additionally, we used an NVIDIA GeForce RTX 2080 Ti GPU, on a Ubuntu 18.04 operating system. This section is organized as follows: "The data" subsection describes the used data and its inherent preprocessing. "Autoencoders" subsection overviews the AEs considered to this study. "Methodology" subsection outlines the pipeline, for each of the referred experiments. "Evaluation" subsection details how we evaluate the results to provide statistical evidence. Finally, "Baseline" subsection presents the established baseline results for all the used datasets.

The data
In our experiments, we use two different types of data, which are described in the subsections that follow.

RNA-Seq data
We used five different RNA-Seq datasets, from The Cancer Genomes Atlas (TCGA) [6], each representing a type The first line (the header) contains the genes names, and the column values represent its expression, sample-wise (except for the first column, which is the sample ID). NA stands for missing value, for a particular gene and sample of cancer: thyroid, skin, stomach, breast, and lung. One can find a sample of the described data in Table 1.
The datasets were downloaded from the cBioPortal [27], which gathers cancer-related data from different projects, including TCGA. To train DNNs, we need as many data as we can get. Ergo, our first criterion was to choose cancer types that had the highest number of examples. Additionally, we decided to gice priority to cancer types with high mortality and high incidence rates. We use the same thyroid, skin, and stomach datasets presented in [21], alongside the lung and breast datasets. The data filtering process in the cBioPortal comprised searching with the keywords PanCancer, sorting the obtained results from highest to lowest RNA-Seq examples, and finally selecting the thyroid, skin, stomach, breast, and lung datasets. All five datasets are composed of approximately 20 thousand features. Each column feature in these datasets represents a specific gene, and the cell values for each column are the expression of that gene in a particular sample. All the RNA-Seq data were normalized according to the distribution based on all samples. The expression distribution of a gene is estimated by calculating the mean and variance of all samples with expression values, and discarding zero's and non-numeric values such as NA, Null or NaN, which are substituted by NA [28]. With the five datasets, we gathered 509 examples of thyroid cancer, 472 of skin cancer, 415 of stomach cancer, 1,083 of breast cancer, and 511 of lung cancer. We would like to emphasize that this dataset is only a toy dataset since the data does not fairly reflect the immense difficulty associated with identifying cancer in a real scenario.
The preprocessing pipeline was executed for each RNA-Seq dataset separately. Firstly, we removed the columns that had only one value throughout all samples. When a value is constant for all the examples, there is no entropic value; with no value variation, one cannot infer any information. In total, 2,056, 2,072, 1,993, 457, and 591 columns were removed on the thyroid, skin, stomach, breast, and lung datasets, respectively. By default, we attributed the remaining missing values (represented by NA in the dataset, as observable in Table 1) with the mean value of the column where the missing value is [29]. Further normalization was not applied in the data. Finally, we added the Label column, to link the instances to their type of cancer, when training the classifier.
Since we aim to distinguish several cancer variations, we test all cancers against each other, assigning the positive value one to the class of interest, and zero to the remaining ones. When detecting thyroid cancer, all thyroid examples are labeled as one and the skin, stomach, breast, and lung instances as zero, and henceforward.
After processing all the datasets, it is improbable that the preprocessing phase removed the same columns in all of them. To guarantee the same features describe all the samples, we intersect all the datasets and use the result as our final dataset. Also, given that the breast cancer datasets had almost the double of instances, we apply downsampling and randomly select 500 breast cancer examples, to keep the final dataset as evenly distributed for all the cancers as possible. In the end, the resulting dataset has approximately 3,000 instances and more than 17 thousand genes.

Data of features extracted from images
We use two datasets of two different diseases, composed of features extracted from images: malaria and breast cancer. Since we aim to evaluate how well this methodology generalizes, by using distinct types of data, we are now able to gather evidence supporting this premise.
The malaria dataset was created by the Fraunhofer AICOS institution, through the MalariaScope project [30]. Their main goal is to develop low-cost solutions that can provide fast, reliable, and accurate results on detecting such disease, particularly in developing countries. In [31], the authors thoroughly describe the feature extraction process, from thin blood smear images exclusively acquired with smartphones. The resulting dataset is composed of 26,839 samples and 1,058 features. These features were normalized between [ −1, 1] via scaling and grouped into three main groups: geometry, color, and texture. From all the examples, approximately 8% contain malaria parasites. Due to the high unbalance between Malaria and Non-Malaria labels, we performed downsampling on the Non-Malaria class, where we randomly selected 60% examples. We decided to choose 60% instead of 50% due to a wide variety of non-parasite artifacts. Once the samples were selected, and similarly to the preprocessing step of the RNA-Seq data, we verify if there are features with constant values and remove them if that is the case. Our working malaria dataset has 5,906 instances (60% negative and 40% positive) and 1,052 feature columns.
The Wisconsin Breast Cancer dataset [32] from the UCI Machine Learning Repository is composed of 569 examples and 30 features. These features are computed from a fine needle aspirate digitized image of a breast mass and describe the cell nuclei characteristics present in those images, such as texture, area, concavity, and symmetry. From the 569 examples, approximately 60% are benign samples, and 40% are malign ones. No under or oversampling techniques were applied, since we do not find it to be needed. As performed in the malaria data, we checked if there were columns with constant values, for which there were not. The data was used as is, with the proportions and characteristics described above.

Autoencoders
An autoencoder (AE) [33] is an unsupervised feature learning neural network, that aims to copy its input based on a lower dimensional representation. This type of architecture is able to extract features by reducing the dimension of its hidden layer [33], which helps the AE to focus on capturing the essential features that best represent the data.
Let the encoding and decoding functions of the AE be f and g, parameterized on θ e and θ d respectively, where θ = θ e ∪θ d , L being the loss function, and J the cost function to be minimized. When learning, the AE aims to find value θ that: penalizing the reconstruction of the input, given byX = g θ d (f θ e (X)); the more distinctX is, the bigger the applied penalty. When training an AE, we use Mean Squared Error (MSE) as the loss function, and the Rectified Linear Units activation function (ReLU) [34] for all its layers. Currently, using ReLU as activation is the default recommendation, when training neural networks [35]. Similarly, using MSE as the loss function is a fairly common practice present in the literature, when training AEs [15,[35][36][37].
We use the AEs as a weight initialization technique [38] since evidence supports that using "unsupervised pretraining guides the learning towards basins of attraction of minima that support better generalization from the training dataset" [13]. Thus, we pre-trained them before importing the encoding part or all their layers to the classification neural network.

Basic autoencoder (AE)
The simplest AE has only one hidden layer. This type of AE learns through the optimization cost function presented in Eq. 1. With the combination of linear activations (ReLU) and the MSE loss function, these AEs behave similarly to the Principle Component Analysis (PCA) method -when trained with an MSE, an AE learns the principal subspace of the training data, consequentially [35].

Denoising autoencoder (DAE)
A Denoising AE (DAE) [39] aims not just to reproduce the input, but also to keep its information intact to undo the effect of an intentional corruption process applied to the original data. Its cost function can be described by: Fig. 1 Overall pipeline of our experiments. This figure illustrates the chosen metodology for our work. Firstly, we pre-train the autoencoders (AEs), before embedding them to the top layers of the classification network, fullfilling either Strategy 1 (import only the encoding layers from the AE) or Strategy 2 (import the complete AE). Each of the full assembled architectures is then trained to detect one of the 5 cancer types, in the input data. The training process can follow two different approaches, regarding the imported weights of the AEs: (A) fixing them or (B) allowing subsequent fine-tune. I represents the input layer, E the encoding layer,Î the output layer of the AE; at the classification region of the network, D represents the fully connected layer, and O the output of the classifer argmin θ J(θ; X) = L(X, g θ d (f θ e (X)) (2) whereX is a copy of the input X, intentionally corrupted by a sort of noise [35]. To simulate a form of Bernoulli Noise [40], we apply a Dropout layer, immediately after the input layer, where 10 of the connections are randomly cut.

Sparse autoencoder
Similarly to a DAE, a Sparse AE (SAE) learning process also has two main goals: (1) minimizing the reconstruction error when aiming to copy the input data, and (2) applying a sparsity penatly (represented by to the parameters involved in the encoding part: Although it also tries to reproduce X, an SAE can address unique statistical features of the dataset it has been trained on [35,41]. To deliver that sparsity element, we use an L1 penalty, with a λ of 10 −5 .

Methodology
We have adopted the methodology described in [19], which was also used in [21]. Our experiments consist of an analysis of the performance of a DNN, trained to classify different cancer types, studying how three different factors may impact the network performance: 1. The top layers, where we use three different AEs as weight initialization; 2. The dimension of the latent vector of the AEs, that means the encoding layer size; 3. The imputation technique, to replace missing data when preprocessing the datasets.
For all these, we follow the same pipeline (see Fig. 1). For each experience, we start by pre-training a different AE to minimize the reconstruction error, before importing them into the top of the classification architecture. When doing so, we choose one of the two strategies considered for this study: (1) add just the encoding layers, or (2) add all the pre-trained AE. After the embedding of the AE to the top layers, we consider two different approaches in the training process: (A) fixing the imported weights of the AE layers, and (B) by allowing them to be fine-tuned, during the model training for the classification task.
With the complete architectures (AE as the top part of the classification network) assembled, we train each one to distinguish: • The RNA-Seq input data as one of 5 cancers, namely thyroid, skin, stomach, breast, and lung; • The malaria input data as Malaria or Non-Malaria; • The breast masses input data as Malign or Benign.
Besides the top layers imported from the AE, the classification part of the full architecture is composed of a Batch Normalization layer [42], followed by two Fully Connected layers with a ReLU [34] activation. Since we aim to detect one type of cancer at the time, the last layer -the predictive one -is a single neuron layer with a Sigmoid non-linearity [43]. This activation considers that if the probability of the classification is lower than 0.5, the sample is classified as negative (that is not having the disease); otherwise, the sample is classified as positive.
To assess the following experiments, we decided to only use the AE that achieved the best results in the first experiments. For points (2) and (3), we try three different dimensions: 64, 32, and 16. For the data imputation study, we use three strategies: replacing the data with (a) the mean column value (used as default), a constant value (in this case, zero), and (b) with the most frequent value.
Furthermore, we want to study if when using Strategy 2 (importing the complete AE into the classification network) the model yields better results just because it has one more layer and, therefore, more parameters to train. To observe if the classifier is better only by being deeper, we pre-trained the AE and, at the embedding step for Strategy 1, we add a decoder layer, with all its weights randomized, guaranteeing that there are no discrepancies concerning the network's topological complexity, for both strategies.
Finally, we want to assess how the pipeline behaves when dealing with different data types, besides RNAseq entries. Hence, we apply the same methodology to the image extracted features datasets described in "The data" section, to assess if the model can adapt and generalize well to these data characteristics.

Evaluation
We use stratified 10-fold cross-validation, to ensure and provide statistical evidence. The AEs are trained during 300 epochs, and the classifier during 500 with a batch size of 100. The classification model is trained with the binary cross-entropy loss function [35] and with an Adam optimizer [44]. Furthermore, we assess the overall performance of the model in the training and validation sets, by analyzing five more metrics: Accuracy, Matthews Correlation Coefficient (MCC) [45], Precision, Recall, and F 1 score, and provide the Receiving Operator Curve, with the respective Area Under the Curve (ROC-AUC), and the Precision-Recall Curve.
Furthermore, to study how the model generalizes to unseen data during the training phase, we evaluate the performance of the best architecture combination on a held-out test set, for the Malaria and the Wisconsin Breast Cancer datasets. For both, and separately, we use a ratio of one third to create two new splits. Therefore: We performed a stratified split, meaning that we preserve the distribution of the label in both the train and test sets. With the training set, we followed the same stratified cross-validation strategy described above. The performance on the held-out set was assessed through the same metrics as well.

Baseline
To support our claim that using AEs as weight initialization improves a DNN performance, we defined three different baselines, for each of the used datasets.
For the RNA-Seq data, we established as baseline the results from the classification part of our methodology, without the top layers of the AEs. The baseline model was trained under the circumstances described in the previous section. The results of such experiment can be found in Table 2, where the best overall performance was achieved when classifying skin cancer, with a mean F 1 score of 51.15%.
We further added another baseline for the RNA-Seq datasets, where we use a simple AE with random and fixed weights, with the intent of discarding the possibility of our pipeline yielding better only because its classification architecture is slightly deeper. These baseline results are presented in Table 3 and will be later assessed in this paper, in the Results and discussion section.
For the malaria dataset, we consider two results of two different approaches, applied to the same domain. Firstly, in [31], the authors used a support vector machine (SVM) to automatically classify each species-stage combination of the malaria parasite. The authors studied the SVM hyperparameters and their influence on the classifier's performance. When considering F 1 score, this classifier performance ranged from 18.8% to 87.4%, considering all the malaria parasite species-stage combinations. Secondly, in [46] a 5-class MobileNet v2 convolutional neural network was used to directly classify the thin blood smears images. The chosen architecture presented an F 1 score of 53% when detecting parasites from artifacts.
For the Wisconsin Breast Cancer dataset, we chose as baseline the work presented in [47], where the authors studied different machine learning algorithms, combined with a Principal Component Analysis (PCA) to detect tumorous and non-tumorous samples on this dataset. Furthermore, they compared their best top 3 models with some state-of-the-art models. Their overall best was the combination a Naïve Bayes with a Sigmoid PCA, with an F 1 score of approximately 97%.

Results and discussion
Autoencoders as weight initialization can efficiently predict diseases when applied to different biological and feature-extracted data. Given the results, one tends to assume that the methodology originally presented in [19] generalizes to different data and problems. This work can be seen as another empirical proof supporting this premise. We outperform the results of Ferreira et al. [21] and the baseline results presented in Tables 2 and  3; our best performance was achieved by combining the pre-trained AE encoding layers import to the upper layers (Strategy 1) of the deep classification network and allowing subsequent fine-tuning (Approach B), with an F 1 score of 99.03 and an MCC of 0.99, when distinguishing thyroid from the other cancer types (and an average F 1 score of 98.27%, when considering all cancer classifications). The various networks combinations also achieved very high results for each cancer type, as observable in Table 4. Furthermore, our methodology outperformed the established baselines for both imagebased features datasets. The best overall performances were: • The combination of the pre-trained DAE encoding layers import to the upper layers (Strategy 1) of the deep classification network and allowing subsequent fine-tuning (Approach B), with an F 1 score of 89.95% and an MCC of 0.84, on the Malaria dataset (as highlighted in Table 5); • The combination of the pre-trained AE encoding layers import to the upper layers (Strategy 1) of the deep classification network and allowing subsequent fine-tuning (Approach B), with an F 1 score of 98.84% and an MCC of 0.98, on the Wisconsin Breast Cancer dataset, (as shown in Table 6).
With these results, there is evidence that this methodology can generalize to other types of data and tasks.

Subsequent model fine-tuning (Approach B) leads to better results than fixing the weights (Approach A).
Similarly to [19], it was clear that, with the new data, our results for all the experiments in the three datasets support that allowing the imported weights of the AEs to be fine-tuned in the training phase gave better results than fixing them.  When measuring loss, lower is better. For all the remaining metrics, higher is better. All the presented results are the 10-fold cross-validation mean values, at the validation set, by selecting the best performing model according to its F 1 score. The highlighted values correspond to the combination that led to the overall best result (detecting thyroid cancer, importing only the encoding layers a Basic AE into the classification network, and allowing subsequent fine-tune, when training for the classification task) There is high evidence supporting that importing only the encoding part of the AE leads to good results. According to the results in Table 7, and considering Approach A, the Strategy 1 of embedding with extra random decoding yielded better results in comparison to Strategy 2, for all the combination except when using an SAE. Regarding Approach B, all combinations achieved quite close results for all the performed experiments. Thus, one can argue that less complex models can achieve better results, similar to what was concluded in [21].  analyzing Approach B (Fine-Tuning Weights), the results in Table 9 show no significant variation in the DNN performance, for both the embedding AE strategies, with a F 1 score variation of 1% to 3%, comparing with the default size experiment. In Approach A (Fixing Weights), the performance difference was more significant, with the F 1 score decreasing nearly 20% with a latent vector dimension of 64, and approximately 60% with a dimension of 16, for Strategy 1.
There is no evidence supporting a conclusion on which is the best data imputation strategy. After the imputation strategy experiment, the results pointed out that the mean strategy led to the highest performance in the classification task when considering Approach B. However, one can observe in Table 10 that the mode strategy yielded better results for Approach A, but all the other imputation strategies achieved similar results. Hence, we cannot affirm that there is a particular strategy that leads   Tables 2, 3 we used that specific AE to assess if there were changes in the classification network performance. However, due to space constraints, we opted to only present the results  for the breast cancer class, since it had a greater results variance between strategies, especially for Approach A, as seen in Table 4.

Conclusions
We compared the performance of a deep neural network (DNN) when using three different autoencoders (AEs) to initialize its weights. To do so, each AE was pre-trained and then attached to the top layers of our classifier. In the importation phase, two different strategies were studied: (1) just importing the AE's encoding layer, and (2) importing all the AE's layers. Each of the three built architectures was then trained to classify the input data as one of the five types of cancer in this study. Two different approaches were analyzed, in the training process: (A) fixing the imported weights, and (B) by allowing them to be finetuned during supervised training. Additionally, we studied (1) how changing the encoding space dimension impacts the AEs and DNN performances, and (2) how the missing data replacement strategy influences the performance in the classification task. We also assessed the impact that the number of AE imported layers has on the DNN overall performance. Furthermore, we extended the generalization study of this methodology by applying it to two different datasets: the MalariaScope thin blood smears data and the Wisconsin Breast Cancer tumors datasets.
We outperformed the best result reported in [21], according not just to the F 1 score, but to all the other evaluation metrics as well. After a 10-fold cross-validation training process, a full embedding of a pre-trained Basic AE to the top layers of the DNN (Strategy 2), followed by fine-tuning, achieved the best overall performance, with an F 1 score of 99.03±1.21. Moreover, we outperformed as well other established baselines, for the MalariaScope and Wisconsin Breast Cancer datasets, supporting the claim that this methodology generalizes well, including when dealing with other data types. After performing two distinct held-out datasets, we could conclude that our models generalize well to unseen and different data, not overfitting during the training phase. Allowing finetune (Approach B) on the imported weights of the AEs led undeniably to better results than fixing the weights of the top layers (Approach A), as can be observed in the results. Approach A is more sensitive to latent vector dimension variations, in comparison with a more stable Approach B. Finally, the results showed no evidence on which imputation strategy is the best, considering the RNA-Seq data.
In conclusion, this methodology led to state-of-the-art performance in cancer classification from gene expression, strongly supporting that using AE as weight initialization can help DNNs achieving better performances. We believe that it also has high potential of generalizing well to other data and problems, as shown in the results using datasets of features extracted from images.
In the long term, and although some of the data is considered a toy dataset, we expect that this work will lead to a more efficient and robust automated system for the diagnosis of diseases, in particular cancer, providing a faster diagnostic, and improving the expected treatment outcome.