Introduction
Colorectal Cancer (CRC) is a prevalent form of cancer, ranking third in terms of global diagnosis rates. Unfortunately, it is also the second leading cause of cancer-related deaths, with a staggering 1.4 million new cases and 693,900 fatalities reported in 2020 alone [1]. However, there is hope for those affected by this disease. Early detection and precise diagnosis of CRC can significantly increase the chances of patient survival and pave the way for effective treatment plans.
Medical imaging is a crucial tool in the timely detection and diagnosis of colorectal cancer (CRC). Histopathological images are frequently employed in CRC diagnosis to identify abnormal cells and tissues in colon tissue samples. However, analyzing histopathology images is a complex task that demands a high level of expertise in pathology. To overcome this challenge, deep learning algorithms have emerged as a promising solution, demonstrating remarkable accuracy in the classification of histology images. Among these algorithms, the Convolutional Neural Network (CNN) is one of the most widely used approaches, exhibiting exceptional performance in extracting salient features from images [2], [3].
One of the main challenges encountered in the medical field is the limited availability of samples. The collection and labeling of histology images can be challenging and time-consuming, and the size of available datasets in the field is often smaller than in other fields. As a result, transfer learning is a crucial technique that can be utilized to overcome the issue of limited availability of histology samples. Transfer learning is a machine learning technique where a pre-existing model that has been trained on a specific task is used as a foundation for a new model to solve a different problem. It involves re-purposing pre-trained CNN models on larger datasets to improve the performance of models trained on smaller datasets [4].
In this article, we present a novel deep learning approach, termed DeepCon,1 for the classification of CRC histopathology images. DeepCon is an advancement of the previously developed Decompose, Transfer, and Compose (DeTraC) model [5] that aims to investigate the impact of trained composition on the learning process. DeTraC relies on class decomposition by partitioning image classes into subsets based on specific guidelines and assigns new labels to these subsets. It then proceeds to composition, assigning each instance classified to a subclass to its original/parent class. In contrast, DeepCon retains class decomposition but takes a transformative step in the composition process. It introduces a learnable composition step that dynamically assembles subclass results using a pre-trained CNN, fine-tuning weights for precise data-driven composition.
The complex nature of histopathological images poses unique challenges, especially due to the intricate tissue structures. Accurately distinguishing between sub-classes within CRC is crucial for informing treatment decisions. DeepCon utilizes transfer learning with various modes by applying pre-trained models on ImageNet [6] to a dataset of 5000 colorectal cancer (CRC) images. Transfer learning from ImageNet-based models has been widely adopted and recognized as the efficient way of training deep learning models for classification tasks. A notable feature of DeepCon is that it incorporates learnable composition of the decomposed classes unlike the traditional composition presented in [5]. This is done by applying a pre-trained CNN to automatically learn and classify the input image into one of the original classes. The learned composition stage uses a pre-trained CNN which uses the CNN weights applied to the learning conducted for the subclass-level classification in addition to fine-tuning to cope with the original class-level classification. Our experiments show that the deep tuning mode of DeepCon outperforms the shallow tuning mode, achieving the highest accuracy among all pre-trained models employed. The use of learned composition in DeepCon is a significant advancement from the previous DeTraC model, which allows more efficient and effective learning. In a nutshell, during the decomposition phase of our divide-and-conquer approach in DeepCon, learning occurs in the divide stage, where classification is performed at the subclass level. Conversely, during the composition stage, learning takes place in the conquer stage, where it is done at the original class level.
DeepCon offers several advantages in this domain. Firstly, it achieves enhanced classification accuracy, enabling more precise diagnoses. Furthermore, its unique ability to perform subclass-level classification provides finer granularity, assisting clinicians in tailoring treatment plans. Additionally, DeepCon is characterized by its flexibility in transfer learning, allowing it to adapt readily to various datasets and tasks. Lastly, it facilitates efficient learning, optimizing the utilization of available resources. However, it is important to acknowledge a limitation of our proposed solution. The inclusion of two transfer learning stages in DeepCon, while enhancing classification accuracy, may introduce computational overhead.
The contributions of the article can be summarized as follows:
Introduces DeepCon, a novel deep learning approach tailored for CRC histology image classification, addressing the critical need for accurate CRC diagnosis.
Employs a unique divide-and-conquer methodology with a focus on learned composition, enhancing the transferability of features between domains and improving classification accuracy.
Demonstrates the effectiveness of two-stage transfer learning with multiple loss functions, showcasing its potential for optimizing CRC histology image classification.
In recent work, Chattopadhyay et al. [7] introduced a deep learning model for colorectal cancer analysis that uses snapshot ensembling to effectively select features, which aligns with our DeepCon model, which optimizes feature utility from decomposed image data for more accurate classification. Similarly, Tsai et al. [8] work on predicting multi-omics aberrations and prognoses from histopathology images complements DeepCon’s aim to enhance diagnostic precision.
The article is organized as follows. Section II provides an overview of the latest related work conducted for CRC histology image classification. In Section III, we describe the methodology of the proposed DeepCon model, including the different transfer learning modes utilized and the process of learned composition. Section IV presents the experimental study carried out on the CRC imaging dataset. Finally, in Section V, we draw conclusions from our findings.
Related Work
In this section, we review the latest work conducted for the classification of CRC histopathology images.
Several studies have been conducted in the field of CRC image analysis [9], [10] and more specifically in classification task using deep learning methods [11], [12], [13], [14]. For example, Peng et al. [12] developed a multitask deep learning framework for simultaneous classification and retrieval of colorectal histopathological images, making use of the well-known concept of k-nearest neighbors to increase interpretation of the model. Their original framework can be built on top of any existing classification network (pre-trained models) by combining a triplet loss function with a novel triplet sampling strategy to compare distances between samples and adding a hashing loss function to accelerate searching for neighbors. Raczkowski et al. [15] introduced a Bayesian network called ARA-CNN, which is precise and reliable and employs an active approach to classify images of colorectal cancer histopathology. The network is developed based on residual network principles and utilizes variational dropout techniques in its design. Shaban et al. [11] presented a method for classifying colorectal images using a two-stacked CNN. Their approach involves incorporating a larger contextual view of the images through the use of a context-aware neural network. The model first converts the local representation of a histopathology image into high-dimensional features and then combines these features while taking into account their spatial arrangement to make a final prediction.
Research conducted by [13] explored the application of the ResNet architecture in the detection of colorectal cancer through deep learning image classification. The study focused on training ResNet-18 and ResNet-50 models on colon glands images to differentiate between benign and malignant colorectal cancer. The work presented by [14] focused on using deep learning architectures to classify and identify colon cancer regions in sparsely annotated histopathological data. The study reviews and compares the latest CNNs and utilizes transfer learning techniques to overcome limited annotated data sets. The models were tested on the AiCOLO colon cancer and CRC-5000 datasets. The work presented by [16] aimed to automatically identify eight types of tissues in CRC histopathological evaluation using Transfer Learning from CNN architectures. CNN structures were modified to extract features from images, which were then fed into various machine learning methods, including naive Bayes, multilayer perceptron, k-nearest neighbors, random forest, and support vector machine (SVM). A total of 108 extractor-classifier combinations were evaluated, and the DenseNet169 with SVM achieved the best results.
Furthermore, Wang et al. [17] propose a transformer-based unsupervised contrastive learning strategy named semantically-relevant contrastive learning (SRCL), combined with a hybrid model CTransPath. This approach achieves exceptional results across diverse downstream tasks, highlighting its robustness and transferability. Kumar et al. [18] introduce CRCCN-Net, a lightweight convolutional neural network framework for automated colorectal tissue classification. The framework showcases impressive performance, positioning it as a potential diagnostic tool for clinicians. Additionally, Zhou et al. [19] developed the HCCANet method, a computer-aided diagnosis (CAD) system for grading colorectal cancer based on a CNN architecture and a novel attention mechanism named MCCBAM. The model's interpretability is improved through gradient-weighted class activation maps (Grad-CAM).
Moreover, Sabol et al. [20] contributed an explainable classifier to improve accountability in decision-making for colorectal cancer diagnosis from histopathological images. The model offers human-friendly explanations about the plausibility of decisions through a Cumulative Fuzzy Class Membership Criterion (CFCMC). The classifier is shown to be comparable to state-of-the-art neural networks in accuracy and is particularly suited for use by human experts in the medical domain.
Changjiang et al. [21] presented a framework that utilized features of varying magnifications of Whole Slide Images (WSI) to classify and localize colorectal cancer, relying solely on global labels. The work presented by Zhou et al. [22] developed a deep learning framework called the cell graph convolutional neural network (CGC-Net) to grade colorectal cancer. Their method involves converting each large histopathological image into a graph representation, where the nuclei within the image are represented as nodes and the cellular associations are represented as edges based on the similarity between the nodes. The network utilizes the local features and spatial dependencies of the nodes to improve its accuracy. Haoyuan et al. [23] proposed the IL-MCAM framework for colorectal histopathology image classification, which involves two stages: automatic learning and interactive learning. The automatic learning stage uses three attention mechanism channels and CNNs to extract multiple channel features, while the interactive learning stage incorporates misclassified images into the training set to improve the model's classification performance.
Awan et al. [24] developed Best Alignment Metric (BAM), a gland-shape metric that correlates with the grade of colon cancer. Their model uses a Deep CNN to detect gland boundaries and an SVM classifier to determine cancer grade. Wang et al. [25] proposed a deep transferable semi-supervised domain adaptation model called HisNet-SSDA to classify histopathological WSIs with limited labeled data. The method uses a pre-trained network to extract features from both source and target domains, then matches the two domains via semi-supervised domain adaptation with multiple-weighted loss functions and a manifold regularization term. The final image-level classification is obtained by combining the estimated probabilities of the sampled patches.
Here, we present a deep learning approach, termed DeepCon, for the classification of histopathology images. DeepCon is a new divide and conquer deep learning technique that can address the challenge problem of data irregularity presented in the Histopathological Image because samples are characterized by high visual variability. DeepCon is working as a two-stage transfer learning approach to achieve coarse to fine transfer learning guided by a divide and conquer training approach to learn the composition of the decomposed images during the transfer learning.
Material and Methods
In this section, we explain in detail the different stages of DeepCon. First, we describe the dataset utilized in the study. Second, we explain the preprocessing stage performed to apply stain normalization for the dataset samples. Then, we describe the transfer learning modes used. We then explain the class decomposition and composition stages of DeepCon. Finally, we describe the training and hyperparameter settings utilized. Fig. 1 describes the complete architecture of DeepCon, including all essential stages. DeepCon has three important stages. The first stage is to apply the decomposition to the samples of the normalized CRC dataset using the K-means clustering algorithm. The aim of applying decomposition is to simplify the complexity and data irregularity of dataset samples. This stage results in having 16 decomposed classes from the dataset. Then, as a second stage, we apply initial transfer learning by utilizing pre-trained CNN trained on ImageNet. We applied fine-tuning to the last few layers of the network to cope with the specific problem we have (16-class classification task of the CRC dataset samples). Lastly, we introduce a novel automatic class composition strategy by applying second-round transfer learning. This is done by utilizing the pre-trained CNN on the decomposed CRC dataset in addition to fine-tuning to the last few layers to cope with the final classification of the original class labels (8-class classification task). See Fig. 1.
Full architecture of DeepCon model. (A) Decomposition Stage: Applying clustering using KMeans on dataset samples resulted in 16 decomposed classes. (B) Transfer Learning Stage: Transferring knowledge by using pre-trained CNN trained on ImageNet to the decomposed CRC dataset for classifying 16 decomposed classes. (C) learned composition: Transferring knowledge by using pre-trained CNN trained on the decomposed CRC dataset to the original CRC dataset to classify 8 original classes.
A. Dataset
In this work, we used the dataset “CRC-VAL-HE-7 K” from the Institute of Pathology (University Medical Center Mannheim, Heidelberg University, Mannheim, Germany) [26]. The dataset contains 5000 images, equally distributed in 8 classes of 625 images each; TUMOR, STROMA, COMPLEX, LYMPHO, DEBRIS, MUCOSA, ADIPOSE, and EMPTY/BACKGROUND. All images are 224×224 pixels at 0.5 microns per pixel with TIF format.
B. Pre-Processing
Due to the high visual variability of the images, a stain normalization method [27] was used to combine the stain density maps with the stain color basis of a selected target image and consequently alter the color profile of the images in our dataset to preserve the structure described by the stain density maps; see Fig. 2. Additionally, affine transformations were used for data augmentation to increase the size of the dataset. Moreover, random orthogonal rotations across were used in combination with horizontal and vertical flips. Although CNNs are not rotationally invariant, there is no correct orientation of histology images; therefore, the full 360-degree rotation was used. The reasoning for orthogonal rotations was to prevent any empty space because of rotation values other than multiples of 90 degrees. Although fill methods exist for such empty spaces, they could have a potentially negative effect and are also more computationally intensive than orthogonal rotation.
Stain normalization process. Transferring color from a target/reference image to source images.
C. Class Decomposition
The research carried out by [5] explored a class decomposition approach in transfer learning known as DeTraC. The method involves clustering the classes within a dataset prior to training a network with transfer learning, with the aim of improving accuracy in situations where the data classes are unevenly distributed. After the training process, the labels are corrected back to their original superclass before evaluating performance metrics. The CRC dataset was used to evaluate the method and was tested in three different classes. The results demonstrated an increase in accuracy compared to other approaches. In this work, we developed a novel automated method for restoring the original classes from the decomposition stage with the aim of achieving higher accuracy.
The decomposition approach is implemented by following the steps outlined in the original paper [5]. For each of the 8 classes in the original dataset, a pre-trained Xception network was utilized to extract a 2048-dimensional feature vector for each of the 625 images in a particular class. We applied Principal Component Analysis (PCA) to project the high-dimensional feature space into a lower dimension, such that the highly correlated features were ignored. This step is crucial for the class decomposition to generate more homogeneous classes, minimize the requirement for memory, and increase the framework's effectiveness. In order to use PCA, the data was projected onto the top principal components after computing the eigenvectors of the feature covariance matrix. The feature vectors' covariance matrices were specifically constructed as follows after being centered:
\begin{equation*}
\Sigma =\frac{1}{n} \sum _{i=1}^{n}\left(x_{i}-\mu \right)\left(x_{i}-\mu \right)^{T}, \tag{1}
\end{equation*}
\begin{align*}
A = \left[\begin{array}{cccc}a_{11} & a_{12} & \ldots & a_{\text{1}\,m} \\
a_{21} & a_{22} & \ldots & a_{\text{2}\,m} \\
\vdots & \vdots & \vdots & \vdots \\
a_{n 1} & a_{n 2} & \ldots & a_{n m} \end{array}\right], \mathbf {C}=\left\lbrace c_{1}, c_{2}, \ldots, c_{j}\right\rbrace \tag{2}
\end{align*}
Let
\begin{equation*}
|S|=j \times l \tag{3}
\end{equation*}
\begin{align*}
A & =\left[\begin{array}{ccccc}a_{11} & a_{12} & \ldots & a_{\text{1}\,m} & c_{1} \\
a_{21} & a_{22} & \ldots & a_{\text{2}\,m} & c_{1} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
\vdots & \vdots & \vdots & \vdots & c_{8} \\
a_{n 1} & a_{n 2} & \ldots & a_{n m} & c_{8} \end{array}\right], \\
B & =\left[\begin{array}{ccccc}b_{11} & b_{12} & \ldots & b_{\text{1}\,m} & s_{11} \\
b_{21} & b_{22} & \ldots & b_{\text{2}\,m} & s_{1 j} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
\vdots & \vdots & \vdots & \vdots & s_{81} \\
b_{n 1} & b_{n 2} & \ldots & b_{n m} & s_{\text{8}\,l} \end{array}\right]. \tag{4}
\end{align*}
Fig. 3 shows a set of images that were randomly picked from the two clusters of the Stroma class. A clear visual difference can be observed between the images in each cluster, indicating that the clustering process was effective in separating the irregular images. Table 1 shows the distribution of dataset samples after the decomposition stage.
D. Coarse Transfer Learning
In this work, two models, namely Xception [29] and InceptionV3 [30], were chosen for transfer learning. The models were initialized with pre-trained weights from ImageNet, but the top fully connected layers were not included. To reduce the number of parameters, the convolutional bases were passed through a GlobalAveragePooling2D layer instead of flattened. Then a new dense, fully connected layer with 1024 neurons and ReLU activation was added. The final output layer had eight neurons with SoftMax activation, corresponding to the number of classes. Consequently, to train the models, three transfer learning strategies were utilized: shallow, fine-tuning, and deep fine-tuning. In the shallow fine-tuning approach, only the fully connected layer is trained while the parameters of other layers in the convolutional network were frozen. The fine-tuning approach involves unfreezing the weights of layer 14 (25--35%) to adapt them to the dataset. Finally, all layers are trained in the deep fine-tuning approach.
As we are addressing a multi-class classification problem, the categorical cross-entropy loss function was used during the training process of the pre-trained CNN utilized for coarse transfer learning (Fig. 1B). In this stage, the decomposed classes generated from the decomposition stage were used. The cross-entropy loss is a measure of the difference between the predicted class probabilities and the true class probabilities, and it is defined as:
\begin{equation*}
L_{Coarse}(y, \hat{y}) = - \sum _{i=1}^{S} y_{i} \log (\hat{y}_{i}), \tag{5}
\end{equation*}
E. Learnable Composition Using Fine Transfer Learning
One of the standout features of DeepCon is its ability to learn the composition to predict original classes. This is achieved by using one more stage of transfer learning in a pretrained CNN with the aim to learn some of the features generated for the decomposed classes and applying fine-tuning to fit to a classification task of the original classes. In other words, to achieve the original class-level classification, the learned composition stage of DeepCon uses a pretrained CNN that is fine-tuned to handle the original class-level classification using the CNN weights applied to the learning conducted for the subclass-level classification.
Similar to the previous pre-trained CNN utilized for the coarse transfer learning, categorical cross-entropy is used during the training process, but with the original class labels. The cross-entropy loss used in the learnable composition stage is defined as follows:
\begin{equation*}
L_{fine}(y, \hat{y}) = -\sum _{i=1}^{C}y_{i}log(p_{i}) \tag{6}
\end{equation*}
Results
In this work, all experiments were performed using 10-fold cross-validation and the accuracy and F1-score were adopted to evaluate the performance of the models. The accuracy and F1 score are determined as follows:
\begin{align*}
Accuracy &= \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}} \tag{7}
\\
F1\text{-}score &= \frac{\text{TP}}{\text{TP}+\frac{1}{2}(\text{FP}+\text{FN})} \tag{8}
\end{align*}
Each model was trained with the different transfer learning modes, where stratified cross-validation was used to ensure the same number of samples from each class was balanced in each training and test fold. A batch size of 32 was selected, as this has been shown to increase generalizability and training stability with Stochastic Gradient Descent (SGD) in the case of transfer learning and especially for the fine-tuning mode. Moreover, all models were implemented using the TensorFlow 2.1 framework and trained and tested using a NVIDIA 2070 GPU with 8 GB of memory on CentOS 7.8.
Table 2 shows the configuration of the six different experiments for the selected models and the tuning strategies.
For the transfer learning stage of the DeepCon model, the Xception and InceptionV3 networks were tested with shallow, finetuning and deep transfer learning modes in the clustered data set with the same hyperparameters and decay rate. The number of epochs for shallow and finetuning modes was changed to 60 and 50, respectively. For evaluation, the labels for each cluster pair were corrected by collapsing them into one of the two pairs for both predictions and ground truth before the evaluation metrics were calculated. For each cross-validation iteration, the DeTraC model was trained on the decomposed dataset before replacing the output layer with 8 neurons. The model was then trained on the original dataset, making sure the same training and testing images were used across models for each iteration. For the second stage, the number of epochs was 7, 10, and 15 for the deep, finetuning, and shallow approaches. The learning rate was also reduced to 0.005, with no learning rate decay being used.
Table 3 compares the results obtained by the Xception and InceptionV3 networks on three different fine-tuning modes. The results of the initial transfer learning experiments show that deep transfer learning by using the pre-trained weights as weight initialization with the Xception model outperformed all the other models. However, despite a notable performance difference between the fine-tuning and shallow-tuning modes in the Xception model, there is no difference in the InceptionV3 model.
The DeTraC method degraded overall system performance across various fine-tuning modes and pre-trained models. On the other hand, our proposed DeepCon shows a significant improvement of 1.2% and 0.2%, respectively, for the Xcpetion network with fine-tuning and deep-fine-tuning modes. Tables 4 and 5 illustrate the performance difference after applying the DeTraC and DeepCon models.
Eventually, a paired student t test was used to determine whether the results obtained by DeTraC and DeepCon were statistically significantly different from the base experiments. The results showed that only the deep DeTraC model was significant with p = 0.027. For DeepCon, the fine-tuning and deep models were significant, with both having p = 0.0002.
Conclusion
Transfer learning based on state-of-the-art CNN image classification models has been widely researched in medical applications where data is often limited. This paper presents a novel deep learning model, called DeepCon to investigate the effect of divide-and-conquer applied to the original classes on the learning process. DeepCon introduces a two-stage transfer learning mechanism, in which knowledge was first transferred using decomposed subclasses of the original classes with coarse transfer learning and then an learned composition with fine transfer learning was employed to cope with the original classes. The proposed model has been validated on a clinically valid dataset of 5000 colorectal cancer (CRC) images. DeepCon based on deep-tuning mode offers better performance than the shallow-tuning mode with the highest accuracy achieved by all pre-trained models used in this work, confirming the impact of the composition learned at DeepCon.