Exploiting Cascaded Ensemble of Features for the Detection of Tuberculosis Using Chest Radiographs

Tuberculosis (TB) is a communicable disease that is one of the top 10 causes of death worldwide according to the World Health Organization. Hence, Early detection of Tuberculosis is an important task to save millions of lives from this life threatening disease. For diagnosing TB from chest X-Ray, different handcrafted features were utilized previously and they provided high accuracy even in a small dataset. However, at present, deep learning (DL) gains popularity in many computer vision tasks because of their better performance in comparison to the traditional manual feature extraction based machine learning approaches and Tuberculosis detection task is not an exception. Considering all these facts, a cascaded ensembling method is proposed that combines both the hand-engineered and the deep learning-based features for the Tuberculosis detection task. To make the proposed model more generalized, rotation-invariant augmentation techniques are introduced which is found very effective in this task. By using the proposed method, outstanding performance is achieved through extensive simulation on two benchmark datasets (99.7% and 98.4% accuracy on Shenzhen and Montgomery County datasets respectively) that verifies the effectiveness of the method.


I. INTRODUCTION
According to the World Health Organization(WHO), in 2018, more than 1.5 million people died worldwide due to Tuberculosis(TB) [1], [2]. Though effective medications have been discovered by the researcher to treat TB patients, prior detection is required to step into the treatment process. The current gold standard definitive test for TB detection is the identification of Mycobacterium tuberculosis in clinical sputum or pus sample [3]. Additionally, other techniques like sputum smear microscopy in which bacteria in sputum samples are observed under a microscope, are also used for TB detection.
Nowadays, the chest radiograph has become more acceptable and widely used technique for the detection of Tuberculosis. Another imaging technique, Computed Tomography The associate editor coordinating the review of this manuscript and approving it for publication was Wenbing Zhao .
(CT) in the chest region, can provide significantly more diagnostic information compared to a chest radiograph but it is less popular among the low income people because of its high cost of diagnosis [4], [5]. Hence, more people will be benefited if detection of TB from chest radiographs is made automatic and more reliable.
To detect Tuberculosis(TB) from chest radiographs, different computer-aided methods have been introduced by the researchers and most of these methods have provided impressive results with information about abnormalities [6]- [8]. While using computer-aided methods to detect TB from chest radiograph an important step is feature extraction process and this process involves two alternating techniques-conventional hand-engineered and deep learning features. In Tuberculosis(TB) detection, different combinations of shape, edge, and texture descriptors have been used as hand-engineered features and these features have also been applied successfully to microscopy images of cells, for which the cell cycle phase has been classified based on their appearances and patterns [9]- [14].
However, recently neural network-especially the convolutional neural network (CNN) has brought significant improvement in computer-aided image classification tasks and provided more significant results than conventional machine learning approaches [15], [16]. As a result, many researchers are now using deep learning features in abnormality detection in the chest region and getting better results compared to conventional machine learning approaches [17]- [19]. In the previous CNN based Tuberculosis detection approaches, the weights of different layers of multiple existing pre-trained models are fine-tuned from scratch and then features extracted from those layers are combined to create a deep learning ensemble model [14], [20]. The main advantage of using CNN is that it extracts discriminative features automatically considering the data characteristics. As conventional machine learning and deep learning-based features have some error profiles that can degrade the output result, an approach is introduced in this paper utilizing both hand-engineered features with the features extracted from the fine-tuning pre-trained deep convolutional network (DCN) model to predict TB in a chest x-ray. The main contribution of our paper is to employ a cascaded ensemble of Hand-Engineered features with Deep Learning features extracted from multiple pre-trained models and evaluate this ensemble technique as an alternative of ensembling different handcrafted features and deep learning features separately used previously. Besides, rotation invariant augmentation techniques are utilized on the training data in the pre-processing step, which is found very effective in increasing the performance of the proposed.

II. PROPOSED METHOD
All major steps involved in the proposed method are depicted in Fig.1. At first, Images are reshaped to uniform sizes (224 × 224) followed by min-max normalization and then rotation invariant augmentation techniques are used to augment the training samples. Next, These samples are passed through two distinct types of feature extraction procedures. The first one involved hand-engineered feature extraction process of training images. Next, By applying feature selection and reduction procedure, features are prepared for further aggregation process. The second feature extraction process employed three fine-tuned transfer learned deep convolution neural networks to extract deep learning-based features from the same training dataset. In order to get the probability scores, a logistic regression classifier is applied in the hand-engineered feature-set and a Receptive-Field-Aware neural network is trained using the features of selected layers of the aforementioned transfer learned models. A Receptive-Field-Aware neural network model can enlarge the receptive field of the deep learning models which in turn increase the learning ability, and enhance the discrimination ability of the deep learning-based features. Finally, the probability scores from both of the processing hands are combined to create the feature vector which is further utilized in the final classifier to get the desired ensemble model. The detailed description of each of the processes is presented in the following subsections.

A. ROTATION INVARIANT AUGMENTATION
Although both of the datasets used in this paper are two benchmark datasets and these chest x-ray images were collected over many years, the number of chest radiographs VOLUME 9, 2021 available is limited. For this reason, data augmentation is required to increase the number of training samples and to achieve more generalizability. The most common technique usually used for the image augmentation purpose is rotation. However, rotation in chest radiographs can adversely affect the performance of subsequent automated processing steps in screening algorithms, such as lung segmentation and detection [40]. Considering the fact, we utilize three rotation invariant data augmentation techniques in the proposed method which is visualized in Fig. 2. Besides, a brief analysis about the effect of introducing rotation in the training samples is discussed in section III.

1) ELEMENT-WISE POWER
Each of the pixels in an image, I is elevated to a power, p which is defined by where, f is a random number taken from the gaussian distribution X → N (µ = 0, σ = 1) and r is a number less than unity. Finally the augmented image, I a is generated by the following equation 2) GAUSSIAN FILTERING A gaussian filter defined by the variance, σ between 0.3 and 0.9, is applied to each of the images in the training set. The radius of the kernel is chosen to be, r = 4 σ .

3) SHEARS
Shearing is done on each images by using the following affine

B. HANDCRAFTED FEATURE EXTRACTION
Handcrafted feature extraction is a conventional image processing method that uses various algorithms to collect information present in the image itself. Many image classification tasks depend on the local characteristics in images. In order to understand the characteristics of the local information, a large number of handcrafted features are extracted that are designed manually. Most of the time the selection of these features totally depends on the accuracy and computational efficiency. The extraction process of handcrafted features for Tuberculosis detection is discussed in this section.

1) HISTOGRAM OF ORIENTED GRADIENTS
Histogram of oriented gradients is a frequently used descriptor in medical diagnosis tasks [21]. In order to extract histogram of oriented gradients, the image is first divided into small connected regions which are called cells. Let, g x and g y are the gradients of an image I(x,y) along the x and y direction respectively and θ g is the angle of gradient. Both the gradient and the angle of gradient at each pixel are calculated as [22]: Next, for each pixels within this region, histogram of gradient directions are calculated by dividing the angles into an equal distances which is call the histogram bins and each pixel inside the cell is voted to a histogram bin according to its angle of gradient. In this method, the histogram bin size is set to 8 and the cell size is selected as 16 × 16.

2) BINARY PATTERN OF PHASE CONGRUENCY (BPPC)
Binary pattern of phase congruency (BPPC) employed both Phase Congruency (PC) and Local Binary Pattern (LBP) approach to describe the image data [23]. Normally, Phase Congruency information enhances the textural content of an image such as edges and lines and is a common descriptor used in the literature and is calculated as: where, A n (x) is the amplitude of information of the wavelet transform at a given wavelet scale, n and E(x) is the local energy. For Phase Congruency (PC), the number of filter orientations and the number of filter wavelength are considered as 6 and 3. On the other hand, for LBP, the number of neighboring pixels is chosen to be 8.

3) IMPROVED WEBER BINARY CODE(IWBC)
Improved Weber Binary Code (IWBC) is a feature map which is obtained by encoding each of the components of Improved Weber Local Descriptor (IWLD) features [24]. IWLD can capture the local variation of pixels in terms of magnitude and orientation and provides a feature map of an image. With the help of this feature map, three histograms are created in a sub-region (M r × M b )of an image and then concatenated across different regions to represent the image. For our work, the size of the sub-region is selected as 8 × 8.

4) CURVATURE DESCRIPTOR HISTOGRAM (CDH)
Curvature Descriptor Histogram is a shape based feature [25]. For each of the pixel of an image I(x,y) it is defined as: where λ 1 and λ 2 are the eigen values of the hessian matrix. A 8 bin histogram is plotted where each of the bin is a feature.

5) SHAPE DESCRIPTOR HISTOGRAM (SDH)
Shape descriptor histogram is used to describe the relationship among different key points of the image [26]. For an image I(x,y), shape descriptor, SD is calculated by where λ 1 and λ 2 are the eigen values of the hessian matrix. A 8 bin histogram is plotted where each of the bin is a feature as in the case of curvature descriptor histogram.

6) LOCAL ARC PATTERN (LAP)
Local Arc Pattern (LAP) is used to find the gray level intensity differences of a certain pixel with respect to the neighboring pixels in all possible directions [27]. For a particular 5 × 5 block two binary pattern is created. The first pattern consists of 4 bit which is found by comparing the neighboring pixel of the center pixel of that block in a particular direction. The second pattern consists of 8 bit which is found in a similar manner. Finally a histogram is created and the histograms of two separate binary pattern is concatenated to extract the LAP features.

7) LOCAL TRANSITIONAL PATTERN (LTP)
Local transitional pattern consists of two basic steps. In the first step, a Gabor filter is applied to represent the textural content of the image [28]. Gabor filter is well known for its better modeling of the visual cortex of mammalian brains and capturing the textural representation [29]. In the 2D-domain, Gabor filter is defined as follows: where σ , ψ, λ, γ are denoted as the filter size, the phase shift, the wavelength and the spatial aspect ratio of sinusoid respectively. Then, the Local transitional binary pattern is computed by comparing the neighboring pixels of different levels at different orientation.

8) MEDIAN BINARY PATTERN (MBP)
Median Binary Pattern(MBP) is very similar to the operation of Local Binary pattern [30]. At first, the median value of the local neighboring pixels around the center pixel of a block is calculated and then each of the pixels of that block is compared against the median value. For extracting the feature a 3 × 3 block is considered.

C. RATIONAL BEHIND THE PROPOSED HAND-ENGINEERED FEATURE SELECTION PROCESS
A fact also arises that all of these extracted features may not perform well in the Tuberculosis detection process. Hence, after the extraction of the features, the quality of hand-crafted features is evaluated in terms of class separability by the standard goodness of feature measures namely -Bhattacharyya Distance (BD) and Geometrical Separability Index (GSI). The similarity of two probability distributions is measured using Bhattacharyya Distance (BD) which is closely related to Bhattacharyya Coefficient (BC). BC gives the information about the overlapping of two distribution. BD is computed as [31], Here δ i and µ i represent the covariance matrix and mean vector of i-th cluster. Bhattacharyya coefficient (BC) is computed as, On the other hand, GSI measures the separability of classes in the sense of nearest neighbor and it is defined as [32], Here x i and x iNN represents a data point and its nearest neighbor respectively. And f (.) is the binary decision classification function which gives the class number of data points.

D. DEEP LEARNING BASED FEATURE EXTRACTION
Comparing to the conventional hand-engineered features based machine learning approaches, deep neural networks have recently achieved remarkably better performance in various detection and classification tasks. Most of the deep learning -based approaches applied in these tasks use transfer learning or deep residual network [33], [34]. Surprisingly in some cases, it outperforms the human detection capability also. In this paper, fine-tuned layers of pre-trained deep neural networks are used to extract the high-level features inherent in the data. Three deep convolution network models named Inception-V3, DenseNet-169, and Resnet-50 are utilized here to extract the deep learned features [34], [38], [39]. These deep convolution networks are different from each other in the perspective of architecture and the number of convolution layer. These neural network architectures expect 224 × 224 image size as input, hence the input images are reshaped to 224 × 224 image size. To fine-tune those pre-trained neural networks and extract features from a particular layer of a model, all the layers up to the desired layer have been frozen and then added a binary classifier layer on top to classify the image. Receptive-Field-Aware Neural Network Classifier is then used to evaluate the features collected from different layers of each of the model. Convolutional neural networks (CNN) are devised such that layer l i feeds from layer l i-1 and then layer l i feeds its outputs to layer l i+1 . Hence, whatever layer l i learns, is a composition of features from the initial layers to layer l i . Because of this property, deeper layers of a CNN model can cover an extremely larger effective area over the input image which makes them capable to extract more complex features than the earlier layers which learn basic features like edges and shapes. Hence the features of last layers of DCN can be used to train a classifier for classification task. And hence, for chest X-ray classification task, last few layers of DCN are used to train a shallow neural network. All the model have been implemented in Keras and the parameters, θ i of the i th layer at time t are updated with Adam optimizer with learning rate α.
The extracted deep learning features and their performance in this particular task are described in Section-III. The notation used for the name of the layer of these models are derived from the notation used in keras library. Features having higher accuracy are selected here for further ensembling process with the hand-engineered features. Based on the accuracy, features of layer conv5_block3_out of densenet 50, batch_normalization_93 of inceptionV3 and conv5_block32_concat of densenet169 are selected for ensembling process.

E. CLASSIFIER SELECTION
To choose an appropriate classification model for a dataset is a difficult work because of the applicability of the classifier on the particular data, higher feature dimensions, and the computational cost of that classification algorithm. To choose the appropriate classification model for this case, different popular models are examined, and their performance is evaluated further. Among all the classification model, Logistic Regression (LR) [35] is chosen because of its capability to perform better in low dimensionality space and binary classification problem. Moreover, it has low computational complexity as well. In this method, each of the chest x-ray images are classified using Logistic Regression into two classes based on the handcrafted and deep learning features extracted in section II-B and section II-C.
On the other hand, for deep learning features, a Receptive-Field-Aware Neural Network Classifier is used as shown in Fig.3. This neural network classifier consists of one receptive field aware block (shown in Fig.4), one Global Average Pooling (GAP) layer, one Dropout layer, and three Dense layers. Receptive field aware block is consists of three parallel convolutional blocks and these convolutional blocks have kernel size of 1×1, 3×3, and 5×5 respectively. Each of these convolutional block is followed by another convolutional block which has kernel size of 3 × 3 but dilation rate of 1, 3, and 5 respectively. At the end, these convolutional blocks are concatenated which is then followed by a convolutional block of 1 × 1 kernel size. GAP and Dropout layers help to reduce the overfitting of the model and interpret the feature maps as categories confidence maps. This type of architecture utilises the multi branch pooling of varying kernels that strengthens the deep learned features [44]. On top of that, dilated convolution controls their eccentricities and generate a good final representation. Since dropoutrate > 0.2 for input data highly affects the training model by adding a penalty to the loss function, a dropout rate of 0.2 is used. The three consecutive dense layers help to reduce underfitting of the network and increase the probability of extracting global relationship between the features. The activation functions used in the first dense layer and last dense layer are Rectified Linear Unit (ReLu) and Sigmoid Activation Function respectively.

F. CASCADED ENSEMBLE OF SELECTED FEATURES
Cascaded ensemble is a technique where outputs from multiple learning algorithms are considered and combined sequentially to make the final prediction. In this paper, instead of concatenating multiple features, we proposed a cascaded ensemble of hand-engineered features along with different deep learned features to make multiple ensemble models and utilizing them to generate the final predictive model. The reason for selecting this approach is to find a better predictive model than any single one. Referred to Fig.5, at first, each logistic regression classifier produces probability scores matrix of where ω is the sample space and in our case ω = {normal, abnormal}. Hence, the dimension of the probability score matrix will be two times the number of samples and they are represented by p BPPC , p IWBC , p LAP , p LMP , and p MTP for the hand-engineered feature-sets. Then, the probability scores from selected Hand-engineered features are combined by the following equation: Here, n is the number of methods used, w i is a weighting factor and its value is considered to be 1 n when no prior information exists about class probabilities, and p i is the scores obtained from i th method. This equation produces the final probability score of hand engineered features, P HE which is then considered as the input of a Logistic Regression Classifier to produce the final ''Hand-Engineered Ensemble'' model. Similarly, from the other hand, each shallow neural network produces similar predictive scores matrix and they are represented by p DENSE − NET , p RESNET , and p INCEPTION . These probability scores from deep learned features are also combined using (12) to produce final predicted score P DL . This predicted score is then used as an input to Logistic Regression Classifier to obtain the ''Deep Learning Ensemble'' model.
Finally, the combined scores P HE and P DL are once again combined using (12) and the final predicted score P Final is obtained. It is then classified through Logistic Regression classifier to produce the final ensemble model.

III. RESULTS AND ANALYSIS
In this section, a detailed overview of simulations in different stages of our proposed method are described. Details of the data-set, Feature Quality Analysis, performance of the handengineered, deep learned and proposed cascaded ensemble model are discussed in the following subsections.

A. DATASET
The proposed method is evaluated on two standard chest X-ray datasets. The first one is Montgomery County (MC) dataset which is made publicly available by Department of Health and Human Services, Montgomery County, Maryland, USA [9]. It comprises of 138 frontal chest X-ray in which 80 of them are normal and the rest are with manifestation of tuberculosis. Each of the image in this set has a resolution of either 4,020 × 4,892 or 4,892 × 4,020 pixels.
The second one that is used for the evaluation purpose is Shenzen dataset which is a standard digital image database for Tuberculosis and is created by the National Library of Medicine, Maryland, USA in collaboration with Shenzhen No.3 People's Hospital, Guangdong Medical College, Shenzhen, China [12]. It has total 662 cases in which 336 cases are with manifestation of tuberculosis and the rest are normal. These X-rays are of variable pixel size but all of them have approximately 3000 × 3000 pixel size. This dataset also provides a clinical reading of each of the CXRs. The data were captured within one month period as a daily routine at Shenzhen Hospital using Phylips DR digital diagnost system. A typical example of a chest x-ray image of a normal patient and a patient suffered from tuberculosis is depicted in Fig.6.
In order to improve the diagnosis capabilities of the transfer learnt models used in the proposed method, pretraining the models serves a very important role. It is worth mentioning that most of these models are pretrained with Imagenet dataset comprising of different natural images. However, for our objective, it is necessary to pretrain the models with that kind of images which is very similar to the nature of chest X-ray images. Hence, a third dataset is utilized in VOLUME 9, 2021  our method only for the purpose of pretraining the models. This dataset is collected from the NIH Chest X-ray dataset [41] which comprises of 108948 X-ray images for 8 thorax abnormalities.

B. FEATURE QUALITY ANALYSIS
As stated earlier, two standard measures-Bhattacharya Coefficient(BC) and Geometric Separability Index(GSI)-are utilized to select the optimum feature set from the feature pool. For any feature with a higher BC value indicates that it has less capability to separate the data into clusters. Hence, it is desired to choose a feature with a lower BC value. Opposite scenario is seen in the GSI case where a lower GSI value indicates that it has less capability to separate the data into clusters. For our preliminary selected hand-engineered features, the computed BC and GSI are shown in Table.1. From the table, it is found that Histogram of Oriented Gradients(HOG), Curvature Descriptor Histogram(CDH), and Shape Descriptor Histogram(SDH) have the highest BC and lowest GSI value and hence these three features are excluded from the selection process.

C. EXPERIMENTAL SETUP
All the experiments of this study have been implemented the google cloud platform with NVIDIA P-100 GPU as the hardware accelerator with 5 and 10 fold cross validation schemes. In the intermediate layer of the transfer learnt neural network classifier, a L2 regularizer with a factor of 0.01 is imposed on both kernel and bias terms to learn sparse features. The Adam optimizer is employed with an initial learning rate of 0.001 which is decayed at a rate of 0.99 after every 5 epochs. To evaluate the performance, Four standard evaluation metrics have been used in this method and they are accuracy, sensitivity, specificity and area under the ROC curve (AUC). The accuracy, sensitivity and specificity are defined as Specificity = TN TN + FP (15) respectively. Where, where The higher the AUC, the better the model can predict.

1) PERFORMANCE ANALYSIS OF HAND-ENGINEERED FEATURES AND THEIR ENSEMBLE
Generally, performance of hand-engineered features totally depends on the quality of that particular feature and on the task it is applied to. The performance of the individual handcrafted features used in this method is shown in Table.2. From the Table, a significant amount of improvement is found   Table.3.
In Table.3, performance of the individual transfer learnt neural network features are summarized. The features of conv5_block3_out extracted from fine-tuned Resnet50 architecture, features of conv5_block32_concat of transfer learnt DenseNet169 and features of batch_normalization_93 of Inception-V3 provide prominent results on Shenzen dataset. Similarlym, for Montgomery County(MC) dataset features extracted from same layers provide noticable classification performance.
From Table.4, it is found that the ensemble of the selected features, extracted from the transfer learnt models provides an improvement when it is associated with pretraining and our proposed augmentation techniques. Pretraining of the models with chest X-ray data provides 2.05, 3.05, 1.94 and 3.04 percent improvement in Accuracy, AUC, Sensitivity, and Specificity respectively, and when it is combined with augmentation techniques, 4.1, 5.22, 3.7 and 5.21 percent improvement in Accuracy, AUC, Sensitivity, and Specificity is found respectively.    improvement has been found by applying the proposed HE and DL ensemble model. The AUC has increased from the highest AUC of 0.974 found in the ensembled DL model to 0.995 as shown in Fig.7(a). Hence, a 2.15 percent higher AUC is found compared to the DL ensemble model in Shenzhen Dataset. On the other hand, a 2.17 percent improvement in AUC is found in the Montgomery County dataset which is shown in Fig.7(b). The performance of all ensemble models are recapitulated in Table.6. It is clear that our proposed handcrafted and deep learning feature fusion is appeared to be very effective in the tuberculosis detection task. To find the performance of different classification scheme in terms of accuracy, AUC, Sensitivity, and Specificity, different classifiers are evaluated in HE and DL ensemble model and their performance is visualized in Table.5. It is found from the table that Logistic Regression provides best classification performance than the other classifiers.
Since cross validation techniques are used to evaluate the performance of the proposed ensemble models, box and whisker plots of their accuracies in two different datasets  are shown in Fig.9. It is found from the plot that HE and DL ensemble technique has a low standard deviation in case of accuracies among different folds of the two datasets in comparison to the other models. In case of Shenzhen dataset, the standard deviation among the accuracies found using HE and DL ensemble technique is 0.187 whereas it is 0.30 and 0.54 for the other two models. The similar scenario is observed in the MC dataset.
Finally, in order to localize the particular area of interest in the images that provides valuable information in decision making process of the models, Gradient Based Class Activation Mapping(Grad-CAM) of two random samples are provided in Fig.8. By visualizing the Grad-CAM of the samples taken from three transfer learnt network which are working as the backend of the proposed receptive field aware neural network, it can be concluded that the region of interests are consistent among the networks.

E. PERFORMANCE COMPARISON BETWEEN DIFFERENT STATE-OF-THE-ART APPROACHES
The performance evaluation of previously proposed Tuberculosis classification model applied in Montgomery County (MC) and Shenzhen dataset is depicted in Table.7. Previously, Jaeger et al., Govindarajan and Swaminathan, and Chandra et al. used Hand-Engineered features based approach in this particlar problem [14], [36], [37]. Abedin et al. and Pasa et al. exploited deep learning approach to handle the Tuberculosis classification problem [42], [43] and found significant performance.
Comparing all the models with our proposed one, it is found from the Table.7 that our HE and DL ensemble model performs significantly better than the other models on both of the datasets. We have used different cross-validation techniques to compare our results with recent methods. Govindarajan and Swaminathan [14], Chandra et al. [37] used 10 fold cross-validation in Montgomery County(MC) dataset and got 87.8%, and 95.6% accuracy respectively. On the other hand, Abideen et al. [42], Pasa et al. [43] used 5 fold cross-validation on MC dataset and got 96.42%, and 79% accuracy respectively whereas our proposed method scored 97%, and 98.4% accuracy on MC dataset when 5 fold, and 10 fold cross-validation were used respectively. In Shenzhen dataset, Chandra et al. [37] used 10 fold cross-validation whereas Abideen et al. [42], Pasa et al. [43] used 5 fold crossvalidation. They acquired 99.4%, 86.46%, and 84.4% accuracy respectively. When 5 fold, and 10 fold cross-validation were used on Shenzhen dataset our proposed method scored 97.8%, and 99.7% accuracy respectively.

IV. CONCLUSION
Tuberculosis(TB) is the second-largest cause of patient morbidity and mortality, and so early detection of TB is very crucial to prevent death. For diagnosing TB, the most common imaging tool is a chest x-ray. However, as the number of patients is increasing day by day, the computer aid diagnosis system is very advantageous for rapid detection of TB. Machine learning and deep learning both are very obliging in this regard. As both of them have some advantages of their own, in this paper, we proposed a method that ensemble hand-engineered features with deep learning features to overcome error probability. From our investigation, we have found that our proposed method provides better accuracy than employing deep learning or machine learning individually VOLUME 9, 2021 for Tuberculosis detection purposes. To get a more generalized outcome, rotation-invariant augmentation technique is applied which makes the model a more generalized one. After doing extensive simulation, a high level of accuracy is found on two standard datasets which eventually proves the performance of the proposed method.