Mitigating Domain Shift in AI-Based TB Screening With Unsupervised Domain Adaptation

We demonstrate that Domain Invariant Feature Learning (DIFL) can improve the out-of-domain generalizability of a deep learning Tuberculosis (TB) screening algorithm. It is well known that state of the art deep learning algorithms often have difficulty generalizing to unseen data distributions due to “domain shift.” In the context of medical imaging, this could lead to unintended biases such as the inability to generalize from one patient population to another. We analyze the performance of a ResNet-50 classifier for the purposes of TB screening using the four most popular public datasets with geographically diverse sources of imagery. We show that without domain adaptation, ResNet-50 has difficulty in generalizing between imaging distributions from a number of public TB screening datasets with imagery from geographically distributed regions. However, with the incorporation of DIFL, the out-of-domain performance is greatly enhanced. Analysis criteria includes a comparison of accuracy, sensitivity, specificity and AUC over both the baseline, as well as the DIFL enhanced algorithms. We conclude that DIFL improves generalizability of TB screening while maintaining acceptable accuracy over the source domain imagery when applied across a variety of public datasets.


I. INTRODUCTION
Generalizability beyond the source domain is an important and difficult challenge for machine learning. In medical imaging, it can have a major impact on clinical trustworthiness. This is because it is not known whether a deep learning algorithm trained on patients from one population will generalize to another without extensive out-of-domain testing. This out-of-domain testing however is rarely performed in Machine Learning (ML) literature, and instead in-domain testing is much more common, yet inadequate because it does not measure the negative impact of domain shift on real-world performance. In the words of Thrall et al. [1], ''Tolerances of using AI programs in imaging between different patient populations is not yet known. Failure to recognize that a program is not generalizable, for example from adults to children or between different ethnic groups could lead to incorrect results.'' Machine Learning algorithms [2]- [4] have elevated The associate editor coordinating the review of this manuscript and approving it for publication was Nadeem Iqbal. the efficacy of Computerized Aided Diagnosis (CAD) across different classes of medical imagery such as Tomography, X-ray. However, these algorithms are rarely tested for cross domain validation.

A. IN-DOMAIN VERSUS OUT-OF-DOMAIN TESTING
The real-world performance of machine learning algorithms is universally lower than the reported in-domain performance due to domain shift. But how much lower? Although there is potential for minor degradation, there have also been reported instances of machine learning algorithms including deep learning where performance drops to surprisingly low levels when tested out-of-domain. A canonical example is the difficulty of deep learning networks, trained using the MNIST digits dataset, to accurately predict the USPS zip-code digits dataset without retraining or Domain Adaptation. It is easy for a researcher to erroneously believe that a deep learning algorithm trained to predict MNIST digits could be applied whole-cloth to handwritten digit recognition tasks in the wild VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ such as parsing of postal zip codes or handwritten checks. But this is often not the case, and most concerning is that this degradation is typically not measured. Fortunately, many recent works have shown that Unsupervised Domain Adaptation (UDA) enabled algorithms trained on MNIST can generalize not only to USPS, but even to the more difficult Street View House Numbers (SVHN) which are typically printed and not handwritten. Furthermore, UDA does not require labels of the target dataset, and can be performed using only unlabeled imagery from the target domain. As such, UDA holds promise as a potential means to overcome the out-of-domain degradation that might occur due to domain shift so long as representative unlabeled images can be obtained.

B. IN-DOMAIN TESTING IN AI FOR RADIOLOGY
Medical imaging datasets contain a unique distribution of meta-attributes including patient demographics, disease presentations, imaging procedures, labeling criteria, scanner equipment and settings. It is possible for a machine learning or a deep learning algorithm to overfit [5] these criteria and thereby have difficulty in generalizing to imagery in a new hospital institution or clinic, even if it appears to work well in in-domain testing, such as at a source institution for which the training data is well-representative.
Nevertheless, the extent of domain shift has not been adequately studied in the context of AI for medical imaging. For the purposes of TB screening in particular, patients across different countries may exhibit different disease presentations. Disease presentation may vary due to the particular strain of the disease that is prevalent in the region but may also vary by the severity of the disease. For example, in an institution where patients are screened when they present severe symptoms, one might expect that active TB cases may exhibit more advanced pulmonary infection than if patients are screened upon exhibiting milder symptoms. Furthermore, variation in the x-ray equipment, settings, and patient positioning may affect the performance of deep learning algorithms if this variation is not well represented in the training data. Changes in image resolution, scanning and compression artifacts may also contribute to domain shift. Finally, inadequate diversity of the patient population demographics may lead to variation in performance and effectiveness across sensitive demographic groups, including potentially biased AI performance between males and females, adults and children, and/or racial/ethnic diversity.

C. CONTRIBUTIONS AND TAKEAWAYS
To the best of our knowledge, no prior published work has evaluated nor reported on effect of domain shift on the outof-domain accuracy of CNN's across public TB datasets. Furthermore, no prior published work has attempted to improve the out-of-domain classification accuracy. As such, the main contributions of this work are as follows • We measure the out-of-domain performance for a strong baseline Convolutional Neural Network (CNN) algorithm for TB screening through out-of-domaintesting across the four most prevalent public datasets for TB screening with chest radiographs.
• We present a novel algorithm for adversarial DIFL for UDA with chest radiograph images.
• We evaluate the extent to which DIFL can improve outof-domain performance and algorithmic performance across out-of-domain TB screening tasks.
• We qualitatively evaluate the results across four public datasets, and include the opinion of a Board Certified Radiologist in determining the capabilities and limitations of the approach for hospital institutional use cases. The key takeaways are as follows, • Severe degradation of accuracy is observed in a strong baseline TB screening algorithm when measured across the geographically diverse public TB X-ray datasets.
• UDA through DIFL is capable of significantly improving (reducing) this domain shift for all source and target dataset combinations.
• The extent of the improvement, although significant for all source/target pairs, was more substantial in certain combinations than others, suggesting that certain source datasets are better suited for generalizability with UDA. Finally a discussion of the limitations of this study is included at the end of this manuscript, including the discussion of a hypothesis that disease presentation was a major factor in the ability for UDA to enable the CNN algorithm to generalize from one patient population to another. Furthermore, we discuss limitations of the UDA using DIFL and propose a strategy for future work to overcome these challenges.

II. TUBERCULOSIS SCREENING
TB is a contagious bacterial infection of the lungs which is widespread globally, thus affecting an estimated 25% of the world's population. 90% of those infected will never have symptoms, but 10% will progress to active TB which frequently causes severe coughing, chest pain, and pulmonary scarring. TB is known to develop more frequently in populations of developing countries, with higher rates of infection in Low and Middle Income Countries (LMIC) than High Income Countries (HIC). In particular, many countries in the African and South American continents have high prevalence of TB infections.
It is desirable to train a deep learning algorithm for TB screening that can generalize to populations from around the world. However, due to the aforementioned problem of domain shift in patient populations across the world, this has proved to be a challenging endeavor. In this paper, the impact of domain shift on TB screening algorithms is explored, by utilizing four major public datasets that are available for the identical task of TB screening from chest X-ray images. By training AI models using regular deep learning algorithms on one dataset and cross-testing them on the other datasets, we can evaluate the presence of bias (i.e. the inability to 45998 VOLUME 10, 2022 generalize to other domains) in these AI models that arise due to domain shift.
In this work we make use of four public TB datasets. Three out of the four datasets are regional datasets. Each regional dataset originates wholly from a singular medical institution, but each of these medical institutions is located in a different country. As such, for the remainder of this manuscript we refer to regional datasets by their origin country. The first dataset originates from a medical institution in Shenzhen (China) [6]. The second originates from an institution in New Delhi (India) [7], and the third from an institution in Montgomery County Maryland (USA) [6].
The last public dataset, referred to as the TBX dataset, is much more extensive, containing a 11,200 chest X-ray images taken from several medical institutions around the world [8]. As such, one might expect that a deep learning model trained using the TBX dataset would be able to perform significantly well on identical TB screening tasks using out-of-domain data, such as the three regional datasets. However, in this paper we demonstrate that this is not necessarily the case because ResNet50 trained on TBX achieves greatly reduced out-of-domain accuracy comparable to random guessing when tested on the regional datasets. Nevertheless, this out-of-domain accuracy can be greatly improved through the proposed DIFL technique, achieving performance more similar to a fully supervised model on the target distribution.
The availability of geographically distributed data sources for an identical classification screening task makes this problem an ideal candidate for exploring the possibility of evaluating unsupervised DIFL methods in the context of medical imaging. As mentioned earlier, DIFL methods have found success in relatively simple cross-domain computer vision tasks such as handwritten digit classification and object recognition. However, there is relatively little prior work of employing DIFL methods for relatively complex tasks and particularly in the medical imaging domain. TB screening from chest X-ray images is an important problem as TB is a very common disease in many countries internationally. Furthermore, country-to-country variation in imagery may affect classification performance, but the impact of such domain shift on AI classification has not been adequately measured and attempts to reduce this domain shift through unsupervised DIFL have not been previously tested.

III. RELATED WORK
While there has been interests and developments in the field of domain adaptation recently, it is still an area of machine learning that is yet to be fully explored. Domain adaptation has found its place in a wide variety of applications, ranging from being a part of a broader transfer learning routine [9]- [12] explicitly modelling the transformation between two or more domains [13], [14] and even in data augmentation [15], [16]. With a wide variety of flavors to choose from, such as supervised [17]- [19] semi-supervised [20]- [22] and unsupervised [23]- [25], domain adaptation has been increasingly incorporated for an increasing number of tasks, such as handwritten digit classification [26], [27], object recognition in images acquired in different conditions [27], [28], 3D pose estimation [29], and a variety of other tasks.
Recent advancements in deep domain adaptation include Zero-shot Deep Domain Adaptation (ZDDA) [30]- [33] which circumvents the need for target-domain data-points within a Task of Interest (ToI), so long as source and target domain datapoints can be obtained from a representative Irrelevant Task (IrT) exhibiting analogous domain shift. Prominent ZDDA techniques include the coupled generative adversarial network (CoGAN) for learning the joint distribution of data samples in dual-domains [31], [32], as well as its conditional coupled variant (CoCoGAN) [33], [34]. These methods show promising results to adaptation of tasks for which the target domain is unseen but exhibits analogous domain shift from an irrelevant source/target domain pair with more plentiful data availability.
However, the application of domain adaptation has been relatively limited to slightly simpler tasks, and work done in the field of domain adaptation in the context of medical imaging has been relatively limited thus far. While there has been a slight uptick in utilizing domain adaptation techniques for tasks relating to medical imaging as of recent, UDA remains relatively unexplored in the context of medical imaging. Logically, UDA proves to be the most difficult to achieve in comparison to supervised/semi-supervised, in large part due to the absence of classification labels in the target domain(s). Given the relatively complex nature of machine learning in the field of medical imaging, i.e. the distribution of data is not as easily learnt as with other simpler tasks such as handwritten digit classification, a large majority of domain adaptation work done in the context of medical imaging has shied away from UDA.
Perhaps the closest recent work to ours is the publication of Zhou et al. [35], which explores the use of semi-supervised domain adaptation methods to improve the accuracy of the label classification across different domains for the task of predicting Covid-19 from chest X-ray images. In their proposed Semi-supervised Open Set Domain Adversarial (SODA) network, they utilize an adversarial semi-supervised method of training, such that the SODA network will be able to learn features that are adaptable to the target domain with relatively high accuracy on both the source and target domains. However, this approach assumes the availability of labeled data (at least in limited quantities) on the target domain, which may not be feasible in all real-world use cases.
Madani et al. [36], very similarly, looks at the potential of semi-supervised domain adaptation methods to increase the accuracy of label classification on different domains for the task of predicting cardiac abnormalities from chest X-ray images, with the added focus of generating synthetic X-ray images to solve the problem of data scarcity in the field of chest X-ray images. As with the previous paper, this approach also assumes the availability of partially labeled data being available in the target domain.
While both rely on the process of DIFL methods to increase the generalizability factor of the classification model across various domains, utilizing a semi-supervised approach assumes the availability of partially labelled data on all the potential target domains. In reality, such labelled data might not be available in the field of medical imaging (mainly pertaining to chest X-ray images), due to difficulty in obtaining gold-standard ground truths for the chest X-ray images. As such, unlabelled data is more easily obtainable, which raises the question of whether UDA methods would be as effective as semi-supervised domain adaptation methods for the same/similar tasks.
In this paper, we seek to investigate this hypothesis, i.e. how effective is UDA in reducing algorithmic bias due to domain shift, as measured by the performance difference of a common CNN algorithm ResNet50 as trained with imagery from one institution, but tested using imagery from another institution. Existing research has not yet looked into this ambitious translational task of UDA as a means of algorithmic tuning to enhance the ability of an algorithm to generalize and maintain accuracy when applied to imagery from a different institution on which the algorithms were initially trained.

IV. METHODOLOGY
In this section, we delineate our proposed unsupervised DIFL model in detail.
The implemented DIFL model draws inspiration from the Generative Adversarial Network (GAN) architecture, as discussed in [37]. The two main components of a basic GAN architecture include the generator network, and the discriminator network. These two networks have opposing goals, with the generator trying to manipulate the input data in a way such that the discriminator fails to successfully categorize the output of the generator into the possible classes, i.e. the generator attempts to fool the discriminator, while the discriminator attempts to perform accurate classification. As such, both of these networks have custom loss functions that are adjusted according to the exact function that they serve. A simplified example of the GAN architecture is shown in Figure 1.
In the following discussion, an input image is represented by the variable x ∈ X where X is the set of all input images, and their corresponding classification labels are defined by the variable y ∈ Y , where Y is the set of all classification labels. The set of images and classification labels are also subdivided into source and target domains, which are denoted by the subscript S and the subscript T respectively. As such, we use the variables x S to denote images from the set of all source domain images X S , where x S ∈ X S ⊆ X , and likewise we use the variable x T to denote images from the set of all target domain images X T , where x T ∈ X T ⊆ X . Similarly, we define the variables y S and y T to denote the classification labels from the set of all source classification labels Y S and the set of all target classification labels Y T respectively, where y S ∈ Y S ⊆ Y and y T ∈ Y T ⊆ Y . However, it is to be noted that the target classification labels Y T are unobserved, and hence they are not used for training the DIFL model. Hence, the proposed method in this paper is called unsupervised DIFL.
Additionally, to differentiate between the source and the target domains, a domain label is added for each image x ∈ X . The domain labels are denoted by the variable d D , where D is the set of all domain labels. The variable d S denotes domain labels taken from the set of all domain labels D S , such that d S ∈ D S ⊆ D, and the variable d T denotes domain labels taken from the set of all domain labels D T , such that d T ∈ D T ⊆ D. Domain labels are used to differentiate the possible domains that any particular image x could be taken from. Thus, for the purposes of this paper, the source domain labels d S ∈ D S for all source images are set to the value of 0, while the target domain labels d T ∈ D T for all target images are set to the value of 1.
A detailed overview of the DIFL model architecture is shown in Figure 2, and explained in the following discussion. The DIFL model, and its training process is explained in the following sections.

A. TRAINING THE DIFL MODEL
The ultimate aim of the DIFL model is to produce an adequate classification model for the target domain, while using only unlabeled images in the target domain. In order to accomplish this, it is necessary to learn a generalized feature representations of the images x ∈ X , i.e. from both the source and target domains, while also performing successful label classification on the source images x S ∈ X S . This overall task is accomplished by training 3 separate neural networks simultaneously: a label classifier network, indicated by C, a domain invariant feature generator network, indicated by G, and a domain discriminator network, indicated by D.
In the ideal scenario, the DIFL model would be able to learn and produce perfectly generalizable features of the images x ∈ X , in which case the label classifier network, which has been trained to correctly classify the generalized features of the source images x S ∈ X S , can also be utilized to make accurate classifications on the generalized features of the unlabeled target domain images x T ∈ X T , as the task is unchanged between the source and target domains. Thus ideally DIFL model will be able to perform equivalently well on both source and out-of-domain data, which is not feasible to be achieved by conventional classification models that are trained using data from a singular domain. Training the DIFL model involves training all 3 networks G, C and D, where each network makes use of a custom adversarial loss function. Broadly, the training process of the DIFL model can be subdivided into two major steps: a label classification step, and a domain invariance step. The DIFL model is trained by conducting these aforementioned steps simultaneously until convergence of the DIFL model is observed.

B. LABEL CLASSIFICATION STEP
In the label classification step, image and classification label tuples (x S , y S ) from the source domain are used to train a part of the DIFL model. The input images x S are passed through the domain invariant feature generator network G to produce G(x S ), the domain invariant features of the input images x S . These features are then passed through the label classifier network C, to obtain C(G(x S )), which are the predicted classification labels of the images x S . We define a variableŷ S to represent the predicted classification labels of the source images, as follows: The predicted classification labelsŷ S are then compared with the true classification labels y S , through an appropriate loss function to produce the classification loss, l C . Due to the binary nature of the classification labels, the Binary Cross Entropy loss function is utilized to calculate l C . This can be represented mathematically as follows, where the variable N indicates the total number of images in the batch used for training, and the subscript i is used to denote each individual image: This loss value l C is used to update the weights of the domain invariant feature generator G and the label classifier C, by differentiating the loss value with respect to the weights of the respective networks, and then multiplying the resulting gradients with an appropriate learning rate before using the resulting values to update the networks themselves. In the following discussions, this particular learning rate is referred to as the classification learning rate, and represented by the variable α C .

C. DOMAIN INVARIANCE STEP
In the domain invariance step, image and domain label tuples (x, d) from both the source and target domains are used to train a part of the DIFL model. The input images x are passed through the domain invariant feature generator network G to produce G(x), the domain invariant features of the input images x. These features are then passed through the domain discriminator network D, to obtain D(G(x)), which are the predicted classification labels of the images x. We define a variabled to represent the predicted domain labels of the images, as follows: As the domain invariant feature generator network G and domain discriminator network D are set up in a GAN-like fashion, updating these networks in the domain invariance VOLUME 10, 2022 step is also done in a similar method as with regular GANs. Let us analyze the functions of these two networks in this particular domain invariance step to logically derive the loss functions that should be used to update these two networks. Regular GANs utilize the minimax loss function to update the generator and discriminator networks, which was first introduced in (Goodfellow et al., 2014), the same paper that proposed the original GAN structure. The minimax loss function, represented by the value function V (G, D) is as follows: While the loss function for the domain invariant feature generator network G and domain discriminator D in the domain invariance step would be similar, it need not necessarily be identical to the above minimax loss function. Let us analyze the functions of these networks in closer detail to set up their respective loss functions.
The domain invariant feature generator network G attempts to produce domain invariant feature representations G(x) of the source and target images x ∈ X , such that the domain discriminator network D is unable to correctly identify which domain the images are taken from. The domain discriminator network D takes G(x) as input and aims to accurately classify them into their appropriate domains, i.e. correctly predict their domain labels.
Thus, the loss function for the domain discriminator network D can be set up in a straightforward manner; the predicted domain labelsd can be compared together with the actual domain labels d through an appropriate loss function to produce a loss value, represented by the variable l D . For the purposes of this paper, due to the binary nature of the domain labels (the image can only either belong to the source domain or the target domain), the Binary Cross Entropy loss function is used calculate the loss value l D . This can be represented mathematically using the following formula, where the variable N indicates the total number of images in the batch used for training, and the subscript i is used to denote each individual image: Setting up the loss function for the domain invariant feature generator network G requires us to examine its purpose in closer detail. Let us first define the term domain invariant. We can say that the feature representations of a particular image are domain invariant, i.e. the feature representations are highly generalized, when the domain discriminator D is unable to distinguish which domain the image originates from. Mathematically, this statement can be interpreted in the following manner: the domain discriminator D assigns an equal probability for the image to be from the source and the target domain.
Thus, the domain invariant feature generator network's goal would be for the domain discriminator network's output, the predicted domain labelsd, to indicate a probability of 0.5 for both the source and target domains. These are termed as the ideal domain labels for the domain invariant feature generator network G, and are represented by the variabled gen . Hence, the loss for the domain invariant feature generator network G, represented by the variable l G , can be calculated by comparing the actual predicted domain labelsd with the ideal domain labelsd gen through an appropriate loss function. As with the domain discriminator network, due to the binary nature of the domain prediction subtask, the Binary Cross Entropy loss function is used to calculate the loss value l G . This can be represented mathematically using the following formula, where the variable N indicates the total number of images in the batch used for training, and the subscript i is used to denote each individual image: The generator loss value l G is used to update the weights of the domain invariant feature generator network G, while the domain discriminator loss value l D is utilized to update the weights of the domain discriminator network D. This is done by differentiating the respective loss values with respect to the weights of the corresponding networks, and then multiplying the resulting gradients with an appropriate learning rate before using the resulting values to update the networks themselves. In the following discussions, this particular learning rate is referred to as the domain invariance learning rate, and represented by the variable α DI . We implemented DIFL using Tensorflow and the Codes 1 for this project are made publicly available.

V. EXPERIMENTAL DESIGN
We now detail the experimental design that was undertaken to develop the final domain invariant feature learning DIFL model.

A. DATASETS
The datasets used for the purpose of testing and evaluating the models are listed in this section. All the datasets comprise of chest X-ray scans, which were taken for the primary purpose of detecting TB.
Jaeger et al. [6] presents two public TB datasets that were used in the purposes of this study. The first dataset consists of chest X-ray scans collected at Shenzhen No.3 People's Hospital, Guangdong Medical College, Shenzhen, China, and this dataset consists of 326 healthy cases and 336 TB cases, having a total of 662 X-ray images. The second dataset comprises of chest X-ray scans taken through the cooperation of Department of Health and Human Services, Montgomery County, Maryland, USA, and this dataset has a total of 138 X-ray images, of which 80 are healthy cases and the remaining 58 are TB cases. The third dataset was obtained from Chauhan et al. [7], which has chest X-ray scans taken from the National Institute of TB and Respiratory Diseases, New Delhi, India. This dataset has a total of 176 chest X-ray images, of which 102 are healthy cases and the remaining 74 are TB cases.
The fourth that we have used in our experiment is the TBX dataset [8]. The TBX dataset was originally proposed to provide a more comprehensive and larger TB imaging database which potentially makes training Deep CNNs somewhat infeasible. Liu et al. [8] presents this dataset which integrates TB data from multiple hospital institutions comprising 11,200 TB images with resolution of 512 × 512 which is higher than prior available datasets. The TBX Dataset has a total of 10000 TB-negative cases with 5000 of them being asymptomatic and 5000 of them presenting some non-TB symptoms. Furthermore, the dataset exhibits 1200 TB-positive cases with 924 Active, 212 Latent, 54 Indeterminate Latency and 10 Uncertain TB Cases. For purposes of binary classification, this dataset is compiled into 10000 TB-negative cases, and 1200 TB-positive cases.
Throughout this paper, these four respective datasets shall be referred to as the China Dataset, USA Dataset, India Dataset and TBX Dataset. The datasets that are used for training and testing the models listed in this paper utilize a 80:20 train-test split, both for the source and target domain datasets. Therefore, only 80% of the dataset is used for training, while the other 20% of the data is not seen by the model until the final testing stage. Samples images of the 4 datasets, from both the TB positive and TB negative classes, are shown in Figure 3.

B. MODEL DEVELOPMENT
In order to have a baseline model for comparison, two non-DIFL models were first developed. Both of these models' performances were evaluated by using each of the 4 TB datasets as the source domain, and the remaining 3 TB datasets as the target domains which gives a total of 12 (source, target) pairs.
The first non-DIFL model was developed using a simple convolutional neural network architecture. Another non-DIFL model was then implemented using the much more complex ResNet50 architecture. The performance metrics of these models on the aforementioned task are detailed in the Results section. Subsequently, the DIFL model was then implemented, as described in the Methodology section. This model was further evaluated in a similar fashion as with the non-DIFL model. Manual hyper-parameter tuning of the DIFL model was performed to determine the classification learning rate and the domain invariance learning rate for the DIFL models.

C. CLASS IMBALANCE AND WEIGHTED LOSS
While training the DIFL model with TBX [8] as one of the (source, target) pairs, the most immediate issue we came across was the difference in scale of the datasets (source, target) pair in terms of the number of training images within each dataset. Since TBX is a much larger dataset, when paired up with any of the other (China, India, US) datasets the scale difference creates imbalance in the training loop. The imbalance is the following -while looping over just a smaller share of the TBX dataset, the other domain dataset VOLUME 10, 2022 due to being small in sample space size exhibits far fewer mini-batches per epoch. To circumvent that encumbrance, additional data augmentation was performed on the smaller datasets in terms of scale and rotation in order to balance the number of images per epoch with the TBX dataset.
Another issue we had with training two differently scaled datasets at the same time was class imbalance. Across all datasets, but especially with the TBX dataset, we had to balance the weights for the final layer logits so that the DIFL model is not biased only towards the more prevalent (TB-negative). We used Weighted Cross Entropy Loss function [38] to counter this class imbalance which is deliniated in the following equation.
We used a value of β = 0.35 for our experiments. Weighted cross entropy helps our cause by up-weighting loss value for the more available examples that pertains to the large scale domains so that the optimizer learns not to heavily get biased towards those domain features. Furthermore, having this module as a part of DIFL helps us juxtapose data domains of varying scales and sample space size, TBX and China datasets for example.

D. EVALUATION METRICS
Several measures were used to evaluate the performance of the DIFL model and determine its effectiveness at the task of domain adaptation. These measures are detailed in the following discussion.

1) ACCURACY
Accuracy of label classification is one of the most crucial measures of evaluating the DIFL model, as it can give us a direct idea of how well it can perform in its primary task of TB prediction.
Calculating the accuracy of the DIFL model is straightforward. Input images and their corresponding classification labels (x, y) are sampled from any domain, i.e. either source or target domain. The input images are passed through the domain invariant feature generator network G, and to produce the domain invariant features, G(x). These features are then passed through the label classifier network C, to produce C(G(x)), the predicted classification label, which can also be represented asŷ. This predicted label is compared with the true classification label y.
The above process is repeated for all data in the domain, and a confusion matrix is built, consisting of the number of true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN). The accuracy of the DIFL model for this particular domain can then be calculated using to the following formula: The resulting accuracy value would be a great indicator of how well the DIFL model is able to correctly classify the input data. However, considering a single performance metric would not be a good enough basis to fully evaluate a model's performance. Hence, additional metrics are also used to supplement our analysis.

2) SENSITIVITY
Another metric which can be utilized to analyze the DIFL model is sensitivity. Evaluating the sensitivity can be done in a similar fashion as with accuracy, by first building the confusion matrix. The sensitivity value is then calculated using the following formula: The sensitivity value provides a good measure of how well the model can correctly classify the TB positive instance. Essentially, it represents the proportion of TB positive cases that were correctly predicted as TB positive by the model as well, and hence this value is also known as the true positive rate. Apart from sensitivity, the specificity is also a possible metric that can be used to evaluate the DIFL model. Using the confusion matrix, the specificity value is calculated using the following formula: The specificity value is similar to the sensitivity value, but instead indicates how well the model can correctly classify the TB negative instances, i.e. it represents the proportion of TB negative cases that were correctly predicted as TB negative by the model. Thus, this value is also known as the true negative rate.

3) ROC -AUC MEASURE
The Receiver Operating Characteristic (ROC) curve is a graphical plot that visualizes the classification ability of a binary classifier model, by varying its discrimination threshold and evaluating its performance on the data. This is done by plotting a graph of its True Positive Rate (Sensitivity) against its False Positive Rate (1 − Specificity) as its discrimination threshold is varied from 0 to 1. While the ROC curve enables one to visualize a model's performance, the graph alone does not however provide a concrete method to objectively compare the performance of different models.
To supplement the analysis of a model's ROC curve, an additional metric, known as the Area Under Curve (AUC) value, can be calculated and used to judge the performance of the binary classifier model. The AUC value can be determined trivially by taking the area under the ROC curve (this can be obtained by integrating with respect to the ROC curve), and provides a simple but effective way of directly comparing the performances of different models.
As the True Positive Rate and False Positive Rate ranges from a minimum value 0 to a maximum value of 1, the maximum AUC value possible would be 1 × 1 = 1, which is only obtainable by a model that is able to perform perfect binary classification. A random binary classifier is expected to achieve an AUC value of 0.5, as it can perform correct predictions about approximately only half of the time. As the AUC value tends closer to 1, we can infer that the model is able to better classify the instances of the data, and hence is indicative of better model performance.

VI. RESULTS
In this section, the obtained results from the aforementioned testing will be detailed and discussed.
In the first round of experiments, each of the regional China, India, and US datasets was utilized as the source dataset, while one of the remaining datasets was used as the target dataset. There are six such possible combinations among by using these three datasets, and a total of 10 trials were conducted for each of these 6 combinations. The obtained results were then averaged, and are presented in Figure 4. A second round of experiments are also shown using adaptation to (from) the geographically distributed TBX dataset from (to) each of the three regional datasets. There are six combinations of this experiment as seen in Figure 5.
For the sake of simplifying labels, the X → Y nomenclature is used to define the source and target datasets, where X is the source dataset, and Y is the target dataset as seen in 4.
In each trial per combination, three different types of models are trained and evaluated accordingly: the first model is a non-DIFL model, which is trained on the source dataset, and then tested on the target dataset. This model, as expected, does not achieve good performance measures, due to the presence of domain shift. As such, the performance scores achieved by this model are used as a baseline, and hence this model is termed as the lower baseline model.
The second model is a non-DIFL model, wherein the model is directly trained on the target dataset, and consequently tested on the target dataset. As it is being directly trained on the data upon which it is also tested, this model is expected to perform well. The performance scores from this model provide an ''upper bound'' to which the results from the experimental DIFL model can be compared against, and thus this model is termed as the upper baseline model.
The final model is the DIFL model, which is trained on the source dataset, and tested on the target dataset utilizing the DIFL algorithm. The performance of this model can be evaluated by comparing it against the previously mentioned lower and upper baseline models.
It is also important to mention that the upper baseline non-DIFL, lower baseline non-DIFL and DIFL models contain ResNet-50 as the backbone architecture and using the same architecture among these three models helps us make a justified comparison of their performance. Because, the resultant difference in performance of these three models then is engendered from the training procedure, dataset selection     mechanism and incorporation of DIFL method in the pipeline and not the CNN architecture itself.

A. ACCURACY SCORES
The accuracy scores of all three of these models, for each of the twelve possible combinations amongst the regional and large datasets, are detailed in Table 1 -12. It is observed that in all of the twelve combinations, the non-DIFL lower baseline model is only able to achieve accuracy scores of around 0.5, signifying that these models are not performing any better than random guessing. The upper baseline model, as expected, achieves high accuracy scores in the region of 0.8 − 0.9.
Looking at the DIFL models, it is observed that they achieve significantly higher accuracy scores than their respective lower baseline models, by approximately 0.2. While it is not able to achieve accuracy scores as high as the upper baseline models, it does come close in one combination, particularly the US → China case, wherein the DIFL model achieves an accuracy score of 0.80 (which is a great Hence, the DIFL model is able to greatly outperform the lower baseline in all cases. Moreover, in some cases it achieves accuracy that is close to that of the upper baseline model. It is observed in Figure 6 when the source domain is TBX then the performance of DIFL is much closer to the upper baseline model than the performance of DIFL in the other combinations of (source, target) pairs. This phenomenon is a resultant factor of the fact that TBX is a much larger dataset that contains cases from a variety of hospitals VOLUME 10, 2022       across different regions and is more representative of a somewhat broader spectrum of disease representation, population sample and equipment settings. Figures 6, 7, and 8 compare the accuracy, false positive, and false negative rates of the lower baseline (red), DIFL (green) and upper baseline (blue) models Respectively. It can be observed that in all cases, DIFL improves accuracy while simultaneously reducing the false positive and false negative rate. As seen in Figure 8, the reduction in the false negative rate is particularly pronounced reducing from 30% to less than 12% in all cases. For the case of the TBX-regional datasets, the false negative rate in particular reduces to less than 5% in all three cases. For all 11 out of 12 source / target combinations, the false negative rate for the DIFL model is within 2% of the false negative rate for the fully supervised model. The exception being the India-TBX case which is particularly challenging given the size and diversity of the TBX target domain relative to the source domain. As seen in Figure 7, the reduction in the False Positive rate is also very large, although somewhat less dramatic than that of the reduction false negative rate. This result is reasonable especially considering that for screening purposes, in the event of a less accurate diagnostic test, it is usually preferable to achieve a low false negative rate even at the expense of a somewhat higher false positive rate. We observe that in all cases, the DIFL method achieves a false positive rate that is roughly 8 − 10% higher than that of the fully supervised technique. In the case of TBX-regional datasets, the false positive rate improves from 30 − 40% (lower baseline) to less than 20% as (DIFL) as compared to 6 − 9% (upper baseline). In all cases, we observe large improvements in the false positive rate although less dramatic than the improvements observed in the false negative rate.

C. PRESENCE OF DOMAIN SHIFT
Upon conducting analysis of the obtained results in the previous sections, the presence of domain shift is confirmed, and its consequence in preventing conventional models to fail when tested on datasets other than the one they have been trained on is evident. The lower baseline model, when trained on dataset X and tested on dataset Y , was unable to perform as well as the upper baseline model, which was trained on dataset Y and tested on dataset Y , even though the classification task is identical, and the image data from datasets X and Y are largely similar to the human eye. This provides concrete evidence that the drop in performance when testing on other datasets does not arise due to potential problems in the other datasets, but rather due to the presence of domain shift, as the non-DIFL model architectures used in both the lower and upper baseline models are identical.
It is also shown that the domain shift present between the source and target datasets can be mitigated by utilizing the proposed DIFL approach. By producing generalized features of the data from both the source and target domain first, before using those features for the classification task at hand, the DIFL algorithm enables the model to perform significantly better when tested on the target dataset than the lower baseline model, which is indicative of the fact that the DIFL model is able to generalize better across these domains.   combination saw the highest improvement in performance from the lower baseline model, and the closest performance in comparison to the upper baseline model of the regional-regional dataset combinations.
A sample of images was qualitatively inspected by a board certified chest radiologist. Upon further analysis of the chest X-ray imagery from both datasets, this particular observation can be attributed to the differences in disease presentation in the datasets.
As TB is a progressive disease, its effects on the lungs can be classified into two main broad categories; more severe TB (significant damage to the lungs), and less severe TB (minimal damage to the lungs). As such, images which are classified as TB positive in the datasets could belong to either of these categories. However, it is important to note that the disease presentation of TB in these two categories may be different, and may not contain the same indications on chest X-rays. However, they are not exclusive; particularly, features of less severe TB cases are likely to be present in more severe TB cases, but the converse is not true.
The US dataset contains chest X-ray imagery with less severe manifestations of TB, while the Chinese dataset comprises mostly of chest X-ray imagery with more severe manifestations of TB. As such, when the US dataset is used as the source (generalizing using features from the US dataset), the DIFL model is also able to perform relatively well on the Chinese dataset, as features from less severe manifestations of TB can generalize well to detect more severe manifestations of TB.  However, the reverse is not true. Features from more severe manifestations of TB do not generalize well to detect the less severe manifestations of TB. As such, this is seen and confirmed with the observations from using the Chinese dataset as the source, and the US dataset as the target; this particular combination does not perform as well as the aforementioned inverse combination.
As such, it is crucial to consider the nature of the datasets, particularly in the context of screening of TB from chest X-ray images, when employing the DIFL approach. Due to the complex nature of TB, factors such as disease presentation may affect the effectiveness of the DIFL approach, and it is important to evaluate these factors before deciding on the source and target datasets.

VII. CONCLUSION
We present an analysis of the cross-domain performance of ResNet50 for tuberculsis classification from four geographically distributed data sources, as well as a novel DIFL method for UDA in order to mitigate the effects of domain shift on out-of-domain performance.
We demonstrate that the baseline ResNet50 non-DIFL models are unable to generalize to international TB datasets for which they were not trained on, even if the actual classification task (TB screening) is identical, and the x-ray images look very similar to a human Radiologist. This is largely due to the presence of ''domain shift'', which causes a non-DIFL model trained on a source dataset to underperform when tested on other target datasets. Utilizing a DIFL approach mitigates this problem by using unlabeled data from the target dataset to produce more generalized features from the source and target dataset, upon which the classification task is then conducted. As such, the DIFL approach enables us to perform classification tasks on the target dataset, even when the data from the target dataset is unlabeled.
Additionally, while it is observed that the DIFL model performs significantly better than non-DIFL models, it does not perform as well as the model which is directly trained on the target dataset in most cases. However, there are circumstances in which the DIFL model performs comparably well to a model which is directly trained on the target dataset. We believe that this variation in general is largely dependent on the disease presentation of the datasets. Although future work is necessary to confirm this hypothesis.
In the context of medical imaging and TB, it is observed that the chest X-ray images from the Chinese dataset have disease presentation which contains more severe appearance of TB in the opinion of a board certified radiologist, as compared to chest X-ray images from the US dataset have disease presentation which has a less severe appearance. By using the US dataset as the source dataset, and the Chinese dataset as the target dataset, the DIFL model achieved comparable performance in comparison with the model that was directly trained on the Chinese dataset.

VIII. FUTURE WORK
The field of DIFL, and domain adaptation in general, has great potential for many purposes, in particular ensuring that algorithms exhibit tailored performance to a given hospital institution. However, further algorithmic refinement is necessary to demonstrate the effectiveness of this method on a consistent basis, and retraining algorithms to a particular hospital demographic brings up additional regulatory challenges that are continuing to evolve. This section discusses a few of the possible extensions through which one could continue the work that has been done in this paper.
The main facet of this experiment which could be improved to achieve better results would be hyperparameter tuning, particularly that of the classification step learning rate and the domain invariance step learning rate. Deciding the appropriate classification step and domain invariance step learning rates was performed through the use of manual tuning, but this is a time intensive process. We anticipate that it is possible to improve the loss function of the DIFL algorithm such as to maximize directly for the expected target accuracy rather than having two separate objectives to optimize (source accuracy and generalizability). In this way, it may be possible to eliminate this hyperparameter tuning step.
Apart from utilizing DIFL algorithms for classification of TB from chest X-rays, one could also attempt to implement similar DIFL algorithms for classification of other diseases from other types of medical imaging, e.g. chest X-rays, mammograms, etc. The effectiveness of DIFL algorithms on other disease presentations have been largely unexplored, and these could be an area of interest for anyone seeking to expand upon the subject of this paper.
Additionally, another aspect which could be tested would be evaluating the effectiveness of the DIFL algorithm when multiple source datasets or multiple target datasets are involved. While it is shown in this paper that the DIFL algorithm, in the context of TB screening from chest X-rays, can generalize and mitigate the domain shift between a single source dataset and a single target dataset, its effectiveness in dealing with multiple source or target datasets is unknown. The DIFL algorithm could be modified slightly to make this change possible, and experiments could be conducted to assess how using multiple source or target datasets affects the end result. ALAN SCHWEITZER received the B.S. and M.E.E. degrees from Cornell University. He is currently working as the Technical Director of informatics with RAD-AID International. He provides technical direction for RAD-AID Informatics Group, including systems architecture, vendor technical liaison, and management of radiology I.T. initiatives at RAD-AID partner sites. Previously, he worked as a CTO with the Radiology Consulting Group (RCG), Massachusetts General Hospital. He provided consultation services for all areas of radiology information technology, including planning, procurement, vendor selection, and implementation of PACS; radiology information systems; and enterprise medical imaging solutions. He has Assisted more than 50 clients worldwide with a variety of imaging-related information technology initiatives. Prior to that, he was the CTO with the Massachusetts General Imaging Business Development Group. He developed and implemented I.T. infrastructure for MGH imaging's teleradiology service. He has presented at the AHRA, SIIMS, and HIMSS conferences, as well as publishing several articles in the areas of speech recognition, storage strategies, PACS vendor evaluations, and imaging in the developing world.
AMEENA ELAHI received the B.S. and R.T. degrees in health administration and medical imaging from Drexel University, in 2003, and the M.P.A. degree in public administration in Keller, in 2011. She is a Senior Technical Analyst with the Medical Imaging Management (MIM) Team at the Penn Medicine's Information Services Department. As the Co-Chair of the Clinical Imaging Artificial Intelligence (AI) Program, her duties include the product management and implementation of AI technology for research and evaluation. She has worked with RAD-AID International, Nigeria, during a friendship PACS installation, in 2019, and has since joined the RAD-AID Leadership Team as the Informatics Operations Director to pursue her passion for global outreach and improving healthcare for all. She was also elected to the Society of Imaging Informatics in Medicine (SIIM), in 2021, for a three-year term. Previously, she held roles as a Diagnostic and Interventional Radiologic Technologist with the Hospital of the University of Pennsylvania.
FAROUK DAKO received the Master of Public Health degree from the Johns Hopkins Bloomberg School of Public Health. He completed a surgical internship at the Mayo Clinic and Diagnostic Radiology Residency at Temple University. He then completed fellowships in cardiothoracic radiology and imaging informatics at the University of Maryland. He is an Assistant Professor of radiology with the Cardiothoracic Imaging Division; a Scholar with the Center for Global Health; and a Senior Fellow of the Leonard Davis Institute of Health Economics, Perelman School of Medicine, University of Pennsylvania. He is also the Director of RAD-AID International Nigeria Program and has led this program for over five years. His research interests include population and global health, specifically the role radiology can play in advancing health equity. He is interested in the utilization of big data to improve health outcomes of traditionally underserved populations. He has attended medical school at St. George's University.
DANIEL MOLLURA received the B.A. degree in government/international relations from Cornell University, in 1994, and the medical degree (M.D.) from the Johns Hopkins University School of Medicine, in 2003. He completed his internal medicine training, in 2004; the diagnostic radiology residency program, in 2008; and the nuclear medicine/molecular imaging fellowship at the Johns Hopkins Hospital, in 2009. Based on his background as a Goldman Sachs Financial Analyst, from 1994 to 1996; premedical/scientific research training at Columbia University, from 1996 to 1999; and prior founding of three other successful start-ups in the media, technology, and public sectors, he founded RAD-AID International, a nonprofit 501c3 organization for increasing health and radiology medical care to underserved and resource-poor communities, in 2008, to become a global nonprofit organization with nearly 15,000 members serving over 85 hospitals in 38 countries. He served for ten years on the Radiology Clinical and Research Faculty of the National Institutes of Health (NIH), Bethesda, MD, USA, from 2009 to 2019. At NIH, in addition to providing clinical patient-care in radiology and nuclear medicine, he led the Artificial Intelligence (AI) Laboratory, NIH Clinical Center, to develop and study AI for quantitative molecular imaging applications. In 2020, he became the full-time President/CEO of RAD-AID International and continues to build the charitable global efforts of the organization, including the integral contribution of AI, IT, clinical radiology, education, training, infrastrastructure, and equipment to low-resource health facilities in medically underserved communities and low-middle income countries. He is the Founder, the President, and the CEO of RAD-AID International.
DAVID CHAPMAN received the B.S. degree in computer science and the M.S. and Ph.D. degrees from the University of Maryland, Baltimore County (UMBC), in 2006, 2008, and 2012, respectively. From 2012 to 2014, he was a Postdoctoral Fellow with the Lamont Doherty Earth Observatory (LDEO), Columbia University. From 2014 to 2018, he was a Design Engineer at Oceaneering International Inc. Currently, he is an Assistant Professor with the Department of Computer Science, University of Maryland, Baltimore County, and the Head of the Vision and Image Processing Algorithms Research Group (VIPAR), which focuses on computer visions, machine learning, and medical imaging informatics. The VIPAR is working in collaboration with the University of Maryland School of Medicine; the University of California at San Francisco; the Mercy Medical Center, Baltimore; and RAD-AID Informatics on problems related to AI in radiologicial imaging. His research interests include challenges in the areas of semi-supervised and unsupervised learning as well as improved AI evaluation methodologies toward clinical translation and improved real-world performance. VOLUME 10, 2022