Secure and Robust Machine Learning for Healthcare: A Survey

Recent years have witnessed widespread adoption of machine learning (ML)/deep learning (DL) techniques due to their superior performance for a variety of healthcare applications ranging from the prediction of cardiac arrest from one-dimensional heart signals to computer-aided diagnosis (CADx) using multi-dimensional medical images. Notwithstanding the impressive performance of ML/DL, there are still lingering doubts regarding the robustness of ML/DL in healthcare settings (which is traditionally considered quite challenging due to the myriad security and privacy issues involved), especially in light of recent results that have shown that ML/DL are vulnerable to adversarial attacks. In this paper, we present an overview of various application areas in healthcare that leverage such techniques from security and privacy point of view and present associated challenges. In addition, we present potential methods to ensure secure and privacy-preserving ML for healthcare applications. Finally, we provide insight into the current research challenges and promising directions for future research.


I. INTRODUCTION
We are living in the age of algorithms, in which machine learning (ML)/deep learning (DL) systems have transformed multiple industries such as manufacturing, transportation, and governance. Over the past few years, DL has provided state of the art performance in different domains-e.g., computer vision, text analytics, and speech processing, etc. Due to the extensive deployment of ML/DL algorithms in various domains (e.g., social media), such technology has become inseparable from our routine life. ML/DL algorithms are now beginning to influence healthcare as well-a field that has traditionally been impervious to large-scale technological disruptions [1]. ML/DL techniques have shown outstanding results recently in versatile tasks such as recognition of body organs from medical images [2], classification of interstitial lung diseases [3], detection of lungs nodules [4], medical image reconstruction [5], [6], and brain tumor segmentation [7], to name a few.
It is highly expected that intelligent software will assist radiologists and physicians in examining patients in the near future [8] and ML will revolutionize the medical research and practice [9]. Clinical medicine has emerged as a exciting application area for ML/DL models, and these models have already achieved human-level performance in clinical pathology [10], Email: adnan.qayyum@itu.edu.pk radiology [11], ophthalmology [12], and dermatology [13]. Some of these studies have even reported that DL models outperform human physicians on average. The aspect of better performance of DL models in comparison with humans has led to the development of computer-aided diagnosis systems-for instance, U.S. Food and Drug Administration has announced the approval of an intelligent diagnosis system for medical images that will not require any human intervention 1 .
The potential of ML models for healthcare applications is also benefitting from the progress in concomitantly-advancing technologies like cloud/edge computing, mobile communication, and big data technology [14]. Together with these technologies, ML/DL is capable of producing highly accurate predictive outcomes and can facilitate the human-centered intelligent solutions [15]. Along with other benefits like enabling remote healthcare services for rural and low-income zones, these technologies can play a vital role in revitalizing the healthcare industry.
Notwithstanding the impressive performance of DL algorithms, many recent studies have raised concerns about the security and robustness of ML models-for instance, Szegedy et al. demonstrated for the first time that DL models are strictly vulnerable to carefully crafted adversarial examples [16]. Similarly, various types of data and model poisoning attacks have been proposed against DL systems [17] and different defenses against such strategies have been proposed in the literature [18]. However, the robustness of defense methods is also questionable and different studies have shown that most of the defense techniques fail against a particular attack. The discovery of the fact that DL models are neither secure nor robust hinders significantly their practical deployment in security-critical applications like predictive healthcare which is essentially life-critical. For instance, researchers have already demonstrated the threat of adversarial attacks on ML-based medical systems [19], [20]. Therefore, ensuring the integrity and security of DL models and health data are paramount to the widespread adoption of ML/DL in the industry.
In this paper, we present a comprehensive survey of existing literature on the security and robustness of ML/DL models with a specific focus on their applications in healthcare systems. We also highlight various challenges and sources of vulnerabilities that hinder the robust application of ML/DL models in healthcare applications. In addition, potential solutions to address these challenges are presented in this paper. In summary, the following are the specific contributions of this paper. 1) We present an overview of different applications of ML/DL models in healthcare. 2) We formulate the ML pipeline for predictive healthcare and identify various sources of vulnerabilities at each stage. 3) We highlight various conventional security and privacyrelated challenges as well as ones that arise with the adoption of ML/DL models. 4) We present potential solutions for the robust application of ML/DL techniques for healthcare applications. 5) Finally, we highlight various open research issues that require further investigation.
Organization of the Paper: The rest of the paper is organized as follows. In Section II, various applications of ML and DL techniques in healthcare are discussed. Section III presents the ML pipeline in data-driven healthcare and various sources of vulnerabilities along with different challenges associated with the use of ML. Different potential solutions to ensure secure and privacy-preserving ML are discussed in Section IV and various open research issues are outlined in Section V. Finally, we conclude the paper in Section VI.

II. ML FOR HEALTHCARE: APPLICATIONS
In this section, various prominent applications of ML in healthcare are discussed and we start by providing the big picture of ML in the context of healthcare.

A. ML in Healthcare: The Big Picture
The major phases for developing a ML-based healthcare system are illustrated in Figure 1 and major types of ML/DL that can be used in healthcare applications are briefly described next.
1) Unsupervised Learning: The ML techniques utilizing unlabelled data are known as unsupervised learning methods.
Widely used examples of unsupervised learning methods are a clustering of data points using a similarity metric and dimensionality reduction to project high dimensional data to lower-dimensional subspaces (sometimes also referred to as feature selection). In addition, unsupervised learning can be used for anomaly detection, e.g., clustering [21]. Classical examples of unsupervised learning methods in healthcare include the prediction of heart diseases using clustering [22] and prediction of hepatitis disease using principal component analysis (PCA) which is a dimensionality reduction technique [23].
2) Supervised Learning: Such methods that build or map the association between the inputs and outputs using labeled training data are characterized as supervised learning methods. If the output is discrete then the task is called classification and for a continuous value output, the task is called regression. Classical examples of supervised learning methods in healthcare include the classification of different types of lung diseases (nodules) [4] and recognition of different body organs from medical images [2]. Sometimes, ML methods can be neither supervised nor unsupervised, i.e., where the training data contains both labeled and unlabelled samples. Methods utilizing such data are known as semi-supervised learning methods. A systematic review of supervised and unsupervised learning techniques can be found in [24].
3) Semi-supervised Learning: Semi-supervised learning methods are useful when both labelled and unlabelled samples are available for training, typically, a small amount of labelled data and a large amount of unlabelled data. Semi-supervised learning techniques can be particularly useful for a variety of healthcare applications as acquiring a sufficient amount of labelled data for model training is difficult in healthcare. Different facets of semi-supervised learning using different learning techniques have been proposed in the literature. For instance, a semi-supervised clustering method for healthcare data is presented in [25] and a semi-supervised ML approach for activity recognition using sensors data is presented in [26]. In [27], [28], authors applied a semi-supervised learning method to medical image segmentation. 4) Reinforcement Learning: Methods that learn a policy function given a set of observations, actions, and rewards in response to actions performed over time fall in the class of reinforcement learning (RL) [29]. RL has a great potential to transform many healthcare applications and recently, it has been used for context-aware symptoms checking for disease diagnosis [30]. Furthermore, the potential of using RL for healthcare applications can be seen through the recent example of the Go game, where a computer using RL with the integration of supervised and unsupervised learning methods defeated a human champion player [31].

B. Applications of ML in Healthcare
Healthcare service providers generate a large amount of heterogeneous data and information daily, making it difficult for the "traditional methods" to analyze and process it. ML/DL methods help to effectively analyze this data for actionable insights. In addition, there are heterogeneous sources of data that can augment healthcare data such as genomics, medical data, data from social media, and environmental data, etc. A depiction of these sources of data is shown in Figure 2. The four major applications of healthcare that can benefit from ML/DL techniques are prognosis, diagnosis, treatment, and clinical workflow, which are described next.

1) Applications of ML in Prognosis:
Prognosis is the process of predicting the expected development of a disease in clinical practice. It also includes identification of symptoms and signs related to a specific disease and whether they will become worse, improve, or remain stable over time and identification of potential associated health problems, complications, ability to perform routine activities, and the likelihood of survival. As in clinical setting, multi-modal patients' data is collected, e.g., phenotypic, genomic, proteomic, pathology tests results, and medical images, etc., which can empower the ML models to facilitate disease prognosis, diagnosis and treatment [32]. For instance, ML models have been largely developed for the identification and classification of different types of cancers, e.g., brain tumor [33] and lung nodules [34]. However, the potential applications ML for disease prognosis, i.e., prediction of disease symptoms, risks, survivability, and recurrence have been exploited under recent translational research efforts that aim to enable personalized medicine. However, the field of personalized medicine is nascent that requires extensive development of adjacent fields like bioinformatics, strong validation strategies, and demonstrably robust applications of ML thus to achieve the huge and translational impact.
2) Applications of ML in Diagnosis: a) Electronic Health Records (EHRs): Hospitals and other healthcare service providers are producing a large collection of electronic health records (EHRs) on a daily basis and comprise of structured and unstructured data that contains a complete medication history of patients. ML-based methods have been utilized for the extraction of clinical features for facilitating the diagnosis process [35]. For example, a semisupervised approach for the extraction of diagnosis information from unstructured EHRs is presented in [36]. The use of ML for the diagnosis of diabetes from EHRs is presented in [37]. In [38], features robustness using EHRs data for the year of care for each record is examined for two tasks, i.e., mortality prediction and length-of-stay and authors showed that prediction performance gets degraded when ML models are trained on historical data and tested on unseen (future) data.
b) ML in Medical Image Analysis: In medical image analysis, ML techniques are used for efficient and effective extraction of information from medical images that are acquired using different imaging modalities such as magnetic resonance imaging (MRI), computed tomography (CT), ultrasound, and positron emission tomography (PET), etc. These modalities provide important functional and anatomical information about different body organs and play a crucial role in the detection/localization and diagnosis of abnormalities. A taxonomy of key medical imaging modalities is presented in Figure  3. The key purpose of medical image analysis is to assist clinicians and radiologists for efficient diagnosis and prognosis of the diseases. The prominent tasks in medical image analysis include detection, classification, segmentation, retrieval, reconstruction, and image registration which are discussed next. Moreover, fully automated intelligent medical image diagnosis systems are expected to be part of next-generation healthcare systems.
• Enhancement: Enhancement of degraded medical images is an important pre-processing step that directly effects the diagnosis process. There are many sources of noise and disturbances encountered in the medical image acquisition process which degrade the quality and significance of the resultant images. For instance, generating MRI images is a quite lengthy process that typically requires several minutes to produce a good quality image and to acquire detailed soft-tissue contrast, patients have to remain still and straight as much as possible. Because movements can cause false artifacts in image acquisition, the complete process has to be repeated usually multiple times to produce significantly useful images. Also, depending on the body area being scanned and the number of images to be taken, patients might be asked to hold their breath during short scans [40]. Therefore, any movement of the subject can introduce artifacts in the acquired image. Moreover, some sort of mechanical noise is also sometimes introduced in the output image. In the literature, different DL models are used for denoising medical images such as convolutional denoising autoencoders [41] and GANs. In addition, GANs have been successfully used for cleaning motion artifacts introduced in multi-shot MRI images [14]. Super-resolution is yet another powerful and impactful enhancement technique for medical images, e.g., MRI denoising [42]. • Detection: The process of identifying specific disease patterns or abnormalities (e.g., tumor, cancer) in medical images is known as detection. In traditional clinical practice, such abnormalities are identified by expert radiologists or physicians that often require a lot of time and effort. Whereas, DL based methods have shown their potential for this task and various studies have been presented in the literature for the detection of diseases.
For instance, a locality-sensitive approach utilizing CNN for the detection and classification of nuclei colon cancer in histopathological images is presented in [43]. A hybrid method utilizing handcrafted features and a CNN model for the detection of mitosis in breast cancer images is presented in [44]. • Classification DL models in particular, convolutional neural networks (CNNs) have proven to give high performance in medical image classification tasks when compared with other state-of-the-art non-learning based techniques. Modality classification, recognizing different body organs, and abnormalities from medical images using CNNs have been extensively studied in the literature. In [2], an approach using CNN for multi-instance recognition of different body organs is presented and a CNN based method for classification of interstitial lung diseases (ILDs) is presented in [3]. In another study, a CNN model is trained for the classification of lung nodules [4]. Transfer learning approaches have also been used for medical image classification [45]. In transfer learning, a pre-trained DL model (typically trained on natural images) is fine-tuned on a comparatively small dataset of medical images. The results obtained by this approach, as reported in the literature, are promising; however, a few studies have reported contradictory results. For instance, results obtained by transfer learning in [46] and [47] are contradictory. • Segmentation: The segmentation of tissues and organs in medical images enables quantitative analysis of abnormalities in terms of clinical parameters, e.g., automatically measuring the volume and shape of cancer in brain images. In addition, the extraction of such clinically significant features is an important and foremost step in computer-aided detection and diagnosis systems that we discuss later in this section. The process of segmentation deals with the partitioning of an image into multiple non-overlapping parts using a pre-defined criterion such as intrinsic color, texture, and contrast, etc. Addressing the problem of segmentation utilizing various DL models (e.g, CNN and recurrent neural network (RNN) [48]) is widely studied in the literature and the common architecture used for segmentation of medical images is U-net [49]. Various DL architectures are being proposed for the segmentation of multi-modal images such as the brain, skin cancer, CT images, etc. as well as segmentation of volumetric images [50]. An overview of various DL models for segmentation of medical images is presented in [51]. • Reconstruction: The process of generating interpretable images from raw data acquired from the imaging sensor is known as medical image reconstruction. The fundamental problem in medical image reconstruction is to accelerate the inherently slow data acquisition process, which is an interesting ill-posed inverse problem in which we want to determine the system's input given its output. Many important medical imaging modalities require a lot of time for reconstructing an image from the raw data samples, e.g., MRI and CT. Thus in medical image reconstruction, we aim to reduce image acquisition time and storage space.
Research on medical image reconstruction using deep models is drastically increasing and various DL models such as CNNs [52] and autoencoders [6] have been extensively used for the reconstruction of MRI and CT images. Recently, generative adversarial networks (GANs) have been widely used for the reconstruction of medical images and have produced outstanding results. For instance, a GAN based MRI reconstruction method is presented in [53] that also cleans the motion artifacts. • Image Registration: Image registration is the process of mapping input images with respect to a reference image and it is the first step in image fusion. Image registration has many potential applications in medical image analysis as described in detail by El-Gamal et al. [54], however, their use in actual clinical applications is very limited [55]. To facilitate the surgical spinal screw implant or tumor removal, image registration is usually applied in spinal surgery or neurosurgery for the localization of spinal bony landmark or a tumor, respectively. Various similarity metrics and reference points are calculated to align the sensed image with the reference image. In [56], a framework for deformable image registration named as Quicksilver is proposed that uses the large deformation diffeomorphic metric mapping (LDDMM) model for patch-wise prediction strategy. Similarly, an unsupervised learning based methods for deformable image registration is presented in . In [57], a CNN based regression approach for 2D/3D image registration is presented that addresses two fundamental limitations of existing intensity-based image registration methods, i.e., small capture range and slow computation. • Retrieval: The recent era has witnessed the revolution of digital interventions from the large-scale image and video collections to big data. This trend is true for medical imaging as well, every hospital and clinic having radiology services are producing thousands of medical images daily in diverse modalities, resulting in the growth of large-scale multi-modal medical image repositories. Thus making it difficult to manage and query such huge databases. In particular, it is more challenging for multi-modal medical data. To facilitate the production and management of multi-modal medical data, traditional methods are not sufficient and various ML/DL techniques are proposed in the literature [58], [59]. In routine practice, clinicians usually compare the current cases with the previous ones, mainly to effectively plan the diagnosis and treatment of the patient being examined. In this regard, identifying modality (i.e., modality classification discussed above) is of great significance as it serves as an initial tool to facilitate the process of comparison and an efficient modality classification system will reduce the search space by only looking for relevant images in the collections of the desired modality.
3) Applications of ML in Treatment: a) Image Interpretation: As discussed above, medical images are widely used in the routine clinical practice and the analysis and interpretation of these images are performed by expert physicians and radiologists. To narrate the findings regarding images being studied, they write textual radiology reports about each body organ that was examined in the conducted study. However, writing such reports is very challenging in some scenarios, e.g., less experienced radiologists and healthcare service providers in rural areas where the quality of healthcare services is not up to the mark. On the other side, for experienced radiologists and pathologists, the process of preparing high-quality reports can be tedious and time-consuming which can be exacerbated by a large number of patients visiting daily. Therefore, various researchers have attempted to address this problem using natural language processing (NLP) and ML techniques. In [60], a natural language processing based method is proposed for annotating clinical radiology reports. A multi-task ML based framework is proposed for automatic tagging and description of medical images [61]. In a similar study [62], an end-to-end architecture developed with the integration of CNN and RNN is presented for thorax disease classification and reporting in chest Xrays. In [63], a novel multi-modal model utilizing CNN and long short term memory (LSTM) network is developed for automatic report generation. b) ML in Real-time Health Monitoring: Real-time monitoring of critical patients is crucial and is a key component of the treatment process. Continuous health monitoring using wearable devices, IoT sensors, and smartphones is gaining interest among people. In a typical setting of continuous health monitoring, health data is collected using a wearable device and smartphone and then transmitted to the cloud for analysis using an ML/DL technique. Then the outcomes are transmitted back to the device for appropriate action(s). For instance, a framework having a similar system architecture is presented in [64]. The system is developed by integrating mobile and cloud for monitoring of heart rate using PPG signals. Similarly, a review of different ML techniques for human activity recognition with application to remote monitoring of patients using wearable devices is presented in [65]. The sharing of health data with clouds for further analysis raises many privacy and security challenges that we discuss in the next section.

4) Applications of ML in Clinical Workflows: a) Disease Prediction and Diagnosis:
The early prediction and diagnosis of diseases from medical data are one of the exciting applications of ML. Various studies have highlighted the potential of using predictive healthcare for the timely treatment of diseases. For instance, the case of cardiovascular risk prediction using different ML algorithms with clinical data is studied in [66] and the study concluded that ML techniques improved the prediction efficacy. A survey of various ML techniques for the detection and diagnosis of different diseases (such as diabetes, dengue, hepatitis, heart, and liver) is presented in [67]. The potential of using MLbased methods for prediction and prognosis of cancer is highlighted in [68]. b) ML in Computer-Aided Detection or Diagnosis: The computer-aided detection (CADe) or computer-aided diagnosis (CADx) systems are being developed mainly for the automatic interpretation of medical images that would assist the radiologist in their clinical practice. The system works by utilizing different functionalities including ML/DL, traditional computer vision and image processing techniques and relies heavily on the performance of these techniques. IBM's Watson is a classical example of CADx system developed by integrating various techniques including ML. However, any task in medical image and signal analysis automated by the application of ML/DL models can be deemed as a CADe or CADx systems, e.g., automation detection of fatty liver in ultrasound kurtosis imaging [69]. c) Clinical Reinforcement Learning: In reinforcement learning, the key objective is to learn a policy function for making precise decisions in an uncertain environment to maximise accumulated reward. In clinical medicine, RL can be used for providing optimal diagnosis and treatment for patients with distinct characteristics [70]. The performance evaluation of different RL techniques (i.e., Q-value iteration, tabular Qlearning, fitted Q-iteration (FQI), and deep Q-learning) for the treatment of sepsis in ICU using real-world medical dataset is presented in [71]. Sepsis is a severe infection involving organ dysfunction and is a leading cause of mortality due to expensive and suboptimal treatment. The dataset contains trajectories of a patient's physiological state and the provided treatments by clinicians at each time, along with the outcome (i.e., survival or mortality). The study concluded that simple and tabular Q-learning can learn effective policies for sepsis treatment and their performance is comparable with a complex continuous state-space method, i.e., deep Q-learning. d) ML for Clinical Time-Series Data: One of the tasks in clinical workflows is the modeling of clinical time-series data. Applications of clinical time-series modeling include prediction of clinical interventions in intensive care units (ICUs) using CNN and LSTM [72], mortality prediction in patients with traumatic brain injury (TBI) [73], and estimation of mean arterial blood pressure (ABP) and intracranial pressure (ICP) which are important indicators cerebrovascular autoregulation (CA) in TBI patients. In a recent study, attention models are used for the management of ICUs forecasting tasks (such as diagnosis, estimation, and prediction, etc.) by integrating clinical notes with multivariate and time-series measurements data [74]. In a similar study, the problem of unexpected respiratory decompensation using ML techniques is investigated in [75]. e) Clinical Natural Language Processing: Clinical notes are a widely used tool by the clinicians to communicate patient state. The use of clinical text is crucial as it often contains the most important information. The progress in clinical NLP techniques is envisioned to be incorporated in future clinical software for extracting relevant information from unstructured clinical notes for improving clinical practice and research [76]. Clinical NLP offers unique challenges such as the use of acronyms, language disparity, partial structure, and quality variance, etc. The challenges and opportunities of clinical NLP for languages other than English along with a review of clinical NLP techniques is presented in [77]. In [78], authors presented a toolkit named CLAMP that provides different state of the art NLP techniques for clinical text analysis.
f) Clinical Speech and Audio Processing: In the clinical environment, clinicians have to do a lot of documentation, i.e., preparing clinical notes, discharge summaries, and radiology reports, etc. According to Dr. Simon Wallace, clinicians spend 50% of their time on clinical documentation and are highly demotivated due to clinical workload, administrative tasks, and lack of leisure time [79]. Typically, they spend more time in preparing clinical documentation as compared to interacting directly with patients. To overcome such challenges, clinical speech and audio processing offer new opportunities such as speech interfaces for interaction less services, automatic transcription of patient conversations, and synthesis of clin-ical notes, etc. There are many benefits for using speech and audio processing tools in the clinical environment for each stakeholder, i.e., patients (speech is a new modality for determining patient state), clinicians (efficiency and timesaving), and healthcare industry (enhance productivity and cost reduction). In the literature, speech processing has been used for the identification of disorders related to speech, e.g., vocal hyperfunction [80] and as well as disorders that manifest through speech, e.g., dementia [81]. Alzheimer's disease identification using linguistic features is presented in [82]. In clinical speech processing, disfluency and utterance segmentation are two well-known challenges of clinical speech processing.

III. SECURE, PRIVATE, AND ROBUST ML FOR HEALTHCARE: CHALLENGES
In this section, we analyze the security and robustness of ML/DL models in healthcare settings and present various associated challenges.

A. Sources of Vulnerabilities in ML Pipeline
ML application in healthcare settings suffers from various privacy and security challenges that we will thoroughly discuss in this section. In addition, the three major phases of ML model development along with different potential sources of vulnerabilities causing such challenges in each step of the ML pipeline are depicted in Figure 4.
1) Vulnerabilities in Data Collection: Training of ML/DL models for clinical decision support requires the collection of a large amount of data (in formats such as EHRs, medical images, radiology reports, etc.), which is in general often timeconsuming and requires significant human efforts. Although in practice, medical data is carefully collected to ensure the effectiveness of the diagnosis, however, there can be many sources of vulnerabilities that can affect the proper (expected) functionality of the underlying ML/DL systems, a few of them are described next.
Instrumental and Environmental Noise: The collected data often contains many artifacts that arise due to instrumental and environmental disturbances. Let's consider the example of one of the widely used imagining modalities used to acquire highresolution medical images, i.e., multishot MRI. This modality is highly sensitive to motion, and even slight movement of the subject's head or respiration can cause undesirable artifacts in the resultant image [14], thereby increasing the risk of misdiagnosis [83].
Unqualified Personnel: Healthcare ecosystems are extremely interdisciplinary and comprise of technical and nontechnical personnel and often lack qualified workers that can develop and maintain ML/DL systems. As for the efficient application of data-driven healthcare, workers with strong statistical and computational backgrounds are required, e.g., engineers and data scientists. On the contrary, the clinical usability of ML/DL based systems is extremely important. Considering this aspect, hospitals tend to rely solely on physician-researchers who lack computational expertise to develop such systems [84].
2) Vulnerabilities Due to Data Annotation: Most applications of ML/DL in healthcare systems are supervised ML tasks which require an abundance of labelled training data. The process of assigning labels to each data sample (e.g., medical image) is known as data annotation. Ideally, this task shall mostly be performed by experienced clinicians (physicians or radiologists) to prepare domain-enriched datasets which are crucial to the development of useful ML/DL models in healthcare systems. The literature has revealed that training ML/DL models without a sound grip of the domain could be disastrous [85]. However, clinicians like expert radiologists are rare professionals and hard to engage in secondary tasks like data annotation. As a result, trainee staff (with little domain expertise) or ML/DL automated algorithms are usually employed during data labelling, which often leads to many problems such as coarse-grained labels, class imbalance, label leakage, and misspecification. Some specific data annotationbased vulnerabilities are discussed as below: Ambiguous Ground Truth: In medical datasets, the ground truth is often ambiguous, e.g., medical image classification task [19] and even expert clinicians disagree on well-defined diagnostic tasks [86]. This problem becomes more adverse with the presence of malicious users who want to perturb data, making the diagnosis difficult and causing difficulties in detecting its influence even with a human expert review.
Improper Annotation: The annotation of data samples process for life-critical healthcare applications should be informed by proper guidelines and various privacy and legal considerations [87]. Most widely used healthcare datasets are annotated for coarse-grained labels whereas real-life utility of ML/DL is to highlight rare, fine-grained and hidden strata within the clinical environment. This inability to perform labelling appropriately can lead to various efficiency challenges that are discussed next.
Efficiency Challenges: The collections of healthcare data on which ML/DL models are built suffer from various issues that arise several efficiency challenges. A few major problems impacting the quality of data are described next.
(a) Limited and Imbalanced Datasets: The size of datasets used for training ML/DL models is not up to the required scale. In particular, one major limitation of the efficient application of DL approaches in healthcare is the unavailability of large-scale datasets, as health data is often small in size. Notably, most life-threating health conditions are naturally rare and diagnosed once in many (thousands to millions) patients. Therefore, most ML/DL algorithms can not be efficiently trained and optimized for such lifethreatening healthcare task. (b) Class Imbalance and Bias: Class imbalance is yet another problem that arises in the supervised ML/DL which refers to the fact that the distribution of samples among classes is not uniform. If a class imbalanced dataset is used for training of the model then it will be reflected in the model's outcomes in terms of bias to certain categories. Biases in models' predictions in healthcare settings will have profound consequences and should, therefore, be mitigated. Various approaches have been proposed in the literature to address class imbalance problems. These approaches are discussed in the next section. (c) Sparsity: Data sparsity, i.e., missing values are common in real-world data that arise due to various reasons (e.g., unmeasured and unreported samples, etc.). Missing values and observations significantly affect the performance of ML/DL techniques. 3) Vulnerabilities in Model Training: The vulnerabilities regarding model training include improper or incomplete training, privacy breaches, model poisoning and stealing. Improper or incomplete training refers to the situations when the ML/DL model is trained with improper parameters, e.g., learning rate, epochs, batch size. Moreover, ML/DL models have been found strictly vulnerable to various security and privacy threats such as adversarial attacks [16], model [88] and data poisoning attacks [89], etc. The vulnerabilities of ML/DL systems hinder their efficient deployment for security-critical applications (such as digital forensic, bio-metrics, etc.) and as well as lifecritical applications (such as self-driving cars and healthcare, etc.). Therefore, ensuring the security and integrity of the ML/DL systems is of paramount importance for such critical applications. Various security threats associated with ML/DL systems are thoroughly described in the next section.
4) Vulnerabilities in Deployment Phase: The deployment of ML/DL techniques in a clinical environment essentially involves human-centric decisions. Therefore, ensuring the robustness of the system while considering fairness and accountability is necessary for the deployment phase. The following are the major vulnerabilities that can be encountered in the deployment phase of ML/DL systems. Whereas, security issues (e.g., adversarial attacks) are discussed in the next section.
Distribution Shifts: Distributions shifts are very much expected in realistic healthcare settings, for example, let's consider different imaging centers and DL models trained on images of one domain (imaging center) are subsequently deployed on different domain images. In such settings, the performance of the underlying DL model degrades significantly. Moreover, in predictive healthcare, ML models are developed using historical patient data and are usually tested on the new patients which raise questions about the efficacy of the ML predictions. Moreover, such differences can be exploited for generating adversarial examples [90].
Incomplete Data: In realistic settings, data collected for providing patient care may contain missing observations or variables, e.g., EHRs. The simplest way to avoid missing values is to ignore them completely while doing analysis but it cannot be done without knowing their relationships with already observed or unobserved data. Using the missing observations for training ML/DL models, on the other hand, leads to two well-known problems, i.e., false positives (a healthy person is diagnosed with a disease) and false negatives (a patient is identified as healthy). Both problems can have severe outcomes in actual healthcare settings, therefore, the healthcare data should be complete and compact in all aspects to ensure accurate predictions of outcomes. 5) Vulnerabilities in Testing Phase: Vulnerabilities in the testing phase are concerned with the interpretation of the results from the underlying ML/DL systems that include misinterpretation, false positive, and false-negative outcomes. False-positive and false-negative outcomes are due to incomplete/inefficient training of the model or due to incomplete data fed for the inference that we have discussed in the earlier section. Finally, the true essence of ML empowered healthcare is not just about turning a crank but it demands the cautious application of analytical methods [91].

B. The Security of ML: An Overview
In this section, we provide an overview of ML security particularly from the perspective of healthcare and highlight various associated security challenges with the use of ML.
1) Security Threats: The security threats on ML systems can be broadly categorized into three dimensions, i.e., influence attacks, security violations, and attack specificity [92]. A taxonomy of these security threats on ML systems is depicted in Figure 5. (a) Influence: Influence attacks can be of two types: (1) causative: the one that attempts to get control over training data; (2) exploratory: the one that exploits the miss-classification of the ML model without intervening the model training. (b) Security Violation: It is concerned with the availability and integrity of the services and can be categorized into  [16], [93]- [96]. In adversarial attacks, the key goal of an adversary is to generate adversarial examples by adding small carefully crafted (unnoticeable) perturbation into the actual (non-modified) input samples to evade the integrity of the ML/DL system. In general, there are two types of adversarial attacks that are described next.
(a) Poisoning Attacks: Adversarial attacks affecting the model training, i.e., manipulating the training data to mislead the learning of ML/DL model are known as poisoning attacks [88]. (b) Evasion Attacks: Adversarial attacks on the inference phase of the training process are known as evasion attacks [97]. In such attacks, an attacker manipulates the test data to compromise the integrity of the ML/DL model to harmful inputs. In healthcare applications, poisoning attacks are highly relevant because direct manipulation of the training data may be difficult or even impossible in some cases. Alternatively, the addition of new samples might be relatively easy, however, any such consequences hinder the applicability of the ML/DL systems. Therefore, the detection of poisoning attacks is critical for the robust application of ML/DL in healthcare applications. For instance, systematic poisoning attacks against six conventional ML models that were developed for hypothyroid diagnosis are presented in [98], where the objective of the attacker was to prevent hypothyroid diagnosis.
Similarly, a few researchers have highlighted the threat of these attacks to ML/DL models in healthcare settings and we provide insights from such articles in this section. Unlike adversarial examples created for evading ML/DL models in other settings, the concept of adversarial patients for healthcare applications is introduced in [20]. The authors argue that rather than intentional adversarial examples, the caution should be for unintentional adversarial patients that can lead to severe ethical issues. They identified a subgroup of adversarial patients and empirically validated that patients with identical predictive features can have significantly different individual treatment effects. In recent studies, white box and black box adversarial attacks have been demonstrated against three clinical applications; namely, fundoscopy, dermoscopy, and chest X-ray analysis [19], [99]. Furthermore, in [99], authors highlighted various potential incentives for adversaries via adversarial attacks in clinical trials that will rise with the increasing use of ML in the future, particularly, with the emergence of computer-aided diagnosis and decision support systems.

C. ML for Healthcare: Challenges
In this section, we discuss various challenges which hinders the applicability of ML/DL systems in practical healthcare applications.
1) Safety Challenges: : Excellent performance in a controlled lab environment (which is a common ML community practice) is not evidence of safety. Safety of ML/DL is the determination of how safe the ML/DL system is for patients. There should be a constant thought of safety throughout the ML/DL lifecycle. Majority of routine clinicians tasks are mundane, and patients they encounter have common health conditions. It is their role of diagnosing rare, subtle, and hidden health conditions which occur once in millions. Enabling ML/DL to performing well on hidden strata, outliers, edge, and subtle cases is key to ensure the safety of current AI systems.
2) Privacy Challenges: Privacy is one of the major challenges in data-driven healthcare which is concerned with the use of users' data by the ML/DL systems for making predictions. The users (i.e., patients) expect that their healthcare service providers are following necessary safety measures to safeguard their inherent right to the privacy of their confidential information, e.g., age, sex, date of birth, and health data. Potential privacy threats can be of two types, i.e., unveiling confidential information and malicious use of data (potentially by unauthorized agents).
Privacy depends upon the characteristics and nature of the data being collected, the environment it has been created in, and patients' demographics. Therefore, mitigation of privacy breaches using the appropriate technique(s) is critical as such breaches can directly harm the patients. The confidential data should be anonymized to prevent privacy breaches such as (re-)identification of the individuals [100]. Moreover, necessary attention should be paid to understand privacy concerns at each stage of data processing and the transfer of data among different departments within a hospital should be communicated in a secure environment.
3) Ethical Challenges: In user-centric applications of ML such as healthcare, it is important to ensure the ethical use of data. Explicit measures should be taken to understand the targeted user population and their sociological aspects before collecting data for building ML models. Moreover, understanding how data collection can harm a patient's well-being and dignity is an important consideration in this regard. If ethical concerns are not taken into account then the application of ML in realistic settings will have adverse results. Furthermore, to ensure fair and ethical operation of automated systems, it is imperative to have a clear understanding of the AI system in uncertain and complex scenarios [101].

4) Causality is Challenging:
Understanding causality is important in healthcare because most of the crucial healthcare problems require causal reasoning, i.e., "what if?" [102]. For example, asking a question about what will happen if a doctor prescribed treatment A instead of treatment B. Such questions cannot be exploited through classical learning algorithms and to answer them we need to analyze the data from the lens of causality [103]. In healthcare, learning is often solely based on observational data and asking causal questions by learning from observational data is quite challenging which requires building causal models.
DL models are black-box which lacks fundamental underlying theory and these models essentially work by exploiting patterns and correlations without considering any causal link [104]. In general, this cannot be deemed as a limitation since prediction does not require any causal relation. In predictive healthcare, the absence of causal relation can raise questions about the conclusions that can be drawn from outcomes of DL models. Furthermore, fairness in decision making can better be enforced through the lens of causal reasoning [105], [106]. The estimation of the causal effect of some variable(s) on a target output (e.g., target class in multi-class classification problem) is important to ensure fair predictions.

5) Regulatory and Policy Challenges:
The full potential of ML/DL systems (which essentially constitutes software as a medical device) in actual healthcare settings can only be realized by addressing regulatory and policy challenges. The literature suggests that the regulatory guidelines are needed for both medical ML/DL systems and their integration in actual clinical settings [107]. Therefore, the integration of AIempowered ML/DL systems in the actual clinical environment should be in compliance with the policies and regulations defined by the government and regulatory agencies. However, existing regulations are not suitable for certifying systems which are ever-evolving such as ML/DL empowered systems because yet another key challenge with the use of ML/DL algorithms in clinical practice is to determine how these models should be implemented and regulated since these models will incorporate learning from the new patient data [108]. In addition, the objective clinical evaluation of ML/DL systems for particular clinical settings is crucial to ensure safe, effective, and robust operation that does not harm the patients in either way. Data scientist and AI engineers should be employed in hospitals for assessing AI systems regularly to ensure it is still safe, relevant, and working fine. 6) Availability of Good Quality Data: The availability of representative, diverse and high-quality data is one of the major challenges in healthcare. For instance, the amount of data available to the research community is very small in size and limited in scope as compared to the heterogeneous collections of large-scale multi-modal patient data being generated on daily basis by different small and large size healthcare institutions. However, the development of good quality data that resembles real clinical settings is on the other very challenging and requires resources for management and maintenance. The availability of high-quality data can effectively serve the intended purpose of disease prediction and decision making for planning treatment.
The data collected in practice suffer from different issues such as subjectivity, redundancy, and bias. As the ML/DL models perform inferences by solely learning the latent factors of the data on which they are trained, therefore, the effect of data generated by the undesirable past practices of hospitals will be reflected in the outcomes of the algorithm. For example, most people with no health insurance are denied healthcare services and if AI learns from that data, it will do the same. It has been shown that a model could depict racial bias by producing varying outcomes for different subpopulations [115] and the training data can also introduce its own modeling challenges [116], [117]. 7) Lack of Data Standardization and Exchange: Medical ML/DL system shall facilitate a deep understanding of the underlying healthcare task, which (in most cases) can only be achieved by utilising other forms of patients data. For example, radiology is not all about clinical imaging. Other patient EMR data is crucial for radiologists to derive the precise conclusion for an imaging study. This calls for the integration and data exchange between all healthcare systems. Despite extensive research on data exchange standards for healthcare, there is a huge ignorance in following those standards in healthcare IT systems which broadly affects the quality and efficacy of healthcare data, accumulated through these systems. There are numerous guidelines to perform specific medical interventions like imaging studies (i.e., with define exposure and positioning) to ensure the significance of the data clinically. However, current healthcare IT systems largely ignore standards and clinicians barely follow well-established guidelines. As a result, data integration and exchange efforts across different specialities and organisations fail. Data integration to match diverse patients' medical records is crucial to deliver high-value patient care. The lack of appetite to implement data exchange standards in wider healthcare industry hinders the efficacy of ML/DL systems as multi-modal data is vital to ensure the deep understanding of algorithms, and will undoubtedly enhance the performance of physicians towards clinical decisions using data driven insights.
8) Distribution Shifts: The problem of data distribution shifts is yet another major challenge and perhaps one of the most challenging problems to solve [118]. In clinical practice, training and testing data distributions can diverge due to many reasons, e.g., medical data is generated by different institutions using different devices for patients having complicated cases. Due to this issue, ML/DL models developed using available  [110] Proposed an algorithm names as TRIM to defend poisoning attacks.
Linear Regression Anticoagulant drug Warfarin Liu et al. [111] XMPP server and several mobile devices.
Proposed a DL framework, Human Activity Recognition Malathi et al. [112] Paillier homomorphic encryption. NaveBayesia, SVM, NeuralNetwork, and FKNN-CBR Indian Liver Patient Takabi et al. [113] Homomorphic encryption. DNN 15 datasets from UCI repository. Kim et al. [114] Security Homomorphic encryption based secure logistic regression. Healthcare IT systems are mostly proprietary and operate in silos, which results in the revision, fixing, and update of software being costly and time-consuming. It has been reported in the literature that in 2013, the majority of hospitals were using the ninth version of the international classification of disease (ICD) systemeven though a revised version (i.e., ICD-10) was released as early as 1990 [19]. The difficulties in updating hospital software infrastructure can raise many vulnerabilities with the use of modern tools like ML/DL systems.

IV. SECURE, PRIVATE, AND ROBUST ML FOR HEALTHCARE: SOLUTIONS
In this section, we present an overview of various proposed methods to ensure secure, private, and robust ML for healthcare applications. A summary of articles focused on the topic of "secure and privacy-preserving ML for healthcare" is presented in Table I and various approaches for secure, private, and robust ML are described next. In addition, a taxonomy of commonly used approaches for secure, private, and robust ML is presented in Figure 6 and described individually next.

A. Privacy-Preserving ML
Preserving the privacy of the user in healthcare is paramount, as it is a user-centric application and involves the collection of personal data and any breach of privacy can lead to unavoidable consequences. Preserving privacy means that ML model training and inference should not reveal any additional information about the subjects from whom data was collected. In general, ML/DL requires training data stored on a central repository (e.g., cloud) that may include the users' private data which raises various threats and to address such concerns data anonymization techniques are used. However, it has been reported in the literature that meaningful information can be inferred about individuals' private data even when the data is anonymized [119].
Various efforts in the literature have addressed the privacy issues with the use of ML. Three different protocols for the two-server model are presented in [120], where the private data is distributed among two non-colluding servers by the data owners and then those servers train the ML models on the joint data by following secure two-party computation (2PC). Furthermore, different techniques have been proposed to perform secure arithmetic operations in the secure multiparty computational environment and alternatives to nonlinear activation functions used in ML models such as softmax and sigmoid are also proposed. Similarly, various techniques for privacy-preserving ML such as cryptographic and differential privacy approaches are discussed in [100]. Here we briefly discuss the widely used methods for preserving privacy.
1) Cryptographic Approaches: Cryptographic approaches are used in the scenarios where the ML model requires encrypted data (for training and testing purposes) from multiple parties. The widely used methods include homomorphic encryption, secret sharing, garbled circuits, and secure processors which are briefly described next.
(a) Homomorphic Encryption: It enables computations on encrypted data with operations such as addition and multiplication which can be used as a basis for computing complex functions. Typically, the data is encrypted using ciphertext and public keys of the original data owners. (b) Garbled Circuits: Garbled circuits are used in cases where two parties (let's assume Alice and Bob) want to get results computed using their private data. Alice will send the function in the form of the garbled circuit along with her input. After obtaining the garbled version of his input from Alice in oblivious fashion, Bob will use his garbled input with the garbled circuit to get the result of the required function and can share it with Alice, if required. The use of homomorphic encryption and garbled circuits to build cryptographic blocks for developing three classification techniques; namely, Nave Bayes, decision trees, and hyperplane decision is presented in [121], where the goal is to protect ML models and new samples submitted for inference. (c) Secret Sharing: The strategy of distributing secrete among multiple parties while holding a "share" of the secret is known as secret sharing. The secret can only be reconstructed when all individual shares are combined; otherwise, they are unuseful. In some settings, the secret is reconstructed using t shares (where t is a threshold value) that will not require all shares to be combined. A secret sharing paradigm for computing privacy-preserving parallelized principal component analysis (PCA) is presented in [122]. In a similar study [123], a protocol is developed using the "secret sharing" strategy for aggregating model updates received from multiple input parties, the updates are used for training of the ML model. A privacy-preserving emotion recognition framework is presented in [124]. Authors used a multi-secret sharing scheme for transmitting audio-visual data collected from users using edge devices to the cloud where a CNN and sparse autoencoder were applied for feature extraction and support vector machine (SVM) was used for emotion recognition. (d) Secure Processors: Secure processors were originally developed by rogue software to ensure the confidentiality and integrity of sensitive code from unauthorized access at higher privilege levels. However, these processors are being utilized in privacy-preserving computation, e.g., Intel SGXprocessor. For instance, Ohrimenko et al. developed an SGX-processor-based data oblivious system for k-mean clustering, decision trees, SVM, and matrix factorization [125]. The key idea was to enable collaboration between multiple data owners running the ML task on an SGX-enabled data center. All types of communications between the data owners and the enclave were performed by establishing independently a secure channel (i.e., an individual channel for each data owner).
2) Differential Privacy: Differential privacy refers to the mechanism of adding perturbation into the datasets to protect private data. The idea of adding adequate noise in the database for preserving privacy was first introduced by C. Dwork in 2006 [126]. Differential privacy constitutes a strong standard for guaranteeing privacy for algorithms performing analysis on aggregate databases and it is defined in terms of the application-specific concept of neighbor datasets [127]. Differential privacy is particularly useful for applications like healthcare due to its several properties such as group privacy, composability, and robustness to auxiliary information. Group privacy implies elegant degradation of privacy guarantees when datasets contain correlated samples. Whereas, composability enables modularity of the algorithmic design, i.e., when individual components are differentially private. Robustness to auxiliary information means that the privacy of the system will not be affected by the use of any side's information that is known to the adversary. To avoid privacy breaches, the researchers can also explore encrypted and noisy datasets for building ML empowered healthcare applications [128].
Various approaches for differential privacy have been proposed in the literature, e.g., private aggregation of teacher ensembles (PATE) for private ML [129], differentially private stochastic gradient descent (DP-SGD) algorithm [127], moments accountant [130], hyperparameter selection [131], Laplace [132] and exponential noise differential privacy mechanisms [133], [134]. For instance, privacy-preserving dis-tributed DL for clinical data using differential privacy that incorporates the idea of cyclical weight transfer is presented in [135].
3) Federated Learning: The idea of federated learning (FL) has been recently proposed by Google Inc. [136]. In FL, a shared ML model is built using distributed data from multiple devices where each device trains the model using its local data and then shares the model parameters with the central model without sharing its actual data. An FL-based decentralized scheme using iterative cluster primal-dual splitting (cPDS) algorithm to predict hospitalization requiring patients using large-scale EHR of heart-related diseases is presented in [137]. In [138], simple vanilla, U-shaped, and vertically partitioned data-based configurations for split learning DL models are presented. The proposed framework is named SplitNN that does not require sharing of patients' critical data with the server. A framework of federated autonomous deep learning (FADL) using distributed EHR is presented in [139].

B. Countermeasures Against Adversarial Attacks
In the recent literature, countermeasures against adversarial attacks are categorized into three classes: (1) modifying model; (2) modifying data; and (3) adding an auxiliary model(s) [140]. A taxonomy of such methods is presented in Figure 7 and are discussed next.
1) Modifying Model: The modifying model includes methods that modify the parameters or features of the trained ML model, widely used methods include the following: • Defensive Distillation: The distillation of neural networks was first introduced by Hinton et al. as a method for transferring the knowledge from a larger model to a smaller one [141]. The notion of network distillation was then adopted by Papernot et al. to defend against adversarial attacks, also known as defensive distillation [142]. The authors used the predicted labels of the first model as the labels of the input sample to the original DL model. This strategy increases the robustness of the DL model to considerably small perturbations. However, in a later study, Carlini and Wagner demonstrated that their proposed adversarial attack (named as C&W attack) evaded the defensive distillation method [143]. • Network Verification: The techniques verifying certain properties of DL models in response to input samples are known as network verification methods. The key goal is to restrain adversarial examples while checking whether the input satisfied or violated certain properties. In [144], such a method is proposed that uses ReLU activation and satisfiability modulo theory (SMT) to make deep models resilient against adversarial attacks. • Gradient Regularization: The idea of using input gradient regularization for defending adversarial examples was proposed by Ross et al. [145]. They trained the differentiable models by regularizing the variation in the results with respect to the change in the input due to which small adversarial perturbations were not able to affect the output of DL models. However, this method increases the complexity of the training process by a factor of two.
• Classifier Robustifying: In this method, classification models are developed that are robust to adversarial attacks rather than building a detection strategy for such attacks.
In [146], authors exploited the uncertainty around the adversarial examples and proposed a hybrid model by utilizing Gaussian processes (GPs) with RBF kernels on top of DNNs to make them robust against adversarial attacks. In a similar study, a robust model is proposed for MNIST classification that uses analysis by synthesis through learned class-conditional data distribution. • Interpretable ML: It includes those methods that aim at explaining and interpreting the outcomes of ML/DL models for robustifying them against adversarial attacks. An approach utilizing the interpretability of deep models for the detection of adversarial examples for face recognition task is presented in a recent study [147]. The key aspect of this method is that it identifies critical neurons for the individual task by initiating a bi-directional correspondence reasoning between the model's parameters and its attributes. The activation values of the identified neurons are then increased to augment the reasoning part and activation values of other neurons are decreased to mask the uninterpretable part. However, Nicholas Carlini demonstrated that the aforementioned method utilizing the interpretability of deep models is not resilient to untargeted adversarial examples generated using L ∞ norm [148]. • Masking ML Model: In a recent study [149], a method for secure learning is presented in which the problem of adversarial ML is formulated as learning and masking problem. The masking of the deep model was performed by introducing noise in the logit output which successfully deafened attacks with low distortions.
2) Modifying Data: It includes those methods that aim at either modifying the data or its features, commonly used methods are described next: • Adversarial (Re-)training: This is a very basic method that was originally proposed by Goodfellow et al. for making deep models robust to adversarial examples [93]. In this method, the ML/DL models are trained (or retrained) using an augmented training set that includes adversarial examples. Various studies have used this method for evaluating the robustness of DL classifiers using different datasets, e.g., MNIST [150] and ImageNet [144]. However, it has been reported in the literature that this method fails to defend against iterative adversarial perturbation generation methods like basic iterative method (BIM) [151]. • Input Reconstruction: The method of transforming adversarial examples into legitimate ones by cleaning the adversarial noise is known as input reconstruction. The transformed samples have no harmful effect on the inference of deep models. In [152], denoising autoencoder is used for the cleaning of adversarial examples. • Feature Squeezing: Xu et al. [153] proposed feature squeezing as a defense method against adversarial examples by squeezing the input feature space that an ad- versary can exploit to construct adversarial examples. To reduce the available feature space to an adversary, authors combined heterogeneous feature vectors in the original feature space into a single space. The feature squeezing was performed at two levels: (1) smoothing the spatial domain using local and non-local operations and (2) minimizing color bit depth. Moreover, the performance evaluation of the proposed defense was performed using eleven state of the art adversarial perturbation generation methods using three benchmark datasets (i.e., CIFAR10, MNIST, and ImageNet). However, in a later study, the aforementioned defense method was found to be less effective [154]. • Features Masking: The method of feature masking was proposed by Gao et al. [155] that aims at masking the most sensitive features of the input that are susceptible to adversarial perturbations. The authors added a masking layer right before the classification layer (i.e., softmax) that sets the corresponding weights of the sensitive neurons to zero. • Developing Adversarially Robust Features: To develop adversarially robust features, the connections between the metric of interest and natural spectral geometrical property of the dataset has been leveraged in [156]. Furthermore, the authors provided empirical evidence about the effectiveness of using a spectral approach for developing adversarially robust features. • Manifold Projection: The method of projecting input samples on the manifold learned by the generative models is known as manifold projection. Song et al. [157] used generative models to clean adversarial noise (perturbations) from the adversarial images then the cleaned images are used as the input to the non-modified model. In a similar study [158], generative adversarial networks (GANs) are used for cleaning of adversarial noise.

3) Adding Auxiliary Model(s):
In these methods, additional auxiliary ML/DL models are integrated to robustify the mainstream model, commonly used methods that fall into this class are described in the following paragraphs: • Adversarial Detection: In this method, an additional binary classifier is trained to distinguish between the adversarial and original samples that can be regarded as the detector model [159], [160]. In [161] [162]. The PixelDefend method is an excellent example of an ensemble defense method in which authors used an ensemble of two methods, i.e., adversarial detection and input reconstruction [157]. However, it has been shown that the ensemble of weak defenses does not necessarily increase the robustness of DL models to adversarial attacks [154]. • Using Generative ML Models: The idea of defending against adversarial attacks by utilizing generative models was firstly presented by Goodfellow et al. [93], however, in the same study the authors presented an alternative hypothesis of ensemble training and articulated that generative training is not sufficient. In [163], adversarial examples are cleaned using GAN that was trained on the same dataset. In a similar study [164], a framework named Defense-GAN is presented that is trained on the distribution of legitimate samples. Defense-GAN finds similar output during the testing phase without adversarial perturbations that are given as input to the original DL model.

C. Causal Models for Healthcare
Asking causal questions in healthcare is a very challenging yet important approach and ideally, causal inferences require experiments. But it in healthcare this not always possible, e.g., if we want to figure out what will happen if a person takes drug A instead of B, we can not experiment it directly on the patient which is unethical and can have unintended consequences. Alternatively, retrospective observational data is leveraged to train models for making counterfactual predictions of what we would have observed if we had run an experiment [165]. Causality can be deemed in two foundational ways, i.e., potential outcomes and causal graphical models that require manipulating reality. In predictive healthcare, potential outcomes can be treatment, action, and interventions. If the total number of possible treatments is T then we can have T possible outcomes and the unit of observation will be a patient who gets one of the T treatments.
In the literature, different approaches have been presented for providing causal inferences and reasoning in healthcare using classical models. For instance, the Gaussian processes based counterfactual causal model has been presented in [165] and in a similar study, authors introduced the counterfactual Gaussian process (CGP) for predicting counterfactual future progression and argued that counterfactual model can provide reliable decision support [102]. The use of probabilistic graphical models to analyze causality in health conditions for identification sleep apnea, Alzheimer's disease, and heart diseases is presented in [166]. A comprehensive review of graphical causal models can be found in this recent study [167].

D. Solutions to Address Distribution Shifts
To cater with data distribution shift problem various techniques have been proposed in the literature (e.g., transfer learning and domain adaptation), which are described next.
1) Transfer Learning: The requirement of the availability of a large-scale dataset for training DL models capable of providing high performances can be partially mitigated using transfer learning. Transfer learning is a technique in which a model trained on a larger dataset is re-trained (fine-tuned) on the application-specific dataset (relatively smaller in size to the first one). The aim is to transfer knowledge learned by the model from one domain (data distribution) to the other domain [168]. However, transfer learning can be problematic for healthcare applications due to the requirement of sufficiently large data for first training and good quality data annotated by expert clinicians such as radiologists for domain-specific training.
2) Domain Adaptation: Domain adaptation is the method of learning a DL model by considering a shift between the training (often called as source domain) and test (often called as target domain) data distributions, i.e., source domain and target domain distributions are different. Domain adaptation is a special case of transfer learning that can be particularly useful for medical image analysis tasks such as MRI segmentation [118], [169], chest X-ray classification [170], and multi-class Alzheimer disease classification [171], etc. Different facets of domain adaptation have been proposed in the literature and can be broadly categorized as supervised, unsupervised, semisupervised, and self-supervised domain adaptation methods which are described below. Please note that the definition of domain adaptation is ambiguous since it may refer to labeled data being available in the source or target domains and the definitions provided below for each method are mostly used in the literature [172].
(a) Supervised Domain Adaptation: This method is similar to a supervised learning strategy with the only difference of different distributions for source domain and target domain data. Supervised domain adaptation is particularly useful when a labeled data is available for the target domain and generally, the source domain also has labeled data. (b) Unsupervised Domain Adaptation: In unsupervised domain adaptation, source domain data is labeled and target domain data is unlabeled. An unsupervised domain adaptation method using reverse flow and adversarial training for generating synthetic medical images is presented in [173]. In addition, the authors used self-regularization for preserving clinically-relevant features. (c) Semi-supervised Domain Adaptation: In semi-supervised domain adaptation, labeled source data and partial labeled target domain. (d) Self-supervised Domain Adaptation: Self-supervised domain adaptation methods aims at learning visual models without manual labeling by training generic models using auxiliary relatively simple tasks (known as pretext tasks). The supervision is provided by modifying the original visual content (e.g., a set of images) according to known transformations (e.g., rotation) and then the model is trained to predict such transformations that serve as labels for the pretext tasks [174].

E. Towards Responsible ML
In this section, we provide different methods for ensuring responsible ML and we start by enlisting general responsible AI practices.
1) General Responsible AI Practices: The following are some recommended AI practices to ensure effective and reliable AI systems 2 .
• Consider human-centered design approach: To have a large impact on the system being developed, it is important to consider the characteristics of the users for true recommendations. • Evaluate training and monitoring using suitable metrics: Instead of using multiple metrics for evaluation of model training, ensure that the metric is appropriate for the context and goals of the systems and consider users' feedback in terms of surveys. • Examine your raw data: The biases and abnormalities in the datasets (e.g., missing values, class imbalance, and incorrect labels) are directly reflected by the learned ML models. To ensure the efficacy of the learning process, careful examination of the raw dataset is necessary while respecting the privacy concerns. • Understand limitations of the model and dataset: It is crucial to understand the capability and limitations of the ML model and dataset, e.g., a model trained for detecting correlations cannot be used for inferences. • Repetitive Testing: Once developed, ML systems should be tested again and again to ensure that they are working as intended. Rigorous tests should be performed to understand how the individual components of the ML system interact with each other. Other similar tests include testing for input drifts, using gold standard datasets, incorporating a larger sample base, and using quality checking mechanisms. • Continuous Monitoring and Updating: To ensure the efficient performance of the ML systems deployed in real-time settings, continued monitoring and updating are required to identify and fix various issues encountered in realistic settings. 2) Responsible ML for Healthcare: ML/DL techniques have a great potential for clinical applications (e.g., radiologist-level pneumonia detection [11] and dermatologistlevel classification of skin cancer [13], etc.) but their limited adoption in actual clinical settings indicates that these methods are not yet optimal and not ready for clinical deployment. In a recent study [175], Wiens et al. have provided a roadmap towards safe, meaningful, and responsible ML for healthcare and argued that ML deployment in any field should be carried out by an interdisciplinary team that may include different stakeholders from multi disciplines, i.e., knowledge experts, decision-makers, and users. Examples for an interdisciplinary team having different stakeholders in the healthcare ecosystem are presented in Table II. In addition, the authors also identified critical steps to be followed/considered when designing, testing, and deploying ML solutions for healthcare applications that include: (1) choosing the right problems; (2) developing 2 https://ai.google/responsibilities/responsible-ai-practices/ a useful solution; (3) considering ethical implications; (4) rigorously evaluating the model; (5) thoughtfully reporting results; (6) deploying responsibly; and (7) making it to market. The main strength of ensuring secure ML relies on the development of security tools and algorithms. To ensure the security and privacy of ML models and data, various tools and libraries have been released so far. For example, Ten-sorFlow Federated 3 , which is an open-source framework for distributed ML/DL that enables training of a global shared model in a federated environment without sharing clients' local data. CrypTen 4 is a framework for secure and privacypreserving ML built on PyTorch that provides secure computing techniques for ML/DL model training and inference using encrypted data and PyTorch-DP 5 -a framework of PyTorch for training DL models with differential privacy. Similarly, OpenMined 6 -an open-source community offers various tools and libraries for building privacy-preserving ML models which are briefly described below.
• PySyft 7 is python library for encrypted and privacy preserving ML. It extends PyTorch, TensorFlow, and Keras and supports differential privacy, federated learning, multi-party computation, and homomorphic encryption. • PyGrid 8 is a platform built on PySyft that provides a peer-to-peer network to collectively train ML models. • SyferText 9 is a privacy preserving framework for NLP tasks.

V. OPEN RESEARCH ISSUES
In this section, various open research issues related to the domain of secure, robust, and private ML for healthcare that require further research attention are presented.

A. Interpretable ML
Although the advancement in ML/DL research has provided significant performance improvements over the previous state of the art methods in terms of performance metrics such as accuracy, precision, recall, and f1-measure, these advancements have made the learning process of modern models very complex and are usually deployed as a black-box. These blackbox methods fail at providing rational or insights as well as at explaining their learning behavior and thought process for making predictions [176]. The aforementioned problem is termed as the interpretability problem of ML in the literature, which is defined as the ability to describe the internal processes of an ML system in a human-understandable manner.
Moreover, interpretability of ML/DL techniques is required to ensure algorithmic fairness, robustness, and generalization based on potentially dispersed data collected from a heterogeneous population. This can eventually help in the smooth deployment and functionality of ML/DL systems in realistic settings. For a critical application like healthcare, the ML/DL model is expected to be highly accurate and understandable at the same time. Moreover, it has been argued that clinical integration of AI models will require interpretability [177].
To perform an interpretation of ML models, questions about the fairness of model's predictions, transparency, and accountability are considered and interpretation is performed using explanation methods for justifying predictions of the model using visual, textual, or features information. Various methods have been proposed in the literature for explaining ML/DL models for general applications [178]- [180], however, more research that is specifically focused on the interpretation of ML/DL systems used in healthcare applications is required.

B. Machine Learning on the Edge
The advancements in ML research have revolutionized traditional healthcare (as discussed in earlier sections). Healthcare services will increasingly adopt the utilization of IoT devices and wearable sensors in the future, particularly with the evolution of smart cities and portable medical devices, e.g., portable MRI scanner. With such proliferation, there is a pressing need for pushing ML models training and inference on edge devices. This introduces unique challenges such as limited hardware and processing capabilities, etc. Moreover, this is crucial for portal medical devices that are utilized for patients in critical care as they cannot be moved to fixed medical equipment in the hospital. The research on enabling ML on edge devices (a.k.a fog) is in the early stages of development and requires further attention from the research community. The development of this field will enable to monitor patients in a critical situation and eventually enable continuous behavioral monitoring for improving individuals' life-style and timely detection of diseases.

C. Handling Dataset Annotation
To increase the performance of ML/DL models, one natural strategy is to acquire more labeled training data. This requires that radiologists and medical experts spend their valuable time manually annotating medical data, e.g., medical images, signals, and reports. Another important aspect is devising true validation sets that will evaluate the performance of the ML/DL models and expose the limitations of these models. Therefore, manual annotation of samples into respective categories is time consuming, costly, and a tidy process. Automatic approaches should be developed to address this issue and one such technique is active learning which can be used to annotate unlabelled data samples.
Data from multiple sources should be considered when performing annotation for specific clinical applications because single-source data might lack precise structured labels [103]. The integration of multiple source data is an important application of ML in healthcare [181], which is known as phenotyping [182]. NLP techniques and recurrent deep models can be used for extracting and integrating rich information from unstructured clinical notes to augment the capacity of data annotators.

D. Distributed Data Management and ML
In healthcare settings, the data is generated in a distributed fashion, i.e., across different departments within a hospital and even across different hospitals. This necessitates the efficient management and sharing of distributed data for clinical analysis purposes, particularly using ML/DL models. In general, for developing ML/DL models, it is assumed that complete training and validation datasets are centrally available and easily accessible. Therefore, there is an increasing demand to develop methods for distributed data management and ML.

E. Fair and Accountable ML
The literature on analyzing the security and robustness of ML/DL approaches reveals that the outcomes of these models lack fairness and accountability [140]. Whereas ensuring the fairness and accountability of predictions in life-critical applications like healthcare are of paramount importance, the fairness property ensures that the ML model should not favor certain cases over others. Such discrimination mainly arises due to biases in the training data. On the other hand, accountability property is concerned with the interpretation of the predictions. Fairness and accountability will assist in developing models robust to biases and imperfections such as past clinical practices F. Model-Driven ML Although ML, AI, and big data are immensely useful tools for healthcare, these tools are not panacea and it is important to be aware of the associated caveats and pitfalls [176]. Failing to realize this, one can easily fall prey to the dangerous dogma that data once available in abundance must and will speak for itself and can handle hypothesis generation as wellwhich in clinical terms would mean that data mining is sufficient and independent of the need of clinical interpretation, external validation, and understanding of data's provenance [183]. To avoid the various problems that can arise from improper use of ML in healthcare, it is important to combine data-driven methods with hypothesis-driven or model-based methods (based on subject matter knowledge) and to bring scientific rigor in these studies. Properly designed experiments are also necessary for deriving causal explanations. Avenues for developing secure and robust ML solutions for healthcare that are scientifically robust and rigorous requires further attention from the community.

VI. CONCLUSIONS
The use of machine learning (ML)/deep learning (DL) models for clinical applications has great potential to transform traditional healthcare service delivery. However, to ensure a secure and robust application of these models in clinical settings, different privacy and security challenges should be addressed. In this paper, we provided an overview of such challenges by formulating the ML pipeline in healthcare and by identifying different sources of vulnerabilities in it. We also discussed potential solutions to provide secure and privacy-preserving ML for security-critical applications like healthcare. Finally, we presented different open research problems that require further investigation.