A Systematic Review on Federated Learning in Medical Image Analysis

Federated Learning (FL) obtained a lot of attention to the academic and industrial stakeholders from the beginning of its invention. The eye-catching feature of FL is handling data in a decentralized manner which creates a privacy preserving environment in Artificial Intelligence (AI) applications. As we know medical data includes marginal private information of patients which demands excessive data protection from disclosure to unexpected destinations. In this paper, we performed a Systematic Literature Review (SLR) of published research articles on FL based medical image analysis. Firstly, we have collected articles from different databases followed by PRISMA guidelines, then synthesized data from the selected articles, and finally we provided a comprehensive overview on the topic. In order to do that we extracted core information associated with the implementation of FL in medical imaging from the articles. In our findings we briefly presented characteristics of federated data and models, performance achieved by the models and exclusively results comparison with traditional ML models. In addition, we discussed the open issues and challenges of implementing FL and mentioned our recommendations for future direction of this particular research field. We believe this SLR has successfully summarized the state-of-the-art FL methods for medical image analysis using deep learning.


I. INTRODUCTION
Image processing and analysis both are different tasks and often dependent on each other in terms of classifying an image data. To describe the image processing history we have to look quite back in 1973, an image of a Swedish model Lena is the first one that was used for image processing. Since then image processing has been applied in dozens of research fields, medical imaging is one of them. An image is essentially composed of 2D signals (vertical and horizontal), also with a number of pixels [1]. Different types of images have their different pixel parameters, during analysis these parameters help to extract respective information from the image. On the other hand, the task of the analysis part is to understand the processed images through different techniques, i.e., Machine Learning (ML); this technique includes different ML oriented algorithms. At the beginning, classical The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . ML algorithms (e.g., SVM, naive bayes, decision tree) were used broadly in image processing research. Later on, it turned into neural network based modeling after introducing deep learning and now it is an integral part of any image analysis task including medical imaging. Every year the usage of medical imaging increases worldwide for diagnostics. The image data mainly represents various radiological images such as, X-ray, computed tomography (CT), magnetic resonance imaging (MRI), ophthalmology images, and so on. Besides, other data from eye, skin, cell have significant contributions in clinical imaging to detect, diagnose and treat diseases [2]. It is becoming increasingly important now to have these medical images being taken by different devices need to be sent across from one system to another and therefore they need a computer network. However, a large collection of such images creates a dataset, they are located and processed in cloud servers under ML approach.
In the era of AI, collaborative learning, more specifically sharing data among different institutions, multiple sources can be very efficient in terms of building robust AI models. Since models are trained in centralized individual locations in traditional ML, the collaboration between models is quite tough. Contrastingly, X-rays, CT, MRI all of these are personal data pertaining individual patients which need to protect from risk of this medical information being disclosed or revealed to any unauthorized third party. In addition, even though data sharing is possible, the data store, processing, and analysis are still difficult tasks in a centralized manner. For such that scenario, data encryption-decryption could be a potential solution to exchange information between participants; however the process could be complex, time consuming and not sustainable [3]. So, instead of bringing the data to the location where the model is trained, why not bring the model to the data (institutions and the hospitals) and train directly there in-house, it allows collaborative learning without centralizing the dataset itself, this is called FL. It was first introduced in 2016 [4] and gained a lot of attraction within last couple of years for the healthcare domain. It addresses the privacy and data protection concern, which is currently an important problem in developing medical AI. In FL, the participants can train models locally and estimate different parameters for respective models, then share the parameters to a centralized server for aggregating them. Therefore, the focus is not on which data is used or what algorithms can be trained, the concept is managing the data in a different way where data privacy is reserved.

A. OBJECTIVE AND CONTRIBUTION
Since medical images are sensitive data, it needs to be protected and preserves the rights of users' personal information. We already discussed FL is arrived to solve the data privacy issue in collaborative ML and within the short time the concept has applied in different fields including medical imaging. Already many articles have been published on FL oriented medical image analysis and they successfully applied this unique data management technique in their research articles.
At this stage, it is time to look back, need to review and assess what has been done till now, what are the impacts of FL on medical imaging. Meanwhile, some SLR have been published on the topic, however, they were about overall healthcare applications not particularly for the medical image analysis context. A SLR has been presented in [5], they considered all of the articles which have used all forms of medical data to train their FL models. Similarly in [3], [9], and [6] the authors have included the whole healthcare area to survey and review the papers. Some review articles presented specific medical domains, for example, Naeem et al. [10] worked particularly on brain tumor diagnosis using MRI images. Since FL is comparatively a new concept, most of the review articles emphasized on the design and implementation. Secondly, they discussed the privacy or security opportunity, which is the fundamental characteristic of FL. Some of them [5], [7], and [10] were formulated on different research questions, a common question was regarding the state-of-the-art FL methods; besides, data properties, impact, gaps and future research have been investigated. Alongside, several survey articles have been published on FL for healthcare informatics. Xu et al. [11] surveyed the papers that focus FL in the biomedical area to provide a review. Their effort was to summarize the privacy, statistical and system challenges that exist in this specific domain. A well-known article in this field [12], where the authors discussed the prime factors related to FL in digital health with challenges and solutions.
This study is a SLR, we exclusively investigated the FL in medical image analysis and extensively touched every component in the considered articles, specially the performance analysis and comparison with usual ML, which is the main distinction of our study corresponding to the previously published review papers. Our study consisted of several research questions and by answering the questions we illustrated the current research lay-out in the field of medical image processing using FL. In addition, several observations were discussed according to the findings extracted from the literature. Table 1  shows comparative analysis of our contribution and related review articles, our study explored the demographic data, FL architecture, privacy preserving concern, federated data management, and performance of FL models. We did not find any article which has worked particularly on medical images. Consequently, this study can be an outline for future research of FL application in medical imaging. The following are the key contributions of our paper: • We surveyed the insights of FL solely in medical image research in a systematic way.
• We provided the latest implementation, advancement, and tendencies toward medical image analysis research using FL in different aspects.
• We presented and compared the performance of different FL architectures used in the reviewed articles with traditional ML models, which is the first of its kind.
• For incoming contributors we discussed open issues, challenges, and future direction of the research field. Rest of the article is structured with six sections. Basic FL concept is introduced in Section II. Section III described the procedures of this review. The results of this investigation are presented through different research questions in Section IV. Open issues and challenges are discussed in Section V. Besides, Section VI includes the limitation of this study. Lastly, the conclusion and future directions is provided in Section VII.

II. FEDERATED LEARNING
In this section we have described an overview of FL architecture. The concept of FL is not related directly to the ML components, it is all about a data management process to share data between multiple clients in a privacy preserving manner. For a practical example, suppose a hospital environment that produces some data, also has a model and some computer resources that would like to tackle a specific problem by an AI system. Moreover, the dataset in the institution has not been sufficient to train the model which is able to address this problem. Another hospital dealing with similar difficulty wants to work together on this promise where they have a common goal and can solve a common task. However, both hospitals have different data locally and they need to use each other's data without sharing data directly. This collaborative model training without sharing the data is exactly the purpose of FL.
In Fig. 1, we have presented FL in left and traditional ML framework in right to illustrate the fundamentals of both for a hospital environment. In association with that, as supplementary information we have listed necessary keywords and their explanations related to decentralized FL implementation in Table 2. Since FL consists of multiple sources of data, we have shown four clients in the figure. Each of the clients has few common duties, they collect the data from the hospitals, train them using the local ML models and estimate some parameters. These parameters are sent to the central server from every client, not the data itself. Once the central server has received all the local modes' parameters, it aggregates them and takes the weighted average, this is known as the global model and sent back to all of the clients. By this process a learning round is completed and repeated for the next round.
However, a well known federated averaging algorithm is FedAvg [13], proposed by Google in 2016, it calculates weighted average of the individual clients. It is very expected that the data quantity could not be the same across the clients, sources with larger datasets will have correspondingly larger weighted losses, individual clients losses are minimized to an overall global loss which is called weighted average. Under FedAvg, every client trains a model for a defined number of epochs through Stochastic Gradient Descent (SGD) algorithm and transmits the learning parameters to the central server and the server performs aggregation in the form of an averaging. The mathematical presentation of FedAvg is: In this function there are k number of clients and each client has its own loss function F k w. Then weight each of the losses by the size of the client's dataset n k . Hence, the overall objective is to minimize a global loss which is a weighted combination of local losses and the local loss is computed on private data which is never shared, only model updates are shared. Apart from the FedAvg, there are many research directions and varieties of FL going on such as SecAgg [54]. Though different combinations exist in the FL implementation, two characteristics are maintained expectedly: the datasets are distributed and remain local, not centralized and have a collaborative model to work towards the same goal.

III. RESEARCH METHOD
There are several review article types available in the literature to do deeper level of research, such as narrative reviews, systematic reviews. We mentioned at the beginning that a systematic review has been conducted for our investigation.
Mainly two SLR methods are popular in practice, one is PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) and another is Kitchenham's guidelines; the second one is mainly considered in computer science and software engineering research fields [14]. To conduct this review we followed the PRISMA procedures which is the most common way of performing SLR in the healthcare sector [6]. However, for a SLR, first we need to identify relevant articles that focus on a very specific research area and question(s), secondly appraising the quality of the studies performed and the strength of the evidence in the papers, and lastly synthesize the findings to draw respective conclusions. Fig. 2 shows all of the steps taken to conduct this review sequentially.

A. RESEARCH QUESTION
Our first step of this review was to establish a group of questions which will describe the literature in the most effective way. Table 3 shows the five contexts and their associated 12 research questions. First context is the overview that talked about the application and problem solved by the FL; next a broad explanation over the datasets was presented; third ML framework; then implementation of FL was discussed including privacy method, types of FL; and lastly the experimental substances, specially the performance comparison have been presented. VOLUME 11, 2023 B. SEARCH PROCESS Since FL was first presented in 2016, the search process of the review was limited over the time period from 1 January 2017 to 30 June 2022. We discovered all of the common databases considered by previous researchers; for example, Science Direct, IEEE Xplore digital library, Springer Link, Wiley Online Library, SPIE digital library, ACM digital library, Multidisciplinary Digital Publishing Institute (MDPI), Nature Portfolio, Taylor & Francis, and Google Scholar. The searching criteria is different across the platforms, we used advanced options of each database to search articles with Boolean ''AND'' and ''OR'' expressions. Our study focused on the implementation of FL in healthcare image processing, so that we carefully avoided the other applications. The search phrases looked over the titles, abstracts, and keywords in each of the databases. Fig. 3 depicts the PRISMA flow diagram where whole statistics of article consideration in this review has been presented. After the search operation, primarily collected articles have gone through a selection process, we have described them in the next sections.

C. INCLUSION AND EXCLUSION CRITERIA
Literature search strategy is a big challenge while it is needed to find too many papers, these circumstances are solved by a predefined inclusion and exclusion criteria in SLR. This might include limiting the search to only those that contain certain types of studies. However, the processes ensure the task achievement properly, reduce the possibility of bias and protect the selection process from irrelevant research documents. We implemented the inclusion and exclusion on the collected articles from the databases to reach the exact materials that are seeking the readers. We emphasized the following points to include articles for final analysis: • Article that studied medical image datasets.
• ML model developed with the FL environment.
• FL was the main focus in the findings (result analysis/comparison). Since we performed keyword search, the articles were collected based on the words present in the paper, even if it was mentioned for a single time. Therefore, we excluded the articles that are not relevant and does not fulfill our scope based on the given criteria: • Articles that used private dataset(s) for the ML model. • Studies that are not mainly focused on FL and medical image data.
• Hybridization or modify the theme of FL, e.g., federated reinforcement learning.
• Abstract, short article, any pre-print, any book or book part.
• Articles do not have a clear presentation of the results using ML based performance measures (e.g., [85], [86]). The functionalities of inclusion and exclusion are observed in Fig. 3. It shows the number of initially collected articles from different databases is 161. We have removed the duplicate articles from there and 138 articles were taken for further steps. After that we screened the articles for two times under two different conditions, first we gently explored the title and abstract which helps to remove 96 articles, besides, we extensively investigated the full text of rest 42, where another 25 papers have been disqualified. Finally, we discovered 17 from 161 articles to hold our review.

D. DATA EXTRACTION
Data collection mostly involved in research questions of our study, we extracted information in order to cover the questions perfectly. At first we created a spreadsheet and input respective information headers on the top. We worked on the 17 articles individually, each time all of the information has been gathered distinctively on the spreadsheet and they were used as our findings. The following data are extracted from every articles: 1) Document title, publication year, and journal/ conference name. 2) Used datasets and their federated settings.
3) The security or privacy protocol used for FL. 4) The algorithms used to train ML models. 5) Performance of the FL model.

IV. RESULTS
We assembled this section following the research questions that we described in Section III. In the upcoming sections, first we have presented the demographic analysis (also known as numerical analysis) data along with the key contributions and limitations of each reference work in Table 4, thereafter we answered the 12 questions successively.

RQ1
What are possible applications of FL?
We found the application of FL in different research fields, such as, Diabetic Retinopathy (DR), MRI classification, cancer, pneumonia, COVID-19 detection, and few more. These topics are popular in medical image processing research with conventional ML. Hence, FL also creates new scope to research due to the privacy production efficiency which is essential for this particular imaging research.
In 2019, coronavirus disease hitted all over the world and created a crisis regarding identification of COVID-19 samples. The RT-PCR test is the most reliable diagnosis method of the diseases, since inadequate testing kits and some technical limitations, researchers tried to explore alternative ways of COVID screening. Therefore, hundreds of ML based automated and time saving COVID-19 detection models have been presented within the last two years [33]. ML based COVID analysis is mostly carried out by radiological chest images, i.e., X-ray and CT images. Among the contributions, FL also discussed and implemented several detection models as data privacy was a big concern there. In this study, we found six articles out of 17 were specifically worked on COVID-19 detection. Feki et al. [18] proposed a collaborative FL for COVID-19 screening from chest X-ray images; they cooperated with multiple medical institutions  without sharing their data. Similarly, Zhang et al. [24] and Yan et al. [29] used X-ray and CT image data for different Convolutional Neural Network (CNN) architectures in FL settings. References [21], [25], and [32] also have contributed to the COVID-19 infection in a multinational way. However, during the pandemic such that artificial intelligence tools were not clinically used significantly to diagnose COVID-19, all of them were experimental operations and hopefully the contribution will help in future initiative.
Millions of patients are suffering from fatal diseases worldwide, cancer is top of them. Researchers have shown early detection of cancer can save a large number of lives [34]. Consequently, deep learning has emerged as a potential of early cancer detection by the help of medical images. It extracts features from the raw images and provides decisions regarding cancer detection with notable performance. As a part of ML technique, FL has been considered in several cancer diagnosis techniques, Fig. 4 shows 29.4% articles (five out of 17) of this review were formed on cancer detection. Researcher Polap and their team have published three research papers [17], [19], [22], all of them focused on skin cancer detection with the FL environment. They used seven different skin marks (classes) to train the detection models and successfully implemented the privacy protected FL. Moreover, Hashmani et al. [30] applied FL on a series of dermoscopy images to classify nine different skin diseases. Nowadays, important internal organs of human body, such as lung, breast cancer are the leading causes of cancer death. A FL oriented lung cancer detection model has been proposed by Adnan et al. [28]. They demonstrated that their model achieved acceptable performance while decentralized data configuration applied.
One of the domains is Diabetic retinopathy (DR) analysis. Diabetes is a chronic disease that affects millions of people globally and uncontrolled diabetes can lead to serious damage to the body's system including eyes. DR is a common diabetic eye disease and the number one cause of vision loss and blindness in the world. It occurs when diabetes damages the small blood vessels on the retina. In the primary care clinic, those retinal images can be transmitted to an eye care specialist who investigates the image and then provides a consultation. However, these days deep learning algorithms can detect the DR within seconds with high accuracy. Lo et al. [16] analyzed the retinal images to classify the DR positive and non-DR samples using the FL approach. In another article, Zhou et al. [31] introduced a FL framework which classifies five scalability categories of DR, 0 to 4 (No DR to Proliferative DR).  Linardos et al. [26] considered FL for Diagnosing Hypertrophic Cardiomyopathy (HCM), whether the subjects are suffering from HMC or normal. In addition to that, a multilabel cardiac diseases classification has been proposed by Chakravarty et al. [23], where 14 classes were examined. The other application includes Autism Spectrum Disorders (ASD) detection. Li et al. [20] applied deep learning in a FL environment to classify MRI images. Their model worked for identifying the ASD using the MRI analysis technique. We also found FL is used in pneumonia detection, Kaissis et al. [27] proposed a model that able to detect different pneumonia samples.
RQ2 What problems were solved? Almost all of the articles considered in our investigation solved an universal problem which is 'ensure the security of private data'. Data is always a key factor while we need to train a ML model, besides it is a challenge to protect the data from potential security and privacy threats. These threats are more crucial in Electronic Health Record (EHR) data analysis. Sharing EHR data includes patients' private information, above all their identity could be under risk to expose publicly. Similarly in medical image analysis, maintaining privacy of users' data such as X-ray, CT, MRI images is going to be difficult with traditional ML layout. Hence, FL is a privacy preserving way of training AI algorithms, allows to move the model to the data rather than moving the data to the model and this makes it very useful in cases where sensitive data cannot be shared. Since researchers are working for a long time on the application domains that we have discussed in the previous section, now they applied the same fenomena with privacy preserving FL as their experimental research.

B. DATASET
RQ3 What type of dataset used?
We divided the used datasets in the 17 investigated articles into several categories based on image type. Data type varies from model to model, it actually depends on which domain the model will apply; for example, skin images are used to detect skin cancer. Fig. 5 displays eight different types of medical images collected from various datasets used in the research field. Lung X-ray image: As we mentioned, severe cases of some diseases affect particular organs of our body, lung is one of them. Literature shows COVID-19 and pneumonia complications include lung damage which is the reason behind using lung X-ray and CT images in such disease detection models. Likewise, as we know smoking is dangerous for health, which particularly affects our lungs and a key reason for lung cancer. According to our investigation, six articles [18], [23], [25], [27], [29], [32] have used chest X-ray images out of 17. The X-ray datasets considered in the articles are Cohen JP, TB x-ray, CheXpert, Mendeley data, COVIDx, Chest X-ray (CXR), and COVID 2019 dataset. Moreover, Zhang et al. [24] proposed a FL oriented COVID-19 detection model where chest X-ray and CT images were considered from three datasets, Qatar-Dhaka data, COVID-CT, and Figure 1 dataset. Skin image data: MNIST: HAM10000 is one of the leading datasets used in skin cancer detection research with deep learning techniques. This repository contains 10,015 dermatoscopic images divided into seven different classes. Połap and the groups used the dataset in a series of articles [17], [19], [22] with FL environment. Similar data has been used in [30] [28] used tissue image data, more specifically they proposed a privacy guaranteed ML model where lung tissue images were considered to classify cancer. In addition, Li et al. [20] and Linardos et al. [26] both used MRI images for their models, brain and heart MRI data consequently. We have included all of the dataset name with their references for easy access in Table 5.
RQ4 Are the number of data samples sufficient? In ML research, it is very established that the more data we have for training purposes the better prediction we will get from the models. Also, chances of model overfitting will increase when we have a smaller dataset; so, it is always advisable to use a larger dataset. For our study, we analyzed the 17 articles by the range of data samples used in the respective research papers. First, we will discuss the articles which have used less than 1,000 samples. As Table 5 shows, four papers [16], [18], [20] and [26] used very small amounts of data, their number of elements are 153, 216, 370, and 180 respectively. Since larger dataset belongs to better potentiality of inside analysis, literally 153 data samples are not technically sound. Next, within the 10,000 sample range, seven papers [21], [24], [25], [27], [28], [31], and [32] used data samples between 2,109 and 6,284 and this number is quite good. Finally, we found six papers [17], [19], [22], [23], [29], [30] all of them have used more than 10,000 images individually. CheXpert, the largest dataset (overall 223,414 CXR image) found under our investigation was considered by Chakravarty et al. [23].
RQ5 Are non-IID data distribution considered? There are two forms of federated frameworks exist according to the data distribution, IID and non-IID. IID refers to independent and identically distributed. This can be divided into two parts, independence and identical distribution; independence means that the value (data) of an example does not affect the value of the other. This particular scenario is commonly described by a coin flipping experiment, when a coin is flipped, every time the result of both roles does not depend on the other die. Identically distributed means that the probability of any specific outcome is the same, for example every time flipping a coin there is a 50% chance of getting heads and a 50% chance of getting tails and that value does not change while flipping a coin every time. Non-IID technically inverse from both of the sides. While IID data feature distribution is same across clients, the feature distribution is different in non-IID. The problem is quite common in real life, for example, the appearance of the medical image sample using different machines across different hospitals may not align due to different imaging protocols. Therefore, non-IID data settings mean values are dependent on each other and there are overall trends between them. Generally in FL, local models are trained independently where data distribution is hidden to each other and as a result data type and features could be vary client to client [6], [53], this variation makes non-IID data consideration important in FL research. However, in this study, we investigated FL used in medical image analysis. We observed FL data structure is complicated, especially while the local clients' data are significantly different to each other. Our results show only four papers (we did not find sufficient explanation from [26] and [21]) considered non-IID type along with IID data and the rest 13 did not talk about the content. In [18], Feki et al. divided the collected dataset into four parts for clients data, for IID, they used an equal number of images from both sides, client and class. Moreover, for non-IID data they allocated the samples among classes unequally by a ratio of 66% and 44%. Likewise, Adnan et al. [28] performed FL with IID and non-IID data individually where number of samples were different in each client under non-IID scenario.

RQ6 Which ML algorithms are used to train local models?
Although FL is the leading focused topic of this investigation, ML techniques make the actual difference when it comes to figure out the overall performance of the models. As usual in the FL framework, each client server data is trained by ML algorithms. Since our review is based on medical image data and this image analysis or computer vision task is mostly conducted by CNN oriented deep learning models. However, to answer the question we searched each of the considered articles and found a variety of using built-in CNN models, such as VGG16, Inception, ResNet18, and many more. VGG16 is a widely considered, reliable, and pre-trained model; five out of 17 surveyed papers considered this CNN model. This model is constructed by 16 layers, 13 convolutional and 3 fully connected layers. Likewise, VGG19 is a 19 layers CNN model and used by Lo et al. [16]. Residual Network (ResNet) is also a commonly used algorithm that can be constructed by different numbers of layers, e.g., ResNet18 ( [23], [26], [27], [29]), ResNet50 ( [18], [24]), ResNet101 ( [24]). Other pre-trained CNN models are Inception ( [17], [19], [22]), AlexNet ( [17], [19]). Besides, CNN associated customised deep learning models have been used in several articles which is listed in Table 6. Li et al. [20] have used multi-layer perceptron (MLP) classifier which was a deep neural network constructed by one input, hidden, and output layers. Adnan et al. [28] performed image segmentation using a supervised learning approach called Multiple-Instance Learning (MIL) to train the local models.

RQ7
Are any additional security methods implemented?
Data privacy and security both are not similar in practice; privacy covers the use (control, access, and regulate) of data, on the other hand, security defines the potential threats of unauthorized access and malicious attacks. FL mainly preserves the privacy concern since trained models of stakeholders are shared instead of sharing data directly. Still, sharing models can be vulnerable while parameters are exchanged between clients and servers and could be a possible threat against system security [28]. Several additional privacy preserving methods have been described in a systematic review article [83]. However, we found few articles that have VOLUME 11, 2023 considered additional initiatives for security in FL based medical imaging research. Most of the articles (three out of four) have used Differential Privacy (DP), it allows companies to collect information about their users without compromising the privacy of an individual and the ultimate goal is to be able to share information about a dataset with other people without revealing individuals Personally Identifiable Information (PII) from the dataset [9], [84]. Li et al. [20] used two different mechanisms of DP, Gaussian and Laplace. They defined the noise level α which varied from 0.001 to 1. Similarly, Kaissis et al. [27] have applied both techniques and Adnan et al. [28] have used only Gaussian noise in their experiments. In addition, Połap et al. [17] used encryption and blockchain techniques to make their FL model more secure. They proposed three different learning agents where blockchain technique was applied in Data Management Agent (DMA). According to their description, all patients data (images) have to be their unique IDs, once a request arrive to analysis, it will check whether the ID is exist or not into the database, if not then it will create an unique ID and a block to the blockchain, then transfer the ID to the database with the image.
RQ8 What types federated data partitioning are used? Mainly three categories of FL described in the previous literature based on the training data distributions across the models. Among the three types, Federated Transfer Learning (FTL) and Vertical FL (VFL) are rarely considered in medical research; another one, Horizontal FL (HFL) was used widely. So, in a horizontal partition the client's database holds many different customers but they are collecting all the same type of data on those customers, in other words ''same features, different samples''. In vertical FL, it has different customers in both but there is an overlap of those customers and they are collecting different features, more specifically ''different features, different samples'' [3], [9], [84]. However, in this investigation we focused on the medical image research and found most of the articles were based on HFL. For example, Feki et al. [18] utilized HFL, they used a chest X-ray image dataset where features are same for all clients but samples are different. Interestingly, Kaissis et al. [27] used two different datasets for training and testing their FL models, the fact is both datasets contain X-ray images (same features) and different data. Only two articles we defined as VFL; [24] have taken three datasets, two X-ray and one CT image based. In the article the authors combined the both types of images and used them to train and test models. In [25], the authors used X-ray and ultrasound images for their federated models. X-ray with CT or ultrasound images are technically different, thus their features will be also different and they used various data features in different clients which makes a VFL scenario.
RQ9 What are the federated frameworks used? Table 6 represents respective deep learning architectures that were used for training their local models (we discussed in RQ6) and next the federated framework which was mainly the aggression approach of the collected local models in the central server. We observed federated mechanisms are executed in two ways, some articles were driven by formerly proposed build-in FL algorithms and others with basic concepts for aggregation. FedAvg (discussed in Section II), which is a commonly used method in federated aggregation, as Table 6 shows six articles considered this algorithm. Likewise, [20] and [27] used two different federated algorithms named Fed, secure aggregation (SecAgg) respectively. SecAgg is a secure model aggregation for FL also proposed by Google in 2016. Połap and Woźniak [19] proposed a meta-heuristic search based federated model, first they calculated average loss of all local models and then selected only models that have scored higher than the average loss for aggregation in server. Mainly all of them pursue fundamental concepts of FL but they implemented it in different ways. However, the described above federated aggregation process has no impact on the model performance, it is all about engineering the data distribution in a decentralized and collaborative manner.

E. EXPERIMENTAL
RQ10 What are the performance measures used in the studies?
The final and startling step of any ML setting is to assess how good the model is through performance evaluation. The basic idea is to develop a ML model using some training samples and test this train model on some other unknown data. However, the training error is not very useful for actual evaluation, because it is easy to overfit the training data by using complex models which do not generalize well to future samples. Contrariwise, testing error is the key metric since it has a better approximation of the true performance of the model on future samples. Thereby, we only considered testing performance throughout our review. As we found from this investigation, classification and segmentation both tasks were used and that is why their performance were also evaluated in different ways. In Fig. 6 we have presented the number of articles using different performance metrics. Most of the experiments (14 out of 17) were evaluated by accuracy. Recall was the second commonly used measurement criteria, considered by five articles. Area Under the Curve (AUC) score three and precision were used two times.

RQ11 How is the performance of the FL frameworks reported?
This question is for getting the overview of performance achieved by the FL based models in the 17 articles. Performance assessment is the ultimate part of any ML model where the conducted experiment is evaluated by different matrices. Our investigation revealed 14 articles worked on data classification (binary and multi-class), one article worked on data segmentation, and remaining two considered both of them (listed in Table 4). Usually performance of classification tasks is assessed by accuracy, it represents the report of correctly identified samples from all of the data [14]. We divided the performance into three categories according to the achieved accuracy by the 17 studies: high (>=90%), medium (80%-89%), and low (<80%). Table 7 summarised the performance scores of all articles.
High: We found eight articles have an accuracy of 90% or more. Feki et al. [18] performed binary classification, their accuracy score is highest, for FL+VGG16 with data augmentation model 94.4% and for FL+VGG16 93.57%. Połap and Woźniak [19] used the inception91 classifier for the FL model and obtained an accuracy of 91%. Score of [25], [30], and [24] is not clear, they discussed the accuracy between 90-95%. Article [32] and [27] achieved an accuracy of 90.61% and 90% respectively. Yan et al. [29] presented their results using sensitivity, their highest score was 91.26%.
Medium: In [22], the author classified the images as diseases and not a disease, their proposed VGG based FL model achieved 89.82% accuracy. Lo et al. [16] performed classification and segmentation both tasks on different datasets, the classification and segmentation accuracy for SFU dataset were 88% and 85% respectively, classification accuracy of OHSU dataset was 89%. In [26], Linardos et al. considered AUC, the highest score achieved by the FL model was 89%. Adnan et al. [28] conducted binary classification with an accuracy of 85%.
RQ12 How perform the FL approach compared to the conventional models?
Last research question explores the comparative performance analysis between FL and traditional ML image processing research. This query is important while we want to discuss the effect, contribution, and drawback of using FL in medical image analysis. To answer this question we intensively collected experiment results from both areas, 17 FL articles and their relevant conventional models. We already described the performance of the FL models in the previous question and here we will present the results of usual ML models and then the comparative analysis. In Table 7 we summarized the performance of all articles in this review and we presented the results of one or more similar articles opposite to each of the articles to make a comparison chart. To do so, we extensively investigated dozens of research papers that analyzed medical images by traditional ML to explore best matching options which was essential for a reliable comparison. Several conditions were applied in this criteria based on the structural and experimental similarity between ML and FL papers, such as we considered the papers which used similar datasets, algorithms, and performance measures. We expect maintaining this condition will ensure an accurate comparison among the two parties. Our investigation shows in Table 7 that all of the ML models have improved accuracy compared to their respective FL models in existing literature, more specifically we found better ML results against every FL article. For instance, Połap et al. [17] have achieved accuracy with federated VGG16 70% and Inception 67%, however in ML part, Jain et al. [56] achieved 79.23% accuracy with Inception and Liu et al. [57] 87% with ResNet50; all of three have considered the MNIST: HAM10000 dataset. Then as well Chakravarty et al. [23] has an AUC score of 80% with FL environment, but with same dataset and ML algorithm article [67] and [68] have 86% and 87% AUC respectively.

V. OPEN ISSUES AND CHALLENGES
FL is still a young research field, so it is difficult to draw a remark on the rejection and acceptance. However, here we have discussed the issues and challenges found in the reviewed articles regarding the application of FL in medical image. Generally, FL is invented to fulfill the privacy concern of private data, unfortunately it does not cover all potential privacy threats [93]. However, we described model performance, data heterogeneity, and federated model efficiency issues found from the review below: VOLUME 11, 2023 A. PRIVACY AND SECURITY Medical image data is created by personal information of patients and no one can share this data for AI applications without reliable data protection. FL makes the data sharing between the different institutions with some privacy guarantees by an advanced data management and model construction process, all we have described in Section II. FL is different compared to ML models where the training process is exposed to multiple parties, we do not know the motive of every participant, it is an issue of trust among them; so this additional communication increases the risk of leakage data via reverse engineering. Meanwhile, we observed two further privacy measures used in federated medical image processing, differential privacy and secure aggregation. Differential privacy involves adding carefully selected noise to the outputs and can either be done by the individual clients or server level, secure aggregation is a cryptographic technique (e.g., blockchain technology), ensures the server can only see the aggregate of thousands of updates rather than individual model updates. But the reality is every privacy mechanism comes with a significant computational cost on the federation.

B. DATA HETEROGENEITY
Our investigation shows data heterogeneity could occur in two ways: number of samples are different (non-IID data) and data features are different (VFL) among the clients. Usually, the number of produced data in hospitals are not identical and in FL, clients can have different data distributions, this uneven distribution of data of client sides might provide opposing gradient updates to the server which is challenging to tackle. Furthermore, practically features of federated datasets are not the same in many cases, for instance X-ray and CT images data can be used in two different clients which makes trouble during aggregate the models parameters centrally in a FL setting.

C. OVERALL MODEL PERFORMANCE
The first impression of an AI model is the performance, how accurately the model accomplished the task. High performance accuracy makes the model more acceptable than a model that achieved a lower score. We previously discussed the federated model performance and compared them with traditional ML models (RQ12). Our findings show FL failed to perform better than ML with similar model structures, this drawback claims us to reevaluate the usefulness of FL in medical image.

D. FEDERATED ARCHITECTURE
Training a personalized model on each of the clients is not difficult in FL, problems emerge when all of the model output transfers to the central server and passes through an aggregation process. We observed that the federated models presented in the reviewed articles are mostly theoretical and less practically implemented, few articles included their open source code with their articles. Since the research started in the field a couple of years ago, the research method and materials need to be more easily accessible to future researchers. Besides, we usually have a very controlled setting in research, but the question comes when we try to aim for huge datasets to simulate in a real-world scenario.

VI. LIMITATIONS
In this section we have admitted the limitations of this study. First, we searched all prominent databases for article collection where some journals and conference proceedings were with subscription download policy. In some of such cases, we could not grab the papers from the sources. Although, we tried for an alternative way, sent email to the corresponding authors and requested for a full text of the required article. However, still we failed to reach some of them ( [94] and [95]) which is limiting the range of this survey. In addition, our inclusion and exclusion process removed articles from the initial fleet and preprint articles were not included there, besides we could not explore all of the searching databases so it could be possible that we missed to include any relevant article(s) on the topic. We did not experiment the models used in the 17 articles under our supervision, for a precise review that would have been more effective. Overall, it is difficult to conclude this study with strong and tested historical evidence, because our review was on very limited time and with insufficient resources since FL was recently introduced.

VII. CONCLUSION AND FUTURE DIRECTIONS
One of the most popular and effective diagnosis methods is imaging techniques in the medical sector. This practice is increasing day by day and produces tons of image data. AI has lots of opportunities in medical imaging using this data, but clinical use of AI and ML is very limited right now. In research direction, creating a publicly shareable image dataset is very difficult for the medical domain. The major hurdle behind data share and collaboration is privacy issues which are less prioritized in typical centralized models. Apart from this concept, federated or distributed learning is different, here a data-driven learning model is shared not the data directly. In this study, we systematically reviewed the articles that considered FL in their ML based medical image research. We elaborately discussed from every perspective, including demographic data, privacy appearance, datasets, FL characteristics, model implementation, and performance comparison. We noticed in one of our previous articles [33] that deep learning oriented COVID-19 detection using X-ray and CT images has high accuracy, most of them achieved more than 95% accuracy. We further observed a similar trend in this study, here COVID-19 detection research articles are the top scorers with FL mechanism. Although, the scores under FL are comparatively lower than general models, as listed in Table 7. Performance of other application domains with FL models were also not mentionable. Besides, previous articles point out the implementation of federated models is relatively complex, it requires extra communication and maintenance trouble. However, it is favorable to become acquainted that the research field got lots of attention and publications within a very little time, that is why we can hope for promising progress of FL in medical image analysis in future. At this stage, we have summarized our findings below for future direction to the researchers who are interested to contribute in the field: • Privacy concern is not fully solved in FL, however, we cannot deny the importance of decentralized concepts. It could be effective for collaborative ML in medical image research, thus researchers should emphasize on the implementation of additional privacy protection in a cost effective way.
• Datasets in the research are collected from various sources and for various purposes where experimental results could differ enormously. There is no particular or benchmark dataset available in federated medical imaging research; need to build some standard datasets to avoid biased data and data heterogeneity problems.
• Similarly no benchmark FL model has been presented yet in this field, such that initiative will assist to build robust AL models for further research.
• In truth, collaborative models data are prone to be heterogeneous, various classes of data are collaborating there. But our results show the accuracy of multi-class classification is very low (as described in RQ11) which needs to be addressed in future research.
• Federated models achieved satisfactory performance in some cases but we cannot narrate as an alternative in the accuracy race with ML models.
• There are many weaknesses observed in current publications (papers investigated in this review) of this field, we included the article quality checklist and results in A. Future research could consider the quality analysis questionnaires for article quality improvement. No doubt FL is something that might be in the future horizon. But still there are some technical problems, that challenges need to be tackled before FL is going to be applied vastly. Best of our knowledge this is the first SLR and we believe this review is a reflection of FL research in the area of medical imaging. Table 8 shows 12 Quality Questions (QQ) and scores, mostly motivated from our previous article [14]. The goal of such inquiry was to check the basic quality of the articles published in FL oriented medical imaging. However, each question has one score for one article and a total score of 12 for an individual. We considered the QQ answer in three forms of scoring, ''Yes (1)'', ''Partially Yes (0)'', and ''No (−1)''. The article which clearly supports the question is Yes, partially supported or where no clear answer found is Partially Yes, and lastly fully disagreed is No. We investigated each of the articles to find the answer and assigned the scores in respective columns. As the table interprets most of the articles have failed to fulfill the quality requirement. Highest score is 9 out of 12 gained by [27], followed by six for [20] and [28] both articles individually. The score indicates in some areas quality has been maintained poorly in the research papers, a reason could be that lots of attention made a rush on FL research among the contributors.