Performance Analysis of Federated Learning Algorithms for Multilingual Protest News Detection Using Pre-Trained DistilBERT and BERT

Data scientists in the Natural Language Processing (NLP) field confront the challenge of reconciling the necessity for data-centric analyses with the imperative to safeguard sensitive information, all while managing the substantial costs linked to the collection process of training data. In a Federated Learning (FL) system, these challenges can be alleviated by the training of a global model, eliminating the need to centralize sensitive data of clients. However, distributed NLP data is usually Non-Independent and Identically Distributed (Non-IID), which leads to poorer generalizability of the global model when trained with Federated Averaging (FedAvg). Recently proposed extensions to FedAvg promise to improve the global model performance on Non-IID data. Yet, such advanced FL algorithms trained on multilingual Non-IID texts have not been studied in industry and academia in detail. This paper compares, for the first time, the FL algorithms: FedAvg, FedAvgM, FedYogi, FedAdam and FedAdagrad for a binary text classification task using 12078 tailored real-world news reports in English, Portuguese, Spanish and Hindi. For this objective, pre-trained DistilBERT and BERT models fine-tuned with these texts are used. The paper results show that FedYogi is the most stable and robust FL algorithm when DistilBERT is used, achieving an average macro F1 score of 0.7789 for IID and 0.7755 for Non-IID protest news. The study also exhibits that BERT models trained with weighted FedAvg and FedAvgM can achieve a similar prediction power as centralized language models, demonstrating the potential of leveraging FL in the NLP domain without the need to collect data centrally.


I. INTRODUCTION
Over the past decade, the field of Natural Language Processing (NLP) has witnessed a transformation, primarily driven by the emergence of Large Language Models (LLMs) and the paradigm shift towards decentralized Machine Learning (ML) methods.A major reason for the success of LLMs could be attributed to the introduction of transformer models with multi-head self-attention mechanisms for capturing complex text data relationships [1].In 2018, a milestone in the history The associate editor coordinating the review of this manuscript and approving it for publication was Xiong Luo .
of NLP was marked by the introduction of the Bidirectional Encoder Representation from Transformers (BERT), a pretrained model capable of comprehending the context of words in a sentence by considering both the words that precede and follow each word, utilizing a vector space [2], [3].With the advent of more recent transformer models, such as Large Language Model Meta Artificial Intelligence (LLaMA) [4] and Generative Pre-trained Transformer (GPT) [5], [6], modern NLP applications have been revolutionized.The pre-training with large text corpuses was another reason for the breakthrough success of LLMs [6].Multimodal models such as Contrastive Language-Image Pre-training (CLIP) [7] additionally combine language understanding models with computer vision models, which opened new avenues for AI applications.
However, with the advance of the development of LLMs, data privacy is becoming an important issue for all organizations creating NLP applications that use sensitive data that must remain within their area of responsibility.When a centralized ML system with sensitive data is implemented, organizations must be cognizant of the potential for data breaches, such as those caused by adversarial ML attacks [8].Moreover, training data in NLP applications are often personalized, such as health records in medical document classification systems.Ensuring data privacy is crucial in these cases, because the disclosure of sensitive information may result in heavy fines [9], [10].Additionally, in centralized ML approaches, a large amount of training data is stored in a single central location (e.g.data lake) to train the model.The foundation for many NLP applications is only possible by high-volume data, which must first be collected centrally at substantial costs.For instance, developing a LLM for a personalized AI assistant usually involves a costly data acquisition process beforehand, resulting in high financial costs for an organization.In fact, the creation and maintenance of centralized storage for NLP applications can be quite expensive, and considerable effort can be required for creating and maintaining central data storage for NLP applications.
As LLMs grew in size, the challenges of centralized training, including environmental impact and computational costs, led to research on decentralized, and more energy-efficient training methods.A novel approach to mitigate the data privacy and cost issues associated with centralized data-driven learning methods is provided by Federated Learning (FL) [11], [12].Leveraging FL in the NLP domain to federate LMs can lead to a privacy friendly and cost-saving data-driven model development by eliminating the need for a complex data-centric acquisition process.However, there are several key challenges to consider when training an ML model in a federated manner, including statistical heterogeneity, effective model aggregation and hyperparameter selection, privacy concerns, communication costs, and heterogeneity of the systems [11], [13], [14], [15], [16].The analysis of statistical heterogeneity and effective model aggregation by training and comparing federated LMs using different data distributions and FL aggregation algorithms is the focus of this study.To the best of our knowledge, these two key challenges are the least explored aspects of FL research in the NLP field.

A. MOTIVATION
In real-world scenarios, distributed clients deal with training data that are Non-Independent and Identically Distributed (Non-IID).In dialog systems, for instance, the data is typically Non-IID because the distribution of words and phrases in a conversation can change over time, depending on language, conversational context, and interactions between users [17].Another example is a collaborative word prediction task [18], where cell phone users may use different languages, which inevitably leads to Non-IID because such cell phones usually have uneven data distributions.In both cases, Non-IID data can negatively influence the global model quality in an FL system.To address this challenge, different researchers have proposed a set of novel FL algorithms to solve this problem [19], [20], [21], [22], [23].
Although, the FL utilization in the NLP domain is discussed in some papers, there is almost no research work connecting advanced FL algorithms with LMs and LLMs.Moreover, a privacy-preserving multilingual analysis of IID and Non-IID protest news texts with federated transformer models to investigate the impact of data distributions and the effectiveness of FL algorithms on federated LMs is missing so far.The dataset employed in this study was derived from an NLP competition [24], serving the purpose of accurately discerning multilingual news texts (i.e.English, Portuguese, Spanish, and Hindi) into two distinct categories: non-political events and political events such as protests (see also Section III-B on page 5).A comparative analysis of pre-trained transformer models trained with advanced FL algorithms with IID and Non-IID protest news can shed light on how the data distribution and choice of an FL algorithm can affect the global model quality in a federated NLP application.

B. CONTRIBUTIONS
In this study, the application of FL for a binary classification task with news texts in English, Portuguese, Spanish, and Hindi is investigated.Pre-trained multilingual BERT [25] and DistilBERT [26] models are fine-tuned on these data, as they show promising results for similar NLP classification tasks in centralized learning settings [26], [27].DistilBERT is used to study the behavior and performance of a distilled LM in a federated environment.To simulate real-world scenarios, texts are partitioned into IID and Non-IID settings.The model prediction quality of the federated transformer models is compared with centrally trained ones, and the effectiveness of the FL aggregation algorithms is investigated in both partitioning settings.To achieve this, a comparative analysis of these federated models trained with weighted Federated Averaging (FedAvg), Federated Averaging with Server Momentum (FedAvgM), FedYogi, FedAdam, and FedAdagrad is conducted (see Section IV on page 9).The federated transformer models are also compared to centrally fine-tuned transformers to observe whether the federated models can achieve a similar performance or even outperform the baseline models.
The major contributions of this paper can be summarized as follows: • Use a customized partioning strategy for splitting protest news reports into IID and Non-IID client settings to perform federated experiments.
• Fine-tuning the pre-trained DistilBERT and BERT models with the partitioned protest news reports in a federated manner.
134010 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• Evaluate and compare the fine-tuned DistilBERT and BERT models trained with advanced FL algorithms to identify the most stable and effective FL algorithm that improves the robustness of the global model in IID and Non-IID settings for NLP tasks.The conducted comparison analysis provides an overview of macro F1 accuracies of federated DistilBERT and BERT models trained with advanced FL algorithms in both client data distribution settings.The results of this study can be used as a foundation for determining a suitable FL algorithm for a cost-effective analysis of distributed multilingual text data, as well as to elucidate the differences between novel FL algorithms trained on IID and Non-IID data, which contributes to the explainability of FL.
Note that the scope of this study is limited to the federated training of DistilBERT and BERT LMs, since memory constraints are enormous when using pre-trained LLMs such as LLaMA 2 [4] and GPT-3 [5].Also, other experimental FL algorithms, except those already mentioned, are not considered in this paper due to the lack of implementation details.
The experimental results and findings from the comparative analysis are organized into a series of Research Questions (RQs), accompanied by their corresponding answers.In this study, the following RQs are answered: • RQ 1: How do weighted FedAvg, FedAvgM, FedAdagrad, FedAdam, and FedYogi perform using transformer models trained on IID and Non-IID text data?
• RQ 2: How generalizable are federated transformer models for texts in languages that are not part of the fine-tuning process?
• RQ 3: Which FL algorithms perform best and are the most stable in detecting protest news in all four languages?
The RQs are addressed and answered in ascending order in Section IV on page 9.The remainder of the paper is organized as follows: Section II presents relevant background information of related works.Section III provides an overview of FL, details about the text data under study, the selected partitioning strategy to split the data into IID and Non-IID settings, the transformer model architectures considered, a set of novel federated model aggregation algorithms, and the experimental settings used for the comparative analysis.Section IV shows and discusses the experimental results of comparing federated DistilBERT and BERT models trained with the FL algorithm candidates using the partitioned data, and are compared with centralized baseline models.Section V concludes this paper with a short summary and outlook.

II. RELATED WORK
There is ample related work in the field of FL directly related to this paper.The most prominent pros and cons of these works are listed in Table 1.The original concept of FL and weighted FedAvg have been introduced to train Deep Learning (DL) models on sensitive training data in a distributed privacy-preserving manner [11].However, in FedAvg, the challenge of Non-IID data is not addressed; only the total amount of training data from the clients is taken into account [28].FedDist is a computationally intensive FL algorithm that can identify dissimilarities between neurons in federated Neural Networks (NNs) among clients using a Euclidean distance matrix [29].The obtained results of FedDist were compared with those of FedAvg, but could not be improved in terms of accuracy, as both achieved an accuracy of 96.96% on proxy datasets [29].
Concerning the LMs, the work of TextCNN proposed a federated NLP model that was designed for sentence-level classification tasks and optimized for imbalanced data distributions on clients with additional privacy mechanisms (see Table 1) [31].For clients with highly Non-IID data TextCNN achieved still worse test accuracies than FedAvg (with a performance drop of up to 10%) [30].In FedBert, the pre-training for the BERT model in a federated setting has been introduced [35].In the work of FedNLP, the federated fine-tuning process of LMs was investigated on a set of NLP benchmark tasks, where pre-trained data was publicly available and the fine-tuning data has been privately distributed [34].Based on FedNLP, AdaFL promises to reduce the convergence delay by inserting small bottleneck modules at a variatey of model layers [37].However, nonoptimal data distributions and the heterogeneity of devices can negatively influence the final model [37].
In several experiments on CIFAR-10, FEMNIST, Stack Overflow, and Reddit proxy datasets it was shown that the impact of federating a pre-trained model is mainly influenced by the distribution of the training data [32], [33].In these experiments, up tho 40% higher accuracy could be achieved by federated pre-trained models compared to federated models trained from scratch by random initialization.It has been observed that an adverse data distribution (e.g.Non-IID) can be counteracted by pre-trained LMs, resulting in an improved global model performance [34], [35].
In FedOpt, a collection of federated optimization algorithms tailored for handling Non-IID data is presented, where a superior model performance over the standard weighted FedAvg is promised for Non-IID data when FedOpt algorithms are used instead [22].The FedOpt suite includes FedYogi, FedAdam, and FedAdagrad, with experiments showing that FedAdam and FedAdagrad are outperformed by FedYogi [22].In another work, experiments with the CIFAR-10 dataset showed a test accuracy equivalent to the centralized baseline model (86 %) for training with FedAvgM [36].
The aforementioned related work has served as inspiration for this study, because pre-trained LMs in FL settings have been utilized.However, no exploration has been conducted to date regarding a comparison with different FL algorithms over multilingual text data using pre-trained LMs.

III. METHODOLOGY
In this section, relevant background information to FL systems, the underlying study data, LM architectures used, considered FL algorithms, and the experimental setup used for the comparative analysis are provided.The summary of the research workflow performed in this paper is depicted in Figure 1.
Given the semi-distributed nature of FL, the initial step involves the partitioning of the protest news text data into two distinct partitions: IID setting (optimal partitioning) and Non-IID setting (non-optimal partitioning).This partitioning strategy is employed to enable an investigation into the impact of IID and Non-IID data on the global model performance.In real-world scenarios, the distribution of data is inherently assumed to be Non-IID [38].The partitioned text data is fit into a two-step process of tokenization and subsequent token embedding, as shown in Figure 1.BERT and DistilBERT are also fine-tuned using centralized news reports, which serve as baseline models for the compairson.Both LMs are federated with the IID and Non-IID partitions using FL algorithms.Consequently, a performance analysis is conducted to assess the extent of divergence in model performances between the federated models and their baseline counterparts.The macro F1 metric is used for the model evaluations in this study because of its capability to accurately gauge the model performance in scenarios characterized by imbalanced data [39].

A. FEDERATED LEARNING SYSTEM
In this subsection the overall FL system used for the performance analysis is described in more detail, as illustrated in Figure 2.
In FL-driven systems, DL models are jointly trained by K clients while keeping their local training data decentralized and isolated from other clients [12].Inherently, data privacy  is provided by design in FL systems, as there is no need to move the training data [28].In the federated training process, a global model is first randomly initialized by a coordinator server and distributed to K clients holding sensitive training data [11] (see Figure 2).Subsequently, K clients use the received global model for the local training.Once the local training is completed, the local model updates are sent back to the server by the clients for the model aggregation phase (see Figure 2).In this step, the model updates are aggregated into a new single global model state by the coordinator server using an average-based FL aggregation algorithm (e.g., FedAvg) [29].Afterwards, the new global model is distributed back to K clients for the next federated training round.This training routine is continued until a predefined number of federated training rounds are completed (a hyperparameter to be tuned), ideally when the federated model has converged [29].In this paper, an FL system with three clients on a single device is simulated (see also Subsection III-E on page 8).

B. MULTILINGUAL PROTEST NEWS
The underlying study data and the partioning strategy used for the comparative analysis of advanced FL algorithms are described in this subsection.

1) DATASET DESCRIPTION
In this study, the dataset from the task of detecting socio-political and crisis events from the Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021) workshop is used [24].Text data for socio-political events and crises is utilized for national and international policy and decision-making [24].Therefore, the validity and reliability of the analyses of this dataset are crucial.For the training set, the data contains news reports in English, Spanish and Portuguese.However, the CASE 2021 data also contains texts in the Hindi language, which are not included in the training set and is used only for testing the generalization capability of the federated LMs.Also, the provided test data from the evaluation phase of the challenge task from CASE 2021 is used in this paper [24].Unfortunately, there are no German protest news in the CASE 2021 dataset.Therefore, this study is limited with the four mentioned languages from the CASE 2021 dataset, which should be sufficient to address the objectives and RQs.
Whether an event (either a crisis or a socio-political protest) has occurred or not, is indicated by the training data labels as shown in Table 2. Thus, a binary classification task is performed in this study.The label 0 means that no political event took place, and 1 means that a political event (e.g., a protest or crisis) occured (see Table 2).It should be noted that this study data encompasses a total of four languages, including English, Spanish, Portuguese (which belong to the Italic and Germanic language families), and Hindi (which belongs to the Indo-Iranian language family).By employing the testing set, the performance of the federated LMs can be evaluated when using a language different from the training language.For instance, comparing languages from distinct language families, such as Hindi and Spanish.
The distribution of the text samples in the dataset for each language can be found in Table 3.The texts in the CASE 2021 dataset exhibit an imbalance, with the highest number of samples attributed to English, followed by Portuguese and Spanish.For the Hindi language, no training data was provided by the workshop organizers [24].Therefore, in this study news reports in Hindi are only used for testing the model inference capability of the federated LMs.This could provide insights into the model performance behaviour of the federated LMs when classifying texts in a language in which they have not been explicitly fine-tuned.

2) DATASET PARTITIONING
In this study, the training and validation sets are created in the centralized learning setting by combining all four languages and employing a 4:1 split ratio based on the number of samples.However, in practical FL settings, the client's data predominantly exhibits Non-IID characteristics [23], [40].As a result, the statistical assumptions for IID data cannot be applied to Non-IID data.Thus, it is necessary to examine both forms of data distribution in order to ensure a fair comparison that simulates real-world conditions.Since news reports in the Hindi language are exclusively part of the testing set, they will not be incorporated into the data partitioning process.The data is partitioned into IID (see Figure 3) and Non-IID partitions (see Figure 4 on page 6) as follows: a: IID PARTITIONING News reports in English, Portuguese, and Spanish are combined into a single dataset.Subsequently, the dataset is split equally into three distinct partitions with equal proportions for each language through random sampling.This results in three clients holding the same amount of texts in each language (see Figure 3 on page 5).

b: NON-IID PARTITIONING
The Non-IID partitions selected in this study for the three clients are based on the number of training data (see Table 3 on page 5).In this partitioning strategy (see Figure 4), the news reports in English, Portuguese, and Spanish are considered as individual clients to create a Non-IID data setting.Furthermore, the client's data distribution is highly imbalanced in terms of the number of samples, which is representative of a more realistic scenario for FL.

C. MODEL ARCHITECTURES
In this subsection, the pre-trained multilingual model architectures considered for the protest news detection task are briefly described.The multilingual versions of the transformer models BERT and DistilBERT are used and federated in this study.A WordPiece tokenizer is used by BERT (see Figure 5) and DistilBERT (see Figure 6) during the tokenization process, where the input data is prepared for the word embedding layers to generate feature vector representations [25].

1) BERT
The bidirectional transformer model BERT has been pre-trained on a large amount of unlabeled text data to learn a language representation [3], [25].BERT can be fine-tuned for specific ML tasks and has demonstrated superior performance compared to state-of-the-art approaches in various NLP tasks, including Multi-Genre Natural Language Inference (MultiNLI) and Stanford Question Answering Dataset (SQuAD) v1.1 and v2.0 [2], [41], [42].Notably, BERT achieved a significant performance in the General Language Understanding Evaluation (GLUE) benchmark and F1 scores [25].The improved performance of BERT can be attributed to its bidirectional transformer architecture, which is encompassed by pre-training tasks such as masked LM and next sentence prediction, as well as a large number of parameters (110 million) [27].
In this study, a multilingual version of BERT is employed, which supports 104 languages and was released alongside the monolingual BERT [25].Considering the vast text corpus utilized during the pre-training of BERT, we assume that the model has acquired some inference knowledge about the text data from the socio-political and crisis events dataset.This model architecture can be federated because it consists of NN components whose local model updates can be leveraged by federated algorithms to train a global model.

2) DistilBERT
A distilled and approximated version of BERT that retains 97% of its performance but uses only half the number of parameters (66 million) is used by DistilBERT [26].The distillation flow shown in Figure 6 on page 6 is FIGURE 6. Model architecture overview of DistilBERT [26].
134014 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
processed on transformer layers on BERT to a reduced set of transformer layers for DistilBERT without any significant information being lost.Figure 7 shows the architectural details such a (distilled) transformer layer.In this block of layers, the multi-head attention layer (K = Key, Q = Query and V = Value) with addition and normalization layers are deployed.The output is fed into a feed forward NN and normalized.DistilBERT enables a computationally less demanding but still powerful multilingual transformer model [43], [44].Specifically, no token or pooling type embeddings are contained, and only half of the layers from the larger BERT model (see Figure 5 on page 6) are retained [26].For this study, a pre-trained multilingual version of DistilBERT is fine-tuned for the comparative analysis with BERT.Similar to BERT, DistilBERT can be federated by sending the model updates (i.e., the trained gradient vectors from the NN components) to the coordinator server for the aggregation phase as shown in Figure 2 on page 4.

D. FEDERATED ALGORITHMS
In this subsection, a set of federated aggregation strategies used for the federated training of the LMs with the multilingual protest news dataset is briefly described.

1) FedAvg
The first FL algorithm, FedAvg, was introduced by Google to solve FL optimization problems [11].FedAvg is used by the central coordinator server of an FL system to compute a weighted average over a set of local weight updates sent by the participating clients to create a new global model state.
FedAvg can be formalized as follows: where K is the total number of clients, n is the total number of training samples, n k is the number of data in client k and w t k are the model weights from client k for aggregation in federated round t.Thus, w t represents the aggregated global model update that is distributed back from the server to the clients.Afterwards, client k calculates a cumulative local gradient with w t+1 k ← w t+1 k − w t+1 , where w t+1 k are the updated weights the of new locally trained model [11], [28].In this study, FedAvg is used because it is the baseline algorithm for FL optimization problems and should be included in any fair comparison of federated algorithms.

2) FedAvgM
In FedAvgM, momentum is added to the server-side, and the velocity component v t and the global server momentum parameter β are computed.Thus, Eq. 1 can be rewritten for FedAvgM as follows: where the global estimate of the first moment is denoted by βv t .FedAvgM stores the global history of gradients, which improves and accelerates the convergence of the model [36].
It incorporates the global gradient information into the weight updates by accelerating the gradient calculation of the current model with the global momentum parameter β.According to FedAvgM [36], the algorithm outperforms the standard FedAvg for most Non-IID problems.Therefore, FedAvgM is included in the comparison suite.

3) FedOPT
The collection of federated optimization algorithms FedOpt was introduced and proposed by [22].It is designed especially for non-convex settings.The main idea behind FedOpt is to use the negative values of the global model difference w t k as pseudo-gradients.FedOpt algorithms are used in the model weight aggregation phase with the gradient momentum update method from Stochastic Gradient Descent (SGD).This reduces the negative effect of Non-IID data and accelerates the convergence speed to some extent.Reference [22] stated in their work that the standard FedAvg lacks adaptability and may be ineffective for settings with heavy stochastic gradient noise distributions.This is often the case when training LMs.The FedOpt collection contains FedAdagrad, FedAdam and FedYogi, which are described in more detail below.
FedAdagrad: The federated version of the SGD optimizer Adagrad, FedAdagrad, was introduced by [20].Similar to vanilla Adagrad, it tracks the sum of the squared federated gradients to adapt to gradients in different directions.Mathematically, FedAdagrad can be expressed as follows: where e t is the Exponential Moving Average (EMA) of the gradients, e t−1 is the EMA of the past gradients and w 2 t is the squared weight update from the aggregation phase (see Subsection III-D1 on page 7).A general problem with the Adagrad optimizer is the rapid monotonic decay of the learning rates, which can lead to a poor model performance [21].However, it is a commonly used SGD optimizer in various DL tasks.Therefore, FedAdagrad is considered as an FL algorithm candidate for the comparative analysis.
FedAdam: FedAdam is a federated adaptive DL version of the popular Adam optimizer function [19], [22].Similar to FedAdagrad, in FedAdam the EMA is used to calculate the gradients.It also maintains an exponentially decaying average of past gradients [45].However, the FedAdagrad problem described above is addressed by downscaling the weights using the EMA past squared gradients [21].Formally, Eq. 3 can be extended as follows: where e t is the EMA of the gradients and β 2 is the exponential decay rate for the 2nd-moment estimates.The squared weight update from the aggregation phase is denoted as w 2 t .The speed of momentum is obtained by FedAdam, which differs from FedAdagrad in that the decay rate for the 2nd-moment is used.However, non-convergence with zero gradients can be caused relatively quickly by FedAdam, especially in sparse settings, where past gradients may be forgotten [19], [21].Nevertheless, FedAdam is used for the comparative analysis because acceptable results were achieved in [22]'s experiments on proxy datasets such as CIFAR-10, [46], and Shakespeare [11].FedYogi: Changes in the stochastic learning rate are adapted by FedYogi to improve the convergence guarantee in the presence of Non-IID data.This results in a more stable updated global model.According to [19] and [47], the adaptive optimizer Yogi outperforms Adam in most nonconvex settings.Formally, FedYogi can be expressed as follows: where e t is the EMA of the gradients, e t−1 is the EMA of the past gradients, β 2 is the exponential decay rate for the 2ndmoment estimates and w 2 t is the squared weight update from the federated aggregation.Hence, similar to FedAdam (see Eq. ( 4)), the gradients are updated using an EMA with 2ndmoment estimates.A notable distinction is that FedAdam is a multiplicative model, while FedYogi employs additive updates [19].Additionally, FedYogi transforms negative averaged gradient updates by inverting (sign) them to positive values (see Eq. ( 5)).The best performance on proxy datasets is demonstrated by FedYogi, based on the comparison results of [22], surpassing FedAvg, FedAvgM, FedAdagrad, and FedAdam.As a result, FedYogi is proven to be a worthwhile FL algorithm candidate for our comparative analysis.

E. EXPERIMENTAL SETUP
This subsection provides further information regarding the chosen hyperparameters, loss function and metric, as well as the hardware and framework used.The fine-tuned LMs are stored in Google Drive.An overview of the parameters and functions considered for training and evaluation is listed in Table 4.

1)
AdamW used a local gradient optimizer on the client-side to fine-tune the LMs in both federated and centralized learning settings, because it is commonly used as a default optimizer for LLM tasks [48].The adjustment of the (hyper-)parameters was conducted through a tuning process, wherein different hyperparameter configurations were tested using a gridsearch approach.Based on this result, a learning rate of η = 5 × 10 −6 and a weight decay of λ = 1 × 10 −2 are set for the comparison experiments.A fixed value of T = 5 is used for the number of federated aggregation rounds, as pretrained transformer models usually converge more quickly in the federated space than LMs without pre-training [35].In each federated aggregation round t i ∈ {1, . . ., T}, a single epoch is trained locally by each client at first, before its local model is sent as an update to the coordinator server for the aggregation phase (see Figure 2 on page 4).A mini-batch size of b = 4 is used in all experiments, as lower or higher batch sizes did not improve the performance of the global model.

2) LOSS AND EVALUATION METRIC
A binary cross-entropy loss function is used to distinguish crisis news from non-event texts.The macro F1 score is used to evaluate the models since the prediction results can only be obtained from the CASE 2021 workshop competition website [24].In fact, the macro F1 scores from the best federated training round t i are considered in all conducted experiments.This best model strategy requires that after each training round, the global model needs to be exported and 134016 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
saved for further evaluation [49], [50].The macro F1 score can be expressed as follows: where precision and recall are performance metrics from the model inference on the test data.Thus, Eq. ( 6) shows that a harmonic mean of precision and recall is calculated in the macro F1 score, where a high F1 score close to 1 is better than one at 0 [39].

3) HARDWARE AND FRAMEWORK
All experiments are performed on Google Colab Pro+ with a high runtime tier (52 GB RAM) and an NVIDIA Tesla P100 GPU.PyTorch [51] and the Flower framework [52] are used to train the LMs and to simulate an FL system on a single device (see Figure 2 on page 4).

IV. EXPERIMENTAL RESULTS AND ANALYSIS
This section discusses the prediction results (i.e.macro F1 scores) obtained from the comparisons of the federated DistilBERT and BERT models trained with weighted FedAvg, FedAvgM, FedYogi, FedAdam, and FedAdagrad in IID and Non-IID settings on the protest news dataset.
In addition, results are presented for the DistilBERT BERT models trained with centralized data, serving as baseline reference (see Table 5).In the following subsections, the RQs defined at the beginning (see Subsection I on page 3) are addressed and answered based on the comparison results and findings.
A. HOW DO WEIGHTED FedAvg, FedAvgM, FedAdagrad, FedAdam, AND FedYogi PERFORM USING TRANSFORMER MODELS TRAINED ON IID AND NON-IID TEXT DATA? (RQ 1) A large number of text samples in both the training and test data are present in the English language (see Table 3 on page 5).Therefore, a performance bias towards English is shown by our results for all compared FL algorithms, with the highest baseline macro F1 score of 0.8192 for DistilBERT and 0.8233 for BERT as can be seen in Table 5.Because DistilBERT is a distilled version of BERT, it should not perform better than BERT [26].However, the macro F1 score for the baseline BERT model (0.6885 0.0054) is still slightly lower than of the DistilBERT model for the classification of news reports in Spanish.Statistical measurement tolerance is assumed because the difference between the two baseline models in terms of accuracy is relatively small.Moreover, the centralized baseline DistilBERT model appears to have a similar model performance to the larger BERT model for all languages except Hindi.This is because DistilBERT maintains up to 95% of BERT's performance with 40% fewer parameters [26], [53].
For the federated case, as in Table 6 shown, stable global models can be obtained for English news reports for both data distribution settings when the LMs are trained with any FL-algorithm.This is because the transformers were trained on a large text corpus in English due to the pre-training task of DistilBERT and BERT [35].A macro F1 score of 0.8168 is obtained for the IID setting when DistilBERT is trained with FedAvg.The Non-IID experiments for the DistilBERT model perform slightly worse except when trained with FedAdagrad (see Table 6).The reason for this is that FedAdagrad adapted the gradients in the direction of a better local minimum faster in the Non-IID space than in the IID setting.In the BERT experiments, it can be observed that the LLM with FedAvg and a score of 0.8228 shows the highest model performance for the downstream task.Again, the best model performance for Non-IID texts is achieved with FedAdagrad.The macro F1 results of the comparison experiments for the Portuguese News reports are shown in Table 7.It can be observed that DistilBERT models federated with FedAvg and FedAvgM achieve the same model performance (0.8173) for the IID case.Interestingly, FedAvg also achieves the highest macro F1 score with 0.7809 for the Non-IID setting.One reason for this is that the other FL algorithms still form an arithmetic weighted average over the local models at the core.For instance, FedAdagrad merely adds EMA to FedAvg to track past gradients (see Eq. 3 on page 8) and still perform FedAvg as an model aggregation.However, it is worth mentioning that the macro F1 scores from Table 7 for the Non-IID settings are all very close and differ only slightly.The same is the case for the federated BERT models, with FedYogi performing best by a small margin in the IID setting, achieving a macro F1 score of 0.7948.For Spanish news reports, on the other hand, larger differences in model performance between experiments are observed, as displayed in Table 8.Although again DistilBERT trained with FedAvg shows the highest model performance with a macro F1 score of 0.7128 for the IID case, for the Non-IID setting FedAdam clearly beats FedAvg (0.7270 vs. 0.6833).The reason could be that FedAdam uses the history of the gradients with giving more weights to recent gradients while downweighting past gradients in the federated space, ensuring that the global model converges to an optimal solution.However, for the federated BERT models, FedAvgM shows the highest model performance (0.7326 for IID; 0.7385 for Non-IID).It is also conspicuous that for FedAvgM the differences between IID and Non-IID are marginal for BERT and identical for DistilBERT.For this special case, the underlying data distribution seems to play a rather negligible role.
It is worth noting that the highest macro F1 score is never achieved by LMs trained with FedAdam, except in the case of Spanish Non-IID (see Table 8).This is because, in nonconvex settings, a local minimum is found more quickly by FedAdam, but worse than FedYogi (see Eq. 4 on page 8).Quite high and stable model performances are shown across all languages by the LMs trained with FedYogi in the IID and Non-IID settings.This is mainly due to the additive model updates and EMA with 2nd-moment estimates introduced with FedYogi.This was also confirmed by the comparative analysis of the FedOpt study [22], as it was found that a more stable global model can be created when FedYogi is used instead of FedAdam.
In general, the global LM performance seem to be more significantly impacted by the data distributions themselves.For instance, a macro F1 difference of 0.0658 between IID and Non-IID settings for English news reports can be observed in the evaluation of the BERT model trained with FedAdam (see Table 6 on page 8).An improved global model performance is seemingly achieved by FedAdam in almost all Non-IID settings compared to FedAdam applied in IID settings.This is somewhat surprising as the assumption is that better results should always be attained in IID settings than in Non-IID settings when using the same FL algorithm and model architecture.However, it is likely that the Non-IID data allows the FedOpt algorithms to converge faster due to adaptive optimization [22], thereby enabling them to find a suitable local minimum closer to the global minimum earlier than FedAvg and FedAvgM.

B. HOW GENERALIZABLE ARE FEDERATED TRANSFORMER MODELS FOR TEXTS IN LANGUAGES THAT ARE NOT PART OF THE FINE-TUNING PROCESS? (RQ 2)
It is known from the CASE 2021 dataset (see Subsection III-B on page 5) that the protest news in Hindi are not included in the training dataset but are present in the test data.Thus, the event texts in Hindi are used in our test data to validate the generalization capability of the federated model when the language is not part of the LM finetuning process.Acceptable results for Hindi texts with DistilBERT are achieved only when DistilBERT is trained with FedYogi, whereas DistilBERT trained with other FL algorithms exhibits significantly inferior performance compared to federated BERT (see Table 9).For example, a macro F1 score of only 0.4834 is achieved by DistilBERT trained with FedAdagrad for Non-IID texts in Hindi, which is worse than random guessing on average.In this case, the BERT model exhibits superior performance compared to the DistilBERT model across all FL algorithms.From Table 9, it can be observed that the other FL algorithms are outperformed by FedYogi for the federated DistilBERT in IID and Non-IID settings, with macro F1 scores of 0.7296 and 0.7147, respectively.Baseline DistilBERT (see Table 5 on page 9) is more than twice outperformed by the federated DistilBERT trained with FedYogi in IID and Non-IID settings in terms of macro F1 scores (0.3108 vs 0.7296).Under certain conditions, the baseline LM can be outperformed by the global LM if a suitable FL algorithm is used.It is possible that the semantic space of the news texts in Hindi has been learned by the federated DistilBERT with FedYogi through zero-shot learning, which is a concept where samples of unseen classes that were not part of the training process are correctly classified [54].However, this should be taken with a grain of salt because in the pre-training process of the multilingual DistilBERT and BERT models, some general Hindi texts were already included in the text corpus [55], Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and no explicit zero-shot learning technique was applied in this work.In some experiments, DistilBERT and FedAvgM achieved a macro F1 score of less than 0.5 (e.g., 0.3651 for IID texts, see Table 9 on page 10).This was because no suitable local minima were discovered in the federated search space during training, which can lead to a poorer model inference quality in FL.Moreover, it is worth mentioning that a different data distribution of the clients can lead to different prediction results as well.
Significantly smaller macro F1 score differences between the individual FL algorithms for the Hindi test data are observed in the federated BERT model (see Table 9 on page 10).The highest macro F1 score of 0.7859 for predictions using Hindi texts in the Non-IID setting (baseline 0.6871, see Table 5 on page 9) is achieved by the BERT model when trained with weighted FedAvg.The best federated BERT model performance in the IID setting for Hindi texts is achieved with FedYogi and a macro F1 score of 0.7420.Therefore, in classifying Hindi texts, federated DistilBERT is consistently outperformed by the federated BERT model.Acceptable model performance for the Hindi language can be achieved by both federated LMs when they are trained with advanced FL algorithms such as FedYogi.Considering the significantly higher communication cost associated with the federated BERT model, it can be concluded that opting for the federated DistilBERT model is a more cost-efficient model for federated NLP applications.

C. WHICH FL ALGORITHMS PERFORM BEST AND ARE THE MOST STABLE IN DETECTING PROTEST NEWS IN ALL FOUR LANGUAGES? (RQ 3)
As in Table 10 shown, the weighted algorithms FedAvg and FedYogi are the most stable FL algorithms for all four languages based on the protest news dataset.The baseline DistilBERT (see Table 5 on page 9) is outperformed by the federated DistilBERT model, with an averaged macro F1 score of 0.7524 when FedYogi is used as the aggregation algorithm (improvement over baseline model for IID: 0.0265, and for Non-IID: 0.0231).For the federated BERT model, it is observed that weighted FedAvg outperforms the other FL algorithms for the IID setting, with an average macro F1 score of 0.8015.Conversely, the highest averaged macro F1 score for the Non-IID setting (0.7903) is achieved by BERT trained with FedAvgM, which even TABLE 10.Weighted averaged macro F1 scores from all four language comparisons.The weighting is based on the number of evaluation samples in each language (see Table 3 on page 5).
outperforms the BERT model trained with FedYogi.One reason for this is, that a suitable local minimum is found by FedAvgM than by the standard FedAvg and FedYogi due to the presence of the global momentum parameter β (see Eq. 2 on page 7).
However, it can be observed from the comparison tables that, for the news reports in English, Portuguese, and Spanish, in most cases, the best or similar macro F1 scores are achieved by weighted FedAvg [11] compared to the more advanced FL algorithms (e.g., see Table 7 on page 9).This observation can be justified by the fact that all FL algorithms perform an arithmetic average of model updates as a primary component, and the advanced FL algorithms are essentially small extensions of the FedAvg equation (see Eq. 1 on page 7).Other FL algorithms, such as Ditto [38] or FedNova [56], which are not the focus of this study, incorporate additional techniques (e.g., regularization or normalization).Nevertheless, a weighted average of model updates is also used as the main component by these FL algorithms [38], [56].It is shown that the investigated advanced FL algorithms in this study are mainly an optimization of averaged-based model aggregations, and therefore, the prediction quality does not differ much in most experiments.
In a similar study [34], a test accuracy of 55.11% on the 20news proxy dataset [57] was achieved by training federated DistilBERT and BERT models with FedOpt algorithms.However, it is shown by the results of this study that a significantly higher accuracy can be achieved by a federated training of these LMs with FedYogi, despite the highly Non-IID training data of the clients (e.g.77.89% 22.78%, as seen in Table 10).The global model quality can also be significantly influenced by the choice of federated hyperparameters, such as the number of aggregation rounds and the number of clients in an FL system [28].
For the protest news classification in this study, a similar or partially improved model performance compared to centralized baseline LMs can be achieved by leveraging FL, as shown by the prediction results obtained (see Table 10 on page 11).These results indicate that FL in the NLP domain is a cost-effective and privacy-preserving analysis method for distributed high volume text data.Furthermore, it is revealed by our comparative analysis that the negative impact on model performance, compared to centralized learning, is mitigated by the use of FL on pre-trained LMs, irrespective of the selected LM architectures.Moreover, when considering an FL setting with a small number of clients and sufficient balanced training data in each client, there is also minimal disparity observed between weighted FedAvg and more novel FL algorithms.However, for clients with highly Non-IID texts (see Figure 4 on page 6), the general recommendation based on the study results is that FedYogi should be used for the federated training of LMs and LLMs, as it shows the most stable model performance among the compared FL algorithms for both data distribution settings.
Thus, the comparative performance analysis shows that there are two main factors in an FL system that influence primarily the global LM model inference quality: the underlying distribution of the training data between the clients and the proper choice of an FL algorithm.However, based on the results of this study, the data distribution has a bigger impact than the FL algorithm.Another reason for the relatively small macro F1 score differences can be attributed to the pre-training of DistilBERT and BERT [35].
This work also demonstrates that FedYogi is the most advanced FL algorithm so far and provides the highest utility on Non-IID texts in federated NLP applications.However, all FL algorithms have in common that an average function is used as an aggregation strategy.We think that a different aggregation function, such as a min-max or rotation approach between local models, could lead to comparable or even improved model performance.
Regarding the convergence behavior, which was not the main focus of this study, it was noticeable that the global model in the test FL system reached the state of equilibrium quite fast after a few federated training rounds.Here we also assume it can be attributed to the large pre-training task of BERT and DistilBERT, which is also confirmed in other works [32], [33].However, further research on the effects and use of pre-trained LMs in FL systems is required.

V. CONCLUSION AND FUTURE WORK
The challenges of statistical heterogeneity in adverse training data distributions between clients and the effective aggregation of local models represent pivotal issues within the context of Federated Learning (FL), which need to be addressed to guarantee an optimal global model performance in an FL system.This article investigates both challenges and compares the standard weighted Federated Averaging (FedAvg) strategy with novel FL algorithms for solving a binary text classification task using socio-political and crisis event news texts in English, Spanish and Portuguese distributed on three individual clients.The aim of this task is to detect whether the multilingual news reports are normal political events or crises.A customized partitioning strategy to split these protest news into Independent and Indentically Distributed (IID) and Non-IID texts is applied.Additional news reports in the Hindi language are used for evaluating the generalizability of the global model for predictions with texts in a language that was not part of the federated fine-tuning process.
For this study, the pre-trained Bidirectional Encoder Representation from Transformers (BERT) and DistilBERT Language Models (LMs) are fine-tuned with FedAvg, Federated Averaging with Server Momentum (FedAvgM), FedYogi, FedAdam, and FedAdagrad for IID and Non-IID settings on 9447 individual news reports.Both fine-tuned LMs are federated in an FL system with three clients and compared against centralized baseline LMs on a test dataset of 2631 news reports.The comparison results show that both federated LMs can achieve a similar prediction quality as the baseline LMs.For instance, the baseline DistilBERT achieves an macro F1 score of 0.8192 for classifying English news reports whereas DistilBERT trained with FedAvg achieves a similar score of 0.8168 for IID setting (0.8039 for Non-IID).While standard FedAvg generally achieves a stable model prediction quality, particularly in IID settings, FedYogi shows the most stability when dealing with Non-IID news reports across all four language, achieving an average macro F1 score of 0.7755 for Non-IID settings.This demonstrates that FedYogi is a worthy FL algorithm candidate for any federated Natural Language Processing (NLP) task with adverse distributions of the training data between the clients.
In future work, we plan to focus on exploring the properties and generalizability of the identified stable FL algorithms on other larger LMs, and how adverse distributions of training data between clients can be mitigated to increase the robustness and stability of the global model.This could support the development of efficient FL frameworks specifically tailored for the federated training of (large) LMs, thus advancing the limited FL field in the NLP domain.

FIGURE 1 .
FIGURE 1. Overview of the conducted research steps.

FIGURE 2 .
FIGURE 2. FL system with clients holding sensitive training data.

FIGURE 3 .
FIGURE 3. Data partitioning for the IID setting in three clients based on the number of training samples.

FIGURE 4 .
FIGURE 4. Data partitioning for the Non-IID setting in three clients based on the number of training samples.

TABLE 4 .
Parameters and functions used for model training and evaluation.
PASCAL RIEDEL received the bachelor's degree in business informatics from the University of Applied Sciences at Neu-Ulm, Ulm University of Applied Sciences, and the master's degree in information systems from the Ulm University of Applied Sciences.He is currently pursuing the Ph.D. degree with Ulm University.He is involved in topics related to the examination of federated learning systems at Ulm University.His research interests include federated learning, deep learning, natural language processing, and computer vision.MANFRED REICHERT received the Ph.D. degree in informatics from Ulm University.He is currently a Professor of databases and information systems with Ulm University, where he is the Director and the Dean of Studies with the Computer Science Department, Faculty of Engineering and Computer Science.His research interests include the development of innovative technologies in business process management, mobile processes and services, and knowledge-intensive processes.REINHOLD VON SCHWERIN received the Ph.D. degree in computational science from Heidelberg University.He is currently a Professor of data science, machine learning, and artificial intelligence with the Ulm University of Applied Sciences.He is on the Board of Directors of DASU-Transferzentrum für Digitalisierung, Analytics and Data Science Ulm, which strives to strengthen the ties between academia and industry in the fields of data science, ML, and AI.ALEXANDER HAFNER received the bachelor's degree in information systems and computing from the Ulm University of Applied Sciences and Edinburgh Napier University, and the master's degree in computer science from Leipzig University.He is currently pursuing the Ph.D. degree, with a focus on the impact of trust on machine learning algorithms.His research interests include bio-inspired algorithms and heuristic optimization methods.DANIEL SCHAUDT received the B.Sc. degree in economics from the Karlsruhe Institute of Technology (KIT), and the M.Sc.degree in business informatics from the Aalen University of Applied Science.He is currently pursuing the Ph.D. degree with the Ulm University of Applied Sciences, where he is involved in AI topics related to medical imaging, synthetic data generation, and explainability.GAURAV SINGH received the B.Tech.degree in computer science and engineering from the Indian Institute of Information Technology Vadodara, India, in 2022.He has been a Data Scientist with S&P Global, Gurgaon, since August 2022.His research interests include federated learning, deep learning, natural language processing, and graphical neural networks.

TABLE 1 .
Overview of contributions and limitations of the related work.

TABLE 2 .
Two english news text excerpts with labels.

TABLE 3 .
Sample counts of data in training and testing sets for each language.

TABLE 5 .
Macro F1 comparison of centralized multilingual DistilBERT and BERT models for all four languages.

TABLE 6 .
Macro F1 comparison of federated multilingual DistilBERT and BERT models for english protest news in IID and Non-IID settings.

TABLE 7 .
Macro F1 comparison of federated multilingual DistilBERT and BERT models for portuguese protest news in IID and Non-IID settings.

TABLE 8 .
Macro F1 comparison of federated multilingual DistilBERT and BERT models for spanish protest news in IID and Non-IID settings.

TABLE 9 .
Macro F1 comparison of federated multilingual DistilBERT and BERT models for hindi protest news in IID and Non-IID settings.