Empirical Measurement of Client Contribution for Federated Learning With Data Size Diversification

Client contribution evaluation is crucial in federated learning(FL) to effectively select influential clients. Contrary to data valuation in centralized settings, client contribution evaluation in FL faces a lack of data accessibility and consequently challenges stable quantification of the impact of data heterogeneity. To address this instability of client contribution evaluation, we introduce an empirical method, Federated Client Contribution Evaluation through Accuracy Approximation(FedCCEA), which exploits data size as a tool for client contribution evaluation. After several FL simulations, FedCCEA approximates the test accuracy using the sampled data size and extracts the client contribution from the trained accuracy approximator. In addition, FedCCEA grants data size diversification, which reduces the massive variation in accuracy resulting from game-theoretic strategies. Several experiments have shown that FedCCEA strengthens the robustness to diverse heterogeneous data environments and the practicality of partial participation.


I. INTRODUCTION
Federated Learning(FL) [1], [2], [3], [4] is an emerging field in distributed machine learning. It aggregates different models of clients in distributed systems without accessing data. Research on FL focuses on various approaches that attempt to reach a similar performance to centralized, optimal models.
In a data-centric approach, FL considers the data quality by measuring contribution at the client level. Client contribution is generally defined as the impact of a dataset for each client on the federated model performance. Measurement of client contribution is applied to two specific aspects of improving the federated model.

A. CLIENT SELECTION
Regarding the context of deep learning models, not all data have equal value [5]. Thus, preserving and discard-The associate editor coordinating the review of this manuscript and approving it for publication was Thomas Canhao Xu .
ing high-and low-quality data is a prerequisite to training a high-performing deep learning model [6]. Similarly, not all clients contribute equally to federated settings [7], [8], [9], [10]. Closely monitoring these clients and measuring the contribution of each client should be achieved to select influential clients and remove unneeded ones.

B. INCENTIVE ALLOCATION
Economically, client contribution is a suitable standard for allocating incentives fairly, while maximizing profit [11], [12], [13], [14], [15]. A proper incentive allocation with client contribution may motivate high-contributors to actively participate in FL, where the amount of high-quality data in each client affects the model accuracy. This incentive mechanism may facilitate the efficient management of revenues and costs in a business system with a highly-performing federated model by the central server (or coordinator).
Then the question is, ''how do we evaluate the client contribution in the FL setting?'' Unfortunately, a different view from data valuation of centralized learning is required.
In centralized data environments, the data value is measured based on data characteristics, such as the presence of samples with uncommon features [16], [17] and the presence of data corruptions [6], [18]. In contrast to centralized data environments, evaluating the contribution in FL environments with the actual data characteristics is impossible.
In particular, the central server cannot measure the exact impact of the data characteristics (or data heterogeneity) in FL environments. First, the blockage of accessing data restricts the server from analyzing the data heterogeneity of each local dataset [19]. This indicates that the server cannot directly seek and estimate the data distribution among the clients or the fraction of noisy data. Only local gradients, weights, and data sizes can be obtained from individual clients during the aggregation. Thus, the server can evaluate contribution only by this limited information.
Moreover, the impact of data distribution, noise, and data quantity is not as clear as in centralized settings because they strongly rely on combinations with other clients. As shown in Fig. 1, a single client can be a low contributor with some combinations ((a)) that update the global weight away from the optimal weight. On the contrary, it can also be a high contributor with different combinations ((b)) that update the global weight closer to the optimal weight. This double-faced combinatorial result causes an unstable impact of the data heterogeneity on the federated model performance.
From early explorations, Shapley Value [8], [20], a gametheoretic evaluation method, predicts the overall combinatorial impact of clients on performance by averaging the marginal test accuracy with all the possible client subsets including and excluding a client as shown in Fig. 2. Although it is a theoretically well-structured evaluation method, the client contribution measurement by Shapley Value faces challenges with extreme accuracy fluctuations of some combinations in heterogeneous data environments. These drastic combinatorial effects result in unstable client contribution estimates. 1 To make a stable and precise quantification of the impact of data heterogeneity, we introduce a novel, empirical evaluation of client contributions using data size. Federated Client Contribution Evaluation through Accuracy Approximation, also known as FedCCEA, predicts the client contribution through a deep learning model named accuracy approximation model(AAM). Contrary to previous studies that only considered full or no data use, FedCCEA diversifies the proportion of data used in every round to stabilize client contribution measurement. This data size diversification may strengthen the robustness in any real-world decentralized setting and even allow the contribution measurement of partial participation with a free choice of data size. We demonstrate our strengths through experiments using three public image sets [21], [22], [23] and different data distribution settings.
This study provides three following main contributions: 1) To the best of our knowledge, this is the first empirical method that allows the partial participation of clients and exploits data size sets for a client contribution evaluation in FL. 2) We empirically measure client contribution through deep learning models with diversified data size combinations to make a stable contribution evaluation in any data setting. 3) We also conduct extensive experiments on three public image sets in real-world environments, such as non-IIDs and data corruptions. We empirically analyze the robustness to diverse heterogeneous data situations and the practicality of data size selection.

II. RELATED WORKS
A. DATA VALUATION Data Valuation, a phrase similar to Client Contribution Evaluation, has been widely studied recently to improve centralized machine learning models and to explain blackbox predictions. The leave-one-out(LOO) method [30], [31] and the influence function [30], [31], [32] measured the counterfactual of a batch and verified whether the performance has changed owing to the batch. However, these perturbation-based methods performed poorly. For example, two identical but influential points do not value as high as they exist together. Shapley Value [33], a classical concept in cooperative game theory, is on the rise in machine learning to tackle the poor performance of LOO. In contrast to LOO, Data Shapley [20] compared all possible training data combinations which a single datum is included and excluded. Moreover, several efficient methods to approximate the actual Shapley Value, such as Monte-Carlo SV and gradient-based SV [20], attempted to reduce the computational inefficiency that actual Data Shapley suffers. However, the issue of the high computational complexity of data valuation remains. Data Shapley costs O(2 N ) of computational complexity for data valuation, and Monte-Carlo SV costs O(N log N ).
Subsequently, empirical methods of data valuation have been introduced as alternatives to theory-based data valuation. Data valuation using reinforcement learning(DVRL) [34] is a meta-learning framework that jointly learns the data value and trains the primary model using reinforcement learning. This method robustly approximates data values, even for low-quality datasets or other-domain samples. Moreover, it achieves high performance in machine learning tasks by removing parts of the low-valued ones.

B. CLIENT CONTRIBUTION EVALUATION FOR FL
In addition to a model-centric approach that focuses on FL optimization [35], [36], [37], [38], the client contribution evaluation in our study is a data-centric solution to the client-drift problem of FedAvg [2]. The server provides more credit to major clients and fewer credit to minor clients.
Despite advances in data valuation methods in centralized machine learning, only a few techniques can be applied in federated environments owing to data blockages. Local gradients, local weights, and local data sizes are the only possible information that the server can use as a tool for client contribution evaluation. [2], [19] In particular, LOO [14], [24] and Shapley Value [8] are applicable valuation methods in distributed systems that use local weights or gradients as a tool for client contribution evaluation. While these game-theoretic methods are timeconsuming, a simple approximation for LOO [24] and Shapley Value [20], [25], [26], [39] makes the client contribution calculation feasible with a theoretical base.
Subsequently, evaluation methods using weight or gradient differences were introduced. RRAFL [27] considered the directional difference between the global model and local weight vectors, assuming that a lower angle contributes more. Empirically, F-RCCE [28] and FAVOR [29] applied REIN-FORCE and DQN models with local weights and gradients to find the best strategies for client selection to optimize the federated model regarding the measured client contribution.
On the other hand, the information of local data size has rarely been exploited for client contribution evaluation because a large amount of data does not clearly lead to a higher contribution in federated learning when data heterogeneity exists. Previously, the local data size was defined as a client contribution for simple construction of a DRL-based incentive mechanism [12] with strong assumptions. However, in addition to the local data size, quantification of the impact of data heterogeneity(e.g. data corruption and non-IID) is required to correctly measure the client contribution in any data environment.

III. PROPOSED METHOD
FedCCEA consists of two phases: simulator and evaluator. The simulator is the preparation step of the evaluation by simulating the FL procedures to obtain the inputs(sampled data size) and targets(round-wise accuracy) of the AAM in the evaluator. After all FL simulations are completed, the evaluator predicts the client contribution of each client through the AAM.

A. SIMULATOR
We denote n ∈ N, R ∈ N, and S ∈ N as the number of participating clients, rounds per FL simulation, and number of FL simulations, respectively. Moreover, we construct D (i) of the training dataset for each client i and denote | D (i) | as the total data size for each client. Nevertheless, D (i) is not used during the testing step; the simulator uses a separate test set D t . 2 The main task of the simulator is to implement FL simulations to obtain a set of sampled data size(x r,s ) and round-wise accuracy(acc r,s ). Therefore, we execute the simulator in three steps: data size diversification, a single FL iteration, and testing. The entire routine of these three steps is a single round of one FL simulation, and we repeat the R-rounds FL simulations S times as initially indicated.

1) DATA SIZE DIVERSIFICATION
Considering the FL environment, each client freely selects the size of its local training data for the FL in each round. To observe all possible actions of clients, we expand the cases of data size selection by randomly selecting the proportion in the uniform distribution between zero and one: r,s ) ∼ U(0, 1). We turn the proportion vector to a real data size vector d r,s for use in a single FL iteration step. Therefore, to store the data size vector in a normalized term for evaluation, we calculate the standard data size | D |= |D (i) | n and determine the scaled data size vector x r,s : 2 The location of the test set depends on the federated system design: the server side or the client side. Testing the global model is held on the site where the test set is located.
r,s , . . . , d This scaled data size vector x r,s , also defined as sampled data size, accelerates the convergence of AAM and allows comparison of the data size between clients.

2) SINGLE FL ITERATION
The subsequent step is a one-round FL classification task. The global model is a neural network such as an MLP or a CNN, as hypothesized in this study. During this step, the central server renews the global model parameter θ G r based on the sampled data size of the clients. Each client updates their local model weights θ r,s of their dataset, which is their actual size for training in this round. Thereafter, the central server aggregates the local model weights using the FedAvg algorithm. Based on the information obtained from clients, FedAvg is reformulated as follows:

3) TESTING
After a single FL iteration, we evaluate the current performance of the federated model using a separate test set D t . The metric obtained in this step is acc r , which is defined as the round-wise accuracy until round r. These three steps are repeated until the final round(R), then a single FL simulation ends. The sets of round-wise accuracies obtained from several FL simulations(S) are used for accuracy approximation in the evaluator phase with the sampled data size.

B. EVALUATOR
The evaluator phase is independent of the simulator; nonetheless, it plays a substantial role in predicting client contribution from a learned accuracy approximation model. This phase begins after the simulator phase is completed to ensure that the evaluator can obtain all stored results from the simulator.

1) ACCURACY APPROXIMATION
The AAM is a regression model that predicts round-wise accuracy using the sampled data size. This model approximates the federated test accuracy of the current round r using the sampled data size sets until the current round. In this model, the sampled data size in the previous rounds also affects the round-wise accuracy. For instance, the accuracy in round r is also affected by the data size set in rounds 1 to r − 1. Therefore, the sampled data size set can be analyzed as time series data.
We design the AAM in a combined framework of linear regressions and a time series model using several distinctive tools, forming g : R n×R −→ [0, 1], as shown in Fig. 4. This model allows the sampled data size to be considered sequentially by organizing static-sized inputs with zeropadding of future actions. In addition, shared weights enable the quantification of the averaged impact of the local dataset, considering a certain data size for each client among the overall rounds.

a: ZERO-PADDING
An input vector of the AAM, r ∈ R n×R , is constructed with the experienced sampled data size sets. Considering the static n × R shape, we list the sampled data size sets before the current round r. Then, we zero-pad the rest of the space. These are the subsequent actions after round r. For each sample, we assume that the federated model is continually trained until round R with clients not participating in FL after   The focal point of the architecture is on the first layer of AAM, which contains shared weights ω ∈ R n . Shared weights are widely used in CNN architectures as convolution filter shapes to extract the local features of image data. Similarly, the client contribution module is structured with round-wise shared weights to extract the averaged impact of the data heterogeneity for each client.

c: APPROXIMATOR MODULE
The remaining layers of the AAM are constituted as a manyto-one time series architecture, f : X −→ [0, 1], which returns a single approximated accuracy. The concatenated vector X = (X 1 , X 2 , . . . , X R ), originating from the linear regressions X r ( r ; ω), is the input of these layers. While X indicates the latent impact of all clients in each round with a given data size set, the approximator module may be closely related to the round-wise impact with the given latent variables. Any type of sequence model is possible in this module; nonetheless, the model must perform well for the approximation of the test accuracy. For example, regarding task difficulty, we use a simple MLP [40] for MNIST and CIFAR-10 classification tasks and LSTM [41] frameworks for the EMNIST classification task. Hence, we formulate the corresponding optimization problem of the AAM as: From Eq. 4, the latent values (X 1 , . . . , X R ) are outputs of the client contribution module(X 1 ( 1 ; ω), . . . , X R ( R ; ω)) with given data size samples( ), where ω ∈ refers to the shared weight vector in the client contribution module.
In addition, f (X 1 , . . . , X R ) is the approximator module with the weight vector −ω ∈ that predicts the round-wise accuracy(acc). We use the root mean squared error as the loss term of the AAM.

2) CLIENT CONTRIBUTION MEASUREMENT
Because quantifying the exact impact of data heterogeneity for each client is unfeasible in FL, we indirectly predict the latent impact of data heterogeneity in the client contribution module. The shared weight vector(ω) represents the importance of the data size set to the latent variable Xs. We interpret ω (i) as the averaged impact of data heterogeneity for client i when x (i) = 1. This index may indirectly include the combinatorial impact of the data distribution among clients and the noise fraction on federated model performance. In addition, the data size is an essential element that affects the performance of the federated model. Thus, the client contribution is formulated as the predicted latent impact of data heterogeneity(w (i) ) with an extra weight of given data size(x (i) ):

IV. EXPERIMENTS
In this section, we want to answer the following questions: 1) How does accuracy variation occur in the Shapley Value evaluation and how does FedCCEA address this problem? 2) Is FedCCEA evaluation accurate even in the strong non-IID and noisy environments? 3) Is FedCCEA evaluation accurate even with partial participation? To answer each question, we design (i) an accuracy variation comparison, (ii) a client removal test, and (iii) a client removal test for partial participation. Moreover, we conduct additional client removal tests and experiments for complexity analysis with different numbers of clients.

A. BASIC EXPERIMENTAL SETTINGS 1) BASELINE EVALUATION METHODS
We answer the above questions and prove the strengths by comparing FedCCEA to the three baseline evaluation methods in recent studies.
• Fed-Influence in Accuracy(FIA) [24] is a type of Fed-Influence measurement metric that simply measures the influence by investigating the effect of removing a client only. The actual FIA value can be obtained from the results of the leave-one-out test.
• RRAFL [27] measures the contribution of cosine similarity between the final global weight vector and current local weight vectors.

2) DATA DISTRIBUTION SETTINGS
We design diverse data distribution settings to answer these three questions. We diversify the degree of data heterogeneity based on the number of classes contained in each client and the presence of label noise. The detailed statistics of the data distribution settings are presented in Table 2. Specifically, we define the Earth Mover's Distance(ρ) [42], [43] between the distribution over classes on each client and the population distribution as the overall degree of IIDness. For Experiment 1 in Section IV-B, we construct setting C.1 by assigning all data of the selected classes(|P| = 2) to five clients(|C| = 5). This may result in a high mean of ρ among clients. Moreover, 40% label noise 3 is injected into client A for more extreme data heterogeneity. This setting is used to empirically observe the unstable combinatorial effect of a client and the extreme accuracy variations of game-theoretic methods in non-IID.
For Experiments 2 and 3 in Sections IV-C and IV-D, 20 clients are constructed with a limited number of classes(|P|) of the MNIST, EMNIST, and CIFAR-10 dataset. The settings with each client having all classes in an identical distribution are defined as IID. On the contrary, settings with each client having half or a few classes are defined as weak and strong non-IID. The mean of ρ increases as the degree of non-IID increases. Furthermore, for the weak and strong non-IID of each dataset, we assign 40% label noise(ξ ) to four clients(|C ξ | = 4). 4

3) FEDERATED LEARNING SIMULATION SETTINGS
Regarding the proper operation of FedCCEA, the configurations of federated learning simulations and a federated model should be set before the simulator phase. Three-and two-layer MLPs are constructed for the MNIST and EMNIST classification tasks, respectively, while two-layer CNNs are constructed for the CIFAR-10 classification task. In addition, we implement 100 simulations(S) for federated learning. Although 50 rounds are implemented for each simulation for the MNIST and EMNIST datasets, we implement 100 rounds for CIFAR-10 to improve the model. To reduce computational costs and enhance model performance in the EMNIST and CIFAR-10 classification task, we increase the initial learning rate and batch size compared with the MNIST classification task.

4) CLIENT CONTRIBUTION INDEX
The evaluation methods extract client contribution values in different ranges, so we standardize these values into a unified index known as the client contribution index(CCI). CCI is newly measured by calculating the relative importance between clients, ranging from zero to one. The negative values outside the boundary are initially set to zero, 5 indicating that the client does not contribute to the federated model. By denoting v i as the value of the client contribution measured in a given evaluation method, we calculate the CCIs as follows:

B. EXPERIMENT 1. ACCURACY VARIATION COMPARISON
Although the IIDness and noise proportion affect the client contribution of each client, the impact of these data heterogeneity elements differs based on the companions with which the client collaborates. As shown in Fig. 6, noisy client A strongly supports the enhancement when collaborating with client C. It increases the accuracy by 9.09% compared to the combination with client B. Depending on which clients participate, data corruption and different data distributions, which are the main attacks of a centralized model, can significantly enhance the federated model. Therefore, client contribution evaluations that only consider full data usage (including RoundSV and FIA) pose a critical challenge with extreme performance variations in non-IID distribution and data-corrupted environments. The actual Shapley Value and FIA are calculated as the marginal accuracy of the combinations, including a client and excluding a client. These estimates can detect the combinatorial influence of each client. Owing to the double-faced combinatorial impact, client A in Fig. 7 suffers from a wide variation in accuracy samples between the combinations with a standard deviation of 0.4883. However, the data size diversification of FedCCEA squeezes the range of accuracy samples with a standard deviation of 0.0754. The contribution of client A is measured more stably with FedCCEA than with the other methods in Setting C.1. -IID settings(W.1, W.2). Ideally, the straight line should maintain high performance while the dashed line should drop dramatically after removing clients. For simplicity, FedCCEA and RoundSV are only shown, while the overall evaluation metrics are described in Table 4.

C. EXPERIMENT 2. CLIENT REMOVAL TEST
In a federated setting, a direct precision test of client contribution is challenging. There are no exact ground-truth values of client contribution in federated environments that can be precisely compared. In addition, the combinatorial impact distracts the measurement of ground-truth values owing to the unclear linearity with data heterogeneity. (AbC: area between the Curves, AR : Accuracy Reversal) The higher AbC is the better evaluation, and the AR mark (×) represents a good evaluation method. The last column(Best) means the number of settings that achieve the best result among evaluation methods.
Alternatively, the client removal test [8], [24], [34] is commonly used for the precision testing of client contribution measurements. As shown in Fig. 8, we incrementally remove clients in descending and ascending order of CCIs and retrain the model. By correctly selecting clients to be removed, the model with removed low CCIs(straight line) consistently retain high test accuracy. In contrast, the performance of eliminating the highest contributors(dashed line) decreases substantially. The two possible evaluation metrics are as follows: • Accuracy Reversal(AR): Regarding any proportion of client removal, the accuracy of high-CCI removal should not exceed the accuracy of low-CCI removal within the same proportion. Moreover, AR should not exist for any type of dataset or setting.
• Area between the Curves(AbC): If the client contribution is properly measured, the difference between the accuracy of the removed low-contributors(acc low,frac ) and that of the removed high-contributors(acc high,frac ) would be estimated to be high. Therefore, we measure AbC using the following equation: AbC = frac acc low,frac − acc high,frac (7) Fig. 8 shows that FedCCEA makes a more precise evaluation of client contribution than RoundSV. While RoundSV experiences accuracy reversals in the MNIST and CIFAR-10 settings, FedCCEA produces the expected results of the correct client contribution measurement with no accuracy reversal.
Specifically, even though RoundSV achieves a clear gap between the 'Least' and the 'Most' in EMNIST setting( Fig. 8(b)), FedCCEA achieves a much wider gap than RoundSV that results in higher AbC than RoundSV. In addition, evaluating the client contributions for CIFAR-10 dataset (Fig. 8(c)) is challenging; however, FedCCEA achieves consistent results with a positive AbC and no AR, whereas RoundSV clearly experiences an accuracy reversal in both settings W.1 and W.2. Table 4 presents the outstanding results of the proposed method. FedCCEA is the only method in which accuracy reversal does not exist in the data distribution settings that we construct, whereas the accuracy of the removed high-contributors exceeds the accuracy of the removed low-contributors in some cases for the other evaluation methods. Furthermore, FedCCEA obtain the highest AbCs in nine out of 13 data settings, whereas FIA and RRAFL obtain the highest AbCs in three and one setting, respectively. Overall, FedCCEA shows robust measurements in most of the heterogeneous data environments, whereas the other baseline methods fail to achieve robustness to specific data heterogeneity.

D. EXPERIMENT 3. CLIENT REMOVAL TEST FOR PARTIAL PARTICIPATION
Another advantage of FedCCEA is that it can measure client contributions even if clients partially participate in FL using only a part of the local dataset. By verifying the same experiment as in Section IV-C, we demonstrate the precision of client contribution, even with the allowance of partial participation, and demonstrate its practicality. We (1) randomly assign the data size of each client in every round, (2) rank the CCIs in both descending and ascending orders, and (3) retrain the federated model by removing a given proportion of the highest and lowest contributors in every round. Finally, (4) we compare the results to the case of FedCCEA using the full data. 6 As shown in Fig. 9, all datasets exhibit outstanding results in the partial use case. It does not lead to accuracy reversal(AR), and provides a similar or superior result for the area between the curves(AbC) compared to the case of full use. In addition to the settings in Fig. 9, we can also confirm that all other data distribution settings reach a similar superiority to the partial use cases. Thus, without reevaluating client contributions, FedCCEA can obtain precise results for partial participation. This widens the options for clients of data size selection in distributed systems.

E. FURTHER EXPERIMENTS 1) CONVERGENCE ANALYSIS FOR CLIENT SELECTION
From a client selection perspective, choosing influential clients and removing unnecessary ones are crucial challenges in federated learning with high data heterogeneity. [44], [45] Partial client participation in a highly heterogeneous data environment can result in slow convergence if inappropriate clients are selected for federated learning. On the other hand, efficient selection of influential clients can result in faster convergence and higher performance. Many studies have introduced client selection strategies for partial client participation using local losses [44], clustering through gradient diversity [46], and linear speedup [47].
With the same aggregation algorithm(FedAvg), contribution-based client selection strategies have a similar convergence rate to each other. However, when we measure the low-and high-contributors correctly, client selection or exclusion strategies with contributions can achieve a reward of fast convergence speed in a highly heterogeneous environment compared to uniformly random selection. As shown in Fig. 10, we exclude four low-contributors measured by each evaluation method and investigated the convergence point of both training loss and test accuracy. The MNIST S.1 and S.2 settings, which are strong non-IID environments with and without label noise, are used in this experiment.
As a result, we discover that the contribution-based client selection strategies have a trivial difference in convergence speed. However, we can claim that the FedCCEA-based client selection consistently achieves faster convergence than random selection in both heterogeneous settings. Furthermore, the gap in the convergence speed between FedCCEA and random selection becomes more substantial in S.2. setting( Fig. 10(b)) when FedCCEA correctly captured four poisoning attackers and excluded them from federated learning participation.

2) CLIENT REMOVAL TEST WITH DIFFERENT NUMBER OF CLIENTS
In addition, we implement client removal tests for FedC-CEA with a different number of clients. In this experiment, we incrementally remove four clients for the 10-client FL and eight for the 20-and 50-client FL. Subsequently, we (AbC: Area between the Curves, AR : Accuracy Reversal) As a standard of a robust evaluation method, AbC should remain positive, and AR should be marked as (×). compare the consequence of the average AbC and the presence of AR.
Clearly, as the number of clients participating in federated learning increases, the contribution of each client becomes more difficult to evaluate. The increased number of client combinations causes more extreme accuracy variations and makes the evaluation unstable. However, as shown in Table 5, FedCCEA provides robust results for evaluation metrics with positive averaged AbC and no AR for any number of clients.

3) COMPLEXITY ANALYSIS
FedCCEA needs to repeat numerous simulations(S) of federated learning to obtain sufficient samples for an accuracy approximation in the evaluator phase. Thus, FedCCEA requires O(S) complexity for contribution evaluation. In contrast, the evaluation complexity of other baseline methods is highly dependent on the number of clients participating in FL. RoundSV costs O(N log N ) [8], FIA costs O(N 2 ) [24], and RRAFL costs O(N ).
The empirical experiment for the complexity analysis in Table 6 shows that RoundSV, FIA, and RRAFL incur a massive cost of contribution evaluation when the number of clients increases. In contrast, the time cost of FedCCEA remains nearly constant. Therefore, when the number of clients in each round is large, the evaluation of FedCCEA is remarkably faster than that of the other baselines. However, when the number of clients is small, FedCCEA, with numerous FL simulations(S), evaluates the client contribution much slower than the other baseline methods.

V. CONCLUSION
In addition to a model-centric approach that focuses on FL optimization, client contribution evaluation is another effective approach for improving the FL performance. In this study, we proposed FedCCEA, an empirical measurement of client contributions, without accessing local datasets. We built an accuracy approximation model to distinctively exploit the data size for accuracy approximation and to extract stable client contributions by considering a given data size. Contrary to other evaluation methods, the data size diversification of FedCCEA loosens the accuracy variation of FL simulations and strengthens the robustness to diverse data settings and practicality for partial participation.