A Novel Joint Dataset and Incentive Management Mechanism for Federated Learning over MEC

In this study, to reduce the energy consumption for the federated learning (FL) participation of mobile devices (MDs), we design a novel joint dataset and incentive management mechanism for FL over mobile edge computing (MEC) systems. We formulate a Stackelberg game to model and analyze the behaviors of FL participants, referred to as MDs, and FL service providers, referred to as MECs. In the proposed game, each MEC is the leader, whereas the MDs are followers. As the leader, to maximize its own revenue by considering the trade-off between the cost of providing incentives and the estimated accuracy attained from an FL operation, each MEC provides full incentives to the MDs for the participation of each FL task, as well as the target accuracy level for each MD. The suggested total incentives are allocated over MDs’ proportion to the amount of dataset applied for local training, which indirectly affects the global accuracy of the FL. Based on the suggested incentives, the MDs determine the amount of dataset used for the local training of each FL task to maximize their own payoffs, which is defined as the energy consumed from FL participation and the expected incentives. We study the economic benefits of the joint dataset and incentive management mechanism by analyzing its hierarchical decision-making scheme as a multi-leader multi-follower Stackelberg game. Using backward induction, we prove the existence and uniqueness of the Nash equilibrium among MDs, and then examine the Stackelberg equilibrium by analyzing the leader game. We also discuss extensions of the proposed mechanism where the MDs are unaware of explicit information of other MD profiles, such as the weights of the revenue as a practical concern, which can be redesigned into the Stackelberg Bayesian game. Finally, we reveal that the Stackelberg equilibrium solution maximizes the utility of all MDs and the MECs.


I. INTRODUCTION
F EDERATED learning (FL), a term coined in 2016 by MacMahan et al., has emerged as a promising approach enabling mobile devices (MDs) to collaboratively synthesize a shared global model while maintaining the privacy of their decentralized sensitive data. In the FL, each MD trains a deep learning model by using its own dataset and then transmits the trained model parameters (e.g., gradients or weights) to the cloud for global aggregation. Finally, the global model generated from the global aggregation is returned to each MD for further updates while using it for the inference procedure. The FL benefits not only the FL participants (e.g., MDs) but also the FL service provider. From the perspective of the MD, the FL helps reduce the privacy concerns and improve the inference capabilities. From the perspective of the service providers, the FL helps facilitate a large-scale data collection from various MDs for model training without requiring a huge volume of storage and computing resources [1], [2]. However, despite these benefits, the FL participation of the MDs contains energy consumption for i) local training at the MDs and ii) transmission of the trained model parameters (e.g., gradients or weights) from the MDs to the cloud, which is challenging the current FL approaches because such significant energy consumption makes MDs with a limited battery reluctant to participate in the FL procedure. Correspondingly, this degrades the accuracy of the model [3].
Several studies have addressed the energy consumption problem in FL when considering the radio and computation resource management for MDs [4], [5]. Specifically, to reduce the energy consumption and latency from a long transmission distance to the cloud server, FL over mobile edge computing (MEC) systems, which utilize the computing capability near the mobile network edge, was actively investigated in terms of radio and computation (i.e., CPU/GPU) resource management [6], [7]. Recently, in [8], a joint computation and radio resource allocation problem was formulated by considering the energy consumption and latency tradeoff, and the convex optimization-based algorithm was then provided. Because the energy consumption during local training in FL increases with the amount of dataset used, dataset management in MD is crucial for energy-efficient FL. Nevertheless, there has been surprisingly little work in the sense of dataset management of FL, and to the best of our knowledge, the efforts closest to our approach are limited in scope because most conventional FL studies are confined to management at the MEC server, and thus MDs fully use their given datasets without management. For instance, in [9], the Astrea framework was designed to adaptively perform data augmentation and down-sampling to handle the local imbalance problem of FL. In [10], a joint dataset management and radio resource allocation algorithm under both CPU and GPU scenarios was formulated by considering a tradeoff between loss decay and latency. Because dataset management is at an early stage in FL studies, the contributions to energy-efficient dataset management are limited and orchestration of the dataset management is conducted at the FL service provider, assuming that MDs cooperatively follow such orchestration. However, in terms of practical concerns, the dataset and local training should be located and conducted at the MDs and not the FL service provider. Because cooperation is difficult to realize without any incentives, an alternative is to formulate the incentive mechanism by considering the competition between the FL service provider and MDs using game theory, which is extensively used in modeling competitive situations. Few studies have been conducted on incentive mechanisms considering the competitive situation between FL service providers and MDs [11], [12]. However, they do not consider the dataset management issues associated with the energy consumption problem inherent to FL. Correspondingly, as a limitation, to the best of our knowledge, no existing studies have considered a joint dataset or incentive management using game theory for FL over MEC.
We propose a novel joint dataset and incentive management mechanism for FL over MEC systems to overcome these shortcomings. Specifically, the amount of dataset used for each MD (i.e., data down-sampling used in [9]), and the incentive and target accuracy level of the FL service provider are studied. To do so, we formulate a Stackelberg game to model and analyze the behaviors of FL participants, which are referred to as MDs, and FL service providers, which are referred to as MECs. Under our approach, to provide incentives to the MDs, each MEC leads the competition by determining the number of total incentives as well as the target accuracy level for MDs subject to a tradeoff between the attained global loss decay (e.g., accuracy estimator used in [10]) from the local updates and payment. As the followers, the MDs then follow the actions of each MEC by independently determining the amount of dataset used by considering a tradeoff between the energy consumption and expected incentives. Proportional sharing applies to the competition among MDs, where the total incentives provided by each MEC are allocated to them in proportion to the amount of dataset used for local training 1 . This is a reasonable assumption because the contributions through the increase in accuracy of the global model are known to be proportional to the amount of dataset applied. To study the economic benefits of such an FL framework with a hierarchical decision-making structure, we conducted an analysis using the Stackelberg game framework. As a practical concern, we also discuss extensions of the proposed mechanisms where the MDs do not know the profiles of other MDs, such as the weights of the revenue, which can be formulated into a Stackelberg Bayesian game.
The contributions of this paper are as follows: ‚ We provide proof on the existence and uniqueness of the Nash equilibrium (NE) for the follower game (i.e., among MDs). ‚ We derive a closed form solution of the NE for the symmetric followers. ‚ Given the closed form solution of the NE, we analyze for the leader game the optimal incentive and accuracy level strategies. Finally, a unique Stackelberg equilibrium (SE) point is given as a closed form, and the proposed approach is guaranteed to converge to a unique SE solution. ‚ As a practical concern, we provide discussions regarding an extension of our scenario as the Stackelberg Bayesian game where the MDs are assumed not to know the private profiles of other MDs (i.e., weights of revenue) by addressing the existence of the Bayesian Nash equilibrium (BNE) for the follower game. 1 Since information regarding the amount of dataset applied is private information held by each MD, it is important to verify and certify the amount of dataset used for reliable incentive distributions (i.e., checking to determine that the data are not fake). To solve this issue, some integrity mechanisms exist, including blockchain [13]. In addition, free-riding attacks related to this issue in FL have also been actively investigated and can be applied to the proposed mechanism [14], [15]. However, it should be noted that developing and applying these mechanisms in the proposed scheme is beyond the scope of this study.
‚ We indicate that a joint dataset and incentive management mechanism for the FL framework based on a welldefined utility function can maximize the payoff for all game players upon equilibrium of the game. ‚ Moreover, the price of anarchy is evaluated, which can measure the inefficiency of NE in terms of social welfare in a non-cooperative game owing to the selfish behavior of the MDs. The results confirm that the proposed NE solution of the non-cooperative game has a small performance loss in terms of social welfare. The remainder of this paper is organized as follows. In Section II, we discuss related studies. In Section III, we provide the system model of the proposed mechanism and the problem formulation to design the proposed utility functions of both MDs and MEC. In Section IV, we provide an analysis of the proposed two-level game by formulating a two-level Stackelberg game and discuss the extension of the proposed scenarios as well as the implementation of the proposed approach. In Section V, we provide a performance evaluation of the proposed mechanism. Finally, Section VI gives some concluding remarks regarding this research.

II. RELATED WORK
The heterogeneous nature of distributed MDs participating in the FL procedure (e.g., heterogeneous dataset, heterogeneous computing power, and networking status) makes FL research somewhat different from that of other machine learning fields [1], [2]. By taking such considerations into account, the resource management for FL should be carefully designed for optimal FL operation.

A. RESOURCE MANAGEMENT OF FL IN MEC
To alleviate the aforementioned issues, by adopting help from MEC, various resource management studies on FL have been suggested by considering three resource aspects: radio, computing, and dataset resources.
The authors of [16] first provided analytical studies on the trade-offs between computation and communication latency determined by the learning accuracy level. Based on this analytical model, an energy-efficient CPU cycle and uplink power control problem were formulated and solved. As an extension of [16], the authors of [17] proposed an energyefficient FL scheme that solves a joint computation and communication resource optimization problem. The proposed scheme aims to minimize the energy consumption by managing the time allocation, bandwidth allocation, power control, computation frequency, and learning accuracy at the same time as the control variables. In addition, the study in [8] introduced a novel hierarchical FL structure by leveraging both edge and cloud servers. In this novel architecture, the joint edge association and resource allocation (e.g., computation and radio resource) problems were formulated and solved to balance the trade-off between energy consumption and latency. Similar to [8], the authors of [18] proposed a joint radio resource allocation and user association algorithm to minimize the overall FL costs, including the energy consumption and latency. They formulated the problem as an integer linear optimization problem under synchronous FL participation, whereas most FL studies considered asynchronous FL participation as a baseline. The authors of [19] proposed dataaware MD scheduling scheme by considering data characteristics of MDs as the key component. Recently, manipulation of the local dataset distribution and size has been actively investigated while managing the radio resources (bandwidth, transmission power, beam-forming configuration etc.,) and computing resources (CPU, GPU resources), respectively [9], [10], [20]. For instance, the authors of [10] proposed a joint dataset size selection and radio resource allocation algorithm for GPU and CPU computation scenarios, respectively. Here, the MEC server orchestrates the local dataset size and radio resources for FL participants by considering the trade-off between loss decay for global model accuracy and latency.

B. INCENTIVE MECHANISMS FOR FL
The above-mentioned resource management studies of FL in MEC assumed that all MDs are willing to provide their computation resources and follow the coordination decided by the FL service provider (i.e., MEC server or cloud). In reality, for training and transmission during FL participation, significant energy consumption is required such that batterylimited mobile devices become reluctant to participate in the FL procedure. To alleviate such issues, there have been recent studies on incentive mechanisms to efficiently motivate and compensate their contribution to the FL [11]. The study in [12] formulated an incentive-based interaction between a crowdsourcing platform and MDs. Here, based on the suggested accuracy level from each FL participant, the crowdsourcing platform optimally sets the incentives to maximize its utility. In [21], the authors first introduced the concept of a reputation as a novel metric to quantify the trustworthiness of MDs applied in FL by using a multi-weight subjective logic model. Then, a reputation-based worker selection scheme and an effective incentive mechanism based on the proposed refutation metric based on the contract theory were designed and evaluated. In [22], the authors designed a deep reinforcement learning based (DRL) incentive mechanism in an edge-cloud interworking scenario, which makes the cloud optimally determine the incentive strategy. Here, the incentive strategy was optimized with respect to the reactions of the edge servers (e.g., participation level for local model training).
Despite the popularity of FL, there are only a limited number of studies that jointly consider incentive mechanisms and resource management in MEC environments. Specifically, most resource management approaches assume cooperative FL participants. In this article, we stay within the theoretical framework and analyze the behaviors of both FL service providers and MDs of FL over MEC with varying system parameters.

A. NOTATIONS
We denote rns " t1, 2, . . . , nu for n P N. The symbol z denotes the set subtraction. For a matrix A P R nˆm , we write . . , A n,j q, for i P rns and j P rms.

B. SYSTEM MODEL
We consider multiple MEC servers as FL service providers deployed in a shared region, and these MEC servers have multiple shared MDs connected as the FL participants, i.e., workers. A single MEC server is referred to as an "MEC" in the remainder of our study for easier reference. An MEC, which consists of a base station (BS), is capable of coordinating the FL process to achieve distributed machine learning. As illustrated in Fig. 1, the MDs are directly connected to the multiple MECs, and the multiple MECs provide independent FL processes to the MDs. For instance, in a smart factory, an arbitrary MEC provides an abnormal detection model of equipment as an FL task to the MDs, whereas another MEC provides an image detection model for the context awareness of the smart factory environment as a different FL task to the same MDs where both FL tasks are independent of each other. Then, local training for each independent FL process can be conducted on a large group of decentralized data on multiple MDs using a possible architecture, as discussed in [1], [2].
In view of a single MEC, let the time be divided into consecutive fixed-length rounds. In each round, in step 1, the MEC conducts task initialization by providing incentive information and a shared global model with a target accuracy level for local training to multiple MDs. As in [2], the shared global model might include a TensorFlow graph and instructions for how to conduct it. In step 2, each MD then determines the amount of dataset applied for extracting sample data from the entire dataset by referring to the incentive information. Correspondingly, in step 3, each MD conducts local training on the extracted local data using the shared global model, and updates are transferred to the MEC. Finally, in step 4, the MEC merges the local model updates from the MDs and generates a global model 2 .
To model this, we define I and J as the set of MDs and MECs, respectively, where |I| " N and |J | " M denote the total number of MDs and MECs, respectively. For each i P I and j P J , we define D i,j as the amount of dataset (i.e., the number of items in the dataset) determined by the MD i for MEC j. To simplify the scenario as in [23], we assume that all MDs have a sufficient D max of raw data that satisfies D i,j ăă D max owing to a sufficiently large value of D max (e.g., sensing information including GPS). The set of available D i,j for MD i corresponding to MEC j, which is denoted as D i,j , is a continuum given by D i,j " rD min , D max s, where D min can be provided by the MEC as the minimum requirement for participating in the FL task. According to [10], for each j P J , to estimate the accuracy of FL, the global loss decay L decay,j , which represents the difference in the global loss function across each round of time in the FL procedure, can be evaluated as a convex function of a summation of the dataset size of each 2 To generate the shared global model, various mechanisms (e.g., the federated averaging algorithm and secure aggregation algorithm) can be applied. Following the steps shown in Fig. 1, once a global training period is established, the MEC sends the current shared global model parameters and any other necessary state as a checkpoint (i.e., the serialized state of a TensorFlow session).

MD
ř N i"1 D i,j for local training as where α 1 is a coefficient determined by the specific deep learning model. Then, because L decay,j is an increasing function with respect to D i,j , in this study, following the principle of proportional sharing, to motivate their active FL participation, the total incentive γ j provided by the MEC j is allocated to the MDs in proportion to the amount of dataset D i,j , thereby contributing to the local training. Thus, the expected incentive of γ i,j allocated to MD i by the FL service provider j is (2)

C. ENERGY CONSUMPTION ANALYSIS
During FL participation for multiple FL tasks over MECs, as specified in step 3 in Fig. 1, for each j P J , the total energy consumption at the MD includes both the transmission energy consumption E t i,j for local model updates and computing the energy consumption E c i,j for local model training. Specifically, to model E t i,j , let l t i,j denote the latency for uploading local model updates to MEC j through a wireless channel. It should be noted that, in terms of the transmission energy consumption, the global model download can be ignored in step 3 shown in Fig. 1. In this study, we adopt an orthogonal frequency-division multiple access (OFDMA) protocol 3 for MDs, where each MEC has a BS with residual bandwidth B j , whereas some of the bandwidth might already be occupied by other services, and B j is allocated over the MDs for FL with B i,j . Correspondingly, in [8], the achievable transmission rate of MD i r i,j is given by where N 0 is the background noise, h i,j is the channel gain between the BS belonging to MEC j and the MD i, and p i,j is the transmission power of the MD to the BS belonging to MEC j. Then, assuming that the size of a local model update w j for the FL task provided by the MEC j is a constant with the same value for all MDs, the transmission time of a local model update t i,j with size w j (MDs to the MEC j) is expressed as Finally, the transmission energy consumption E t i,j is given by Then, to model E c i,j , let c i denote the CPU cycles 4 for MD i to process a data sample. Then, the total number of CPU cycles for the FL task of MEC j is defined as c i D i,j . Thus, the latency for local model training for the FL task of MEC j by MD i l c i,j can be formulated as follows: where f i is the CPU frequency of MD i, and I l pθ i,j q is the number of local iterations that the MD i needs to apply to achieve accuracy θ i,j P p0, 1q for the FL task of MEC j, which is calculated approximately as α 2 logp1{θ i,j q in [8], where α 2 is a constant coefficient that depends on the machine learning task. Then, the computing energy consumption E c i,j for local model training is given by where βi 2 is the effective capacitance coefficient of the chipset of MD i. For notational convenience, we denote η i,j " I l pθ i,j q βi 2 f 2 i c i . Finally, the total energy consumption E tot i,j with respect to D i,j and θ i,j is defined as

D. UTILITY FUNCTION OF MEC (FL SERVICE PROVIDER)
Our design of the utility function of MEC considers two components. The first component represents the global loss decay L decay,j by aggregating local models from MDs with ř N i"1 D i,j , which captures the accuracy gain from the FL framework during each round. The second component represents the costs of providing incentives γ j to compensate MD participants in the FL framework. Correspondingly, the proposed utility function aims for a balance between the attained accuracy gain and costs from the FL frameworks because the two components conflict with each other. Thus, the utility function of MEC j is defined as where τ j ě 0 is the weighting factor of the second component.

E. UTILITY FUNCTION OF MD (FL PARTICIPANT)
For i P I, we consider the following utility function where t i ě 0 is the weighting factor of the revenue for MD i and γ " pγ 1 , γ 2 , . . . , γ M ).
The utility function of the MD comprises the revenue and the energy consumption from participating in the FL procedure; thus, the MD can attain optimal welfare from FL participation by maximizing the utility function. Here, t i is controllable such that the MD reflects its interest in its utility function. If the MD wants to reduce energy consumption, the MD sets t i to a lower value. This could be the case when the MD has a small amount of battery power. By contrast, if the MD wants to obtain revenue, the MD sets t i to a higher value. Finally, the utility function designed in this way aims to balance the attained utility from the revenue and operating cost in terms of energy consumption. Accordingly, through the proposed utility function, each MD can participate in the FL framework in a cost-effective manner.

IV. ANALYSIS OF THE PROPOSED TWO-LEVEL GAME
In this study, a joint dataset and incentive management mechanism for FL between MECs and MDs is designed as a hierarchical non-cooperative decision-making problem that can be formulated as a Stacklberg game. This is due to the fact that the Stackeberg game is a special type of non-cooperative game in which a hierarchy exists among players. Specifically, these players are classified as leaders and followers [24]- [26].
In our problem, we consider both MECs and MDs as multiple leaders and followers, respectively. In the MD-level game, each MD chooses its own strategy independently. In particular, MD i determines the amount of dataset for local training based on the FL participation. Knowing the strategies of MECs, that is, the value of the incentive γ and θ i , each MD aims to maximize the utility function U i pD i , D´i, γ, θ i q defined in (10). With this competition, the MDs do not know the strategies of the others, and the Nash equilibrium (NE) provides a set of strategies using a property in which none of the MDs can increase their own utility by determining a different strategy under the given strategies of the other MDs and MECs. Subsequently, each MEC also independently chooses its own strategy. In particular, MEC j determines the value of the incentive γ j and the accuracy level vector θ rjs , aiming to maximize its utility U M EC,j pθ rjs , γ j , D rjs q defined in (9).
Finally, to obtain the Stackelberg equilibrium (SE), which is the solution for this game, including the optimal strategies of both MDs and MECs, we adopt a backward induction principle introduced in [26]. This principle can be summarized as follows: First, in the lower MD-level game, each MD determines the strategy (i.e., the amount of dataset for local training, called dataset management) to maximize the utility function (10) for a given incentive and accuracy level profile provided by the MECs. Subsequently, in the upper MEC server-level game, combining the strategy of each MD with the utility function (9), MEC j independently optimizes its strategies (i.e., the value of the incentive γ j and accuracy level vector θ rjs ) that maximizes the utility function (9).

A. NON-COOPERATIVE GAME AMONG MDS : MD-LEVEL GAME
In this section, the dataset management mechanism is provided as an MD-level game, which is given as a matrix D " pD 1 ; D 2 ; . . . ; D k q over the MDs. Here, each MD competes with each other to maximize its utility U i pD i , D´i, γ, θ i q under the given γ, and θ i by determining the amount of dataset D i for local training. Definition 1. The best response (BR) function B i pD´i, γ, θ i q of MD i as a follower is the best strategy for MD i over the MECs given the other MD strategies D´i and MEC strategies γ and θ i . The BR function is denoted as follows: for i P rN s, where γ " pγ 1 , γ 2 , . . . , γ M ).

Definition 2.
A Nash equilibrium of the non-cooperative game among the MDs is a profile of strategies D˚= pD1 , D2 , . . . , DN q with the property that, given the MEC strategies γ, and θ, we have Di " B i pD´i, γ, θ i q for i P rN s.
Theorem 1. The utility function U i pD i , D´i, γ, θ i q of MD i is strictly concave on D i .
Proof. From (7), (8), and (10), we obtain Taking the first and second derivatives of U i pD i , D´i, γ, θ i q with respect to D i,j , we obtain and It follows from (14) and (15) that the Hessian matrix of U i is negative semi-definite. This implies that the utility function U i pD i , D´i, γ, θ i q is strictly concave on ś jPrM s D i,j . Theorem 2. A Nash equilibrium exists in the noncooperative game among the MDs.
Proof. Let D i " ś jPrM s D i,j . Because the utility function U i pD i , D´i, γ, θ i q is strictly concave on a compact and convex set D i , from the result of [27], the proposed noncooperative MD-level game has a Nash equilibrium. Definition 3. A function f ppq " pf 1 ppq, ..., f N ppqq, where p " pp 1 , ..., p N q, is said to be standard if the following properties are satisfied for all p ě 0.
To prove the uniqueness of the Nash equilibrium for the proposed MD-level game, it suffices to show that the BR function of each MD defined in (12) is a standard f unction [28].
Proof. By equating the first derivative of U i pD i , D´i, γ, θ i q in (13) to zero, we obtain From the strict concavity of the utility function U i pD i , D´i, γ, θ i q on D i , the BR function B i pD´i, γ, θ i q of MD i can be obtained by solving the above equation.
We then obtain ‚ Monotonicity : Under the same condition, by taking the first derivatives of B i pD´i, γ, θ i q with respect to D l,j for l P Iztiu, we obtain ą 0 ‚ Scalability : From (16), we have Because µ ą 1, we obtain Under the condition tiγj 4ηi,j ą ř lPIztiu D l,j , we can conclude that the BR function B i pD´i, γ, θ i q is a standard f unction of D´i.
It should be noted that the physical meaning of this condition indicates for guaranteeing the unique Nash equilibrium for this MD-level game, sufficient incentives γ from the MECs should be provided to the MDs in proportion to the summation of the dataset over such MDs (i.e., estimation of the global accuracy).
Theorem 4. The non-cooperative game among the MDs has a unique Nash equilibrium.
Proof. Based on Theorem 3, we can check whether the BR function B i pD´i, γ, θ i q is a standard f unction. Correspondingly, the fixed point D˚= pD1 , D2 , . . . , DN q must be unique.
Theorem 5. For the non-cooperative game among symmetric MDs, the unique Nash equilibrium has a closed-form expression given by the following: where N denotes the number of MDs. Moreover, it is assumed that in the symmetric MDs, all MDs have the same t i and η i,j during FL participation belonging to MEC j; thus, we simply denote them as t and η j , respectively.
Proof. Symmetric N MDs implies that the BR function B i pD´i, γ, θ i q has the same form for all i P I such that Di " Dl for all l P Iztiu. Then (16) is rewritten as By solving the equation, we obtain (17) as desired.
Remark 1. It is extremely difficult to find a closed-form solution of the Nash equilibrium for the general case. Thus, to provide an insight into the Nash equilibrium formulation, a symmetric MD case is studied and presented in Theorem 5. Symmetric MDs can be realized when a group of prearranged MDs participate in the FL. In other words, the MEC (i.e., FL service provider) sets all configurations as the requirement for FL participation, excluding the dataset management. VOLUME 4, 2016

B. EFFICIENCY OF NASH EQUILIBRIUM FOR MDS
To check the efficiency of the proposed non-cooperative game among MDs, price of anarchy is a widely used tool, which is defined as the ratio between the optimal centralized solution and the worst equilibrium for a game in which there are multiple equilibrium solutions. This price of anarchy can be used to measure the inefficiency of the NE in a noncooperative game because of the selfish behavior of its players. Hence, under our scenario, because of the uniqueness of the NE, we first define the social welfare of multiple MDs and check the ratio of the defined social welfare of the MDs between the optimal centralized solution and the NE. For a given γ, θ, the social welfare of the MDs is defined as follows: U pD, γ, θq " Let P oA denote the price of anarchy, which is defined as Here, because max D U pD, γ, θq is a decreasing function of D, the maximum social welfare is achieved by letting D " D min . Then, for symmetric N MDs, P oA can be simplified from (10), (17), and (18) as where E t j is the transmission energy consumption for each symmetric MD over MEC j.

C. UTILITY MAXIMIZATION FOR MECS
For all j P J , the optimization problem for MEC j as the FL service provider is given by maximize θ rjs ,γj U M EC,j pθ rjs , γ j , D rjs q subject to I l pθ i,j q ď I max,j , for i P rN s, where I max,j is the maximum number of local iterations provided by MEC j. This constraint guarantees that the FL participation does not require excessive processing of the MDs. The choice of the optimal incentive γj and accuracy level vector θ rjs,˚i s affected by the optimal amount of dataset applied D rjs,˚o f the MDs for MEC j, where θ rjs,˚" pθ1 ,j , θ2 ,j , . . . , θM ,j q and D rjs,˚" pD1 ,j , D2 ,j , . . . , DM ,j q. Accordingly, we can obtain the optimal incentive and accu-racy level vector in (21) using the optimal solution described in (17). Then, from (1), (9), and (17), we have U M EC,j pθ rjs , γ j , D rjs,˚q " α 1 where Theorem 6. The optimal accuracy level and optimal incentive of the MEC for (22) are determined as follows: where γ max j is the maximum incentive budget provided by the MEC j.

Proof. From (22), we have
It should be noted that because we consider symmetric MD cases, β i , f i , and c i are identical for all MDs.
Because U M EC,j pθ rjs , γ j , D rjs,˚q is a strictly increasing function with respect to θ i,j , the optimal θi ,j exists at the boundary condition that satisfies the constraint I l pθ i,j q ď I max,j , for i P rN s. Accordingly, the optimal θi ,j should satisfy α 2 logp1{θi ,j q " I max,j , and thus we have Moreover, to obtain the optimal incentive γj , let X " t ηj N´1 N in (22), we obtain U M EC,j pθ rjs , γ j , D rjs,˚q " α 1 a γ j X´τ j γ j Taking the first derivative of U M EC,j pθ rjs , γ j , D rjs,˚q with respect to γ j , we have BU M EC,j pθ rjs , γ j , D rjs,˚q Bγ j " α 1 An optimal value of γ j can be obtained by setting the first derivative of U M EC,j pθ rjs , γ j , D rjs,˚q to zero. Thus, we have α 1 ? X 1 2 γ´1 2 j´τ j " 0.
By solving the above equation, we obtain A unique Stackelberg equilibrium exists in the proposed two-level FL game.
Proof. Because the non-cooperative game among the MDs has a unique Nash equilibrium D˚= pD1 , D2 , . . . , DN q, as shown in Theorem 4, and the MEC can always find its optimal strategy γj owing to the concavity of U M EC,j pθ rjs , γ j , D rjs,˚q , we conclude that a unique Stackelberg equilibrium exists in the proposed two-level FL game.

D. DISCUSSIONS
In this subsection, for a more practical consideration, we first discuss extensions of the MD-level game scenarios where the MDs are unaware of explicit information of other MD profiles, such as the weights of the revenue, which can be redesigned into a Stackelberg Bayesian game. This is more reasonable in a practical scenario because MDs have different interests between the revenue and energy consumption and do not share their interests with each other while determining their optimal strategies. Finally, we conclude the discussions by addressing the existence of the Stackelberg Bayesian Nash equilibrium (BNE) for the MD-level game in the extended scenario. In the extended scenario, the weight of revenue t i of each MD is private information, and only its probability distribution is commonly known. Thus, for a given θ, γ, the MD-level game can be extended and formulated as a Bayesian game with incomplete information and under the following conditions. ‚ The action of MD i, denoted by D i , is the amount of dataset applied. Thus, D i is the set of actions corresponding to the MD i. ‚ The type of MD i is the weighting factor of revenue t i , which represents the interest of MD i on the revenue compared to the energy consumption. ‚ We define the strategy of MD i as a function s i : T iR N`1 Ñ D i such that s i pt i , θ, γq P D i . That is, the amount of dataset is determined by the type of MD i, target accuracy level, and total incentive provided by the MEC.
The expected utility function of MD i when playing action s i pt i , θ, γq, which is extended from (10), is determined as follows: where f´ipt´iq " ś j‰i f j pt j q is the joint probability density of t´i " pt 1 , t 2 , . . . , t i´1 , t i`1 , . . . , t N q. A vector ps 1 , s 2 , . . . , s N q is said to be a pure BNE if We show that the expected utility function has at least one pure BNE as follows: By taking the partial derivative of the expected utility function with respect to t i and D i , we obtain Because γ ą 0 and s i pt i , θ, γq ě D min for i P t1, 2, . . . , N u, we obtain B 2Ū i pDi,s´i;ti,t´iq BtiBDi ą 0. This implies that the expected utility functionŪ i satisfies a single crossing of the incremental return in pD i , t i q. Thus, the game has at least one pure BNE [29] 5 .
Furthermore, for practical concerns, we provide discussions about the signaling complexity. Regarding such complexity, for each MEC, the proposed mechanism requires the total incentive information and incentive distribution information, and the target accuracy information should be provided to all MDs. Thus, it requires O(N ) signals where N is the number of MDs. Specifically, the signaling overhead linearly increases with respect to the number N . Nevertheless, all signaling messages can be piggybacked in general FL procedures, such as step 1, and no additional latency is imposed. Thus, the complexity of the linearity and message piggybacking show that the proposed mechanism can be practically deployed with acceptable signaling complexity.

V. PERFORMANCE EVALUATION
In this section, the effectiveness of the proposed mechanism is analyzed using quantitative results. We are interested in the effects of the various system parameters described in Table 1 on the utility of the MDs and MECs, as well as their strategies, when using our mechanism. 5 The proof of uniqueness and development of an algorithm for obtaining the solution of the BNE is not simple and straightforward and can be considered as a future research direction. Thus, the discussion is confined to showing the existence of the solution.    (10) and (17), where we see that the utility of the MD strictly increases as γ j provided by the MEC increases. This is because, with the same amount of dataset applied, MDs obtain more incentives from participating in the FL procedure. By contrast, in Fig. 3, there is a maximum utility point beyond which the MEC cannot increase with its own utility by increasing the incentives γ j . Owing to excessively high incentives for compensating the participants of the MDs in the FL framework, the MEC uses excessive costs compared to the attained accuracy gain over the MDs such that the utility also decreases at a certain point.
In Figs. 4 and 5, we demonstrate the utility of the MD and MEC with respect to the number of MDs under the Stackelberg equilibrium. Owing to the increasing competition among the MDs, the utility of the MD monotonically decreases as the number of MDs in the FL increases. By contrast, as the leader, the MEC has the advantage of choosing its strategies first, and hence, from the increased global loss decay L decay,j over the increased number of MDs compared to the marginal increase in total incentives, the MEC can also  achieve a higher utility.
In Figs. 6 and 7, we illustrate the strategy of MD (Di ,j ) and each MEC (γj ) under Stackelberg equilibrium with respect to the number of MDs where there are three MECs with different values of I max . We observed that, as shown in Fig.  6, the optimal strategy of MD (Di ,j ) decreases as the number of MDs in the FL framework increases. For this reason, owing to the increasing competition among the MDs for sharing the total incentives, the expected incentive for each   MD decreases, as indicated in the previous figures. Therefore, the MDs choose a smaller Di ,j , as proved in Theorem 4. In addition, as I max,j , which is the maximum number of local iterations provided by each MEC, increases, the more energy consumption for local model training with given amount of data is required for the MDs. Correspondingly, the MDs tend to reduce the amount of dataset (Di ,j ) applied to maximize the utility while reducing the energy consumption. By contrast, as shown in Fig. 7, the optimal strategy of each MEC (γj ) increases as the number of MDs increases. This is because, to motivate them to contribute the global loss decay L decay,j , the MEC should try to sustain the per MD incentive during the increase in competition by increasing the total incentives by as much as (23). Moreover, as I max,j determined by each MEC increases, owing to the reduction of Di ,j from the MDs, the total incentive γ j determined by the MEC is reduced to maximize its utility.
In addition, the results shown in Figs. 8 confirm that the optimal target accuracy (θi ,j ) decreases as I max,j increases. As proved in Theorem 6, the optimal target accuracy (θi ,j ) is independent of the number of MDs and is only affected by I max,j . With a given higher value of I max,j , the MEC can set a higher level of accuracy (i.e., the lower target accuracy level is close to zero).
In order to verify the effectiveness of the proposed mech-  anisms, we set 3 benchmarks, which have different settings of D i,j . Specifically, we consider that MDs are set their local dataset to D min and D max , respectively. It should be noted that the maximum of social welfare is achieved by letting D " D min as we discussed in the previous section. For the example in Fig. 9, we fix N " 5 and also assume that the MEC is set to D min for the FL participation as the 0.5D max . Moreover, D mid is newly defined as medium value between D min and D max . As shown in Fig. 9, the proposed mechanism outperforms the two benchmarks such as D max and D mid . Furthermore, the social welfare gap between the proposed mechanism and the Centralized solution (i.e., D min ) decreases as the ratio D min {Di ,j reaches close to 0.9. Finally, based on the findings in Fig. 9, in Fig. 10, we demonstrate the efficiency of the NE by illustrating the PoA defined in (20) with respect to ratio D min {Di ,j . The results show that the PoA are approximately 1.5 as the ratio D min {Di ,j reaches close to 0.9. In addition, the PoA approaches 1 as the ratio D min {Di ,j increases. Thus, if the MEC is set to D min for the FL participation as an acceptable value in a historical manner, the proposed mechanism can perform as efficiently as the centralized mechanism, as shown through this PoA analysis. VOLUME 4, 2016

VI. CONCLUSION
In this paper, we proposed a novel joint dataset and incentive management mechanisms for FL over MECs. We provided a rigorous game-theoretic analysis to obtain the closed-form Stackelberg equilibrium solution of the proposed mechanisms. We also discussed the extension of our scenario using a Stackelberg Bayesian game in which the MDs were unaware of the explicit profile information of other MDs. Thus, the numerical analysis demonstrated that our proposed approach is guaranteed to have a unique equilibrium solution, with and the results show that the proposed mechanism can maximize the payoff for all participating players. This paper provided proof that incentive-based FL dataset management is beneficial for both MDs and FL service providers.
DUSIT NIYATO (M'09-SM'15-F'17) is currently a professor in the School of Computer Science and Engineering, at Nanyang Technological University, Singapore. He received B.Eng. from King Mongkuts Institute of Technology Ladkrabang (KMITL), Thailand in 1999 and Ph.D. in Electrical and Computer Engineering from the University of Manitoba, Canada in 2008. His research interests are in the area of energy harvesting for wireless communication, Internet of Things (IoT) and sensor networks. VOLUME 4, 2016