Dynamic Spectrum Access using Stochastic Multi-User Bandits

A stochastic multi-user multi-armed bandit framework is used to develop algorithms for uncoordinated spectrum access. In contrast to prior work, it is assumed that rewards can be non-zero even under collisions, thus allowing for the number of users to be greater than the number of channels. The proposed algorithm consists of an estimation phase and an allocation phase. It is shown that if every user adopts the algorithm, the system wide regret is order-optimal of order $O(\log T)$ over a time-horizon of duration $T$. The regret guarantees hold for both the cases where the number of users is greater than or less than the number of channels. The algorithm is extended to the dynamic case where the number of users in the system evolves over time, and is shown to lead to sub-linear regret.


I. INTRODUCTION
Dynamic spectrum access has emerged to address the problem of spectrum under-utilization caused by treating the frequency spectrum as a fixed commodity. We study the spectrum sharing paradigm in which all the users are treated equally i.e., there is no distinction between primary or secondary users. We model the system as a stochastic multi-user multi-armed bandit (MAB) problem [1] where the channels correspond to the arms of the bandit similar to the model considered in [2]- [11]. The interference in the system is captured through the reward observed by each user. We propose a decentralized algorithm that leads to efficient channel access and achieves sub-linear regret with time when employed by each user independently.
Stochastic multi-armed bandits have been used to model dynamic spectrum access extensively in literature. Multi-armed bandits with coordination between users was studied in [9], [12]. We are more interested in the uncoordinated stochastic multi-armed bandit model investigated in [3], [4] and [5]. The algorithm in [3] achieves optimal regret but restricts the number of users to be lesser than the number of channels. The algorithms in [4] and [5] provide only high probability bounds on the expected regret.
All of these approaches assume that when more than one user tries to transmit on the same channel simultaneously (commonly referred to as a collision), the colliding users receive zero reward, due to which the number of users in the M. Bande  system is constrained to be less than the number of channels. Hence, these approaches are not applicable to the case where the number of users is greater than the number of channels. In our model, each user receives a reward depending on which channel they choose and the number of other users that choose the channel at the same time. We consider a more general setting where the users can receive a non-zero reward when more than one user accesses the same channel with the reward for each user decreasing as a function of the total number of users on the channel. The work in [13] also considers a setting with non-zero rewards on collisions, and provides guarantees for the expected time to converge to an optimal allocation when there is no explicit communication among the players. However, they assume that users have knowledge of the total number of users occupying their channel at any given time.
On any given channel, we assume that the reward obtained is a random variable that is drawn from a distribution that depends on the number of users on the channel. For example, the instantaneous reward could be the rate achieved by the user on the channel which may decrease due to interference from other users accessing the channel. The decrease in the reward observed by the user as a function of number of users depends on the system parameters, e.g., the distance between the users and transmission protocol (e.g., hybrid ARQ). In our model, the users do not communicate with each other. However, we do make the mild assumption that a low-bandwidth broadcast channel is available to the users for time synchronization (see also [5], [6], [14]).
A preliminary version of this work was considered in [2], in which an algorithm was presented with guarantees of constant regret with high probability. Upon a more careful examination of the assumptions on the reward distributions of the arms and the clustering algorithm (Algorithm 2 in [2]), we believe that the results of the Theorem 2 in [2] do not hold under the assumptions stated. In our current work, we have been able to remove these assumptions and avoid the clustering approach altogether, and we present an algorithm achieving the order optimal regret of (log ). We show that when each user employs our algorithm, the accumulated system-wide regret is (log ) where is the time-horizon. We then consider the more realistic scenario where the number of users in the system changes over time with minor restrictions on the rate at which users can enter and leave the system. For this dynamic setting, we show that our algorithm can be easily extended to achieve sub-linear regret.

II. SYSTEM MODEL AND NOTATION
Let be the number of users in the system. We initially assume that the users have unlimited data for transmission. In a more realistic setting, users may become active or inactive depending on their transmission needs; our dynamic setting (see Section V) covers this scenario. Each user can choose one among channels for transmission. We assume that each user has prior knowledge of . The assumption of known is reasonable if the spectrum partition is enforced and fixed. We model the system as a stochastic multi-user multi-armed bandit (MAB) system with users and arms (channels). In each time slot , let A , denote the set of channels available to user . User chooses a channel , ∈ A , based on the reward history according to a certain policy and receives a reward. The reward on each arm depends on the number of users who have chosen the arm. Let k = [ (1), . . . , ( )], with ( ) denoting the number of users on channel at time , and =1 ( ) = . Let the reward received by user at time is a function of the channel chosen , and the number of users on the channel ( , ), and is denoted by ( , , ( , )). Note that the reward ( , , ( , )) depends on the channel chosen by all users, and this dependence is captured through ( , ). The reward is normalized to lie in the interval [0, 1]. We model the system as a stochastic multi-user MAB system with users and arms (channels). Each user can choose one among channels for transmission, where we allow for the possibility that ≥ . As mentioned earlier, we assume that the reward observed decreases with the number of users transmitting on the same channel. Let ( , ( )) denote the mean reward on channel when the number of users on the channel is ( ), i.e., ( , ( )) = E[ ( , ( ))]. We assume that ( , ( )) is 0 for ( ) = + 1, where is a constant that depends on the system. This imposes a restriction on the number of users in the system since cannot be greater than . We define the expected regret in the system as, over all feasible k such that k ∈ {N∪0} 1× and =1 ( ) = . Note that k * corresponds to the optimal number of users on each channel.
Since the reward distributions of the arms do not vary across the users, the optimal configuration (users occupy channels according to k * ) does not depend on the channel allocated to any particular user. The mean reward of one channel may be greater than the others, and in order to ensure that one user does not monopolize a channel for an extended period of time, we impose the following condition. For each user, transmission on a particular channel takes place for a maximum of time slots, after which the user releases the channel for at least time slots before attempting to access the same channel. This notion of fairness does not interfere with the optimality of the system. Let 1 = =1 * ( ) ( , * ( )) be the system reward for the optimal configuration, and 2 the system reward for the configuration that achieves the next possible lower value for system reward. In our algorithm we assume that we have access to a lower bound on the value Note that Δ > 0, even though there might be multiple optimal configurations k * achieving the system reward 1 . Such an assumption is usually required for the analysis of multi-user MABs when communication between users is not allowed.
In the case where a bound on Δ is not known, the method of increasing exploration phases and eliminating sub-optimal matchings [7] can be used to develop an algorithm that does not require the knowledge of Δ.
In order for the users to get estimates of mean rewards of the arms, we assume that the users have unique IDs from 1 to at the beginning of the algorithm. This assumption is required in order to devise a simple exploration phase when no assumptions on the reward distributions of the arms are made. Since the users have access to a low-bandwidth broadcast channel, unique IDs for users from 1 to and the value of can be broadcast to the users at the beginning of the algorithm. In the dynamic case where users can enter and leave the system, the value of can be broadcast at the beginning of each epoch.

III. POLICY FOR DECENTRALIZED MULTI-USER MULTI ARMED BANDITS
The decentralized policy for each user (Algorithm 1) proceeds in epochs, with the number of epochs being over a horizon of length . Each epoch consists of two phases. The first is an estimation phase during which each user estimates the mean reward as a function of the number of users ( ) on each channel . Using these estimates, each user then computes an optimal configuration of number of users on the channels. The second is an allocation phase where the users align themselves according to the optimal system configuration. We show that our algorithm leads to sub-linear regret of (log ) where is the time-horizon.
The estimation phase is for user to obtain estimates of mean rewards (denoted byˆ ( , )) of arms ∈ [ ] for all ∈ [ ]. This phase proceeds for a fixed number of time units in every epoch. Since every user has an unique ID, and the total number of users is known, the users simply sample each arm , for each value of from 1 to , for 0 = 1 Note that if there is more than one optimal configuration in the system, the algorithm can dictate how the users make a decision about the estimate. For example, in the event of multiple configurations with same reward, the users choose the one with increasing number of users on the channels.
We use Algorithm 3 to construct an efficient allocation for which the regret does not grow with time when all the users have the correct estimate for the optimal configuration. During The proof follows from Hoeffding's inequality [15]. We now present the upper bound on the expected regret incurred by the users employing Algorithm 1.
Theorem 1: The expected regret incurred by employing Algorithm 1 is given by Proof: Let denote the number of complete epochs in time horizon . By construction of the algorithm, we have that < log . Let the regret incurred during the estimation phase of all epochs be denoted by and the regret incurred during the allocation phase be denoted by . The estimation phase in each epoch proceeds for at most 2Δ 2 time units. Thus, Note that regret is incurred in the allocation phase only in the case when there exists some user ∈ [ ], some channel ∈ [ ] and some ∈ [ ] such that |ˆ ( , ) − ( , )| > Δ. We have from Lemma 1, that the probability of this event is upper bounded by −ℓ . Thus we have that Therefore, the expected regret for Algorithm 1 for a time horizon is given by We now extend the results to a dynamic system with a changing number of users. The key idea is to run Algorithm 1 repeatedly in super-epochs, each consisting of epochs described in Section III. In order to obtain a sub-linear regret bound, we restrict the number of users entering and leaving the system until time , denoted by , to be sub-linear. Let be ( ), where < 1 2 . We note that this is different from [5] where the time-horizon is fixed and known, and there is also a restriction on when users can enter or leave the system. Each user considers some known time which is greater than the estimation phase ( ( ) 3 2Δ 2 log time-units) and runs Algorithm 1. After time , the user continues to use Algorithm 1 with a super-epoch of length 2 , then 3 , and so on. Let denote the number of active users at time , where ≤ . The resulting algorithm is given in Algorithm 4.

Algorithm 4 Dynamic Allocation
We now show that if all the users employ Algorithm 4, the system-wide regret is sub-linear in when < 1 2 . We emphasize that the users do not need to know the time-horizon to achieve sub-linear regret.  Proof: Note that ≤ √ 2 , and recall that ≤ . Let E denote the set of super-epochs until during which at least one user enters or leaves the system. Note that |E | ≤ . Let denote the regret accumulated in super-epoch . In super-epochs where no users enter or leave the system, the regret is bounded according to Theorem 1, and in super-epochs in E , the regret accumulates through the entire super-epoch. The expected regret in super-epochs with change is given by: The regret in super-epochs with no change is bounded using Theorem 1 as where = log 2Δ 2 + 1.4, which is a constant. The regret up to time bounded as follows: Thus, E[ ( )] ∼ ( 1 2 log + 1 2 ), and if is ( ) with < 1 2 , we have sub-linear regret.

VI. EXPERIMENTS
We consider a system with = 10 users and = 6 channels with = 3. The reward distributions are chosen to be uniform with a variance of 0.01, and means between 0 and 1. The performances of Algorithm 1 and an algorithm where the users choose channels uniformly at random are compared in Fig. 1. It can be seen from the figure that the regret incurred by the naive random selection algorithm is linear, whereas the regret incurred by Algorithm 1 is sub-linear. Algorithm 1 performs worse initially due to a shorter allocation phase in each epoch compared to estimation phase. Note that the allocation phase of epoch ℓ proceeds for 2 ℓ time units, and we can see from the flat regions of the plot that the regret incurred during the allocation phase is zero with high probability.

VII. CONCLUSION
We developed algorithms for uncoordinated spectrum access within the framework of stochastic multi-armed bandits. We allowed for the users to receive non-zero rewards on collisions, and for the number of users to be greater than the number of channels. In this setup, we presented an algorithm that achieves order-optimal system regret of (log ). We also presented an algorithm that achieves sub-linear regret for the dynamic case where the number of users evolves over time. It is of interest to extend the results in this paper to the case of heterogeneous reward distributions across arms; some initial results in this direction are explored in [8], [11].