Neural Episodic Control-based Adaptive Modulation and Coding Scheme for Inter-satellite Communication Link

Inter-satellite links (ISLs) play an important role in the global navigation satellite system (GNSS), which is known as one of the key technologies for the next generation of navigation satellite systems. Deep reinforcement learning algorithms have achieved significant improvement over various wireless communications systems. However, it has been reported that deep Q network (DQN) algorithm requires an enormous number of trials. To resolve this problem, in this paper we propose an adaptive modulation and coding scheme based on a neural episodic control (NEC) algorithm, which is one of deep reinforcement learning algorithms. The proposed scheme adjusts the modulation and coding scheme region boundaries with a differentiable neural dictionary of the NEC agent, which enables the effective integration of the previous experience. In addition, we propose a step-size varying algorithm to encourage the NEC agent to efficiently approach the suboptimal state. We confirm that the proposed scheme can reduce the number of trials to 1/8, compared to the DQN-based adaptive modulation scheme. To further evaluate the proposed scheme, we employ an online learning loss evaluation algorithm that calculates the loss in time-step based on interaction records of the reinforcement learning agent and the derived modulation and coding scheme region boundaries.

maximum throughput algorithm chooses the MCS region boundaries following the envelopes of the maximum throughput. Those MCS region boundaries in the form of signal-to-noise ratio (SNR) are recalculated into the distance between satellites. In addition, a continuous phase modulationbased AMC scheme is proposed in [12] with the framework of [11]. It has been reported that the proposed scheme in [12] can improve the spectral efficiency for the inter-satellite communication link subject to the same transmit power level.

A. BACKGROUND AND RELATED WORKS
Reinforcement learning algorithm, which is one of the machine learning algorithms, is based on interactions between environment and a learning model [13], which is called an agent. Reinforcement learning algorithms have been effectively employed in various decision-making areas of communication systems [14]- [20]. In [14], a fuzzy Q-learning-based MCS selection and MIMO configuration scheme for high-speed access evolution systems is proposed. Here, the proposed scheme observes channel quality indication and block error rate between a base station and user equipment. By combining the fuzzy logics and Q-learning algorithm, an appropriate MCS and MIMO configuration are selected via a closed-loop iteration of the reinforcement learning agent. It is confirmed that the proposed scheme outperforms the conventional adaptive threshold selection scheme and Q-learning-based hybrid automatic repeat request scheme [15] under the given block error rate requirement.
However, due to the enormous dimension of state-action space in a practical environment, deep reinforcement learning algorithms have been proposed, bringing in significant advances in the decision-making process [16], [17]. The deep reinforcement learning algorithms employ the deep neural network and reinforcement framework to improve the automotive decision-making process in an online manner. Comparing with conventional heuristic approaches, the deep reinforcement learning algorithms easily adapted to environmental changes by learning from new experiences of the agent [18]. Moreover, it has been known as the reinforcement learning-based approaches can provide optimal decision at each time step without any prior knowledge of the environments [13].
In [19], a deep Q network (DQN) algorithm-based adaptive modulation scheme with outdated channel state information is proposed. To overcome a limitation of outdated channel state information that changes non-linearly, the DQN agent selects a proper modulation scheme from received signal strength. Here, it has been reported that the proposed DQN-based adaptive modulation scheme outperforms conventional outdated channel state information measurement schemes. Moreover, in [20], a DQN algorithm-based adaptive modulation scheme with a trial efficient strategy is proposed. The proposed scheme divides the SNR range into 4 different regions with different bits per symbol rate, which achieves a reduction in computational complexity compared to the scheme in [21]. It has been known that the DQN algorithm has the following advantages [22]. Firstly, deep neural networks are employed in the DQN algorithm to encourage the agent to efficiently explore among its action candidates by setting the output of the networks as the action options, which can be viewed as the concept of virtual exploration [23], [24]. Secondly, the correlation between input data is removed to stabilize the learning stage with a replay memory scheme [22], [25]. Lastly, the main and target networks are separated to stabilize the update process of the neural network. With those advantageous properties, the effect of the dilemma between computational complexity and throughput loss can be relieved, which is incurred by the quantized states in [21].
It has been known that the DQN algorithm has various improved forms with advantageous properties. The double DQN (DDQN) [26] and dueling DDQN algorithms [27] are proposed to improve the DQN algorithm. The DDQN algorithm attempts to mitigate the effect of overestimation in the DQN algorithm. Here, over-optimistic selection and evaluation on actions can be effectively reduced by decomposing the max operation of the target value of the DQN algorithm into the action selection process. Meanwhile, the dueling DDQN algorithm adopts a distinctive neural network structure, termed dueling architecture [27]. In the dueling architecture, the last hidden layer is split into two layers to generate the state value and the action advantage value, which is added into the Qvalues [27, Figure. 1.]. It is reported that the agent can learn which states are more valuable to learn, without observing the effects of each action at each state, via the dueling structured network.

B. MAIN CONTRIBUTIONS
It has been reported that the DQN algorithm requires a large number of trials and training processes of the reinforcement learning agent [28]. Motivated by this fact, in this paper we propose a neural episodic control (NEC) algorithm-based AMC scheme for ISLs. The NEC algorithm, which is one of the deep reinforcement learning algorithms, provides the advantageous properties of faster integration of the previous experience compared to the DQN algorithm. By employing a distinctive memory structure of the NEC algorithm, it is able to make optimal decisions among the action candidates. Further, it has been reported that the adaptive modulation scheme in [20] can be viewed as a gradient-based approach. From this sense, a variable step-size algorithm [29] can be applied for our scheme to minimize the number of trials of the agent and the convergence time of the algorithm.
The main contributions of this paper are listed in the followings.
• The NEC-based AMC scheme has been proposed. By employing the NEC algorithm, the number of trials of the agent to the suboptimal state can be significantly reduced, while the NEC algorithm requires less volume of memory than the DQN algorithm. • To minimize the required number of trials to the suboptimal state of the agent, a step-size varying algorithm has been employed to encourage the agent to reach the suboptimal point faster, in conjunction with the NEC-based AMC scheme. • We employ an online learning loss evaluation algorithm to calculate spectral utilization loss based on interaction records of the reinforcement learning agent and the derived MCS region boundaries. The remainder of this paper is organized as follows. Section II presents the system models of inter-satellite communication and reinforcement learning. In Section III, the NEC-based AMC scheme and the online learning loss evaluation algorithm are described. Simulation settings and results of the proposed scheme are shown in Section IV. Lastly, concluding remarks are given in Section V.

II. SYSTEM MODEL A. INTER-SATELLITE COMMUNICATION MODEL
It has been reported that the loss for the satellite-to-satellite communication links contains various loss factors such as directivity of antenna, polarization loss of antenna, etc. [11]. However, free-space loss is the primary factor of the ISL, which is dominantly dependent on the distance between satellites [12]. SNR per bit of the ISL at the receiver can be obtained as [11] where denotes the energy per bit, 0 is the one-sided power spectral density, is the effective isotropic radiated power, / is the factor of quality of receiving antenna, represents the noise of the system, is the distance between satellites, is the wavelength of the carrier, denotes the data rate, and 0 is the system margin, respectively.
From the perspective of the constellation of the satellites, Standard Walker 24/3/1 constellation [11] has been considered in this paper. It has consisted of 24 medium earth orbit (MEO) satellites, 3 geostationary orbit (GEO) satellites, and 3 inclined GEO (IGEO). The maximum and minimum distances between the satellites on the constellation have been provided in [11, Table 1]. The GNSS satellites have the information of distance between other satellites by using ephemeris. Thus, a control channel between transmitter and receiver is not required for the AMC schemes of inter-satellite communication systems. The architecture of the adaptive communication model for the intersatellite communication link is shown in Figure. 1 [11], [12]. As shown in the Figure 1, the transmitter selects a proper MCS by calculating the distance between the transmitter and the receiver. Then, the receiver demodulates and decodes the transmitted signals via pilot signal and the distance information.

B. REINFORCEMENT LEARNING MODEL
Reinforcement learning algorithm aims to make the agent learn a task with the observation of environments and its interaction [13], which are called a state and an action, respectively. The agent evaluates its action by a reward, which is given by the environments and the primary goal of the agent is to maximize the cumulative reward. In this paper, the reinforcement learning agent has the following components: ■ = { 1 , 2 , 3 } represents the state values, which are the rate region boundaries of the SNR range [20]. In this paper the 4 different modulation schemes are allocated with the 3 state values. Moreover, convolutional coding scheme [30] with the rate of 1 2 ⁄ , 2 3 ⁄ , 3 4 ⁄ , and 5 6 ⁄ has been employed in each rate region, respectively. ■ = { 1 , ⋯ , 6 } denotes the candidates of the action of the agent. Each pair of the action increases or decreases the corresponding state value [20]. Considering an example of a pair of 5 and 6 , 5 decreases the value of 3 by a step size and 6 increases the 3 by the same amount. ■ = { } is reward function, where denotes the reward unit earned by the agent, which is assumed to be [21] = where denotes the modulation order of the i-th MCS region, is the code rate of i-th MCS region, and ̅̅̅̅̅̅̅ is the average BER of each MCS region, and is the number of MCS regions, respectively. The reward unit is set to represent the average number of bits that successfully transmitted for each MCS region.

C. NEURAL EPISODIC CONTROL MODEL
Deep reinforcement learning algorithms have shown significant performance improvements over the various environments with neural networks. However, it has been reported that the DQNbased approaches take an enormous number of trials of the agent [28], and the NEC algorithm has been proposed to resolve this problem. The NEC algorithm has a distinctive memory unit called differentiable neural dictionary (DND), which can be represented as where and are the i-th elements of the key and the corresponding value, respectively. The NEC agent will have a DND for each action that the agent can take. The DNDs are utilized for two operations of the agent: 1) write and 2) lookup [28].
In the operation of writing, a new key-value pair will be added to the DND of the corresponding action that the agent takes.
The key vector can be obtained from embedding, which is the output of the neural network. The corresponding value can be obtained with the N-step Q-learning value [28] as where denotes the discounting factor, is the number of trials, and the subscript of the r is time step, respectively. The value of ( , ) is stored as the corresponding value . However, if the embedded key vector already exists in the DND, the corresponding value is updated with the classic Q-learning value as [28] ← where is the learning rate of the Q-learning algorithm. The agent will append or update the key-value pairs through its interaction to the corresponding DND. The agent will estimate the best action choice with the process of the lookup. After the embedding is made, the agent compares the Q-function value of each action DND, which can be obtained as [28] ( , ) = ∑ ( , ) where ( , ) denotes the inverse kernel function [28,Eq. 5], is the i-th element of , and is the embedding, respectively. The output of each DND can be viewed as a form of weighted sum of values, while proper vectors can be selected as the most similar sets of . In this paper, k-nearest neighbor algorithm has been considered to pick up the proper vectors. In Figure. 2, an architecture of DND of each action is shown. The NEC agent selects the optimal action at each time step via the lookup operation of each action DND. To encourage the agent to explore various action space, -greedy algorithmbased exploration and exploitation strategy has been considered [13].

III. NEC-BASED AMC SCHEME A. AMC SCHEME
Modulation schemes of Gray-coded QPSK, 16-QAM, 64-QAM, and 256-QAM have been employed for the proposed scheme. The modulation schemes with different code rates are allocated to the MCS region divided by S. If the instantaneous SNR between the transmitter and receiver falls into the MCS region, the assigned modulation and coding scheme is decided [7]. We consider BER approximation functions of each modulation scheme. The BER approximation for Gray-coded QPSK can be obtained as [31,Eq. 8. 32] We also consider the convolutional coding scheme for the proposed AMC scheme. The coding gain of the convolutional coding scheme can be obtained as [32, Eq. 7. 23] ≤ 10 log 10 ( ), where denotes the code rate and represents the free distance of the coding scheme, respectively.
The spectral utilization rate of each modulation and coding scheme can be represented as [12] = [bps/Hz].
In this paper, weighted average of the spectral utilization rate is considered to reflect the range of each MCS region, which can be derived as where denotes the range of SNR region in dB scale and it is assumed to be 25dB in this paper, and is the range of the i-th MCS region. However, it has been observed that the overall spectral utilization rate of the proposed scheme has not been significantly varied because the term of the range of each MCS region is dominant in (11). To compensate this, the overall SNR range has been normalized into the range of [0, 1]. The spectral utilization rate with the normalized range can be obtained as , (12) where , denotes the normalized range of each MCS region.

B. NEC-BASED AMC SCHEME
The proposed NEC-based AMC algorithm is shown in Algorithm 1. It is assumed that the DNDs for all action are initialized by the random actions of the agent before the interaction of the agent starts. Then, the agent will get an embedding vector from S, and observe the spectral utilization rate and the BER of each region before and after taking an action to calculate the reward. The agent will get a penalty when at least one of the maximum BER of each region exceeds the target BER, which is assumed to be 10 −3 in this paper. The agent will load the previous optimal state at the beginning of the episode and when the BER of each region is violated, to encourage the agent to efficiently converge to the suboptimal state [20]. Then, if the spectral utilization rate is increased, the agent will get the reward via the reward function. The agent will also get the penalty if the S is invalid. The invalid conditions are 1) when S exceeds the bound of simulation range, and 2) each state value gets closer within a margin of 3dB. The penalty will cause decrease in the cumulative reward, and it is expected that the agent will follow the valid condition. After the reward is given to the agent, the key-value pair is written to the corresponding DND, which can be expressed by ( , ( , )). Replay memory has been employed in the NEC algorithm, like the DQN algorithm [20]. The information of transition is saved in the replay memory to use in the training process of the neural network. It has been assumed to minimize the L2 loss between the predicted output value of an action and the Q-value from mini-batches that randomly sampled in the replay memory [28], which can be obtained as where denotes the trainable variables in the neural network, and the terms with the subscript represent the data sampled from the replay memory. Back propagation-based update on of the neural network has been considered, which can be viewed as the neural network is trained to make effective representation of embeddings to reflect context and feature of environments.
The proposed scheme can be viewed as a gradient-based approach, which is based on the steps to the suboptimal state of the agent. Inspired by this fact, the step-size varying algorithm can be applied. The step-size of the algorithm is set to be 1 ( + 1) ⁄ in dB scale, which is inversely proportional to the number of the episodes. The required number of trails to reach the suboptimal state of the proposed algorithm can be described as where denotes the suboptimal state, refers to the initial state of the agent, ∆ is the step-size per action, and is the required number of trials. Because and are assumed to be same in the step-size varying algorithm and the fixed step-size case, is only dependent on the step-size. It can be interpreted that efficient convergence to the suboptimal state can be achieved with the step-size varying algorithm because of its advantage of faster approach in the range of lower number of episodes and precise tunning in the range of higher number of episodes.
The error between the optimal Q-function value and the Qfunction value of the i-th iteration can be obtained as [33] = ( , ; ) − * ( , ) = + + , where * (•) denotes the optimal Q-function value, is the target approximation error, refers to the overestimation error, and denotes the optimality difference. The three components of the overall error can be described as [33] = ( , ; ) − ( , ), Get an embedding vector e from S 10: Evaluate spectral utilization rate , via (12) 11: Evaluate of each rate region via (7)-(9) 12: Take action with -greedy algorithm and update S 13: Evaluate spectral utilization rate ′ , via (12) 14: Evaluate ′ of each rate region via (7) Write ( , ( , )) to 23: Store transition ( , , ( , )) in 24: if step = = J then Store arg max in B end if

25:
Sample random mini-batches from 26: Train network with mini-batches via (13) 27: end for 28: end for VOLUME XX, 2021 6 =̂( , ) − * ( , ), where ( , ) is the target value of the network and ̂( , ) denotes true target value, which can be expressed as [33] ( , ) = [ + max ′ ( ′ , ′ ; −1 ) | , ], Because the N-step Q-function can be viewed as the target value of the DQN algorithm [28], it can be expected that the target value of the NEC network can be derived by substituting (4) into (19). The overestimation error represents a bias of the overall error term, which has its upper bound that is proportional to the number of actions in a state [34]. It can be assumed that the overestimation error is not dominant because the selectable actions in all states are assumed to be same in the NEC and DQN algorithms. Besides, the optimality difference represents difference between the optimal Q-function value and the true target value. Because the state and the action space of the true target value is under its boundaries of the environment, it can be expected that the optimality difference is not a dominant term.
On the other hands, the target estimation error, which can be a dominant within the error terms above, is comprised of the following components [33]: 1) suboptimality of the network trainable parameters because of inexact minimization of optimizer, 2) modeling error of neural network, and 3) generalization error of unexperienced state-action pairs. The first and second factors of the target estimation error can be assumed to be equal in both cases of the NEC and DQN algorithms because they are assumed to employ the same neural network and optimizer. Thus, it can be interpreted that the NEC algorithm can effectively reduce the error between the optimal Q-function value and its Q-function value with the integration of previous experiences by employing the DNDs and the N-step Q-function.
In addition, it can be expected that the NEC algorithm has the following benefits compared to the DQN algorithm. Firstly, the NEC algorithm can integrate the previous experience, which is the record of adjusting MCS region boundaries, with a form of a dictionary. By using the architecture of DND of each action candidate, it allows the agent to take optimal action much faster. Secondly, the N-step Q-function in (4) that can be viewed as the sum of on-policy and off-policy back-ups, and it can act as the target network of the DQN algorithm [28], which stabilizes the training processes of the DQN agent by separating from the main network. Further, due to this advantageous property, it has been reported that the replay memory for the NEC agent can be reduced compared to the DQN algorithm. Lastly, the slowly varying embedding vectors produced by the neural network can stabilize the DND. Contrary to the DND, some of the memory architecture-based approaches empty out its memory at the beginning of each episode [35], and it can be expected that the memory is unstable over the episodes.

C. ONLINE LEARNING
The online learning loss evaluation algorithm, which can evaluate real-time loss of the AMC model, is shown in Algorithm 2. In this algorithm, the MCS region boundaries and the interaction records of the NEC and DQN models will be utilized. Firstly, the interaction records of the AMC model are loaded to reflect exact interactions of the reinforcement learning model. Then, the SNR value, which is denoted as , and the probability of are randomly sampled. denotes the threshold probability of current link being maintained. If > , the SNR between satellites is maintained in a range of 1dB. The spectral utilization rate loss due to misplaced MCS is occurred in the case of * < < for all MCS regions, where * refers to the optimal state. For example, if the SNR value falls into the region of 1 * < < 1 , the 16-QAM with code rate of 2/3 should be employed. However, the QPSK with code rate of 1/2 will be assigned in the area until the agent reaches the suboptimal state. The spectral utilization rate loss is evaluated with L1 loss, and it can be obtained as where * is the spectral utilization rate with the optimal MCS region boundaries and , denotes the spectral utilization rate with the MCS region boundaries at time-step t. The time-step t is set to be unit time step in the Algorithm 2, where unit timestep is assumed to include the time of a single adjusting step of the agent, generation of random SNR and link maintain probability, and the loss evaluation process.

A. NEC-BASED AMC SCHEME
The simulation parameter settings are listed in Table 1, and further assumptions on the simulation have been made as follows: 1) a 4-layered fully connected network has been employed [20], 2) the dimensions of the input and output of neural network are set to be 3 for NEC algorithm, which are the dimension of S and embedding vector, 3) the output dimension is set to be 6 for the DQN algorithm, 4) RMSprop optimizer [36] has been considered for the training process with the learning rate of 0.4, 5) DND of each action has the least recently used (LRU) property [28], 6) the k-nearest algorithm of the lookup operation has been approximated by using kd-trees [37], and 7) the initial state is set to be the lowest point of the spectral utilization rate with the equally divided the simulation range, except the highest MCS region [20].
In Figure. 3, the simulation results on the maximum spectral utilization rate, which indicates the maximum spectral utilization rate achieved by the agent within one episode, are compared among the AMC schemes with the four algorithms: the NEC, dueling DDQN, DDQN, and DQN algorithms. It is assumed that the maximum volume of replay memory and the mini-batch size of the dueling DDQN and DDQN algorithms are set to be same as DQN algorithm. As seen in Figure. 3, the proposed scheme can approach the suboptimal state with the smallest number of episodes among the four algorithms, and the average difference in the maximum spectral utilization rate over the episodes between the NEC-based scheme and the DQNbased scheme is 0.4879 bps/Hz. The proposed scheme reaches the suboptimal state within 24 episodes, while the other algorithm-based schemes require the number of episodes to reach the same point by 48 episodes for dueling DDQN, 68 episodes for DDQN, and 74 episodes for DQN, respectively. It has been observed that the time needed to achieve the suboptimal state can be reduced to a half compared with the DQN-based scheme, while using less memory. In addition, the steep improvement of the proposed scheme can be viewed as due to the optimal decision on actions with the lookup process of the DNDs. Due to the trade-off relationship between the spectral utilization rate and the BER of each MCS region, all algorithms approach the suboptimal point by adjusting MCS region boundaries while sacrificing the BER. Thus, the four different base algorithm-based AMC schemes in Figure. 3

Parameter
Value Step-size 2,400 The number of nodes The number of node exploration 9,992 achieve the same performance under the BER constraint after the sufficient number of adjusts, which is significantly varied with the base algorithms. In addition to the simulation results in Figure. 3, we also investigate a dynamic programming approach. In the proposed scheme, the NEC agent controls the three state values to approach the suboptimal state. From this sense, the formulated problem of the proposed scheme can be viewed as finding shortest path over three dimensional points. Dijkstra's shortest path algorithm [39] is employed to find the shortest path from the initial point to the suboptimal point, while each point is set to be ( 1 , 2 , 3 ). In Table 2, the simulation parameters and the evaluation result are given. It is assumed to have the weight of the path only for the path between adjacent nodes with a single step. Due to the enormous volume of the search space with ∆ =0.05dB, it is set to be ∆ =0.5dB. It requires 9,992 node explorations for deriving the shortest path. If a node exploration process is considered as an action of the reinforcement learning agent, it takes 99 episodes to approach the suboptimal state, which is significantly slower than the four different reinforcement learning-based approaches in Figure. 3. It is also expected that it takes more episodes to approach the final point with the same step-size of the reinforcement learning models.
The simulation results for various k values are shown in Figure. 4. The k value is the parameter of the k-nearest neighbor algorithm for the lookup operation of the proposed NEC scheme. For instance, if the k value is set to be 10, the agent will calculate the Q-function value based on the 10 elements of the DND for each action candidate. In the simulation results of  NEC models, all cases of the NEC models yield rapid improvement on the spectral utilization rate in the range of lower number of episodes. It has been observed that as the k value increases, the required episode to reach the suboptimal state is decreased. However, the significant performance degradation on convergence has been occurred with the model of = 40. It can be interpreted that if the parameter of k is set to be unnecessarily high, it makes the NEC agent to look up irrelevant embeddings in the DNDs, which can be a primary factor of degradation. In Figure. 5, the simulation results are compared between fixed step-size approaches and the step-size varying algorithm. It is clearly seen that the step-size varying algorithm yields the fastest convergence to the suboptimal state. The required number of episodes to reach the suboptimal state for the case of ∆= 0.025dB has been observed to be 47. Contrary to this case, the step-size varying algorithm has reached the suboptimal sate with 9 episodes. Besides, it is observed that the NEC model with the step-size varying algorithm requires less than 1/8 of the DQN model in Figure. 3. It can be interpreted that the proposed step-size varying algorithm can efficiently approach the suboptimal state, and it can effectively reduce the number of trials of the agent. Moreover, it can be expected that the stepsize varying algorithm can relieve the effect of a dilemma on selecting a proper step-size, which is slow convergence with small step-size and leaving room for improvement with unfilled target BER with the large step-size.
In Table 3, a comparison of the MCS boundaries provided by the target BER algorithm in [11, Figure. 3] and the proposed scheme is shown. Due to the target BER in [11] being set to be 10 −4 , approximations on the MCS region boundaries by their simulation results are assumed to compensate the difference. It is confirmed that the proposed scheme yields improvement in the spectral utilization rate by 0.7916 bps/Hz, and it can be interpreted that the proposed scheme more properly allocates the MCS and adjusts the MCS region boundaries than the scheme in [11]. Moreover, the region boundaries in distance of the proposed AMC scheme are derived in Table 4, where d  denotes the distance between the satellites. The distances have been calculated via (1). The carrier frequency is assumed to be 30GHz [11]. From the simulation results, it was confirmed that the NEC algorithm can achieve faster convergence to the suboptimal state than the DQN-based algorithms. It can be interpreted that the improvement on convergence time is due to the following advantages of the NEC algorithm: 1) efficient integration of the previous experience of the agent through DND structure, 2) stabilizing on training process by employing N-step Q-function, and 3) stable DNDs with slow varying embedding vectors.

B. ONLINE LEARNING ANALYSIS
The simulation results on the online learning loss of the proposed AMC scheme and the other DQN-based AMC schemes are shown in Figures. 6 and 7. Each simulation result includes the interaction records of each agent and the derived optimal MCS region boundaries. Moreover, due to the randomness of probability, the simulation results were obtained with the averaged results of 1,000 times of the loss evaluation process. As seen in Figure. 6 [20] by employing advantageous properties of the NEC algorithm.
In Figure. 7, the simulation results of the online learning loss evaluation are shown. It is observed that the NEC-based AMC scheme with the step-size varying algorithm reaches the suboptimal state at 1,900 unit time step, and its cumulative online loss is 410.08 bps/Hz for = 0.  [20].
From the simulation results, it has been observed that the maximum difference between the cases of = 0.1 and = 0.3 is 8.04 bps/Hz, and it can be interpreted that the online learning loss is more dependent on the interaction records of the AMC agent than the threshold probability. Moreover, it was confirmed that the proposed scheme of the NEC agent with the step-size varying algorithm can effectively reduce the online learning loss compared to the DQN agent with the fixed stepsize scheme in [20]. Thus, we expect that the proposed scheme   will provide the implemental advantages in an inter-satellite communication environment with limited energy and communication resources.

V. CONCLUSION
In this paper, we have proposed the NEC-based AMC scheme with the step-size varying algorithm. The proposed scheme adjusted the MCS region boundaries with the interactions of the NEC agent. The lookup and write operations of each DND enabled the agent to make the optimal decisions on actions by employing the previous experience of the agent. Compared to the work in [20], the proposed scheme adopted an advanced deep reinforcement learning algorithm in the practical intersatellite communication systems to significantly reduce the number of trails of the agent deriving the optimal MCS region boundaries while using the less volume of memory. Further, the proposed scheme effectively accelerated this process with the step-size varying algorithm. It was confirmed that the proposed scheme can reduce the number of trials to less than 1/8 compared to the DQN-based adaptive modulation scheme in [20], and also the proposed scheme can achieve the fastest convergence to the suboptimal state, compared to the dueling DDQN, DDQN, and DQN algorithm-based schemes. Moreover, it was observed that the proposed AMC scheme yields the higher spectral utilization rate by assigning proper MCS. We also employed the online learning loss evaluation algorithm, which can be viewed as kind of the loss evaluation process in a practical environment, to calculate the spectral utilization rate loss during the convergence to the suboptimal state. From the simulation results, it was confirmed that the online learning loss can be effectively reduced by employing the NEC-based AMC scheme with the step-size varying algorithm. Thus, it can be expected that the proposed scheme has advantages in resourcelimited environment of satellite communications, considering the smaller number of trials of the agent, the less volume of memory, and the less amount of online loss. Especially, the proposed scheme can be effective in 6G mobile communication systems and next generation wireless communication systems that are expected to form a massive and dense network, where the link is changed in real time and an optimal MCS region is required.
As for the future work, it can be viewed as an extension of the proposed deep reinforcement learning framework into another domain of the adaptive communication systems, such as adaptive transmitting power level control and spectrum resource allocation schemes. Moreover, to optimize the training process and to transfer the knowledge domain of the reinforcement learning agent, a transfer learning algorithm can be considered for the future work.
serving as an organizing/technical committee member for various prominent international conferences. In particular, he was the Technical Program Committee Chair of APWCS 2013, the General Co-chair of APWCS 2014 and APWCS 2015, and the Keynote/Panel Co-chair of WCNC 2020. He is also the Executive Vice-chair of ICC 2022. His area of interests includes wireless communication and communication signal processing.