Reputation-Based Opportunistic Routing Protocol Using Q-Learning for MANET Attacked by Malicious Nodes

Irrespective of whether the environment is wired or wireless, routing is an important challenge in networks. Since mobile ad hoc networks (MANETs) are flexible and decentralized wireless networks, routing is very difficult. Furthermore, malicious nodes existing in the MANET can damage the routing performance of the network. Recently, reinforcement learning has been proposed to address these problems. Being a reinforcement learning algorithm, the Q-learning mechanism is suitable for an opportunistic routing approach because it not only adapts to changing networks, but also mitigates the effect of malicious nodes on packet transmission. In this study, we propose a new reinforcement learning routing protocol for MANETs called reputation opportunistic routing based on Q-learning (RORQ). Using this protocol, which works based on game theory, a reputation system can detect and exclude malicious nodes in a network for efficient routing. Thus, our method can find a routing path more effectively in an environment attacked by malicious nodes. The simulation results showed that the proposed method could achieve superior routing performance compared with other state-of-the-art routing protocols. Concretely, compared to other algorithms, the proposed method demonstrated performance gains significantly, in terms of packet loss, average end-to-end delay, energy efficiency in both the blackhole attack scenario and the gray hole attack scenario.


I. INTRODUCTION
A mobile ad hoc network (MANET) comprises various types of independent mobile nodes, such as tablets, smart watches, and smartphones. Each of these nodes displays both dynamic and decentralized characteristics. Since MANETs can communicate without an infrastructure network, they are useful in emergency situations, such as natural disasters or warfare. Studies on fast and accurate routing in such situations have been considered deeply important because it is important to receive information quickly when sending a rescue signal or searching for the enemy in a hostile area [1].
The associate editor coordinating the review of this manuscript and approving it for publication was Chuan Heng Foh .
However, path disconnection occurs frequently in MANETS owing to the mobility of each node in these networks; this also results in the network structure being frequently changed, which makes it difficult to guarantee a good quality of service (QoS) in relation to path reliability or network lifespan [2]. Owing to this complexity, the problem of finding the optimal path that guarantees a good QoS in a MANET arises; this problem is known as an NP-complete problem [3]. Moreover, traditional routing protocols for MANETs, such as the Ad hoc On-Demand Distance Vector (AODV) [4] and Dynamic Source Routing (DSR) [5] protocols, first determine the optimal path and then start transmitting packets. Owing to this additional calculation, the problem of increased energy consumption occurs in mobile network environments wherein the power supply is limited [6].
To overcome this drawback, opportunistic routing has been proposed as a routing method for MANETs. Unlike the traditional routing protocols for MANETs, opportunistic routing considers shared wireless media as an opportunity. The key idea behind opportunistic routing is to effectively use broadcasts. Instead of preselecting a specified routing path, opportunistic routing broadcasts data packets to multiple neighbors that later form the set of candidate relays. The actual packets are then forwarded to the final destination.
Opportunistic routing exploits the reception of the same packet at multiple nodes to improve the network performance, significantly reducing the number of packet retransmissions caused by link failure. Clearly, opportunistic routing can be applied to all types of wireless multihop networks. In addition, compared with traditional MANET routing protocols, opportunistic routing has a lower transmission cost. Therefore, it is used for emergency evacuation and recovery, conducting nature surveys in rural areas, communication in vehicular networks, flying ad hoc network (FANET), and underwater sensor network (UWSN) [7], [8], [9], [10].
Along with these advantages, opportunistic routing has a primary disadvantage; the neighboring nodes waste too much memory to transmit a single packet. Frequent message transmission has the effect of quickly saturating the buffer, thereby preventing other messages from being sent. This is a fatal disadvantage for mobile terminals that do not have sufficient memory space.
To solve this problem, opportunistic routing algorithms are required to reduce the buffer burden of each node. Extremely Opportunistic Routing (ExOR) [11] is considered the most typical opportunistic routing algorithm. It allows transmission to all neighboring nodes; thus, they can all participate in forwarding the packet. ExOR uses a scheduler to allow only one forwarder to transmit packets at a time. However, if the number of forwarders participating in routing increases, all the forwarders, except the currently transmitting node, need to wait until their turn arrives. As the number of forwarders increases, the waiting time increases. Therefore, ExOR is not suitable for large networks because using it can degrade the performance.
Another problem to overcome in MANET routing is that various obstacles may be encountered during the routing process. A typical problem is a blackhole attack [12], wherein a malicious node included in a routing path intentionally drops a packet. The network needs to detect these malicious nodes and operate based on reliability. Another variant attack method, known as the gray hole attack [13], entails the initial involvement of a node in routing similar to a regular node but then transforms into a malicious node over time. This means that the routing protocol in a MANET should detect not only naturally malicious nodes, but also adaptively malicious nodes.
Thus, in this paper, to efficiently route in a situation wherein nodes can be attacked, we propose a new opportunistic routing method called reputation opportunistic routing based on Q-learning (RORQ). RORQ does not broadcast a packet to all the neighboring nodes but only forwards a packet to some trusted nodes, thereby reducing memory wastage, which is a problem in traditional opportunistic routing methods. The proposed routing method can be adapted and used in a flexible mobile environment using reinforcement learning. Additionally, by managing the trust of a node using an incentive technique, each node arbitrarily induces the forwarding of packets to other nodes, and malicious nodes are naturally isolated.
The main contributions of this paper are as follows: (1) We developed a routing algorithm that performs Q-learning based on the feedback received from the forwarding result of each node for the packet to arrive at its destination. Thus, efficient routing will be possible in a MANET without any prior knowledge of the entire network structure. Additionally, considering the amount of energy of each node, it is expected that the lifetime of the entire network will be extended in network contained malicious node. (2) The trust of nodes is managed through an incentive technique, which is based on game theory. The trust of a node is judged by its reputation, a metric used in this study; and for malicious nodes, this value is naturally lowered. The proposed method can reduce the packet loss rate by excluding malicious nodes that have a reputation less than a certain value. (3) We measured the results of various simulations to compare our proposed method with not only the traditional routing algorithm but also other state-of-the-art routing methods. We assumed the presence of blackhole and gray hole attacks and compared the performances of all the methods using various indicators; our method delivered superior performance under both types of attacks.
The remainder of this paper is organized as follows. Related work regarding this study is introduced in Section II. In Section III, we explain our method, including the reputation system and how reputation is calculated during the routing process; and then, we thoroughly describe the working of our proposed routing algorithm based on reputation and how to select relay nodes. In Section IV, we evaluate the performance of the proposed method and compare it with the performance of other existing Q-learning-based routing protocols. Finally, in Section V, we conclude the paper.

II. RELATED WORK
Here, we briefly review opportunistic routing methods that use reinforcement learning, game theory, and comprehensive methods. These techniques are related to the proposed scheme. A summary of the related works is presented in Table 1. VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

A. REINFORCED LEARNING-BASED OPPORTUNISTIC ROUTING
Opportunistic routing methods that use reinforcement learning have been extensively studied. First, Kexin et al. [14] proposed a reinforcement-learning-based opportunistic routing method (RLORa) for live video streaming. They proposed a new path-cost metric called the expected anypath delay (EAD) to estimate the end-to-end delay more accurately. EAD includes the average waiting time of a packet in a node queue, the time it takes for a packet to be successfully transmitted to at least one of the candidate forwarder node sets, and the total time it takes from the candidate forwarder node set to its destination. These authors performed numerical simulation that showed that the proposed method balanced network traffic, guaranteeing a better average viewing quality in multihop wireless networks. However, since the EAD metric only works assuming situations wherein the location of the nodes is fixed, it is not suitable for MANETs wherein nodes move frequently.
Zhu et al. [15] designed a reinforcement-learning-based opportunistic routing protocol (ROEVA) for underwater acoustic sensor networks. ROEVA solves the void-routing problem by using a new method called two-hop availability checking to detect void nodes. This method filters void nodes by checking whether there is a node closer to the destination before transmitting the candidate forwarder node set. The simulation results demonstrated that the retransmission rate could be reduced because the packet transfer rate was higher than that under other existing methods. However, ROEVA does not consider situations wherein nodes can move.
Zhang et al. [16] introduced a relay-node selection method using Q-learning in an underwater sensor network environment. They proposed a new reinforcement learning-based opportunistic routing protocol (RLORb) that finds a bypass path by entering the recovery mode when relaying a void node to reduce the possibility of selecting a void node when selecting a relay node. Extensive simulations showed that RLORb was superior to other methods in various aspects. Although the performance of ROLRb was excellent, similar to other methods, it also did not consider situations involving malicious nodes.
Bhorkar et al. [17] proposed a reinforcement-learningbased routing scheme called distributed adaptive opportunistic routing algorithm (d-AdaptOR). d-AdaptOR finds the best relay node by using a probabilistic model called estimated best score (EBS). It based on ACK for the transmitted packet. d-AdaptOR guarantees to find the optimal routing path without any knowledge of the network, even with incorrect knowledge. However, this algorithm is not suitable for the environment assumed in this paper because it cannot distinguish whether an ACK is lost due to network instability or a malicious node.
Additionally, many similar studies [18], [19] have been conducted, but these studies all assume that the nodes participating in the routing process are altruistic, without considering the problem that malicious nodes may exist in the network [20]. Therefore, in this study, we propose a new opportunistic routing algorithm that can reduce packet loss in a non-ideal network environment that includes malicious nodes.

B. GAME THEORY-BASED OPPORTUNISTIC ROUTING
The opportunistic routing methods studied thus far are effective when all the nodes are cooperative, but it cannot be guaranteed that all the nodes in a network will be altruistic. Mobile terminals consume energy from a power supply to forward packets. Since this supply is limited, all mobile terminals need to use their energy to a minimum. Routing will not work properly if all the nodes in the network behave in this manner; game theory is widely used to solve this problem [21].
Zhang et al. [22] proposed an auction incentive mechanism (AIM) to encourage cooperation between nodes for opportunistic routing. The AIM uses an auction game, which is based on game theory, wherein the source node pays incentives to the receiving node. When forwarding, the source node bids to the neighboring nodes, and the node becomes the next bidder. These authors provided a Bayesian Nash equilibrium solution that enables all the nodes to obtain the maximum benefit. Through simulations, they found that the AIM could also reduce the energy consumption of the entire network. However, although the AIM could make selfish nodes participate in forwarding, it was not designed for networks with malicious nodes that intentionally dropped packets.
Wu et al. [23] presented the first cooperation-optimal protocol for multirate opportunistic routing and forwarding (COMO) mechanism. COMO is a method used to manipulate the input/output metrics of each node using an incentive technique. This ensures that by following the COMO routing protocol, each node maximizes its payoff. Additionally, COMO guarantees the faithfulness of each player and maximizes the end-to-end throughput. COMO operates effectively under the assumption that all nodes are reasonable but cannot handle malicious nodes that behave irrationally. This is because irrational malicious nodes do not care about their own utility because their intention is mostly to degrade the system performance [24].
Zhong et al. [25] studied energy and trust-aware opportunistic routing (ETOR) in the cognitive radio social Internet of Things (CR-SIoT). They defined a forwarding candidate set using a new routing metric that included trust and energy efficiency. ETOR was designed based on this routing metric and interference factor. It was demonstrated that the packet transmission rate and network lifespan increased compared to those under existing methods when misbehaving nodes were included. However, ETOR could not deal with malicious nodes that changed their pattern of behavior, such as nodes that carried out a gray hole attack.
Zhong et al. [26] introduced a traffic differentiated secure opportunistic routing (DSOR) with game theoretic perspective. DSOR considers the trust value of nodes, available resources, and the condition of each flow. According to these metrics, DSOR selects the forwarder using auction system to guarantee QoS. DSOR has the advantage of being able to cope with blackhole attacks by using a trust system but has the disadvantage of causing a delay in the auction process, like bidding in large-scale network.
Su et al. [27] suggested a trust model based opportunistic routing (BTOR). BTOR filters forwarding nodes based on their behavior to combat malicious nodes. BTOR calculates the trust value as the weighted sum of direct and indirect trust. Like DSOR, trust-based routing methods such as BTOR can effectively cope with blackhole attacks but have a disadvantage in that it is difficult to respond to irregular attacks such as gray holes.
As mentioned above, some existing opportunistic routing protocols have novel ideas for coping with malicious nodes based on game theory. These methods can reduce the packet loss and energy consumption to some extent. However, these methods only operate effectively when certain conditions are maintained, such as all nodes including malicious nodes being reasonable. Additionally, these methods are problematic in that they are vulnerable to changes in the network. In this study, we propose a new opportunistic routing method to overcome these shortcomings.

C. COMPREHENSIVE ROUTING PROTOCOLS
With the development of routing protocols, many studies have been conducted on routing methods that combine various technologies [31], [32], [33]. Since these integration research studies are closely related to our method, they will be introduced in detail here.
Rovira-Sugranes et al. [28] suggested a simulated annealing based fully-echoed Q-routing (SAQ) protocol for flying ad hoc network (FANET). Simulated annealing is a meta-heuristic method used to solve various real-world problems based on the thermodynamics principle [34]. Simulated annealing Q-learning (SAQ) used the temperature parameter T to reflect the mobility of nodes. It has the advantage of fast convergence for networks with high node speeds, such as FANET. In addition, SAQ can avoid routing loops by ensuring that the same node is not included in the routing path more than once. However, since SAQ has no provision for congestion control, a routing hole problem may occur in which a specific node consumes more energy and dies quickly.
Chen et al. [29] presented a Q-learning-based multihop cooperative routing protocol (QMCR) for underwater acoustic sensor networks (UASN). QMCR can save energy consumption when forwarding packets to relay nodes through cooperative communication. Cooperative communication is a method in which one of the nodes between the transmitting node and the receiving node is selected to amplify the signal sent by the transmitting node. Moreover, since QMCR is based on Q-learning, it can be applied to a dynamic network environment such as MANET. However, since QMCR requires the assumption that all neighboring nodes cooperate in order to achieve the best performance, ideal performance cannot be expected in an environment containing malicious nodes.
Li et al. [30] studied a probability prediction-based opportunistic routing algorithm (PRO) for vehicular ad hoc network (VANET). PRO predicts the probability of signal to interference plus noise ratio (SINR) and packet queue length (PQL) of neighboring nodes. By calculating the utility function using these two parameters, PRO selects the best forwarding node. PRO showed excellent routing performance such as packet delivery ratio and end-to-end delay in a fast-moving VANET environment; however, it is difficult to apply to MANET because it does not consider the residual energy of the nodes. In addition, there is no countermeasure against malicious nodes in PRO either.
Although these various studies have obviously left noteworthy results, it is difficult to apply them to our hypothesized environment due to various obstacles. In the next section, we clearly define the network environment and propose a new routing protocol method suitable for this environment.

III. REPUTATION OPPORTUNISTIC ROUTING BASED ON Q-LEARNING
In this section, we describe the proposed method in six parts: the network model, energy model, reputation model, selection of the candidate forwarding set, selection of the relay node using reinforcement learning, and packet model.

A. NETWORK MODEL
First, we review the environment and mobile network-based terminology used in this study. As shown in Fig. 1, we have assumed that the mobile network structure can have both normal and malicious nodes.
A MANET is defined as an undirected graph G = (N , E), where N is the set of nodes included in the network, and E is the set of edges. In this graph, each node represents a mobile terminal and an edge indicates a wireless link. If two nodes u and v have an edge e (t) = (u, v) ∈ E, then u and v can communicate directly with each other in time t. Each node can indicate its location through a pair of X and Y coordinates. Additionally, it is assumed that each node can determine its location using a global positioning system (GPS) [35]. Further, we assume that all nodes broadcast a hello packet at each time t and identify neighboring nodes through a process of receiving feedback. Therefore, each node can know the neighboring node that can communicate with it. To put this simply, the set of nodes with which node u can communicate at time step t is defined as neighbor t (u). In the opportunistic routing process, a source node first broadcasts a packet to a neighboring node. One of the nodes that receive this packet is selected as the relay node, and the process is repeated till the packet is sent to the destination. In Fig. 1, the relay nodes are represented by green nodes.
As mentioned previously, we assumed that the network could contain malicious nodes. In Fig. 1, malicious nodes are shown in red. These nodes deliberately disrupt the network; their typical behavior is to drop packets. Sending a packet to a malicious node is a waste of energy that results in the lifespan of the network being reduced. Therefore, a node suspected to be malicious should be sent as few packets as possible by other nodes. In the following sections, we propose a method to detect malicious nodes.

B. ENERGY MODEL
In a MANET, it is important to find a routing path that efficiently consumes the energy of wireless nodes because the total energy of these nodes is the lifespan of the network. In this section, we present the energy models used in our energy management method.
In data transmission between nodes, the transmission power increases as the distance between the nodes increases. We calculated this energy using the Friis free-space equation [36]: where P t and P r are the transmitted and received powers in W, respectively; d is the distance between the transmitter and receiver; and λ is the wavelength of the radio frequency. According to (1), the energy used by a node to transmit data to another node is proportional to the square of the distance between the two nodes. Additionally, because the available transmission energy for each node is different, this energy needs to be normalized. In this case, the normalized energy consumption C uv can be VOLUME 11, 2023 expressed as: where P max (u) represents the maximum transmission power of node u, and P tmin (u, v) represents the minimum transmission power from node u to node v. Every node consumes energy when transmitting data, as shown in (2). Therefore, the amount of residual energy in all the nodes decreases over time. Because the battery capacity of each node may be different, it is necessary to normalize the battery capacity. We define the normalized battery capacity as energy level. The energy level of node u can be calculated as: where e res (u) and e init (u) denote the residual and initial energy of node u, respectively.

C. REPUTATION MODEL
As mentioned in Section III-B, when a mobile terminal forwards data received from another mobile terminal, it consumes energy; every node should consume minimum energy. Accordingly, in an attempt to save energy, every node would not necessarily forward packets received from other nodes. Malicious nodes that intentionally drop packets to degrade the performance of the network may even exist.
To solve this problem, we created a method to punish selfish and malicious nodes using a reputation system based on an incentive mechanism. In this system, the reputation of all nodes starts with the initial value of Rep init . For any node u ∈ N that forwards a packet successfully, the reputation of node u increases; otherwise, its reputation decreases.
Additionally, the higher the reputation of a node, the more nodes around it would want to forward their own message, creating a high probability that the desired message will arrive safely at the destination. The lower the reputation of a node is, the lower the probability of that node delivering its message is, decreasing the probability of the message reaching the destination. In this case, a node with a low reputation needs to improve its own reputation by eagerly passing the messages of other nodes.
The reputation of a node u is called Rep (u), and it is updated iteratively [37]: where Rep old (u) denotes the old reputation value of node u, Rep new (u) is the reputation that node u receives based on the results of forwarding, and α ∈ [0, 1] is the weight between the old and new reputations. When node u succeeds in forwarding to node v, the Rep new for node u is calculated as: where D (u, v) is the Euclidean distance between nodes u and v, D (dest, v) is the Euclidean distance between the destination and node v, TTL is the time-to-live value of the packet, and T uv is the time required to transmit a packet from node u to node v. In (5), the first term indicates the proximity to node v of node u, which transmits the packet to the destination, and the second term indicates how fast node u transmits the packet; β ∈ [0, 1] is the weight of these two terms. If node u fails to forward, then Rep new is 0. Therefore, the reputation of node u is reduced through the update, as shown in (4).

D. SELECTION OF CANDIDATE FORWARDING SET
Next, we focus on the selection of the forwarding set. In Fig. 1, the candidate forwarding set for node u at time step t can be viewed as neighbor t (u). Among these nodes, ones that are not suitable for forwarding, such as isolated nodes and malicious nodes, may be included.
In this study, we filtered out unsuitable neighbors using the reputation model presented in the previous section. The set of neighbors remaining after filtering node u is called CNODE t (u), and the algorithm for selecting this set is as follows.

Algorithm 1 Selection of Candidate Forwarding Set by Filtering Neighbor Set
Input In Algorithm 1, Rep threshold denotes the reputation threshold. If CNODE t (u) is empty, node u does not forward at time t.
In the worst case in Algorithm 1, all nodes are densely populated within the communication range. In this case, the set neighbor t (u) becomes a set except u from N , which is the set of all nodes. Therefore, the time complexity of Algorithm 1 is O (N ). The space complexity of Algorithm 1 is also O (N ), when CNODE t (u) = N − {u}.

E. SELECTION OF RELAY NODE BY Q-LEARNING
The proposed routing scheme is thoroughly explained in this section. We exploited reinforcement learning to select relay nodes to enhance QoS-based routing. Reinforcement learning is a type of machine learning that learns the best behavior in each situation through trial and error in a specific 47706 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  environment [38]. As shown in Fig, 2, for discrete times t = 0, 1, 2, . . . , the agent recognizes the current state s t ∈ S from the given environment and selects an action a t ∈ A to obtain a reward R t+1 ∈ R and the next state s t+1 ; S is a set of states, and A is a set of actions.
The probability of choosing action a t for a certain state s t is called a policy, and it is defined as π (s t , a t ). The goal of reinforcement learning is to determine the best policy for obtaining the maximum reward in the long run. The longrun payoff is defined as the sum of the discounted payoffs R, which is expressed by (6): where 0 ≤ γ ≤ 1 is a discount factor, and it is a parameter used to compensate for values farther away from the present value. Several algorithms are available for determining the optimal policy π (s t , a t ) in reinforcement learning of which the Q-learning algorithm is considered representative [39]. Q-learning guarantees convergence to the optimal policy through iterative calculations, as shown in (7): However, the method of applying Q-learning to routing differs slightly. Typically, the Q-routing method suggested by Boyan et al. [40] modifies Q-learning to fit the routing problem. In Q-routing, the Q-value is defined as Q x (d, y), which is the Q-value when node x forwards a packet to destination d and forwards it to node y. After node x sends a packet to node y, it receives the remaining transmission time t expected by node y as follows: where neighbor (y) is the set of neighboring nodes of node y. As mentioned in the previous section, we consider that neighbor (y) = CNODE t (y). When node x receives this value from node y, node x updates its Q-table as follows: where η ∈ [0, 1] is the learning rate of Q-routing, q is the waiting time until a packet is transmitted from the queue of node x, and s is the time it takes to send a packet from node x to node y. The update formula used in our proposed protocol was modified to suit the environment assumed above by Q-routing as follows: where B idle and B max are the idle buffer space and max buffer space, respectively. The lower q, s, and t in (9) are, the more accurate the values they represent. Thus, the lowest Q-value represents the best action. Contrastingly, the higher e level , Rep, and B idle B max in (10) are, the more accurate they are. Thus, the highest Q-value is the best action. The reason for including the energy level and the remaining buffer when calculating the Q-value as shown in (10) is to prevent a problem such as a routing hole, in which packets flock to a specific node.
An action selection strategy is also important in reinforcement learning. A representative method for selecting an action for learning is ε-greedy [38]. This strategy selects the action with the highest Q-value with a probability of 1 − ε and selects a random action with a probability of ε. As time passes, the state and action are repeatedly visited, and the value of ε approaches 0; therefore, after a long time, the best action is selected based on previous experience. In our proposed scheme, we adopted the ε-greedy rule to select the next relay node. The ε value for node u at time t was adjusted as follows: where n t (u) is the number of times node u is visited, and c is an arbitrary constant. Initially, there is no experience with node u; therefore, the probability that the agent will choose a random action is high. The more it visits node u, the higher the probability of choosing the maximum Q-value action. The entire pseudocode of the proposed algorithm is as follows.
Algorithm 2 Reputation Opportunistic Routing Based on Q-Learning (RORQ) 1: Input: Network graph G = (N , E) 2: Output: Optimal routing policy π 3: Q ←initialize Q-table for all s ∈ S, a ∈ A(s) 4: π ←initialize arbitrary policy 5: Loop for each packet do 6: Get the coordinate of source node and destination node 7: while packet has not reached destination do 8: Calculate CNODE using Algorithm 1 9: Send packet to CNODE 10: Choose action a for relay node using (11) 11: Update Q-table using (10) 12: Update reputation using (4),  The worst case in terms of space complexity is when all nodes can communicate with each other, such as in Algorithm 1, when they are densely populated. In this case, since each node should have Q values and reputation values for all other nodes, the space complexity is O N 2 .

F. PACKET MODEL
This section details the structure of packets used in our routing protocol. The packet structure of RORQ is illustrated in Fig. 3.
In Fig. 3, each row occupies 48 bits, except the data field. The first row includes the packet ID and TTL. The packet ID is the packet identifier and occupies 32 bits. The packet ID is obtained by Cyclic Redundancy Check 32 (CRC-32). TTL refers to the time-to-live of the packet and occupies 16 bits. The second and third rows represent the MAC address of source and destination nodes, respectively. The rest of the rows, except the data field, refer to the MAC address of each CNODE of the current node.

IV. PERFORMANCE EVALUATION
In this section, we discuss the evaluation of the proposed scheme through extensive simulations. The simulation results are presented here to validate the benefits of our approach using various metrics, such as packet loss rate, end-to-end delay, and energy level. The packet loss rate is defined as the ratio of the number of packets lost by the source nodes. The end-to-end delay is the time that a packet takes from its source to reach its destination. The energy level is given by (3); it is discussed in Section III-B. To compare the performance of our method, we use the traditional Q-routing method [40]. In addition, the trust-based opportunistic routing (BTOR) [27], Q-learning-based routing protocols such as SAQ [28], and Q-learning-based multihop cooperative routing (QMCR) [29] schemes, which were recently published in papers reporting state-of-the-art results, were also selected.

A. SIMULATION CONFIGURATIONS
The simulation network area was a square area of 1000× 1000 m. Each node was randomly placed in a given area and moved every second in a random direction at a speed of 0-10 m/s. The battery capacity of each node was the same, and the initial battery capacity was 100%. Additionally, the battery power of each node was consumed and the battery was not charged. Finally, we assumed that packet loss only occurred because of the presence of a malicious node or when the time-to-live was exceeded. The other key parameters of the simulation are summarized in Table 2.
We used Python to simulate the performance of our routing algorithm, Q-routing, BTOR, SAQ, and QMCR. We chose Python because it has libraries such as NumPy, SciPy, and Pandas that were useful for solving complex problems such as reinforcement learning.

B. SIMULATION RESULTS
In this simulation, we tested two types of attacks: blackhole and gray hole attacks, as discussed in Section I. Therefore, we assume the following two scenarios and perform simulations respectively. a) Scenario 1 (blackhole attack): malicious nodes involved in the routing path drop all passing-by data packets with probability 1. b) Scenario 2 (gray hole attack): malicious nodes behave like a normal node at initial stage; however, after some time, they drop all passing-by data packets with probability 1. First, we assumed the Scenario 1 wherein the network was subjected to a blackhole attack. The results are presented below. Fig. 4 shows the packet loss rate based on the number of packets sent in the Scenario 1. It can be seen that QMCR is most vulnerable to blackhole attack because it is designed to perform optimally when all nodes are cooperative. BTOR initially has the highest packet loss rate; but the packet loss rate rapidly decreases because it identifies malicious nodes using a trust system. The traditional Q-routing method reduces the packet loss rate over time, but it does not go below (approximately) 0.4 because there is no provision for malicious nodes implemented. On the contrary, the packet loss rate increased initially under SAQ, but gradually stabilized as the learning  progressed. Our method initially had a higher packet loss rate than SAQ did. This was because the reputation learning in our method was incomplete in the early stage. However, if the reputation learning had been sufficient, the packet loss rate would have decreased rapidly. Considering this, the proposed method had the lowest packet loss rate. Fig. 5 shows the average end-to-end delay based on the number of packets sent in the Scenario 1. In all methods except SAQ, the average end-to-end delay gradually decreased because of learning progress. In particular, BTOR has a fast drop in average end-to-end delay thanks to the trust system. However, when a malicious node intentionally dropped a packet, the traditional Q-routing, SAQ, and QMCR routing algorithms simply assumed that packets were lost and continuously sent packets to the malicious nodes. In contrast, our method reduced the reputation of nodes suspected of being malicious and did not consider selecting that node as a relay node. Therefore, the average end-to-end delay under the four algorithms was higher than that under our proposed method. Fig. 6 illustrates the energy level based on the number of packets sent in the Scenario 1. When a packet was lost during packet transmission, the rate of packet retransmission increased, resulting in higher power consumption. The  proposed method could reduce unnecessary power consumption by excluding nodes suspected to be malicious from the forwarding target. As a result, the performance of our algorithm was superior to that of the other algorithms. Therefore, the energy level under the proposed method was the highest.
Next, we performed the same experiment for the Scenario 2. In this simulation, we assumed that malicious nodes behaved similar to normal nodes until the network system sent 5,000 packets-that is, 5000 s later-and then started attacking. The results are presented below. Fig. 7 shows the packet loss rate based on the number of packets sent in the Scenario 2. Before the number of packets reached 5000, all the nodes were cooperative in the routing process. Therefore, there was no loss except for that due to the TTL timeout. However, once 5000 packets were sent, the loss rate increased rapidly owing to attacks by malicious nodes. Due to past experience, other methods take a long time to be reflected in the Q-value because even if a malicious node drops a packet, they mistakenly judge an accidental phenomenon. However, in the proposed method, if packet loss occurs continuously, even a node with high reputation will rapidly decrease. As a result, although all slope of the loss rate decreased over time, the proposed method had the  lowest loss rate in response to the attack by malicious nodes because of our reputation system. Fig. 8 shows the average end-to-end delay based on the number of packets sent in the Scenario 2. As shown in Fig. 4, the initial end-to-end delay was slightly high owing to the exploration process. Prior to the number of sent packets reaching 5000, there was no significant difference in the endto-end delay under the whole algorithms. Since our routing algorithm is designed under the assumption that the network is under attack, the end-to-end delay at this time is higher than that of QMCR and SAQ. However, after 5000 packets were sent, the end-to-end delay of the all algorithms significantly increased because of the attack. This is related to the rapid packet loss rate shown in Figure 6. Although the shape of the graph representing our method resembles those of the other four methods, our method had the lowest end-to-end delay. This was because we assumed that if a specific node continuously dropped packets, it was malicious. Therefore, the proposed method adaptively selected the effective paths that bypassed the malicious nodes, alleviating the increase in end-to-end delay caused by the gray hole attack. Fig. 9 shows the energy levels under the five algorithms based on the number of packets sent in the Scenario 2. Until the attack occurs, the graph shows that our algorithm has slightly higher power consumption compared to SAQ and QMCR algorithms. In particular, QMCR shows the lowest power consumption because it is specialized in reducing power consumption by using cooperative routing. However, After the malicious nodes started attacking, the number of lost packets increased significantly, increasing the power consumption in all methods. As mentioned in Figures 4 and 5, our routing protocol is less affected by malicious nodes because it has a lower transmission failure rate and end-to-end delay compared to other routing methods in the Scenario 2. Therefore, fewer retransmission packets were obtained under the proposed method than under other four methods, and the proposed method maintained the best energy level.

V. CONCLUSION
In this paper, we proposed a new opportunistic routing method to overcome routing that is vulnerable to malicious nodes in the existing opportunistic routing of MANETs. In our proposed method, the reputation is calculated based on the forwarding result of a node, and the candidate forwarding set is selected based on this reputation. This enables efficient routing in MANETs that contain malicious nodes under blackhole and gray hole attacks. In addition, the proposed method is suitable for a flexible MANET environment because it is based on Q-learning, a type of reinforcement learning. The experimental results showed that for networks containing malicious nodes, the proposed method was more efficient than traditional Q-routing, and state-of-the-art routing protocols like BTOR, SAQ, and QMCR in terms of various metrics such as the packet loss ratio, average end-toend delay, and energy level.
Our method yields significant results on two key obstacles in MANET: a network structure with high flexibility and routing disruption due to packet drop attacks. In our future work, we aim to not only focus on more sophisticated fields such as energy harvesting and energy sharing in routing, but also on various attacks such as wormhole attack. Moreover, we plan to study ways to apply our routing method not only to MANET environments but also to flying ad hoc network (FANET) and vehicular ad hoc network (VANET).