A Time-Slotted Data Gathering Medium Access Control Protocol Using Q-Learning for Underwater Acoustic Sensor Networks

Contention-basedmedium access control (MAC) protocols for underwater acoustic sensor networks are designed to handle packet collisions that are caused by long propagation delays. However, existing protocols are known to suffer from relatively high collisions, which decrease system performance. To enhance system performance, we propose a contention-based MAC protocol that employs a widely-popular machine learning technique, namely, Q-learning. Using Q-learning, the proposed protocol allows the sensor nodes to intelligently select the back-off slots and accordingly schedule the transmission of data packets such that collisions are minimized at the receiver. Unlike in existing protocols, the sensor nodes are not required to exchange scheduling information, which implies that the proposed protocol has low complexity and overhead. Under varying traffic loads and node numbers, the proposed protocol is compared with the state-of-the-art ALOHA-Q for underwater environment (UW-ALOHA-Q), multiple access collision avoidance for underwater (MACA-U) and exponential increase exponential decrease (EIED) protocols. Results demonstrate the effectiveness of the proposed protocol in terms of energy efficiency, channel utilization, and latency.


I. INTRODUCTION
With recent technological advances, it is now possible to explore and monitor the ocean through the application of underwater acoustic sensor networks (UWASNs) [1]. Generally, UWASNs contain a large number of battery powered acoustic sensors [2] that execute collaborative tasks such as oceanographic environment monitoring, data collection, disaster prevention, and tactical surveillance [3]. Unlike terrestrial wireless sensors, underwater acoustic sensors use acoustic waves to communicate with each other; these are approximately five-fold slower than the radio frequency waves used by terrestrial wireless sensors [4]. Despite being slower, acoustic waves are preferable to radio waves as they are relatively robust to attenuation while radio waves suffer from high attenuation in underwater environments [5]. However, due to the lower propagation speed of the underwater acoustic channel, some distinct features such as limited The associate editor coordinating the review of this manuscript and approving it for publication was Miguel López-Benítez . bandwidth, low channel capacity, high bit error rate, and high dynamics of channel quality are observed in underwater acoustic sensor networks [3].
Medium access control (MAC) protocols are pertinent for UWASNs as they provide high system performance in an energy efficient manner by coordinating access to a shared medium [6]. The aforementioned characteristics of the underwater channel make direct application of terrestrial MAC protocols in UWASNs an inefficient process; thus, the process must be rethought [7]. Generally, the major task of both underwater and terrestrial MAC protocols is resolving collisions of data packets at the receiver. However, for underwater MAC protocols, resolving data packet collision at the receiver must include consideration of transmission time along with the distance between the senders and the receiver, while terrestrial MAC protocols require consideration of transmission time only as the propagation delay is negligible in the terrestrial domain [8].
Consider a small network with one receiver and two senders in the underwater environment. In this network,  sender-1 and sender-2 have different distances with reference to the receiver. Suppose that sender-1 and sender-2 want to transmit one data packet to the receiver, and both senders select the same sending time ( Fig. 1(a)) for data packet transmission. In this case, even with the same transmission time collision is possibly avoided as both data packets arrive successfully at the receiver at different receiving times. Moreover, assuming, that sender-1 and sender-2 transmit data packets at different sending times and, despite these differences, the reception time for these data packets is possibly the same at the receiver; a collision would then occur ( Fig. 1(b)). This phenomenon is a two-dimensional uncertainty, which is better known as space-time uncertainty [9].
Considering the dynamicity and the complexity of underwater acoustic channels, applying artificial intelligence (AI) techniques found to be an effective solution. Hence, various works on UWASNs consider AI techniques to solve a range of problems. For example, a combination of three AI techniques, i.e., ant colony optimization, artificial fish swarm, and dynamic coded operation, is designed to reduce energy consumption and increase robustness [10]. Boosted regression tree technique is used to classify the modulation and coding scheme levels by investigating the underwater channel characteristics [11]. A modulation selection method, which is based on a conventional neural network and random forest technique to ensure a reliable underwater acoustic communication in the time-varying underwater acoustic channel, is proposed in [12]. In [13], [14], and [16], Q-learning is used to determine the optimal routing path, the selection of adaptive underwater channels, and schedule the transmission of data packets, respectively.
In the present study, we focused on reducing the collision probability at the receiver by designing a novel Q-learning-based MAC protocol for UWASNs. Q-learning is a popular model-free reinforcement learning technique that is based on the state-action pair value denoted Q(s,a), where s denotes a state and a denotes an action. Briefly, Q-learning predicts the Q(s,a) value without any prior knowledge on the parameters of the system of a given environment. The Q(s,a) value for taking an action in a particular state is defined by a policy π , which determines how an agent behaves at a given time [15]. The agent is considered an entity that learns and makes decisions independently while everything outside the agent is considered the environment [11]. For learning, the agent explores its environment by selecting a particular action in each state based on the Q(s,a) values. Depending on the action performed, the agent receives a positive or negative reward. The ultimate goal is to determine the action that generates the maximum reward.
Q-learning, if applied to underwater acoustic sensor networks, may increase the performance of these networks by reducing collisions at the receiver. Additionally, as Q-learning works on a trial-and-error basis, it does not require signaling information exchange to schedule the transmission of data packets and its algorithm has a relatively low computational complexity.
The main contributions of the current study are as follows.
• An underwater Q-learning-based MAC protocol, which has the major advantage of low-signaling overhead and low complexity, is proposed to optimize the selection of back-off slots while offering low packet collisions.
• A newly designed reward function that is based on the outcome of transmitted data packets, i.e., success or collision. Moreover, for the first time, the normalized received power level, which is not considered in the existing contention-based MAC protocols, is considered to determine the number of collisions. • Through a system-level simulator, the energy efficiency, latency, and channel utilization of the proposed protocol VOLUME 9, 2021 under varying traffic loads and the number of nodes is investigated. The remainder of this paper is organized as follows. Section II provides a discussion on the related works. In Section III, the assumptions and conditions are presented, whereas in Section IV the proposed scheme is introduced. Numerical results are presented in Section V and conclusions are drawn in Section VI.

II. RELATED WORKS
Previous studies have shown that the two general classes of MAC protocol, contention-based and contention-free MAC protocols, attempt to complete a common task, namely, collision minimization at the receiver [16]. Contention-based underwater MAC protocols can use full underwater channel bandwidth [17]; thus, most efforts to design underwater MAC protocols have focused on the contention-based class.
Handshaking and random-access MAC are the two classifications of contention-based MAC protocols. In handshakingbased MAC protocols, sender and receiver nodes exchange small control packets before data packet transmission for the reservation of channels and the avoidance of collisions. In random access-based MAC protocols, collision avoidance is a probabilistic approximation [6] because the senders' data transmission has no prior coordination. Hence, handshaking-based MAC protocols have dominance over random access-based protocols in terms of collision avoidance but they show less efficiency in terms of energy and latency.
Fang et al. proposed the carrier sense multiple access with collision avoidance for underwater (UW-CSMA/CA) [18], an asynchronous underwater MAC protocol in which a combination of carrier sensing and handshaking is applied to reduce collision probability. The main principle of the UW-CSMA/CA algorithm is similar to that of the original CSMA/CA algorithm but with enhancements in the state transition rules, transmission deferment rules, and waiting slot time. These three improvements have been introduced into UW-CSMA/CA to harmonize the underwater environment features.
Similar to the UW-CSMA/CA algorithm, the handshakingbased MAC protocol slotted floor acquisition multiple access (slotted FAMA) was proposed by Molins and Stojanovic [19]; this was adapted from the FAMA protocol with the addition of an ARQ technique. Although similar, slotted FAMA differs from UW-CSMA/CA because it is synchronized and channel access is divided into fixed time slots where all the packets (RTS, CTS, DATA, or ACK) must be delivered at the beginning of the time slot. The length of the time slot is the combination of maximum propagation delay, length of RTS/CTS packet, and guard time. The introduction of the time-slotting technique in slotted FAMA minimizes the chance of collisions and eliminates the requirement for long control packets.
Peleato and Stojanovic [20] proposed an improvement to the slotted FAMA protocol that minimizes overheads and increases energy efficiency. The improved protocol also utilizes the handshaking technique, but it includes the addition of a waiting time period between the time when a CTS packet is received and the time the sender node starts sending its data packet, which avoids potential packet collisions.
Another improvement to the slotted FAMA proposed by Dong et al. [21] considered an adaptive slot duration strategy that avoids packet collisions by using the handshaking technique while trying to offer good throughput of MAC in the UWASNs.
Multiple access collision avoidance for underwater MACA-U [22], a collision avoidance underwater MAC protocol, proposed by Ng et al. is an adaption of the terrestrial MACA protocol but with changes to the control rules and packet forwarding approach. Although MACA-U is a handshaking-based MAC protocol like the other abovementioned MAC protocols, it differs by not using carrier sensing before sending an RTS, which improves energy efficiency, and not including an ARQ technique, which eliminates extra overheads.
These handshaking-based underwater MAC protocols guarantee collision avoidance but degrade the channel utilization as significant amount of idle time is introduced in the network. Additionally, their exchange of control packets can introduce high energy consumption, latency, and extra overheads into the network.
In [8], [23], and, [24] for UWASNs, random access MAC protocols i.e., different variants of Aloha-based protocols have been proposed, where one of the protocol use slot guard time for the mitigation of packet collisions [8], while others, such as, protocol proposed in [17] add additional control packets for collision avoidance. A receiver synchronized approach is used in the protocol proposed in [18] to minimize packet collisions at the receiver. However, for high traffic conditions or large number of sensor nodes, the random-access protocols suffer from low channel utilization following high latency and energy consumption.
Apart from the above-mentioned approaches, several other procedures, such as back-off algorithms, that can reduce the collision probability at the receiver are applied to contentionbased MAC protocols. One of the examples named exponential increase and exponential decrease (EIED) is applied to many existing works [25]- [28]. In EIED, the back-off value is exponentially incremented or decremented based on the status of a packet transmission, i.e., success or collision. The fixed tuning of EIED introduces a considerable amount of idle time at the network when the traffic load is high and for large number of sensor nodes, which affects the network's channel utilization.
Contention-free MAC protocols, that is, TDMA based MAC protocols are well-known for scheduled data packet transmission that can mitigate data packet collisions at the receiver. In [29], [30], and [31], a TDMA based MAC protocol is proposed where collision reduction at the receiver comes at the cost of a large scheduling delay. The reason is that first the central node has to collect the information to determine the transmission/reception schedule, then it has to send out the schedule information to the sensor node, and finally the sensor nodes, on receiving the schedules, transmit their data packets accordingly.
Owing to low-computational requirements, Q-learning is widely-studied in terrestrial MAC protocols to solve issues such as time-slot scheduling [32], data transmission scheduling [33], and active and sleep time adjustment for duty cycling [34]. At present, very few studies have been conducted on using Q-learning-based protocols in underwater. Most existing research focuses on routing, while only two protocols focus on the MAC layer. One extends the lifetime of underwater acoustic sensor networks [35] and the other focuses on improving the channel utilization of underwater acoustic sensor networks [16].

III. ASSUMPTIONS & CONDITIONS
A data gathering star network topology is considered wherein the sink is located at the center and sensor nodes are uniformly distributed within the transmission range of the sink [24], [36], [38]. In this study, the sensor nodes are considered to be the last-hop nodes of a multi-hop data-gathering network. These last-hop relay nodes collect data either from the environment or from sensor nodes that are more than two-hop away from the sink and send to the sink. Such networks can be useful in real-world applications, such as military surveillance, oceanographic data collection, underwater environment monitoring, etc. [1], [3], [5]. Time is slotted with the duration of the packet length plus guard time.
We assume that sensor nodes are synchronized with the sink and that the propagation delays are known. Synchronization in this study means that each sensor node knows the exact transmission time that corresponds to the data packet arrival time at the sink [24], [37]. Based on synchronization, the data packet is transmitted in a way that the data packet arrives at the beginning of the slot at the sink regardless of sensorsink distance. The associated events are illustrated in Fig. 2, where the n-th data packet of Node i denoted by D i (n) and m-th data packet of Node j denoted by D j (m) are generated at any time and backed-off before being transmitted to alleviate possible collision at the sink. The transmission time was calculated by considering the targeted slot at the sink and the distance. The transmitted data packets are then received at the beginning of the targeted slot at the sink, e.g., in Fig. 2, the k-th and (k + 2)-th slots for D i (n) and D j (m), respectively. Sensor nodes were considered static; hence, the movement of sensor nodes caused by water current and waves, for example, is ignored.
Based on Throp's underwater channel model, we assume that by using pre-acquired information on distance and frequency, each sensor node is capable of controlling the transmission power in such a way that the received power level, P R , at the sink must be constant regardless of the locations of the sensor nodes [38].

IV. PROPOSED SCHEME
Where only one data packet arrives at the beginning of a given slot of the sink the transmission is considered successful. Otherwise, reception of more than one data packet at the beginning of same time slot is considered a collision.
After reception of data packets, the sink responds by broadcasting ACK or NACK at the beginning of the next slot depending on whether the data packet transmission was successful or not; thus, the corresponding sender node decides to move forward to the next data packet transmission or to retransmit the failed data packet. Unlike ACK, NACK contains the received power level, denoted by nP R , where n is the number of collided packets. Fig. 4 illustrates an associated event in which the (n + 1)-th packet of Node i denoted by D i (n + 1) and (m + 1)-th packet of Node j denoted by D j (m + 1) collide at the (k + 25)-th slot of the sink. Since the number of packets received by the sink at the (k + 25)-th slot is 2, at the beginning of the (k + 26)-th slot, a NACK is broadcast that contains the twice received power level, 2P R . This 2P R value informs the Node i and Node j that two data packets collided at the sink; based on this information, both nodes will perform back-off and retransmit the collided data packets at a later time. As the proposed system works in a full duplex mode, a separate downlink channel is used for broadcasting the ACK or NACK.
The decision to select a back-off value before the transmission of generated data packets or to retransmit the collided data packets is made by sensor nodes using the Q-learning technique. Each sensor node operates as an agent of Q-learning and manages a Q table. Initially, all the Q values are set to 0 but are updated according to the obtained reward after sensor nodes take an action. In general, the following equation is considered for updating Q values: where Q(s t , a t ) is the current Q value obtained when action a t is performed in state s t . The maximum possible Q value is found by the agent in the next state s t+1 , given that a t is taken, reward r t is obtained, and the current Q value get updated. The discount factor, θ , which ranges between 0 and 1, gives more weight to either immediate rewards (if θ < 1) or future rewards (if θ > 0), whereas, the learning rate, α, which also ranges between 0 and 1, is used to tune the learning speed.
In the proposed protocol, action is defined as the decision made by each sensor node for selecting the back-off value denoted as B. The value of B will be in the range [B min , B max ]. where B min and B max represent the minimum and maximum back-off values, respectively. State is defined as the receiver of data packets.
In the deployed single-hop network, only one receiver exists, i.e., the sink. Since the sink is the sole data packet receiver, the applied Q-learning technique is basically a single-state Q-learning technique.
In single-state Q-learning, only the immediate reward is considered because there is no new or future state and, accordingly, no future reward. Differing from the update rule shown in (1), the following expression is used in the proposed protocol for the update of Q values [11]: where Q t (a) and Q t+1 (a) denote the current and new Q values at time steps t and t + 1, respectively. The reward is the main element that controls the decrement or increment of the Q values and determine not only the behavior but also the performance of the system. After an action is performed, a sensor node transmits a data packet to the sink and receives a certain amount of reward consequent to the performed action. The reward settings differ depending on whether a collision occurred or not at the sink. If data packet reception is acknowledged by the sink, then the positive earned reward denoted by r suc is defined as follows: where r 1 , r 2 , and r 3 are the fixed reward factor, backoff-related reward factor, and success-related reward factor, respectively. β 1 , β 2 , and β 3 are the respective related weights of the reward factors. The reward factors are given as follows: The fixed reward factor r 1 represents the giving of punishment to the sensor nodes as energy is consumed when packet transmission occurs. The back-off-related reward factor r 2 depends on the performed action B. A higher value of B introduces high idle time and delay in the network, which is responsible for the degradation of channel utilization. Therefore, a lower value of B is desirable that provides higher rewards for successful data packet transmission. r 3 , the success-related reward factor, is given when ACK broadcasting is received by sensor nodes.
On the other hand, if a collision occurs at the sink, a negative earned reward denoted by r col is defined as follows: where γ 1 , γ 2 , γ 3 , and γ 4 are the respective reward weight terms. r 1 and r 2 reward factors are the same in both the r suc and r col . The other two reward factors, r * 3 and r 4 , are the collision related reward factor r * 3 , which is awarded when NACK broadcasting is received by sensor nodes, and the related reward factor r 4 is the received power level. They can be defined as r * 3 = −10, Based on the obtained reward, the Q values of sensor nodes are updated. Sensor nodes use these updated Q values to make the decision of selecting B values. Suppose that, at time step t + 1, the reward is r suc , the following equation would then be used for the calculation of the B t+1 value: The following equation is applied for r col when calculating the B t+1 value: where · denotes rounding. Q t (a) and Q t+1 (a) represents Q value at time step t and t + 1, respectively. B t is the back-off value at time step t. Initially, when Q values are set to 0, the sensor nodes select B values randomly from the range of [B min , B max ]. Later on, the B values are calculated using (10) or (11) based on the success or collision of a packet. A simple example of how the Q values of sensor nodes help to select B values after receiving a broadcast from the sink is given below and illustrated in Fig. 5. Suppose that, at the beginning of the process when all Q values are 0, Node i randomly chooses a B value of 25 and transmits the D i (n)-th data packet after back-off at the known transmission time. The transmission is successful and the earned reward is positive for Node i . Now, according to equation (2), the Q value of Node i is updated and the new Q value is 0.253. To simplify the calculation, it is assumed that β 1 = 0.1, β 1 = 0.7, β 3 = 0.2, α = 0.1, and B max = 256. The new B value for Node i is now 19 according to (10). Based on this obtained B value, Node i performs back-off and transmits the D i (n + 1)-th data packet, but this time collision occurs at the sink with the D j (m + 1)-th data packet of Node j . Again, the Q value of Node i is updated. This time the earned reward is negative for Node i and the new Q value according VOLUME 9, 2021 the (2) becomes −0.30. Here, γ 1 , γ 2 , γ 3 , and γ 4 are assumed to be 0.05, 0.05, 0.45, and 0.45, respectively, with the same values of α and B max as previously mentioned. Based on (11), the new B value for Node i is now 24. Node i performs back-off according to this obtained B value and retransmits the collided D i (n + 1)-th data packet. If the same packet of Node i exceeds the limit for the maximum number of packet retransmissions, it will be discarded. As with Node i , Node j also performs back-off according to its updated Q value and retransmits the collided D j (m + 1)-th data packet after selecting B with the aim of achieving successful packet retransmission.

V. SIMULATION MODEL AND RESULTS
To evaluate the proposed protocol, we provide simulation results obtained using MATLAB-(R2020a). In the simulation, 20 sensor nodes were randomly deployed within the transmission range of the sink. The traffic generated by each sensor node followed a Poisson distribution with the rate of 0.05-0.15 [packets/s]. To show the effectiveness of the proposed protocol, more simulations were performed by setting the generated traffic load as 0.1 [packets/s] where the number of sensor nodes are varied from 10 to 30. As the proposed protocol is a learning-based protocol, therefore, it is beneficial to perform the simulation to show how the performance of the proposed protocol varies over simulation time. In this regard, the proposed protocol is simulated by fixing the traffic generation rate and the number of nodes as 0.125[packets/s] and 18, respectively. Throp's underwater acoustic channel model was considered to characterize the underwater model [39]. For RX/TX power and data rate, a Teledyne Benthos ATM-903 commercial underwater modem was considered [40]. The proposed protocol was compared with MACA-U [22], EIED and UW-ALOHA-Q [16]. MACA-U is a collision avoidance underwater MAC protocol in which RTS-CTS control packets are exchanged before the transmission of data packets to mitigate data packet collision at the receiver. EIED is a random back-off algorithm. To lower collision probability at the receiver, EIED increases the back-off value of a sensor node by a back-off factor κ i if a data packet transmitted from that sensor node is involved in a collision; otherwise, the back-off value is reduced by back-off factor κ d after a successful data packet transmission, where κ i = κ d = 2. UW-ALOHA-Q is a learning-based underwater MAC protocol adopted from terrestrial ALOHA-Q protocol where Q-learning technique is implemented in repeating frame structure to achieve collision free scheduling by finding an optimal data packet transmission time slot in each frame. Three improvements, namely, asynchronous operation, refinement of the frame length and uniform random back-off is incorporated in UW-ALOHA-Q to address the key limitations of underwater environment.
Three performance metrics, i.e., energy efficiency, latency, and channel utilization, were considered for performance evaluation.
First, energy efficiency is defined as, where L d is the data packet size, N is the number of data packets received successfully by the sink, and ε c is the total energy consumption for the duration over which the network was active. Second, latency, τ , is defined as the average time interval between the generation of the data packet and successful delivery of that generated data packet at the sink. It is obtained as follows: where T g,k is the generation time of the k-th generated data packet, T a,k is the successful arrival time of the k-th generated data packet at the sink. Finally, channel utilization, η, is defined as the ratio of the duration of successfully received data packets at the sink to the total simulation time and is defined as follows: where, T data is the data packet duration, T slot is the slot duration, N slot is the number of slots during the total simulation time. Discarded data packets are not considered in this case. Fig. 6 shows the energy efficiency for various average traffic loads. A downward performance was exhibited when average traffic load for the proposed, EIED and MACA-U protocols increased while the performance of UW-ALOHA-Q being saturated. UW-ALOHA-Q shows the highest energy efficiency as, in part, the number of collisions is greatly reduced because each node learns to determine the time-slot that has a low collision probability and, in part, because the time-slot duration is set to the maximum round trip time. However, it causes unavoidable long delay and subsequently poor channel utilization. On the other hand, for the proposed protocol and the remaining two comparison protocols the downward performance is observed with the increment of traffic load as collision occurs at the sink more frequently when the number of data packet transmission increases. The proposed protocol shows a higher energy efficiency compared with the MACA-U and EIED because more rewards are obtained as a result of increasing data packet transmission, which helps the sensor nodes to learn the back-off value for reducing data packet collision at the sink. In EIED, more collisions are introduced at the sink compared with the proposed protocol because, based on the received ACK or NACK, sensor nodes simply increase or decrease back-off values exponentially while ignoring how many packets have collided. MACA-U shows the lowest energy efficiency among the four protocols, which may be attributed to sensor nodes exchanging control packets before data packet transmission or to the propagation delay. Because of the propagation delay, there remains a possibility that control packets will collide in the process of handshaking. Both of these reasons can account for high energy consumption in the network. Fig. 7 shows the energy efficiency for various numbers of nodes, which decreased with increasing node number. This occurs because, with an increasing number of sensor nodes, the sink faces a greater number of collisions. Consequently, data packet retransmission occurs more frequently in the network and energy consumption therefore increases. Greater energy consumption is an indication of lower energy efficiency. The energy efficiency of the proposed protocol is close to that of UW-ALOHA-Q's for the reasons mentioned FIGURE 7. Energy efficiency versus number of nodes. VOLUME 9, 2021 in Fig.6. For EIED and MACA-U, the trend illustrated in Fig. 6 is responsible for the poor performance compared with the proposed protocol and UW-ALOHA-Q.   8 shows the latency for various average traffic loads, which increased for all four protocols with increments of average traffic load. An increase in average traffic load indicates a greater number of data packet retransmissions due to a higher number of collisions at the sink; consequently, high latency is introduced into the network. Among the four schemes, MACA-U shows the highest latency. This is due to the handshaking procedure between the sender and receiver that takes place before data packet transmission, which causes a significant amount of waiting time. It is also due to the additional collisions induced by control packets during the handshaking. Where control packet collision occurs, the whole handshaking process starts again from the beginning, which introduces an indicative delay in the network. UW-ALOHA-Q shows higher latency compared to the proposed protocol and EIED due to two reasons. The first one is the large frame duration that introduces high idle time. The second one is the increase in the number of buffered data packets that are queued for transmission as the traffic load increases. Additionally, as total elimination of collisions is impossible, a certain number of retransmissions exist that increase network latency. The proposed protocol and EIED perform data packet transmission without applying the handshaking technique; thus, the latency of each protocol is much lower than that of MACA-U. As shown in Fig. 8, the lowest latency was achieved by the proposed protocol. This can be explained by the sensor nodes using Q values for the adjustment of back-off values, which results in a lower number of data packet retransmissions. Fig. 9 shows the latency for various numbers of nodes, which increased with the increment of node numbers in the network. As shown, this latency trend was apparent for all four protocols. Generally, more nodes lead to more collisions at the sink, which increases the average delay. The proposed protocol has the lowest average latency, for reasons illustrated in Fig. 8. For the other three protocols, UW-ALOHA-Q, MACA-U and EIED, Fig. 8 also shows why their performance is poor compared with that of the proposed protocol. Fig. 10 shows channel utilization, which increased as traffic load increased up to a certain point and then subsequently decreased for the proposed protocol, EIED and MACA-U. This is a general feature of channel utilization versus traffic load for contention-based protocols because, beyond a certain threshold, packet collisions become the dominant factor that determines channel utilization rather than traffic load, which proportionally increases channel utilization where traffic load is lower. In case of UW-ALOHA-Q, channel utilization increased and then saturated. Fig. 10 shows that the proposed protocol has the highest channel utilization, which reflects the ability of the Q-learning technique to select back-off values that reduce collision at the sink and thereby mitigate the average delay of the network compared with UW-ALOHA-Q, EIED and MACA-U. MACA-U had the lowest channel utilization because it performs handshaking to avoid possible collisions; handshaking introduces a significant amount of latency into the network. UW-ALOHA-Q has lower channel utilization compared to the proposed protocol and EIED because it has higher transmission delay due to the large frame duration. However, with incremental increases in average traffic load, the channel utilization of EIED decreases more rapidly than that of the proposed protocol because more retransmissions are performed in EIED. The advantage of fewer retransmissions in the proposed protocol arises from the adjustment of back-off based Q-learning.   11 shows the channel utilization for various numbers of nodes in the network. Among the four protocols, the proposed protocol had the highest channel utilization for the same reasons described in relation to Fig. 10. Fig.12 demonstrates the performances of the protocols over the simulation period. It can be seen that the proposed protocol and UW-ALOHA-Q show some variations while EIED is almost steady. This is because both the proposed protocol and UW-ALOHA-Q have a learning process, while  EIED does not, resulting in a relatively steady-state performance. Moreover, the reasons behind the proposed protocol's better performance compared to the other two protocols are already described in Fig.10. Fig.13 follows the same description as given for Fig.12 except for the fact that in case of energy efficiency, UW-ALOHA-Q shows better performance for the same reasons described in Fig.6.

VI. CONCLUSION
Herein, a contention-based underwater MAC protocol was proposed in which a single-state Q-learning technique was integrated to optimize the back-off slots to reduce collision probability at the sink in underwater acoustic sensor networks. Through subsequent trial-and-error learning, the proposed protocol permitted sensor nodes to intelligently predict the number off backed-off slots that were needed to mitigate collision probability.
Through extensive simulations, the performance of the proposed protocol was compared with the UW-ALOHA-Q, MACA-U and EIED protocols in terms of energy efficiency, latency, and channel utilization. Moreover, the proposed protocol outperformed the UW-ALOHA-Q protocol in single-hop networking in terms of latency and channel utilization where for MACA-U and EIED, the proposed protocol outperformed in terms of energy efficiency, latency and channel utilization.
In future works, a multi-state Q-learning will be investigated to improve underwater contention-based MAC protocols in multi-hop networks.