Overcoming Obstacles With a Reconfigurable Robot Using Reinforcement Learning

The reconfigurable robot RSTAR (Rising Sprawl Tuned Autonomous Robot) is a newly developed crawling robot that can reconfigure its shape and shift the location of its center of mass. These features endow the RSTAR with inherent robustness and enhanced ability to overcome obstacles and crawl on a variety of terrains for a vast range of applications. However, defining the trajectories to fully exploit the robot’s capabilities is challenging, especially when complex maneuvers are required. Here, we show how reinforcement learning can be used to determine the optimal strategies to overcome three typical obstacles: squeezing through two adjacent obstacles, ducking underneath an obstacle and climbing over an obstacle. We detail the implementation of the Q learning algorithm in a simulation environment with a physical engine (UNITY™) to learn a feasible path in a minimum number of steps. The results show that the algorithm successfully charted a feasible trajectory in all cases. Comparing the trajectory found by the algorithm to trajectories devised by 12 human experts with discrete or continuous control showed that the algorithm trajectories were shorter than the expert trajectories. Experiments showing how the physical RSTAR robot can overcome different obstacles using the trajectories found in the simulation by the Q Learning algorithm are described and presented in the attached video.


I. INTRODUCTION
Miniature search and rescue ground robots that can be used in disaster areas have numerous advantages. Palmsized search and rescue ground robots such as the Mini-Whegs [1], Dyna-RoACH [2], DASH [3], iSprawl [4], OctoRoACH [5], RHex [6], and Sprawlita [7], STAR [8], 1STAR [9], TAYLRoACH [10] and several reconfigurable swarm robotic systems [11]- [13]. They can crawl, run and climb over obstacles. They can be deployed in large numbers to scout vast areas while keeping operators out of harm's way. Their low weight and small size are crucial to many tasks. However, these features complicate their maneuverability. Although non-orthodox mechanisms can provide advanced solutions to various maneuverability needs, executing trajectories in real world environments remains a challenge [14]. Here, we describe a way to overcome this problem based on autonomous learning.
The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Alshabi .
In a previous work [15], we developed the Rising STAR robot (RSTAR - Figure 1) a palm-size search and rescue ground robot. The RSTAR is part of the STAR robot family [8]- [10] which is characterized by its ability to rotate the axis of its wheels. RSTAR is fitted with sprawl (which tilts the wheels' rotating axes) and four bar extension mechanisms. It can change its height, width, and length and shift its center of mass (COM). This considerably enhances the RSTAR's abilities to overcome obstacles and traverse a wide range of terrains.
Teleoperating a reconfigurable robot like the RSTAR requires a highly trained operator since different mechanisms often need simultaneous actuation. In many applications, it may be imperative to reduce human intervention by facilitating autonomous robot operations. One approach is to teach the robot to autonomously perform common operations. In the case of off-road missions these include climbing over obstacles, and maneuvering around or underneath obstacles. Machine learning, and specifically reinforcement learning (RL), can be used to teach robots how to perform FIGURE 1. The Rising STAR (RSTAR) robot is a highly reconfigurable robot for search and rescue proposes (see video [12]). such maneuvers offline by running repetitive experiments in a physical or simulated environment.
Considerable efforts have been devoted over the years to devising methods for the autonomous learning of robotic motion trajectories. One of the most popular is reinforcement learning (RL) [17], [18]. However, most researchers have concentrated on devising ways for the robot to learn how to avoid obstacles [21]- [25]. For example, Lee et al. applied RL to navigate between obstacles with a quadruped robot. Similarly, Zhang et al. [22] use deep RL to address navigation issues in cluttered environments on previously unknown rough terrains.
Overcoming an obstacle (e.g., climbing over or ducking under) rather than avoiding one can lead to more efficient operations, and in some cases be critical for mission success. Successful operation typically requires a combination of kinematic and dynamic maneuvers. The motion plan must provide a feasible path as a function of the robot's dynamic capabilities. The autonomous learning of such trajectories requires taking complex dynamic considerations into account. It must also be done in setups that preserve the mechanical an electronic hardware of the robot during the learning process. To date, very few projects have successfully met this challenge. Notable exceptions include Totani et al. [26] who used RL to learn a step climbing method for a crawler-type rescue robot. Dutta et al. [27] used multi-agent learning for locomotion learning in modular self-reconfigurable robots (MSRs).
One of the main drawbacks of RL is that learning times can be prohibitively long when the problem dimensionality is high. In robotic systems, problem dimensionality is a function of the number of controlled dimensions (e.g., actuators). Given the need for multiple learning sessions and exposure of the robotic system to potential hazards during leaning epochs, many researchers have opted to use simulation platforms for the leaning process.
A major challenge in using simulation environments for learning policies (that must be implemented in the physical world) is to reliably ensure the transferability of the learning from the simulated to the physical environment. Much research effort has been exerted for improving the transferability of the simulation to the physical world (sim2real). Researchers have suggested various methods, that largely fall into three categories: bringing the simulation closer to the physical reality e.g., [28], [29], basing training data on data from the physical world e.g., [30] and improving the efficiency of integrating experiments in simulation and in the physical world, e.g., [31], [32]. However, robots notoriously learn policies that take advantages of simulation intricacies and transferability from sim2real remains an open research question [28].
In many systems that apply RL their multiple degrees of mobility interact tightly with the environment, e.g., multirotor drones [34], where complex aerodynamic effects are produced by the interactions between the multi-rotor airflow and the environment. Other systems require intricate parameter tuning, e.g., underactuated legged robots [35], [36] performing highly dynamic movements involving balance. In such cases the feasible solutions require extremely precise determinations. This necessitates searching for solutions in continuous action spaces which constrains the transferability of simulation results. Considerable effort has been devoted in recent years to developing efficient RL algorithms suitable for continuous action environments that are not sensitive to hyperparameters, only need a manageable amount of learning sequences, and whose results are transferable from simulated to physical environments [34]- [37].
Overcoming obstacles (rather than circumventing them) is rarely discussed in literature. Partly, since highly advanced and complex mechanics are required. Robots that have such capabilities, typically require intricate coordination of their degrees of freedom for performing motion. Engaging an obstacle (and especially climbing over an obstacle that is at the mechanical limits of the robot) with no preliminary solution at hand is extremely challenging. It's a problem with a binary outcome (success/fail) which requires performing multiple mechanical maneuvers that the algorithm must figure out.
The unique design of the RSTAR robot gives it very high maneuverability with a relatively small number of actuators [4]. This makes its solution search space (more) manageable. The inherent stability of the RSTAR and its decoupled degrees of freedom (the mechanisms and wheels can be actuated practically independently) allow for control policies with relatively large tolerances. This makes it possible to use the much simpler discrete RL methods. It also increases the robustness of the solution to small discrepancies in the modeled dynamics, thus enhancing the transferability of the solutions to the physical world. Nevertheless, because motion dynamics are key to mission success, a physical simulation engine is required for meaningful learning. Beyond simplifying the learning process, discretizing the action space simplifies run-time control, since polices can be represented as discrete sequences of actions.

FIGURE 2.
A kinematic definition of the RSTAR including its characteristic dimensions. The RSTAR can vary its sprawl and FBEM angles to increase its height and width by three fold and move its COM along the fore-aft direction (dimensions in mm).
In this article, we show how to apply RL to exploit RSTAR's advanced reconfiguration capabilities to climb over, duck underneath and squeeze through obstacles. We apply the Q-learning algorithm, an off-policy, temporal differencing model free RL algorithm [19] for autonomous learning. This model free algorithm was selected because the formulation of the dynamic equations of RSTAR is complex and prone to error which may impede learning. For deterministic Markov decision processes (MDP), as is the case in the currently simulated environment, Q-learning is guaranteed to converge asymptotically to the optimal solution if each state-action pair is visited infinitely-often [33]. Q-learning has good convergence properties also for general MDPs, even though for the general case convergence is guaranteed only under additional constraints on the learning rate [Sutton &. Barto, 2018], and is generalizable across different environments. System operation is also compared to operation by human operators under a similar setup where the artificial learning mechanism does not have more information than the human operator, such that the results hold in the general case.

II. KINEMATICS OF THE RSTAR
The RSTAR is composed of a main rigid body and a pair of arms holding two sets of wheels and whegs (wheeled-legs) [1]. The main body houses a micro controller, the drivers and the batteries. The RSTAR is fitted with a sprawl mechanism and a four-bar extension mechanism which increase the height and width of the robot and rotate the rotation axis of the wheels. Each of these mechanisms is actuated by a single motor and both sides of the robot are phased together and move symmetrically relative to its center. The wheels on each side are actuated by a single motor. In total, the robot has four motors.

A. SPRAWL AND THE FOUR BAR EXTENSION MECHANISMS
The sprawl mechanism makes it possible to change the relative angle between the legs and the main body in the range of [−90, 90] (FIGURE 2). The sprawl angle is used to move the center of mass (COM) of the robot along the fore-aft direction. This feature allows the robot to pitch upwards by moving its COM backward and accelerating or avoid falling on its back by moving the COM to the front when climbing over obstacles. The sprawl also allows to reduce the height of the robot for it to crawl underneath obstacles. The four-bar extension mechanism (FBEM) lengthens or shortens the distance between the body and the legs by changing the FBEM angle in a range of [−72, 72]. The FBEM is attached to the sprawl mechanism and rotates together with it. Thus, the combination of the sprawling capabilities and the FBEM allow the RSTAR to raise and lower its center of mass (COM) and move its COM in the fore-aft direction. It also allows RSTAR to increase or decrease its height and width and operate in a variety of planar, upright, or inverted configurations.

B. DRIVING MECHANISM
The RSTAR has a differential driving mechanism, where each side is actuated by a single motor. To improve stability and reduce energy consumption, RSTAR can be fitted VOLUME 8, 2020 with wheels or spoked legs or a combination of the two, which give it greaterability to engage in different terrains. The round wheels are primarily adapted to smooth surfaces whereas the spoked wheels are more efficient in unstructured environments. The RSTAR can flip its body upside down and drive inverted using dynamic maneuvers to change the type of wheels contacting the ground [15].

C. CONTROL SYSTEM
The RSTAR can be controlled by either a human operator (direct mode) or a micro controller (program mode). In the program mode, the robot can be programmed to perform a consecutive set of actions. The configuration mechanisms (sprawl and FBEM) are controlled in a closed loop with angular potentiometers whereas the driving motors of the wheels are controlled in a closed loop with magnetic encoders. The main components of the control system are illustrated in FIGURE 3.

III. LEARNING PROCESS
This section details the simulated virtual environment used for the learning process, the different obstacles the robot learned to overcome, the discretization of the state space, and the learning algorithm.

A. SIMULATION ENVIRONMENT AND OBSTACLE DEFINITION
The simulations were conducted using a Unity R software environment (real-time engine development platform) [39] and included a (1:1 ratio) model of the RSTAR and different obstacles. We defined the RSTAR's kinematics (see Section II) and tuned its speed and contact properties with the surface to mimic those of the physical robot. Three common obstacle cases were learned in three separate simulations: (1) A narrow (180 mm wide) channel in which the robot has to reduce its width to pass through (FIGURE 4 -a).
(2) A low entry (55 mm high) where the robot has to lower its body to crawl underneath (FIGURE 4-b). (3) A 50 mm step obstacle the robot needs to climb over by shifting its center of mass (FIGURE 4 -c). The geometric properties were modifiable so that the simulation results could be tested on obstacles of different sizes. The simulations were performed on an Intel R Core TM i7-3632QM processor 2.2GHz, 8GB RAM memory running a Windows 10 operating system x64 bits.

B. SIMULATION ENVIRONMENT AND OBSTACLE DEFINITION
The Q-learning algorithm was applied using the RSTAR Unity R model. FIGURE 5 illustrates the iterative procedure of the learning process: the RSTAR observes its state, s, performs the action, a, and receives a reward, r. Its next state, s', is produced by the environment, and finally the Q value of the related action-value pair is updated.
Each learning iteration ends when one of the following conditions is met: 1) The RSTAR successfully reached its target.
2) The number of actions performed exceeded the maximal number of actions allotted. 3) The RSTAR reached a ''dead end'' by moving sideways past obstacle borders.
4) The RSTAR flipped over. All the intermediate learning data, including states, actions, rewards, and Q matrix values, were retained for convergence analysis and validation.
The Algorithm is composed of 4 main partitions: 1. Initialize Q matrix. 2. Set the agent in its initial position in the physical engine and initialize ε.

While:
(Agent has not reached the target) && (performed less actions than the number of allotted actions) && (not flipped over): 3.1 Extract the current state, s n , of the agent from the physical engine.

Draw random number 3.3 If (random number < ε)
Choose random action a n . Else Choose action a n a n by Q value. 3.4 Perform action a n a n in the physical engine. 3.5 Extract the subsequent state, s n+1 from the physical engine. 3.6 Update (decrease) ε 3.7 Calculate Reward, r n 3.8 Update the appropriate Q value.

C. STATES
The state of the agent (the RSTAR) was based on its position, orientation, and configuration. To reduce the dimensionality of the state space, the position of the robot was limited to the advancing direction (z) alone and roll was neglected. Altogether, the agent's state was composed of five parameters: a) the position in the advancing direction, b) the yaw, c) the pitch, d) the FBEM angle, and e) the sprawl angle, as illustrated in FIGURE 6. The partition of the state and action spaces into discrete values was slightly different as a function of case. The partition of the advancing direction and the FBEM was constant and equal to 18 mm and 12 • respectively in all three simulations. The yaw angle, α, was divided into three angle ranges in all simulations: 0≤ α <5, 5≤ α <10, and 10≤ α. The sprawl angle was partitioned into step sizes of 9 • or 15 • depending on the use case and required sensitivity (TABLE 1). Because the first two problems (squeezing through a channel or a low opening) did not involve climbing, the pitch was ignored.
In the third problem in which the robot needed to climb over the step, the pitch angle, β, was divided into two ranges, 0≤ β <5 • , 5 • ≤ β <180 • .

D. ACTIONS
The action space consisted of five actions (  that the robot might be unable to perform a specific action if prevented to do so by an obstacle. For example, in Action type 1, the wheels simply slid if the robot could not advance forward.

E. Q MATRIX INITIATION
Each action-state pair was rated based on the expected value when taking that action in that state. The Q matrix was 5 dimensional for the use cases of squeezing through a channel and ducking underneath an obstacle and 6 dimensional for the climbing over the obstacle use case. The number of Q matrix cells was on the order of 105 (TABLE 1). Unlike more common projects, in our work the robot is expected to overcome the obstacle, rather than circumvent it. Interacting with an obstacle is inherently dynamically different from driving around it. The goal state is not a point or a configuration, as is typical in may RL studies, rather it is a zone. With a goal biased Q value initiation [40] and rewards, we direct the robot towards the obstacle but we do not dictate how to overcome it. Therefore, to encourage the robot to advance towards the obstacle and overcome it, the initial Q matrix values were based on the advancing direction and the yaw angle. The values increased as a function of the advancing direction and decreased with the absolute value of the yaw. In all use cases, the workspace was divided into four zones (i-iv) along the advancing direction (z). FIGURE 7-a presents the specific zone borders of the climbing over an obstacle use case.
The other use cases had a similar pattern but different border values. Zone (i): located behind the agent's initial position (z ≤ 3) which was characterized by negative equal values for all yaw angles: where Z tot was the total number of partitions in the z (advancing) direction and z ≤ 3. This zone could be reached if the RSTAR moved its COM backwards using the FBEM. Zone (ii): located between the agent's initial position and the proximity line (4 bins before the success line 3 ≤ z < 15). In this zone, the Q started with positive values that increased monotonously as the robot advanced towards the success zone. Lower values were assigned to greater yaw angles between 5 • to 10 • and even lower to values beyond 10 • to encourage the robot to maintain a straight orientation.
Zone (iii): in the vicinity of the success zone, with a length of 4 partitions (in the advancing direction 15 ≤ z < 19) and Q values that were substantially higher than zone (ii). In this zone, the Q values were independent of the yaw: Zone (iv): the success zone, characterized by the highest initial Q values (identical to Eq. (3)) to ensure that the robot was attracted to the success zone and remained in it.

F. REWARD AND UPDATE
As in the Q matrix initialization, the workspace was divided into the same four zones (FIGURE 7-b). The rewards were negative in the first three zones to minimize the number of steps and ensure convergence to an optimal solution: Zone (i): a highly negative rewards zone.
Zone (ii): the reward in this zone was also negative but increased monotonously to ''encourage'' the agent to leave the zone and advance forward. The reward was more negative if the yaw angle was larger than 5 • to ensure that the agent did not attempt to rotate sideways.
r(s n , a n ) = Zone (iii): the reward function in this zone was similar to zone (ii), and was designed to ensure it continued attempting to advance forward. Zone (iv): the success zone was highly positive to ensure that the agent advanced towards it and finished its climbing learning.
The initial values of the matrix Q i and rewards are summarized in Appendix B.
The Q matrix was updated using: where r n was the reward for taking action a n at state s n , s n+1 was the state of the agent after taking action a n . Action a' was the one associated with the highest possible Q(s n+1 , a) value. The learning rate α was 0.6 whereas the discount factor γ was 0.9.

G. POLICY
Learning was conducted as an ε-greedy policy where the probability of taking a random action, , decreased by: where i is the initial probability, min is its minimum value and N max is the total number of times random actions are taken. In all simulations i = 1, min = 0 and N max = 30,000.

IV. SIMULATION RESULTS
This section presents the convergence rate of the algorithm followed by an analysis of the solution and its validity over a range of parameters. The last section compares the simulation to human operators.

A. CONVERGENCE RATE
The convergence rate of each simulation was tested using three parameters: the sum of reward over the path at each iteration, the number of actions per iteration, and the success rate of the iterations. The results of all three use cases were qualitatively similar. FIGURE 8 presents the results of squeezing between the walls of a channel. The algorithm started to find feasible solutions after 100 iterations (60% success) (FIGURE 8 top). Subsequently, the number of actions in the found solution decreased and after 600 iterations, the success rate was nearly 100%. The quality of the solution (the number of actions) improved with the number of iterations and reached an average of 20 actions per solution. The best solution (the smallest number of actions) involved 17 actions. Section IV.E discusses the fact that some of the solutions with a larger number of actions had advantages when overcoming challenging obstacles not previously learned. The iterations started with a highly negative sum of rewards (FIGURE 8 bottom). The number of iterations increased, along with the sum of the rewards (the absolute number decreased) as the robot successfully found feasible solutions after nearly 200 iterations and then continued to slowly improve.

B. SQUEEZING THROUGH A CHANNEL
The simulation converged to a successful solution (with 17 actions) after about 600 iterations (24 hours running time). The simulation presented in FIGURE 9 shows that starting in (a) for a channel width of 240 mm, the simulated robot performed four actions to lower its sprawl from 48 to 18 degrees and decrease its width from 240 mm to 163 mm. When the width of the robot was smaller than that of the channel (180 mm), the simulated robot continued advancing to reach its target. (see video)

C. DUCKING UNDERNEATH AN OBSTACLE
In this use case the simulation converged after 700 iterations to a solution with 14 actions (24 hours running time). Starting from an initial configuration in which the simulated RSTAR's VOLUME 8, 2020 FIGURE 9. The simulated RSTAR crawling between two walls. Starting at (a), the robot increased the sprawl to reduce its width (b-c) and continued advancing toward its target (see video). height was 100 mm (compared to the 55 mm clearance of the obstacle), the simulated RSTAR first moved its FBEM forward (b) and then lowered its sprawl until reaching a height of 51 mm (c), then continued to advance (d) to reach its target.

D. CLIMBING ON TOP OF AN OBSTACLE
The RSTAR was trained to climb from the ground as in FIGURE 11 (a) to the top of a step as in FIGURE 11 (f). Given the fact that the obstacle was substantially higher than the radius of the wheels, this use case was complex (even for experienced human operators). It required more complex dynamic maneuvers relative to the previous two use cases and its solution required more actions to perform. In order to reduce the learning time and speedup solution convergence, the learning process was divided into two epochs. In the first epoch, the agent was initially placed with its wheels on the tip of the step (FIGURE 11 (d)) and learned to climb as illustrated in FIGURE 11 (d)-(f). After the convergence of the first epoch, the Q values were saved and were set to be the initial values of the second learning epoch (the random action factor was reset to 0.6). In the second learning epoch, the agent learned how to start climbing from the ground (a) until it placed its wheels at the tip of the step (c). Together with the first learning epoch, the robot successfully learned how to fully climb over the step starting from the ground. In total, the algorithm required 26 hours to converge to the solution.
The algorithm produced multiple solutions to climb over the obstacle, with a range of 24 to 27 actions. The difference between the different solutions is analyzed in Section (IV.E). FIGURE 11 illustrates how the simulated RSTAR climbed over a step obstacle. The step was 50 mm high compared to the 58 mm diameter of the RSTAR's wheels. Starting in (a), the simulated robot moved its COM backward using its FBEM (b) and advanced forward to pitch upwards and place its front wheels on the edge of the obstacle (c). Note that dynamically, moving the COM was essential to allow it to pitch upwards. After placing its front wheels and leaning on the step, it sprawled down to lower it COM (d) and then advanced its COM forward using the FBEM (e). Both actions were also critical to climbing, since otherwise the robot would pitch on its back when it attempted to advance forward. The robot continued advancing using its wheels to finish climbing over the obstacle (f). (See video)

E. SOLUTION SUITABILITY FOR UNTRAINED OBSTACLE SIZES
To examine the level of compatibility of the solutions to obstacle size variation, the learned solutions were tested on obstacles of different sizes (TABLE 2). To squeeze through a channel, the solution learned for the 180 mm wide channel emerged as suitable for narrower channels up to 164 mm wide. For ducking underneath an obstacle, the solution for a clearance of 55 mm was appropriate for a 51 mm clearance. To climb over an obstacle, the solution with the smallest number of actions (24) was only suitable for climbing over the obstacle it was trained on (50 mm). However, solutions with more actions were more versatile. Two solutions with 26 and 27 actions allowed the robot to climb on top of obstacles that were 51 and even 55 mm high.

F. OUTPERFORMING HUMAN EXPERTS
To estimate the quality of the learning process, we compared the algorithm's results to human solutions. This comparison was conducted for the most challenging case of climbing over a step. Human solutions were proposed by two groups of six M.Sc. and fourth year undergraduate students from the Bio-Inspired and Medical Robotics Lab. who were familiar with the physical RSTAR robot and its kinematics. Both groups were asked to solve the climbing use case using the simulation. Each student was given an explanation of the action possibilities and then allotted one full hour to find a solution. The students worked separately and were not allowed to collaborate. The first group was asked to find a solution by listing a sequence of actions to perform. The students were told how many actions the simulation needed (24) but not the action set.
Interestingly, and surprisingly at least to the authors of this manuscript, the Q learning algorithm outperformed all the ''human experts''. None of the students was able to find a solution with a smaller number of actions. Two students were not able to find a climbing solution at all, and the four others found solutions ranging from 25 to 28 actions compared to the Q learning's solution of 24 actions.
The students in the second group were asked to control the simulated RSTAR using a joystick (continuous operation) and their performance was recorded, using the Unity software, in terms of motor actuation. They were asked to stop when they believed they had found a good solution but were not given any feedback on the quality of their solution during the trial. All the students of this group succeeded in climbing over the step using the joystick but their solutions demanded more time (+47%), and more actuation of the FBEM (+220%) and driving motors (+188%). However, on average, they used less sprawling (-40%). A summary of their solutions is presented in TABLE 3. For this small number of samples (6), we compared the results of the algorithm versus the students for three different parameters, the time required for the operation, the sum of sprawl and FBEM actuation, and the total amount of driving forward. In all three measures, the algorithm outperformed the students with a 95% confidence probability.
Given the robust and parsimonious design of the RSTAR, a user can quickly learn a feasible control policy. However, optimizing the steps appears to be very challenging since the user must make two concurrent control decisions with different objectives (driving the robot or changing the robot's configuration using the sprawl or the FBEM).

V. HARDWARE IMPLEMENTATION AND VALIDATION
This section presents the implementation of algorithm's results on an actual RSTAR prototype. In each experiment, the initial conditions of the RSTAR (position and configuration) and the obstacle's geometric properties were (nearly) identical to the learned dimensions.
Although the robot can calculate its own configuration using its encoders and potentiometers, it does not have a camera or range sensor (for example) to detect its location or orientation. In each case, a sequence of actions was uploaded to the RSTAR's control system. The uploaded sequence was the set learned using the Q learning algorithm. To squeeze through a channel the robot reduced its width and advanced between the two walls (FIGURE 12). The robot   also successfully lowered its body and ducked underneath the obstacle (FIGURE 13). In both cases, the robot successfully performed its task in the physical environment as of the first attempt.
The robot was able to climb over obstacles measuring 45 mm in height but not obstacles 50 mm high using its minimal 24 action sequence. This 10% difference between the simulation and the actual robot can be attributed to minor differences in the definition of mechanical and geometric properties and the accuracy of the solution of the physical engine.
Interestingly, the robot successfully climbed over the 50 mm high obstacle using the 27 actions solution (which according to the simulation would allow it to climb over 55 cm steps). This result is consistent with the Q learning solution which suggested that the 27 actions solution would increase its climbing capability by 5 mm. The results of the three experiments and their comparison to the simulation are summarized in table 4. In all the experiments, the physical robot repeatedly performed its task successfully. There were no unsuccessful attempts.

VI. CONCLUSION
In this article, we applied a Q learning algorithm to a reconfigurable robot learning autonomously to overcome three typical obstacles use cases. These included squeezing through a channel, ducking underneath an obstacle, and climbing over an obstacle (which requires a very skilled operator). To simplify the process, safeguard the robot, and reduce the learning time, all the learning took place in a simulated environment using Unity software. The algorithm found solutions to all the use cases. Given that our problem is overcoming an obstacle rather than circumventing it, we directed the robot using goalbiased Q matrix initiation and rewards to advance towards the obstacle and we rewarded the robot for successfully overcoming the obstacle but did not dictate how to overcome the obstacle. We used a uniform discretization along all the DOF which was sufficient to find solutions for each problem. A finer discretization may lead to better results. However, a finer discretization will require a longer learning time.
The results were implemented on a physical RSTAR. The robot was placed in an environment reminiscent of the simulated environment and preprogrammed to perform the learned set of actions. With no human intervention, the RSTAR successfully overcame all three obstacles on its first attempt.
These learned solutions could also be implemented on similar obstacles with different geometries. For instance, the solution of squeezing through a channel was developed for a width of 180 mm but was also shown to be suitable for a narrower channel 164 mm wide. The same sequence for climbing over a 50 mm obstacle was also proven compatible for steps with lower heights. To climb over an obstacle, the algorithm provided multiple solutions with different numbers of actions. An analysis of these solutions showed that a larger number of actions were suitable for climbing over higher obstacles. For example, the simulated robot climbed over a 50 mm step with 24 actions, and over a 55 mm step with 27 actions. This result was also validated in the experiments.
The climbing use case was especially challenging. Although the algorithm had no prior knowledge of the kinematics of RSTAR or the obstacles, it generated mechanically intelligent results for climbing on the step. The result included moving the COM backward and then forward and changing its height to overcome the obstacle. This solution outperformed solutions devised by a group of human experts, and used fewer actions. Based on these encouraging results, we are now working on including perception capabilities in the setup to learn action sequences directly based on perceptual input. Figure 16 compares the algorithm to human experts. Human experts needed more time, and activated the driving wheels and the FBEM more often than the algorithm. By contrast, they tended to actuate the sprawl less often than the algorithm. One expert did not actuate the sprawl at all.

APPENDIX B
See Table 5. Dr. Zarrouk received multiple prizes for excellence in research and teaching, including the Fulbright Postdoctoral Fellowship in 2011, the Fulbright-Ilan Ramon Postdoctoral Fellowship for most prominent Israeli Fulbright Fellow in 2011, the Hershel Rich Innovation Award, the Aharon and Ovadia Barazani prize for excellent Ph.D. thesis, and the Alfred and Yehuda Weisman prize for consistent excellence in teaching. VOLUME 8, 2020