Highlighted Map for Mobile Robot Localization and Its Generation Based on Reinforcement Learning

This article proposes a new kind of map for mobile robot localization and its generation method. We call the map a highlighted map, on which uniquely shaped objects (landmarks) in monotonous environments are highlighted. By using this map, robots can use such landmarks as clues for localization, and thus, their localization performance can be improved without having to update their sensors or online computation. Furthermore, this map can be easily combined with many other existing localization algorithms. We formulate the problem of making a highlighted map and propose a numerical optimization method based on reinforcement learning. This optimization method automatically identifies and emphasizes the important landmarks on the map. The generated highlighted map is adapted to situations such as the sensor characteristics and robot dynamics because this method uses the actual sensor measurement data. It is proven that the optimization converges under certain technical assumptions. We performed a numerical simulation and real-world experiment showing that the highlighted map provides better localization accuracy than a conventional map.


I. INTRODUCTION
Autonomous navigation technology has been applied to various robots, including automated guided vehicles (AGVs) and cars. For robots, estimation of their pose, namely localization, is essential.
Monte Carlo localization (MCL) is one of the most popular localization methods [1]- [3]. It is a particle-filter-based algorithm, which scatters a large number of particles (candidates for the pose estimate) onto the environment map and calculates the likelihood of each particle (the likelihood of the corresponding pose estimate) to estimate a probability distribution of the robot's pose. Since MCL can handle complex multimodal probability distributions, it works robustly in various scenes. Thus, it has been widely studied and used, e.g., [4] improved MCL by adding the process of injecting The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Ayoub Khan . random particles, and [5] proposed Kullback-Leibler divergence (KLD) sampling which dynamically adapts the number of particles. In [6], a pre-caching technique was applied to MCL for reduction of the computational burden and efficient sampling of particles. The method in [7] combined MCL with a laser scan matching algorithm to inherit the stability of MCL and the high accuracy of the scan matching. A particle initialization method by a convolutional neural network was proposed in [8]. The major target of MCL is a wheeled robot, but it has also been applied to an underwater robot [9], an unmanned aerial vehicle [10], and the field of wireless sensor networks [11].
As described above, many robots utilize MCL, however, MCL often fails in monotonous environments, e.g., a long straight corridor, a large vacant room, and a warehouse where the same shelves are equally spaced.
The details of MCL are described here in order to explain the problem. MCL includes a measurement model, which VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ gives the likelihood of a pose estimate by matching the map and measurements of the robot's sensor, such as a laser range finder (LRF), ultrasonic sensor, or camera. The measurement model is often built based on a likelihood field (see, for instance [12]- [17]). A likelihood field is a map that considers the sensor's measurement noise. Fig. 1 shows an example. Fig. 1(a) is the original map, and Fig. 1(b) is the corresponding likelihood field, which is generated by blurring the original map by the magnitude of the measurement noise. The map of Fig. 1(b) denotes that the darker a location is, the more likely it is to be observed by the sensor. Fig. 2 is an example demonstrating how the likelihood of a pose estimate is calculated from a likelihood field in the case of using an LRF sensor. Fig. 2(a) shows the LRF measurement data, and Fig. 2(b) and (c) show the projection of the measurement data onto the likelihood field at each pose estimate. The measurement data in Fig. 2(c), which are the red points, overlap the dark locations more than in Fig. 2(b); thus, the pose estimate of Fig. 2(c) is judged to be more likely. The likelihood calculation in MCL does not work well in monotonous environments. For example, in the case of a long straight corridor shown in Fig. 3, the pose estimate of Fig. 3(b) is greatly shifted to the left from the actual robot pose, whereas in the case of Fig. 3(c), it is slightly rotated clockwise from the actual pose. On the other hand, the measurement in Fig. 3(b) overlaps the dark locations more than in Fig. 3(c). This means that the pose estimate of Fig. 3(b) is more deviated, but is judged to be more likely than that of Fig. 3(c). If the robot does not have enough computational resources, it will get lost by this matter.
The most common conventional method for localization in such environments is to use additional sensors having different characteristics from those of an LRF (e.g., use a GPS [18], [19] or a camera [20], [21]) and fuse these data. The sensor fusion method in [22] adopted the fault detection and isolation techniques to exclude the effects of faulty sensors. Another approach is localization based on a moving horizon  Case of a pose estimate whose error is large but likelihood is high. (c) Case of a pose estimate whose error is small but likelihood is low. estimation for the environments [23]. Methods that prevent a robot from entering such monotonous areas were presented in [24], [25].
In this article, we consider how to make full use of a few uniquely shaped objects in a monotonous environment (hereinafter referred to as landmarks) to increase the localization accuracy. Our approach is to generate a map on which the landmarks are emphasized and use the map for localization instead of a likelihood field. For example, if the landmarks are darkened, the likelihood of the pose estimate increases when the measurement overlaps the landmarks. As a result, the localization accuracy improves. Fig. 4 demonstrates the idea. The measurement in Fig. 4(a) overlaps only the light locations, whereas the measurement in Fig. 4(b) also overlaps the dark locations. Thus, the pose estimate of Fig. 4(b) is judged to be more likely than that of Fig. 4(a). (a) Case of a pose estimate whose error is large and likelihood is low. (b) Case of a pose estimate whose error is small and likelihood is high.
We call the map whose shade is modified for making a more accurate localization a highlighted map. The advantage of the highlighted map approach is that it requires no additional sensor or online computation for localization (only an offline pre-computation to generate a highlighted map is needed). Therefore, the performance of many robots, especially low-cost robots, can be improved at no additional cost. In addition, the highlighted map approach can be easily combined with other existing algorithms based on MCL. Now let us consider how to generate a highlighted map. The simplest method is to find landmarks directly based on the shape of the map and modify their shade. However, it is difficult to automate the process of finding them, because there is no rule regarding the shapes of the landmarks. For example, there are several protrusions of the same shape in Fig. 1, but, as indicated in Fig. 5, • The protrusion at the center of the upper corridor (in a long straight corridor) is a landmark.
• The protrusion at the upper right (in a corner) is not an important landmark compared to the above landmark, because it is easier to estimate pose in a corner than in a long straight corridor.
• The protrusions in the lower corridor are not landmarks whereas the place without protrusions is a landmark, because the most unique shape in that corridor is the flat area, not the protrusions. Moreover, in such an approach, it is also difficult to make a map that reflects the sensor characteristics and robot dynamics, and to guarantee the optimality of localization.
This article presents the reverse approach, where landmarks are searched for from the localization results instead of searching for them directly according to the shape of the map. In this approach, the following two steps are repeated.
1) Estimate the pose with the map and evaluate the accuracy. 2) Modify the shade of the map based on the result of Step 1). Here, we use reinforcement learning (RL), which has attracted much attention in various fields including robotics [26], [27], to optimize the shade of the map. Our idea is to regard the randomness of MCL (particle filter) as the randomness required for RL and to modify MCL to fit the RL framework. The features of this approach are as follows: • Landmarks useful for localization can be automatically identified and emphasized.
• The map is optimized using the actual robot data (e.g., control data and measurement data). Nevertheless, unlike many other RL applications, we only need to drive the actual robot once.
• Since the optimization uses actual robot data, it is possible to obtain a highlighted map specialized for situations, such as the type and specifications of the sensor, the robot's velocity, the robot's route, and the MCL setting. Note that our study is different from simultaneous localization and mapping (SLAM). SLAM is the problem of associating the measurement data at each position to construct a map of the environment [1]. The constructed map is generally used for path planning as well as localization. Various SLAM methods, such as EKF-SLAM [28], [29], particle-filter-based SLAM [30], [31], and graph-based SLAM [32], [33], have been studied, and their main topic has been loop closure, which is the technology used to create a consistent map by adjusting the positional relationships of these measurement data when the robot revisits a known area. On the other hand, this article considers modifying the shade of each location in the map based on the existing map (e.g., a map created by SLAM) for the purpose of improving localization. Therefore, the construction of a highlighted map is a process performed after SLAM.
To summarize, the main contributions of this article are as follows: • Highlighted maps, on which uniquely shaped objects are highlighted, are newly proposed. The map can improve the accuracy of the MCL algorithm without additional sensor or online computation. Moreover, the idea can be combined with various versions of MCL.
• The method of generating a highlighted map based on RL is proposed. This article is organized as follows. After explaining the MCL algorithm with the likelihood field, Section II formulates the problem of generating a highlighted map, and Section III proposes the solution for it. The numerical simulation described in Section IV and experiment described in VOLUME 8, 2020 Section V show the effectiveness of using a highlighted map. Section VI is the conclusion.
Notation: Let R, R + , R 0+ , N, N 0 , and P n 0+ denote the real number field, the set of positive real numbers, the set of nonnegative real numbers, the set of natural numbers, the set of nonnegative integers, and the set of the n × n positive semidefinite matrices, respectively. We use ∅ to represent the empty set. For N data x (1) , x (2) , . . . , x (N ) , the notation x (i) N i=1 denotes the set of them, i.e., (2) , . . . , x (N ) . For time-varying data x t , x t 1 :t 2 denotes sequences of data from time t 1 to time t 2 , i.e., x t 1 : ∇ v f := (∂f /∂v 1 , ∂f /∂v 2 , . . . , ∂f /∂v n ) .

II. PROBLEM FORMULATION
Although the highlighted map can be applied to various localization algorithms related to MCL, we consider the combination with the most basic MCL for simplicity. This section describes the MCL algorithm and formulates the problem of generating a highlighted map.

A. GRID MAP
A grid map is a grid array that discretizes the environment with a certain resolution. Each grid cell has a value. Let the map be denoted by m := m (1) , m (2) , . . . , m (M ) , where M ∈ N and m (µ) ∈ R are the number of grid cells and the value of the µ-th cell, respectively. Fig. 6 is an example of the grid map.
Several kinds of maps can be represented as such a grid by setting the cell values m (µ) in an appropriate manner. In the case of Fig. 1(a), which is called an occupancy grid map [1], m (µ) indicates the probability that the µ-th cell is occupied by an object. In the case of Fig. 1(b), which is called a likelihood field, m (µ) denotes the probability that the µ-th cell is observed by a sensor with measurement noise. Although Fig. 1 is the case of a 2D map, a 3D map can also be expressed in the same form by regarding m (µ) as a voxel value.

B. MONTE CARLO LOCALIZATION
Let x t ∈ R n denote the true pose of the robot at time t ∈ N 0 . In the case of a robot moving on a plane, x t is given by x t := (x t , y t , θ t ) where (x t , y t ) ∈ R 2 is the position and θ t ∈ R is the orientation.
Algorithm 1 illustrates MCL for estimating x t , which corresponds to Table 8.2 in [1]. Algorithm 1 is based on a particle filter, wherex (i) t ∈ R n is the state of the i-th particle, N ∈ N is the number of particles, and χ t := x is the particle set.
Each particlex (i) t represents a candidate for the pose estimate, and the set χ t approximates the probability distribution of the pose. The purpose of Algorithm 1 is to update χ t at each time.
Algorithm 1 has inputs u t , z t , and m in addition to χ t−1 , where u t ∈ R n u are control data such as velocity commands end for 8: for j = 1 to N do end for 12: return χ t ,χ t 13: end function or wheel odometry, z t := z are data measured by a sensor (e.g., LRF), and m ∈ R M is a map. The measurement data z t is a point cloud consisting of L ∈ N points z ( ) t ∈ R n z , which is an n z -dimensional vector of an object's position measured by the sensor in the sensor coordinates.
In Algorithm 1, f : R n u × R n → R n is a function that applies control data u t to the state of a particle in accordance with the robot motion model. The outputx (i) t|t−1 ∈ R n is the state of the particle after the update.
The function g : R n z ×L × R n × R M → R 0+ corresponds to a measurement model, which provides the likelihood of a pose estimate by matching the measurement data z t to the map m. The output w (i) t ∈ R 0+ is the likelihood of the i-th particle. Note that g is different from an output equation in control theory which outputs measurement data.
The process of Algorithm 1 is outlined below. After calculating the setχ t := x (lines 3-7), the particles are randomly sampled based on the ratio of the likelihood w (i) resampling. Finally, χ t andχ t are returned. Note that the algorithm in [1] returns χ t only, whereas Algorithm 1 also returnsχ t . This is because we useχ t for generating the highlighted map, as described later.

C. MEASUREMENT MODEL AND LIKELIHOOD FIELD
The measurement model simulates the probability distribution p z t x t , m . The function g in Algorithm 1 calculates the likelihood of a pose estimate based on the measurement model.
There are several kinds of measurement models, e.g., the beam model and the likelihood field model [1]. The likelihood field model is often used, not only for 2D maps [12]- [15] but also for 3D maps [16], [17], because it tends to produce smoother distributions even in cluttered areas and it is computationally efficient.
Here, we describe the details of the likelihood field model. An example of the calculation based on it is presented in Table 6.3 in [1] as algorithm likeli-hood_field_range_finder_model. This algorithm includes a costly operation of searching for the nearest object on the map for each measurement data. Therefore, it is useful to pre-compute a lookup table that implements a likelihood field, so that the most part of the likelihood calculation can be skipped by referring it [1], [17].
In the case of a 2D map, the likelihood field m lf ∈ R M is generated from an occupancy grid map m occ ∈ R M by m (µ) where κ 1 ∈ R + is a constant, (x µ , y µ ) ∈ R 2 is the map coordinates of the µ-th cell, and N : R 0+ × R + → R + is a function. The value N (d, σ ) represents the probability density at d ∈ R 0+ under a zero-centered Gaussian distribution with standard deviation σ ∈ R + . Note that the likelihood field model assumes that the sensor measurement noise is Gaussian. The likelihood field implemented by the lookup table simplifies the calculation in the likelihood field model. In the case of using algorithm likelihood_field_range_finder_model in [1] as the function g; g includes only the following operation: if − th sensor does not find object where m is the likelihood field and κ 2 , κ 3 ∈ R + are the model parameters. Also, λ : t is the position obtained by converting z ( ) t (the measurement data in the sensor coordinates) into map coordinates based on the robot pose x t . For example, in Fig. 7, the measurement data are located on the 14th, 20th, and 23rd cells; thus, the likelihood of the pose estimate is Example of likelihood calculation with likelihood field. The 14th, 20th, and 23rd cells are measured, and the likelihood is calculated with m (14) , m (20) , and m (23) .

D. HIGHLIGHTED MAP GENERATION PROBLEM
Here, we consider modifying m consisting of M values m (1) , m (2) , . . . , m (M ) . If we obtain a suitably modified map m, using it in the function g may improve the performance of Algorithm 1. We call the modified map a highlighted map. We define the objective function for constructing the highlighted map m as follows: where T ∈ N is the operating time, and ρ : R n × R n×N → R computes the estimation accuracy at each time from the true pose x t and the particle set (probability distribution of the pose estimate) χ t . For example, ρ can be defined as the error between x t and the average of χ t where VOLUME 8, 2020 and A ∈ P n 0+ is a weight coefficient matrix. Note that (6) does not use the true pose x t .
Our problem is formulated as follows. Problem 1: For Algorithm 1 and given f , g, and ρ, assume that an initial particle set χ 0 , control data u 1:T , measurement data z 1:T , and true poses x 1:T are given. Then, find the optimal highlighted map arg max mJ (m) (8) whereJ : R M → R is the expected value of (4), Note that the solution (8) is optimal just for the past data u 1:T , z 1:T , x 1:T . Optimality is not guaranteed for the future data, but it is assumed that the map provides the same performance when the robot moves on a trajectory similar to the one the robot followed when the data u 1:T , z 1:T , x 1:T were measured. In fact, we verified that it is effective for future data by conducting a simulation and experiment (see Sections IV and V).

III. HIGHLIGHTED MAP GENERATION
This section presents the solution to Problem 1 based on RL, which is an approach for finding the optimal policy through trial and error [34].

A. REDUCTION OF HIGHLIGHTED MAP GENERATION PROBLEM TO REINFORCEMENT LEARNING
RL deals with the system in Fig. 8, composed of an agent and an environment. The behavior is as follows: 1) The agent observes a state of the environment s t .
2) The agent selects and executes an action a t according to s t and the policy π.
3) The environment outputs a reward r t according to s t and a t . 4) The environment s t transitions to s t+1 by referring to a t , and return to Step 1). For this system, the purpose of RL is to determine the optimal policy π that gives the highest cumulative reward.
Let us associate RL with MCL. RL is characterized by a mechanism to search for the optimal policy by using a probabilistic policy that selects an action at random. MCL is the algorithm that estimates the pose by resampling particles at random. Therefore, by regarding the resampling process of MCL as a policy π and the resampling result as an action a t , the highlighted map, which is the design parameter used in MCL, can be optimized in the framework of RL.
There are various methods of RL; We use the method in [35], a kind of policy gradient method, which directly optimizes the parameters of the policy [36], [37]. First, we convert MCL into the framework of RL by the following correspondence: • State s t : States of all particles of MCLx The highlighted map m is a parameter that determines the behavior of these processes.
• Reward r t : Output of the function ρ in (4). Fig. 9 shows the relationship between MCL and RL in this case. In Fig. 9, D is equivalent to the system D : which calculates the frequency distribution of the N samples from the probability distribution represented by w Fig. 9 is the process to p a t s 1:t , a 1:t , π = p a t s t , m , E r t s 1:t , a 1:t , π = E r t s t , a t (13) are satisfied, and the whole system can be regarded as a Markov decision process, which is the target system for many RL algorithms [34], [35]. Next, we derive the characteristic eligibility [38] because many policy gradient methods including [35] need it [36], [37]. The characteristic eligibility is given by whereπ : The characteristic eligibility e t represents the value of the executed action evaluated from information theory, and if the action is less frequent and is more sensitive to m, e t becomes larger. First, considerπ. It is derived as follows: expresses the probability that a particle is resampled to the statex is the probability that the resampling process results in one of the patterns. That is, (16) expresses the probability that the N (not distinguishable) particles are distributed into N groups with the capacity of a (1) t , a (2) t , . . . , a (N ) t by resampling. From (14) and (16), the characteristic eligibility is derived as follows: Equation (17) can be rewritten as a function of z t , m, χ t , andχ t , which are the inputs and outputs of Algorithm 1, as follows: where ζ : (17). In fact, the values of a (i) t and w (i) t in (17) are obtained from χ t andχ t , and ∇ m g can be calculated from z t , m, andχ t . For example, if g is given by (2), where β (i,µ) ∈ N 0 is the order of the factor κ 2 m (µ) + κ 3 in the function g(z t , x (2), and it corresponds to the number of measurements z ( ) t projected onto the µ-th cell. In the case of Fig. 7 and (3), β (i,µ) takes the following values:

B. METHOD OF GENERATING HIGHLIGHTED MAP
We propose the following method for generating a highlighted map based on RL: 1) Prepare the initial value of the highlighted map m(0), e.g., a likelihood field constructed from the occupancy grid map. 2) Drive the robot to record control data u 1:T , measurement data z 1:T , and true poses x 1:T . Set k ← 0.

3) Estimate the robot poses by performing MCL with m(k)
and u t , z t of Step 2) sequentially from t = 1 to T . Construct the updated map m(k + 1) based on the result of the estimation and x 1:T of Step 2). 4) If k + 1 = k max , then stop. Otherwise, set k ← k + 1 and return to Step 3). Note that m(k) ∈ R M is the highlighted map at iteration k ∈ N 0 and k max ∈ N is the maximum iteration number. In Step 2), for example, the true poses x 1:T can be obtained by any of the following procedures: • Use a device that can measure the robot pose directly, such as GPS and indoor positioning system (IPS) [39].
• After performing Step 2), estimate the pose with u 1:T , z 1:T by using a more complex algorithm than usual (e.g., MCL with a larger number of particles N , at a higher frequency), and use it as the approximate true pose. Note that real-time processing is not required for this computation.
• Add a higher performance sensor (e.g., a 3D sensor) to the robot temporarily, and perform Step 2) with it in addition to the original sensor to obtain a high precision pose estimate. Then, use it as the approximate true pose. The specific method we propose for Step 3) is described in Algorithm 2, where b ∈ R is a reinforcement baseline [38] and η(k) ∈ R + is a learning rate for updating the map.
In this algorithm, the following calculation is performed sequentially from t = 1 to T . At each time, MCL is executed first; then, the evaluation value of the estimation r t for t = 1 to T do 4: χ t ,χ t = MCL(χ t−1 , u t , z t , m(k)) 5: end for 10:

11:
return m(k + 1) 12: end function and the characteristic eligibility e t are computed. Next, the cumulative sum of e t denoted by v t ∈ R M and the cumulative sum of (r t −b)v t denoted by V t ∈ R M are updated. After the calculation at time T , the map m(k) is modified by adding V T .

C. THEORETICAL ANALYSIS
We present two theorems for the proposed method. Let X ⊂ R n be a bounded set, which is used in both theorems. The first theorem regards the gradient of the objective function.
Theorem 1 implies that Algorithm 2 is equivalent to the stochastic approximation method [40], [41], which updates the parameter m according to the gradient ofJ . The second theorem proves that the method converges.

IV. SIMULATION
To clarify the effectiveness of the highlighted map, we compare it with a conventional map (likelihood field) based on several MCL-based algorithms first. Furthermore, comparisons with other existing MCL algorithms are performed. These comparisons are carried out under harsh conditions, i.e., low computational resources and a low-performance sensor, because our method is especially effective for a low-cost robot.
In this section, our method is evaluated in the 3D dynamic simulator Gazebo [42]. A robot model and an environment model are shown in Fig. 10. The robot had a 2D LRF, and could obtain wheel odometry data u t and LRF measurement data z t . Since the robot moved on a plane, its pose could be expressed as x t := (x t , y t , θ t ) .

A. GENERATION OF HIGHLIGHTED MAP
The (local) optimal highlighted map m * was generated under the settings described below. The initial map m(0) is shown in Fig. 11 (represented on a color scale), which is a likelihood field generated by (1) with κ 1 = 0.1, σ = 0.15, and an occupancy grid map m occ . The data used for generating the map u 1:T , z 1:T , and x 1:T (denoted by u Note that the true poses could be obtained from the simulator, however, we did not use it for generating the map in order to simulate a realistic situation. The initial particle set χ 0 is given bŷ where x ini ∈ R 3 is the initial true pose. The function f is algorithm sample_motion_model_odometry in Table 5.6 FIGURE 11. Initial map m(0) (conventional likelihood field map) in the simulation. This is the same map as in Fig. 1(b), except that it is represented on a color scale.
The highlighted map m * generated by the method in Section III-B is shown in Fig. 12. In the upper corridor, the cargo at the center is highlighted most strongly (red spot). VOLUME 8, 2020 In contrast, the cargo at the upper right in the corridor is highlighted only weakly. This is because it is more difficult to estimate pose in a long straight corridor than in a corner, and therefore, the central cargo (in the long straight corridor) is more important as a landmark. In the lower corridor, conversely, the location without cargo is highlighted (purple spot). Note that the red spot indicates that the corresponding object is an important landmark, and the purple spot denotes that the absence of an object is such a landmark. It can be seen that the way of highlighting differs depending on the surrounding environment even for the same-shaped objects, which implies that landmarks useful for localization can be automatically identified and emphasized. Fig. 13 illustrates the change in the objective function, and it indicates that convergence is achieved. Note that a similar highlighted map will be generated if ρ is given by (6) instead of (29). In this case, the process of obtaining true poses x 1:T can be omitted because (6) does not require x t .

B. PERFORMANCE OF HIGHLIGHTED MAP
Next, the generated highlighted map m * was verified. The data for the verification u 1:T , z 1:T , and x 1:T (denoted by u test 1:T , z test 1:T , and x test 1:T ) were obtained as follows: • The control data u test 1:T and measurement data z test 1:T were measured while the robot followed the same route at the same velocity as Section IV-A. Note that these values are different from u gen 1:T and z gen 1:T because the robot was operated manually and traced a route slightly different from that of Section IV-A.
• The true poses x test 1:T were obtained from the simulator. The performance of the map was evaluated by • The objective function J with (29); • Estimation error: with (7); • Computation time for one estimation cycle t ∈ R + . The values of J and t were calculated with x test 1:T and χ 1:T . Also, χ 1:T were computed by MCL with u test 1:T , z test 1:T , the parameters in Table 1, and the same f and g as in Section IV-A. The computation time was measured by using a Broadcom BCM2837 system on a chip (SoC) with a 1.2 GHz 64-bit quad-core ARM Cortex-A53 processor. Since MCL contains randomness and the estimation result changes every iteration, we tested the map 20 times using the same u test 1:T , z test 1:T , and x test 1:T . The following two comparisons were carried out. Firstly, we compared the highlighted map with the conventional map in several MCL algorithms to verify the effectiveness of the highlighted map and the validity of combining the highlighted map with various MCL-based algorithms. Secondly, the combination of the basic MCL and the highlighted map were compared with that of the improved MCL-based algorithms and the conventional map to verify the significance of the improvement by the highlighted map. As the improved algorithms, we chose KLD-sampling-based MCL (KLD-MCL) [1], [5] and the algorithm introduced in [7], which we call scan-matching-based MCL (SM-MCL). KLD-MCL is the most widely used MCL-based algorithm and has been actively studied in recent years, and SM-MCL is one of the state-of-the-art methods.
For the above two comparisons, we performed six simulations denoted by S1-S6, whose configurations and results are shown in Table 2. Table 2 summarizes the values of the objective function J , the mean and maximum values of the estimation error t and computation time t . All these values are the averages over 20 tests. To show how the highlighted map improved the performance, the estimation errors t of 20 tests in the case of S1 and S2 are shown in Figs. 14 and 15, respectively. Furthermore, we evaluated the statistical significance of the improvement by box plots presented in Fig. 16 and t-tests. The mean and maximum values of t of S2, S4, and S6 had statistically significant differences from those of S1, S3, and S5, respectively (t-test, p < 0.001). The highlighted map (optimized for use in MCL) improved KLD-MCL and SM-MCL as well as MCL because of the similarity between these improved MCL algorithms and MCL algorithm. These results confirmed the effectiveness of the highlighted map and the validity of combing the highlighted map with other MCL-based algorithms.   In addition, by comparing the mean and maximum values of t of S2 with those of S3 and S5, we find that the significance of the improvement by the highlighted map was comparable or greater than the other existing algorithms even though the highlighted map approach required less computation time.
Note that we generated a highlighted map from the actual robot data u  The items S1-S6 correspond to the configurations in Table 2. acteristics and robot dynamics. For example, since u gen 1:T , z gen 1:T , x gen 1:T were measured when the robot followed a clockwise route, the generated highlighted map was especially effective for clockwise driving as shown in this section. On the other hand, by using the data recorded when the robot traced a counterclockwise route, we would obtain a highlighted map specialized for counterclockwise driving, which is slightly different from m * in Fig. 12. As described here, our method can generate various highlighted maps adapted to each situation.

V. EXPERIMENT
To verify the performance of the highlighted map in a practical situation, this section performs the real-world experiment in two environments: the corridor shown in Fig. 17 and the laboratory shown in Fig. 18. Fig. 19 shows     the robot system. A 2D LRF mounted on the robot was RPLIDAR (SLAMTEC), whose scanning angle, number of data, and possible maximum detection range were 360 degree, 360 data/rev, and 12 m, respectively. However, we used only 1/20 of the data (18 data/rev) and excluded data that were more than 5 m in order to simulate a low-cost sensor.
First, we generated the highlighted maps m * .        experiment, 220 × 180 and 0.9 in the laboratory experiment, respectively. All other parameters and functions were the same as in Section IV-A. The generated highlighted maps   Table 3. and the changes in the objective function in the corridor and laboratory experiments are shown in Figs. 22, 23, 24 and 25, respectively. As in Section IV-A, it can be seen that the landmarks useful for localization are highlighted and convergence is achieved.
Next, we evaluated the generated highlighted maps m * . The measurement procedure of u test 1:T and z test 1:T , and the evaluation method were the same as in Section IV-B. The true poses x test 1:T were approximately calculated from u test 1:T , m(0), and the complete data obtained from RPLIDAR (number of data was 360 data/rev and possible maximum detection range was 12 m) instead of z test 1:T (number of data was 18 data/rev and possible maximum detection range was 5 m). The results of the corridor experiment are shown in Table 3, Figs. 26, 27, and 28. Table 3 summarizes the values of J , t , and t      Table 4. respectively. Fig. 28 illustrates box plots of the errors. Similarly, the results of the laboratory experiment, denoted by L1-L6, are presented in Table 4, Figs. 29, 30, and 31. The mean and maximum values of t of C2, C4, C6, L2, L4, and L6 had statistically significant differences from those of C1, C3, C5, L1, L3, and L5, respectively (t-test, p < 0.001). As in Section IV-B, it was confirmed that the highlighted maps improved those localization algorithms by replacing the conventional maps. By comparing C2 with C3 and C5, and L2 with L3 and L5, respectively, we find that the accuracy of the highlighted map approach was comparable or better than the other existing algorithms, regardless of the fact that our approach took less computation time. These results show the effectiveness of the highlighted map in a practical situation. Note that the reason why the degree of improvement differs for each experiment is mainly due to the difference in the shape of the environments.

VI. CONCLUSION
This article proposed a highlighted map for mobile robot localization and a method for generating such a map. The highlighted map can improve the localization accuracy without any need for updating their sensors or online computation, and moreover, it can be applied to various versions of MCL. Although this article only generated 2D highlighted maps, the proposed method can also create 3D highlighted maps.
Furthermore, the proposed method can be easily applied to other particle filter applications in addition to localization. For example, it can be used to optimize template images in object tracking that uses template matching.
In future work, we are going to consider the robustness of a highlighted map to resist disturbances. A highlighted map generated by our proposed method may be vulnerable to uncertainties because the map is generated from a single experiment. In order to solve this, we will develop an RL algorithm that targets disturbance environments.

APPENDIX A PROOF OF THEOREM 1
To prove Theorem 1, several variables have to be defined first. From (11)- (13) and (15), the state transition probability and the expected reward under π are derived as P m s t ,s t+1 := p s t+1 s t , π = a∈Aπ (a, s t , m) p s t+1 s t , a , where A ⊂ N N 0 is the action set. The probability distribution of the initial state is defined as Then, from (4) and (9), the gradient ofJ is represented by (32) ∇ m S · · · S P χ 0 s 1 P m s 1 ,s 2 · · · P m s t−1 ,s t R m s t ds t · · · ds 1 .
Note that, from (A1), the integrals here mean to integrate over the space of all possible states S := X N (except for the space of state c t , the current time). For the integrand of (35), we define for which the following properties hold: • The integrand F t is integrable, that is, S · · · S |F t (s 1:t , m)| ds t · · · ds 1 < ∞ (because S is the bounded set, and |F t (s 1:t , m)| are bounded from (A2) and (A4)). • The derivative ∇ m F t (s 1:t , m) exists for ∀s 1 , s 2 , . . . , s t ∈ S (because F t is a polynomial inπ, and ∇ mπ exists from (A2) and (A3)).

APPENDIX B COMPLEMENT TO PROOF OF THEOREM 2
This appendix provides lemmas about conditions (B1) and (B2) in the proof of Theorem 2.
Proof: The following equation holds for k := k k =0 η(k )ξ (k ), Therefore, k is a martingale sequence, and we obtain the martingale inequality of Doob [40] From (46) and (48)