Deep Reinforcement Learning-Based Multi-Object Adaptive Route Planning for Traveling Recommender Systems

In travel recommendation systems, the most important business demand is to make adaptive route planning and navigation decision based on computational intelligence. Existing researches mostly lacked the ability to adaptively suggest route planning schemes for users, according to contextual features. To deal with the problem, this paper proposes a deep reinforcement learning-based multi-object adaptive route planning method for travel recommendation systems. In particular, the technical framework utilizes both value iterative network (VIN) module and positioning module to construct a specific DRL algorithm for this purpose. It inputs the environment map of the agent into VIN module, obtains the value distribution corresponding to the map by means of convolution iteration, and extracts features by means of self-positioning. Such a technical framework is expected to improve generalization ability of navigation strategies. Empirically, this work selects some popular scenic spots in a certain region as the research object for case analysis, and collects road network traffic data and tourist scenic spot parameter data through two public travel service platforms. The experimental results show that the suburban tourism route planning model based on deep reinforcement learning scene perception is reasonable and feasible, and can effectively improve tourist satisfaction.


I. INTRODUCTION
In recent years, with the continuous development of tourism, people's consumption concept and demand for going out to travel have undergone profound changes compared with the previous years [1].At present, more and more tourists choose self-service travel.Independent travel is a popular way of travel at present, because of its freedom, flexibility, economic characteristics have been favored by many tourists [2].When tourists choose self-help travel, the first thing to solve is the route planning problem in the process of travel.A reasonably designed tour route can not only help travelers choose and arrange their own travel activities with purpose, The associate editor coordinating the review of this manuscript and approving it for publication was Fabrizio Messina .but also avoid tourists from ''roaming'' in the scenic spot, saving time and expense [3], [4].However, for the self-help travelers, it is difficult to make a reasonable travel itinerary to better achieve the purpose of playing in the face of numerous tourist attractions in the scenic spot, with limited travel time and unfamiliar to the scenic spot [5].Therefore, it is necessary and urgent to provide a system platform that can not only meet users' travel planning needs, but also timely understand the scenic area environment.
With the development of technology and the performance breakthrough of computer hardware, big data analysis method has become a handy tool for researchers [6].The application of data mining to the tourism market is an important technological breakthrough in the field of tourism research, in which tourism route planning is an indispensable part of smart tourism research and tourism recommendation system development [7], [8].Relying on machine learning and deep learning as well as the vigorous development of its supporting technology and hardware, the accuracy of tourist interest mining and prediction has been greatly improved through the model [9].The core of travel recommendation systems is tourists.Thus, the starting point of the system and algorithm design should be in line with the interests and motivations of tourists [10].Tourists' satisfaction with the routes planned by the system will directly affect tourists' subjective evaluation of the scenic spots in the tourist city, and indirectly affect tourists' travel planning, as well as users' adherence to the system [30].On the basis of taking tourists as the center, the optimal route is planned around tourists' interests, time, budget, experience and other factors.A personalized travel route can bring the best travel experience to tourists [11].
As an important part of the field of machine learning, deep learning has proposed and developed many classical neural network architectures and algorithm models.In front of the huge tourism data, it is difficult for human beings to find some rules and knowledge in a short time, but the deep learning model can dig out a lot of invisible knowledge and useful information that human beings can hardly detect in a short time, so as to solve the problem of accurate recommendation of tourism preference information [12].Aiming at the problem of navigation generalization ability in random environment, this paper designs a neural network structure based on VIN module and cross-information positioning module, and uses overhead map and first perspective input for scene perception route autonomous planning.The network uses convolutional neural network for visual feature extraction, VIN module for map feature extraction, and positioning task as an attention extraction mechanism to filter effective information, thus greatly reducing the number of parameters.
Aiming at the Markov task of navigation, A3C algorithm is used to realize the training of the strategy and value function of the agent in the environment.In order to solve the problem of training efficiency, the method of experience playback is used to realize the benefit of historical information.Considering the single factor of tourism route planning, it is not only difficult to truly describe the actual situation of tourists, but also unable to effectively improve the travel satisfaction of self-driving tourists.Therefore, considering the impact of comprehensive road obstruction, self-driving tour cost and scenic spot visit satisfaction on tourist satisfaction, an evaluation index system of tourist satisfaction is constructed.The AHP -entropy weight method is selected to analyze the weight contribution of different indicators to the evaluation results of tourist satisfaction, so as to establish the weight value of each level of indicators.On this basis, a suburban tourism route planning model based on tourist satisfaction is established, and deep reinforcement learning is used to solve the planning model.The effectiveness and feasibility of the proposed method are proved by an example analysis, which provides decision support for the self-driving travel of tourists.
The rest of this paper is organized as follows.Section II discusses the related work.In Section III, a scene-aware route autonomous planning method based on deep reinforcement learning is proposed.In section IV, the experimental results are given and analyzed.Section V summarizes the full text.

II. RELATED WORK
Zhang et al. [13] use context-aware methods based on collaborative filtering to make recommendations.This work defines a context-dependent recommendation system aimed at suggesting relevant points of interest (POI) to visitors.Firstly, the data of users and points of interest are collected and sorted, and then the user fuzzy classifier is obtained through the processing of fuzzy C-Means algorithm.Then the POI fuzzy classifier is established by using POI clustering method.Association rule mining is carried out by combining user fuzzy classifier, user history selection data and POI fuzzy classifier.When a user requests a recommendation, the fuzzy classifier first performs fuzzy classification on the user.Then it analyzes the context of the user profile, searches the association rules of the user, calculates the classification of tourist attractions and specific tourist attractions that the user is interested in, and finally outputs the recommended arrangement for tourists to choose.
The synergy between context-aware computing and collaborative filtering optimizes the accuracy of the recommendation system, bringing its functionality close to user requirements [14].Yu et al. [15] adopted a model-based collaborative filtering approach.Bayesian networks are used to calculate the probability of a tourist attraction, which represents the extent to which visitors are interested in the attraction.Recommended routes and tourist attractions are provided by Google Maps.Based on the Engel-Blackwell-Miniard (EBM) model, the Google Maps interactive user interface is used as a front end to display tourist routes and attractions.This study confirms that by combining EBM model with Bayesian network, the Intelligent Tourist Attraction System (ITAS) is proposed, which shows good prediction function for tourist attractions and helps tourists to make decisions about tourism.
Reyes-Rubiano et al. [16] combined particle swarm optimization algorithm, local search algorithm and ant colony algorithm to propose a hybrid ant colony -particle swarm optimization -2-Opt algorithm, and the test results of the new algorithm are satisfactory.A local search operator is proposed and combined with the ant colony algorithm.The local search operator is used to delete and insert cities to improve the quality of understanding.This algorithm has good efficiency and results in solving the traveling salesman problem.Ma et al. [17] have discussed the problem of parameter setting of ant colony algorithm in solving the travel salesman problem, and the results have certain reference value.The multi-pheromone dynamic updating ant colony algorithm is proposed, and the data of the travel service recommendation system is verified, with good performance.Liu et al. [18] proposed an improved ant colony algorithm based on ''survival of the fittest'' evolutionary strategy, which has a good effect on solving large-scale traveling salesman problem.
On the basis of ant colony algorithm, Hung et al. [19] put forward the chance encounter algorithm to increase the size of understanding space and apply it to the solution of tourism route planning problem, which has certain effectiveness and practicability.Lim et al. [20] combined ant colony algorithm and simulated annealing algorithm to propose an improved ant colony algorithm.The algorithm selects and iterates the paths of the ant colony algorithm through simulated annealing algorithm, so as to obtain the global optimal solution, and is applied to the route planning of tourist attractions.Yu et al. [21] use deep reinforcement learning to quickly find feasible solutions to the travel salesman problem and convert the feasible solutions into pheromones of ant colony algorithm.Then, the ant colony algorithm is used to solve the travel salesman problem and apply it to the route planning of scenic spots.
Ruch et al. [22] combine the classic travel recommendation algorithm and personalized recommendation algorithm, and lead to the concept of key users according to the number of trips of users.The clustering of users according to key users can reduce the time consumed to do clustering with all users.Yha et al. [23] improved the traditional collaborative filtering method based on items, set up a scenic spot evaluation index system according to each attribute value, and then calculated the similarity between scenic spots.Qiu et al. [24] calculate users' preferences for each topic through LDA model and Bayesian formula, then supplement the user-scenic spot scoring matrix through clustering algorithm, and then generate the recommendation list.
The recommendation credibility and quality credibility formula are introduced to combine the rating credibility of scenic spots, and carry out TOP-N recommendation for users.The Apriori algorithm is used to mine the travel of users belonging to a class.By calculating the change of people flow in different scenic spots within 24 hours a day, Lin et al. [25] can see people's demand for scenic spots at different times, and recommend more suitable scenery for users.Park [26] fused the PB-CF algorithm by calculating the similarity of user check-in labels and the similarity of distance, and then took similar users, trusted users and historical geographical location in social relations into the traditional H-CF algorithm to form a new SL-CF algorithm, which is of great significance in unknown service and social network recommendation.

III. METHODOLOGY A. RESEARCH IDEAS
The main research task of this paper is to try to realize independent route planning under unknown environment through visual observation from the first perspective and global overlooking map in 3D visual environment, and through intensive learning and training under a large number of maps.The first difficulty to be solved is feature extraction from visual input.It is a complex and important task to perceive and decide the environment through vision.At present, Lidar and other methods have been widely used in SLAM.However, Lidar and other sensors have disadvantages such as high cost and insufficient information.Compared with sensors such as laser radar, visual information is more abundant, which can not only judge distance through vision, but also judge semantic information, with stronger robustness.Image processing by convolutional neural network has achieved good results in image recognition, classification and other tasks, but it is often difficult to explain the shortcomings.In the aspect of end -to -end training, images with their rich information can often be better used.
The second difficulty to be solved is cross-positioning.In a random environment, it is often difficult for an agent to effectively estimate its own position, and it is difficult to realize positioning in a random environment only by relying on the information from the first perspective.However, the position information of an agent is not observable at any time due to the obstruction of the wall in the top view.Therefore, the multi-source information fusion of the information from the first perspective and the top view is more conducive to the estimation of its own position.
The third difficulty lies in the feature extraction of map input.In this experiment, the top view is used as map input, which is more difficult to convert into corresponding features than 2D maps.Therefore, mapping map input to corresponding features and guiding agents to navigate is a major difficulty in map understanding.VIN module is a differentiable neural network module that can map the reward feature map corresponding to the map to the value related features of the map.The map is mapped to the reward feature through the convolutional neural network, and the feature extraction is carried out through the VIN module, which is conducive to the extraction of the map related features.
The fourth difficulty lies in the low training efficiency in deep reinforcement learning.This paper uses the experience playback technique of DQN algorithm for reference, stores historical data in the experience pool, and randomly selects historical data from the experience pool for training while using the current data, so as to improve sampling efficiency.

B. PROBLEM DESCRIPTION AND ALGORITHM
In this paper, A3C algorithm is used to train the agent.A3C algorithm is an improvement of Actor-Critic algorithm, all called Asynchronous Advantage Actor-Critic.Firstly, the Actor-Critic algorithm is a combination of the algorithm based on value function and the algorithm based on strategy gradient, which has the advantages of two algorithms.It is mainly divided into two parts, Actor and Critic, in which Actor is responsible for making decisions through probability and optimizing through strategy gradient, while critic judges the behavior of actor and optimizing through value function.The agent learns and makes decisions about actor strategy through critic module.
The Agent obtains state from the environment and makes decisions through policy network.The agent interacts with 120260 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the environment through decisions to make state transfer and reward acquisition [27].The value network evaluates the current value according to the state, and updates the strategy according to the evaluation results.At the same time, the evaluation network also updates itself according to the rewards obtained by the agent, making the evaluation more accurate.Critic can promote the adjustment of strategy by means of advantage function.The advantage function is defined as: The formula of strategy gradient update is: A3C algorithm will add a cross entropy in the exploration process as a regular term to promote exploration.The overall goal of policy update is: Meanwhile, the cumulative return is used to update the evaluation function, as shown in the following formula: It trains the Actor-Critic algorithm by starting multiple threads at the same time.The main parameters are stored in the main network, and each process copies the parameters from the main network to make decisions, calculates the gradient every t steps, and sends the gradient back to the main network respectively for update.In this way, each independent agent interacts in an independent environment, breaking the data correlation and improving the efficiency.

C. FEATURE EXTRACTION MODULE OF MODEL STRUCTURE
This algorithm mainly uses VIN module to extract map features, uses cross information to predict the location distribution of agents and extract features, and uses LSTM module to ensure the memory ability of the model [28].Through the depth of reinforcement learning algorithm for training of strategy and the value function, specific structure is shown in Figure 1.The value function of fmap is estimated by VIN module, and the information is extracted by the attention mechanism based on positioning module.Together with the image feature fimage processed by the cyclic neural network module, FMAP distributes the feeding strategy and the output strategy of the value network module in series.
Since the input of the agent is the image input, we use convolutional neural network to process the image input.For the first view of the observation, because the visual input size is 84 * 84 pixels, we still use the convolution kernel for feature processing, and the size of the convolution kernel is adjusted according to the size of the map.For the top view, since the output of VIN module needs to use the probability of location distribution to extract the attention mechanism, the value function distribution of VIN should match the actual number of states in the map, that is, the size of the feature map should be equal to the actual size of the map.Depending on the size of the map, the map input should be scaled accordingly.In the convolutional neural network, the corresponding relation before and after the convolution is generally as follows: where N is the input size, F is the convolution kernel size, and P is the fill pixel value.Taking the map size of 13 * 13 as an example, three convolutional check map information with the convolution kernel size of 8 * 8, 6 * 6, 4 * 4 were selected for extraction in this experiment, and the map visual input was 352 * 352.

D. VIN MODULE AND PRINCIPLES
Firstly, we assume that the map observation state of the agent is s, and the reward distribution and state transition distribution corresponding to the map environment are φp(s) and φR(s), respectively.The map context-related reward distribution and state transition distribution depend on the map observation state (reward function).As feature maps, the reward function and state transition function can be learned from the map state.For example, in a 2D map, the location of the target in the map can be mapped to the part with higher reward in the feature map, while the obstacle can be mapped to the part with negative reward in the feature map.For 3D maps, we assume that this reward distribution does not necessarily map to the real rewards in the map, but we can extract rewards and obstacle related features.
In 2D grid maps, each location is usually regarded as a state, and the value of the current state is often dependent on the value of the surrounding state.The formula is: where V * (s) is the optimal value of the current V * (s ′ ) is the optimal value of the adjacent states, R(s,a) is the return of the current state, and P is the state transition function.This process can be regarded as a Markov process, that is, the value of the state is only related to the value of the surrounding and the reward obtained at the present position, and has nothing to do with the further state.So we can use the reward feature map to approximate the value distribution of the map by iterating over time.Similarly, the state action value function can also be described in this iterative manner.
For MDP programming, this iterative process is similar to the operation of convolution through convolution kernel in convolution, so convolution operation can be used instead, where convolution kernel is the weight [29].
The basic principle of VI module is to realize Markov process through learnable weights.The iterative algorithm is similar to the convolution operation.For convolutional networks, the general operation formula can be described as: where i and j are coordinates, a is the number of channels, W represents the convolution kernel, and R represents the input feature map.We can regard the iterative algorithm in VIN module as the reward and value function to operate together through the convolution pooling layer, where the convolution kernel is the conversion function, the value function is the target output, and the reward function is the reward feature learned through visual input.We can consider that each channel in the convolution layer corresponds to the Q function of a specific action, and the pooling operation achieves the calculation of the optimal value.
We design VI module as follows, using a multi-layer convolutional neural network to achieve approximate calculation of value function distribution, and optimize through back propagation.After several iterations, the corresponding value distribution V and Q of the corresponding action on the map can be obtained.Since the initial value distribution is unknown to the agent, it can be iterated continuously through the reward distribution, so it is initialized to 0, and the reward distribution is learned from the map observation through the neural network.We use convolutional neural network to process the output of feature extraction module to realize the estimation of reward distribution, and then combine it with VI module, which is called VIN module.Its overall structure is shown in Figure 2.

E. POSITIONING MODULE AND ATTENTION MECHANISM 1) NETWORK DESIGN OF POSITIONING MODULE
In this module, the output of feature extraction module is connected in series, and the agent position is estimated through the two-layer full connection layer.Since the agent only needs rough estimation of the region where it is located, rather than 120262 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.precise location information, and the probability distribution of location is convenient for us to filter it as an attention mechanism.
This paper divides the map into grids and uses classification tasks to obtain the probability distribution of the region where the agent is located.Instead of using the prediction task to directly predict the horizontal and vertical coordinates of the location of the agent, the complexity of training is reduced.The optimization objective is shown in the following equation: where P is the true probability of agent location x i , p(x i ) is the output processed by softmax through neural network, and represents the probability distribution of the agent.

2) ATTENTION MECHANISM BASED ON LOCATION
Common attention mechanics fall into two categories: soft and hard.The difference between Soft attention and Hard attention is that soft attention is micro and can be obtained by backward propagation of neural network, while Hard attention pays more attention to information in a single position, which is often realized by sampling, that is, a single position is 1, and other positions are 0, which cannot be trained by the back propagation algorithm.The soft attention mechanism can be defined as follows: where X is the input, q is the query vector, a i is the attention distribution calculated by the query vector and the attention mechanism, and att(q,X) is the output extracted by the attention mechanism.In this experiment, the agent has a positioning assist task, aiming to estimate the position distribution of the agent itself according to map observation and firstperson observation, which is defined as follows: Therefore, we use the location distribution of agents instead of the attention mechanism of agents to extract the output value estimate Q of VIN.

F. RECURRENT NEURAL NETWORKS AND STRATEGY AND VALUE NETWORKS
For the agent, the environment is partially observable, and the observation value o t obtained by the agent from the environment actually refers to the environmental observation image It obtained from the first-person perspective at time t, which often cannot be well reflected to the state of the agent.Instead of using mapping method to map the features extracted from observations to corresponding states, it is better to learn strategies directly from the observed features of the sequence.Long Short Term Memory (LSTM) network is a common recurrent neural network.We improve the long-term dependence problems common in Recurrent Neural Networks (RNN) to facilitate and sequence modeling.
We use long and short term memory networks to implicitly approximate environmental states from a series of observed features.
In order to speed up the training efficiency, this module uses single-layer LSTM to realize that the features obtained from the observation from the first perspective of historical information are sent into the LSTM layer.We assume that the features can represent the state represented by the observation from the first perspective.This feature is connected in series with the feature extracted by the positioning module and outputs the strategy probability distribution and value function respectively through the full connection layer.For the value function V, we use the full connection layer to output a state value estimate, and use L2 norm to realize the supervised learning of V.For the probability distribution, we use the full connection layer to output an array of the same size as the action space, with the softmax function achieving a sum of 1, so that each output corresponds to the probability of performing the action.

G. EXPERIENTIAL PLAYBACK MECHANISM
A3C algorithm is an on-policy algorithm, that is, the target policy and the exploration policy are the same policy, and only the current data is used to update parameters, while the historical data is discarded.This module uses the mechanism of experience playback to realize the use of historical information.
Experience playback is a mechanism that is applied in DQN and can effectively improve training efficiency.It is used to update network parameters by storing historical experience in experience cache and sampling through certain algorithms.Of course, applying it to the on-policy algorithm A3C still needs some improvement.
For the positioning task, the positioning task is a supervisory task, and the positioning task only depends on the current map observation.Therefore, whether the training data used in positioning is historical experience or current experience will not affect the stability of parameter updating, but can improve the efficiency of data utilization and speed of convergence.
For the updating of deep reinforcement learning algorithms, namely, strategy network and value function network, there is a big difference between the historical experience and the current experience.For policy gradient, the use of historical experience will lead to inconsistency between the target policy and the actual exploration strategy, which is called off-policy update.According to the importance sampling theorem, the expectation under one probability density can be obtained by sampling under another probability, as follows: where f (x) is the target distribution, and q(x) is the sampling distribution.In the case that sampling cannot be carried out under f (x), q(x) can be sampled and the distribution difference between the two can be used for fitting.However, for policy update, not only the distribution of historical strategy and current strategy need to be calculated, but also the value of historical state should be estimated more accurately, and more sampling times should be guaranteed, so as to ensure the unbiased estimation of the current strategy.Here, we only use historical information to update the value function and positioning parameters, namely, the update target is shown as follows: For value function, the estimation of value function will also be affected by the difference of policy distribution due to bootstrapping algorithm.Different from V-trace and other complex calculations, this module uses a queue instead of buffer to ensure that the policy distribution of historical experience will not be too different from the current policy distribution.

IV. RESULTS AND ANALYSIS A. EXPERIMENTAL DATA COLLECTION
The case analysis of this paper is based on the road traffic situation data of a certain region, and the data obtained are all from the intelligent transportation public service platform.
The platform provides average road speed data, road congestion delay index and road length data for the target area network.Through unified processing, traffic indexes such as average travel time, congestion delay index and road length can be obtained, which provides reliable and real data support for the analysis of tourism route planning.The experiment selects the data of three traffic characteristic periods of ''peak-peak, morning peak and evening peak'' for model verification, which can reflect the changes of road network traffic status in different periods in a more comprehensive way.In order to directly reflect the traffic data types, variable units and numerical forms collected, we take the traffic data of various sections of the Autonavi platform on February 15, 2023 as an example, and the corresponding statistical results of some data at different time periods are shown in Figure 3.
The collected tourist destination data includes the required visiting time of tourists, rating of scenic spots, service time and geographical location information, etc., which can describe the basic attributes of scenic spots in a certain degree of detail and provide good and complete basic data for the planning of tourist routes.The basic attribute information of the target attraction is shown in Figure 4.

B. EXPERIMENTAL DATA PREPROCESSING
In view of the differences in the types and orders of magnitude of the influencing parameters of traffic sections, the obtained data of traffic section length, congestion index and travel time are preprocessed to provide a high-quality data source for the comprehensive road resistance solution and the entropy weight method to determine the index weight.
Different scenic spots require different visiting time and service time, and the corresponding service time window of each scenic spot is different.Therefore, it is necessary to preprocess the attribute data of scenic spots, so as to get the data more in line with the actual service situation of scenic spots.Based on the acquired basic information of scenic spots, the expected service time window of scenic spots is shown in Figure 5.

C. DETERMINING THE SATISFACTION WEIGHT INDEX USED IN TOURISM RECOMMENDATION SYSTEMS
After the standardization of the tourist satisfaction evaluation index, the hierarchical judgment matrix constructed by the quantitative index of tourist satisfaction was first determined, and the consistency index value CR of the first-level index layer and second-level index layer at different time periods was obtained as 0.0043, 0.0007 and 0.0128, which were all less than 0.1.It is known that the judgement matrix of tourist satisfaction evaluation index has satisfactory consistency, and the AHP weight of evaluation index is determined.Then, according to the difference degree of evaluation index data, the entropy weight of evaluation index is determined.Finally, Lagrange multiplier method is used to calculate the hierarchical entropy weight of tourist satisfaction evaluation index.Table 1 shows the judgment matrix of all levels of index layers constructed based on analytic hierarchy process.

D. PARAMETER SETTINGS
In order to further verify the dynamic tourism path selection model among scenic spots based on comprehensive road obstacles and the tourism route planning model based on tourist satisfaction, and ensure that the calculation results are more close to the real situation of self-driving travel, relevant parameters in this model should be set and analyzed, and the parameter results are listed in Table 2.
Since the model in this paper is solved by deep reinforcement learning, the running parameters of deep reinforcement learning need to be set.If the population size is too small, it is easy to inbreed, and the generation possibility of competitive offspring is less.However, when the population size is too large, the algorithm is difficult to converge and will reduce its robustness.Therefore, after several experiments and debugging, the population size was set to N=60.
Secondly, in accordance with the principle of population diversity, in order to reduce the destruction probability of excellent individuals, the algorithm crossover probability p c and mutation probability p m are set to 0.85 and 0.06 respectively.
Finally, in order to ensure full convergence of the algorithm and reduce waste of time and resources as much as possible, the termination algebra is set to 200.

E. RESULTS OF TOURISM ROUTE PLANNING
Based on the traffic parameters of the road network at different periods of time, this paper obtains the optimal suburban tourism route by combining deep reinforcement learning and Dijkstra algorithm, according to the constructed suburban tourism route planning model with maximum tourist satisfaction and the dynamic optimal route selection model among scenic spots with minimum comprehensive road resistance.The whole tour route forms a closed loop end to end.According to the results of the dynamic optimal travel path between scenic spots, the arrival time of node and the actual visit time of self-driving tourists in each scenic spot were further calculated, so as to obtain the tourist satisfaction of each target scenic spot.The tour time in the suburbs of the city is shown in Figure 6.

F. RESULT ANALYSIS
In order to verify the validity of variable selection of the tourism route planning model, the model based on the shortest travel time and the shortest travel distance were respectively used for control experiments.The model solving algorithm adopted deep reinforcement learning, and the data of regional road network traffic parameters and basic parameters of tourist attractions were consistent with the parameters in the tourism route planning method based on tourist satisfaction.In order to further verify the superiority of the tourism route planning model based on tourist satisfaction, the results of tourism route planning based on different models are compared and analyzed.The travel route planning results based on the shortest time and shortest travel distance are compared with the route planning model based on tourist satisfaction proposed in this paper.The time distance index results based on different planning models are shown in Figure 7.
The route planning model based on deep reinforcement learning can obtain the shortest path between any target nodes.Compared with other route planning models, the travel route selected based on this model has the shortest distance.Compared with the travel route based on the shortest travel distance, the travel distance of the travel route planned 120266 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.based on the model of tourist satisfaction and travel time is increased.In terms of road travel time, the model based on deep reinforcement learning can obtain the travel route with the shortest travel time, and the travel route planned based on the model of tourist satisfaction and shortest travel distance also has an increase in road travel time.
According to the waiting time index of scenic spots, the tourist route planning based on the tourist satisfaction model can effectively reduce the waiting time of tourists at scenic spots.Compared with the model with the shortest travel time and the shortest travel distance, the waiting time of scenic spots is reduced.As for the visiting time index of scenic spots, the tourist routes planned based on the tourist satisfaction model increased the visiting time of scenic spots compared with the routes planned based on the shortest travel time and shortest travel distance models.The satisfaction index results based on different planning models are shown in Figure 8.
In order to demonstrate the superiority of the route planning model constructed in this paper in solving tourist routes in the time-varying road network, based on the pre-established partial road network of a certain region and the obtained model parameter data set, the static tourism route planning model of urban suburbs based on tourist satisfaction is taken as a comparison for example verification.The specific method is as follows: The deep learning tourism route planning method based on tourist satisfaction is to only take the traffic data of the road network in the morning peak hours as the input of the model and plan the tourism routes according to the characteristics of the input data.According to the static optimal travel path results between scenic spots, the actual arrival time of tourists at each target node and the actual visit time at scenic spots under the time-varying road network were further calculated, so as to obtain the tourist satisfaction of each target scenic spots.The scene awareness results of the planning model are shown in Figure 9.
In this paper, the time and distance index results of the deep reinforcement learning method are generally better than those of the deep learning method.Compared with the deep learning route planning method, the travel distance and road travel time of the deep intensive learning route planning method based on tourist satisfaction are reduced.From the point of view of the visiting time of scenic spots, the tourism route planning method of deep intensive learning proposed in this paper is better than that of deep learning.
The effect of the deep reinforcement learning travel route planning method in this paper is better than that of the deep learning travel route planning method.Compared with the deep learning travel route planning method, the scenic spot tour satisfaction of the planning model in this paper is improved, and the tourist dissatisfaction index is reduced.
In terms of model performance under dynamic road network, this paper considers the time-varying characteristics of road network traffic state, and establishes a dynamic route planning model, which makes the route planning result better than the static route planning method, indicating that the travel route planning model based on tourist satisfaction proposed in this paper has certain reliability under dynamic road network.

V. CONCLUSION
Aiming at the generalization ability of deep reinforcement learning in navigation under random environment, this paper takes top view map and first-view image as input, and proposes a model structure based on VIN module and positioning module.VIN module is used to extract map features and use cross-information for positioning, which is combined with first-view image features.The navigation performance of deep reinforcement learning in random environment is improved, and empirical playback is used to improve the efficiency of data usage.The experimental results show that the optimal path between scenic spots based on the minimum comprehensive road resistance has higher reliability under the time-varying condition, which provides effective path information support for urban suburban tourism route planning.According to the influence degree of comprehensive road obstruction, self-driving tour cost and scenic spot visit satisfaction on tourist satisfaction, the evaluation index system of tourist satisfaction was constructed, and the weight of each level of influence index was established by AHP -entropy weight method.A tourism route planning model based on tourist satisfaction is established.Deep reinforcement learning is used to solve the planning model.
The example analysis shows that the route planning method based on tourist satisfaction can effectively reduce tourist travel delay and improve tourist satisfaction.However, in view of the multi-objective constraint mathematical model of scenic spots, more scientific methods should be adopted to collect data and improve the accuracy of data acquisition.In this paper, the data of all scenic spots in the experiment is small, so the extracted scenic spot information is relatively simple.In order to popularize the model to more tourist attractions, data mining and other technologies can be adopted to obtain more tourism data and more detailed tourism data, so as to make the research results more universal.
This article is based on the assumption of a tourist's travel route in one day.In fact, in many short trips, tourists usually visit the tourist city for more than one day.In the future, while increasing the number of tourist attractions, it is necessary to ensure that tourists can obtain fresh tourism experiences beyond their interests and preferences.The next step of the research can be to include more tourist attractions and venues, expanding the attraction search buffer neighborhood.In the algorithm, tourists can set their own travel days to help tourists get more humane services.

FIGURE 2 .
FIGURE 2. Basic structure of VIN module.

FIGURE 3 .
FIGURE 3. Partial traffic data at different time periods.

FIGURE 4 .
FIGURE 4. Statistics of scenic spot attribute information.

FIGURE 8 .
FIGURE 8. index results based on different planning models.

FIGURE 9 .
FIGURE 9. Scene perception results of the planning model.

TABLE 1 .
Hierarchical judgment matrix of first-level index.