Learning-Based Human Segmentation and Velocity Estimation Using Automatic Labeled LiDAR Sequence for Training

In this paper, we propose an automatic labeled sequential data generation pipeline for human segmentation and velocity estimation with point clouds. Considering the impact of deep neural networks, state-of-the-art network architectures have been proposed for human recognition using point clouds captured by Light Detection and Ranging (LiDAR). However, one disadvantage is that legacy datasets may only cover the image domain without providing important label information and this limitation has disturbed the progress of research to date. Therefore, we develop an automatic labeled sequential data generation pipeline, in which we can control any parameter or data generation environment with pixel-wise and per-frame ground truth segmentation and pixel-wise velocity information for human recognition. Our approach uses a precise human model and reproduces a precise motion to generate realistic artificial data. We present more than 7K video sequences which consist of 32 frames generated by the proposed pipeline. With the proposed sequence generator, we confirm that human segmentation performance is improved when using the video domain compared to when using the image domain. We also evaluate our data by comparing with data generated under different conditions. In addition, we estimate pedestrian velocity with LiDAR by only utilizing data generated by the proposed pipeline.


I. INTRODUCTION
R OBOT navigation depends on real-time, precise, and robust sensing. A robot should recognize its surrounding environment and objects such as pedestrians and other robots. The robot is also required to be robust for various kinds of situations. Therefore, LiDAR is often employed to acquire accurate 3D information with a high sampling frequency. For example, LiDAR has been utilized for mapping processes [1] and robotics applications including robot navigation [2]. Human recognition with LiDAR is a very important task in robot navigation. In many LiDAR systems, several sensors scan to acquire 3D information and the scanning frequency is very high. Therefore, a single scanned data is considered the same as frame data similarly to how video data are collected for processing. This is positive because we can use the advantages of high performance computer vision algorithms.
In the computer vision field, videos have been researched with various approaches including action recognition [3], video retrieval [4], and irregular detection [5]. Because deep neural networks have had a positive impact on computer vision, especially in the image domain [6]- [8], neural networks for the video domain are now being actively researched [9]- [11]. Another topic in this field of research is semantic video object segmentation. This focuses on the detection and segmentation of object-like areas in video with predefined class labels. Since deep neural networks have achieved high performance in the image segmentation field, various network architectures have also been proposed for handling the video domain [12], [13]. However, the performance of the learning based approach strongly depends on the training dataset. Available video datasets are not labeled at pixel-level with the entire frame. Therefore, many learning based semantic video segmentation methods take weakly supervised learning to overcome the lack of ground truth [12], [13]. On the other VOLUME  hand, in this study, we tackled data collection by generating sequential data. We also focused on human segmentation with sequential data collected by LiDAR. In this paper, we define the 'frame' as data from a single LiDAR scan, and the 'sequence' as sequential data from constant LiDAR scanning.
Human recognition with LiDAR data has been researched vigorously [8], [14], [15]. According to [15], the human recognition performance drastically decreases as distance between a human and LiDAR increases. Because the number of points is inversely proportional to the square of the distance between a human and LiDAR, humans in the distance are difficult to recognize with only shape information detected from the frame. On the other hand, the sequence provides not only shape but also motion information, the recognition accuracy of distant humans can be improved by utilizing this motion information. We have confirmed that the sequencebased approach improves the accuracy of human detection when compared to the frame-based approach.
Collecting a sufficient amount of labeled data requires significant investment in terms of both time and money. In this study, we develop an automatic 3D LiDAR data sequence generation pipeline for human detection and velocity estimation. In comparison to single frame data generation, data sequence generation is more challenging because we need to take into account the temporal consistency of the sequence. For example, all objects should be temporally continuous. In addition, the generated data sequence should be spatially and temporally realistic to ensure the network is accurately trained.
In [14], a method to generate a single frame of realistic 3D LiDAR data for the human detection has been proposed. In that method, a statistical human shape model [16] is used to build different types of precise human shapes. That model is very useful for generating the single frame of LiDAR data. However, this information is insufficient for generating the realistic LiDAR data sequence. For sequence generation, it is necessary to consider 1) the human walking model, 2) human trajectory, and 3) the sensor position and pose. In this study, we incorporated a walking model and an observed walking trajectory. Combining this information based on the LiDAR trajectory consideration, the 3D LiDAR data sequence is generated using human labels.
One of our goals is to accelerate the research of learningbased video human segmentation with point cloud data. For that purpose, we have generated a large amount of labeled sequential data (more than 10K sequences) with the proposed data generation pipeline. We have trained several neural networks with different training policies. The trained networks have been evaluated in comparison with actual data collected by a real LiDAR sensor. All generated data and labeled real data are presented in following url. Trained network weight and test sample code also included.
http://www.ok.sc.e.titech.ac.jp/res/LHD/ The remainder of this paper is organized as follows. We quickly review related work in Section II. In Section III, each step in the data generation procedure is explained in detail. The network architecture for the training sequence is described in IV. The specific policy of the training and experimental results are then discussed in Section V. Finally, we conclude our research with an outline of further improvements and potential future applications in Section VI.

II. RELATED WORK
Dataset of depth map. After the release of the Microsoft Kinect in 2010 [17], several RGB-D datasets have been published. RGB-D datasets for human recognition have also been provided, such as that for the re-identification of a person with RGB-D sensors [18], BIWI RGBD-ID dataset [19], and UPCV Gait dataset [20]. As the Kinect cannot measure depths greater than 10 [m], LiDAR sensors were employed to handle depths over 10 [m]. In addition, LiDAR sensors are used in auto driving technology. In this field, the KITTI dataset [21] is widely used by many researchers [22], [23]. However, KITTI only provided 93K+ of depth data without labeling. Collecting labeled depth maps is still challenging. Pixel or point-wise labeling for 3D depth data is usually a challenging task that involves significant costs. Under the circumstances, the video game Grand Theft Auto was deployed to collect data [8], [24], [25]. This approach may reduce the cost of data construction, but limitations still remain. Grand Theft Auto is not designed for research purposes; therefore, we cannot control specific properties of the circumstances of the simulation, such as the human body type and model deployment location. Automatic Labeled LiDAR Data [14], [15] have been released with 1M+ data including depth, xyz coordinates and pixel-wise human labels. Automatic Labeled LiDAR Data may cover demands in the image domain; however, its application is not sufficient to address the requirements of the video domain. In addition, the datasets of depth map cannot handle velocity information because they only contain single frames, and not sequences on the time axis. Datasets for RGB video segmentation. Following the increase in video dataset demands, several datasets have been published. For example, the Freiburg-Berkeley Motion Segmentation dataset [26] has been proposed for motion segmentation. In addition, SegTrack v1 [27] and v2 [28] have also been published for video segmentation specialized with fast motion and complex changing shapes. In the case of video object segmentation, the DAVIS challenge [29] has been presented and updated since 2016. Accordingly, RGB video datasets have been vigorously proposed and updated. On the other hand, to the best of our knowledge, fully labeled video segmentation datasets for LiDAR with large scale have not been proposed. Our research focuses on filling this gap.
To address these problems, we constructed a sequential data generation pipeline enabling us to change any parameters and environments. Further, the proposed generator creates human labels and velocity information in the process, thereby incorporating this normally human related task into the computation cost. As a result, as long as sufficient computational resources are available, we can generate sequential FIGURE 1: Overview of our data generation pipeline. Bold words represent controllable parameters. (Green boxes: Human walking model) Human models are built by weight and height, then they are combined with a human walking motion. Thereafter, they are deployed following the sampled trajectory. (Blue boxes: Background sequence) Sensor trajectory and background LiDAR data are refined with a frame length of sequence generation. (Yellow boxes: LiDAR data generation) Human walking models are synthesized with depth map sequentially. The synthesized pixel-wise depth maps are labeled based on the information from the human model deployed location. Then, pixel-wise velocity information of sensor origin is generated according to human velocity, velocity command, and labeled information. data continuously.

III. SEQUENTIAL DATA GENERATION PIPELINE
One of the main contributions of this paper is the automatic generation of labeled sequences of LiDAR data considering the precise human model and motion, without involving manual labeling. Unlike image generation, all of the object data must be connected in time-series for generating a sequence. The sequence generation pipeline comprises three steps: 1) background sequence collection, 2) human walking model generation, and 3) sequential LiDAR data generation. The details are described in the following subsections, and an overview of the pipeline is illustrated in Fig. 1.

FIGURE 3:
Coordinate systems. An uppercase character indicates that its origin is the ground and a lowercase character indicates that its origin is the sensor.

A. SENSOR TRAJECTORY AND BACKGROUND SEQUENCE EXTRACTION
From the given sensor trajectory and background sequence, we extract a specified length of data. First, the start time is randomly selected. Then, we cut out the sensor trajectory and associated background sequence from that start time with the specified time length. Those data are used in the sequential LiDAR data generation process described in Section III-C. The extracted sensor trajectory is also used for the velocity calculation and the human depth map generation process detailed in Section III-C. The standardized sensor trajectory is also utilized for velocity calculation and the human depth map generation process detailed in Section III-C. We can employ any LiDAR data as background LiDAR data in simulations [8], [30], in real time [31], [32], and taken by ourselves. The sensor trajectory is determined automatically when the LiDAR data are determined. : Example frames of a generated sequence in point cloud. Blue points denote the background and red points denote a human. Notice that human models are walking following their individual trajectory and the background scene is also changing following the LiDAR trajectory.

B. HUMAN WALKING MODEL
For the human walking model, first, we build a human model based on the human body database [16]. The walking sequence is also observed. We can generate the human walking mode by combining the observed walking sequence and the human model build with height and weight. We independently observed human trajectory. Assuming that the human model, the walking model, and the human trajectory can be sampled independently, we can construct various kinds of human walking models with associated human trajectories. However, since we considered them independently, it is necessary to contemplate the relationship between human velocity and stride length in order to construct a precise human walking model. For the human model database, we employed a 'Dhaiba-Works' [16] to build precise, functional 3D human models. DhaibaWorks supports editing and visualizing basic models such as 3D meshes and skeletal structures, including human models with motion [33]. Using DhaibaWorks, we can easily generate a specific human model by setting human parameters such as height, weight, and action status [34].

C. SEQUENTIAL LIDAR DATA GENERATION
In the sequential LiDAR data generation step, random human models, walking trajectories, and LiDAR trajectory are sampled. Then, the depth map of the constructed human model is synthesized. Thereafter, the synthesized depth map of the human model and the background depth map associated to the LiDAR trajectory are combined to generate the training depth map for human segmentation. After generating a depth map, human models are resampled by the relationship between stride length and observed velocity in trajectory information. Thereafter, updated human models and the LiDAR position are relocated based on their trajectory information. With this loop flow, we can generate the LiDAR sequence.
The entire coordinate system in this study is illustrated in Fig. 3. To synthesize the human model depth map, we virtually located the LiDAR sensor at (X s , Y s , Z s , Q s ) of ground origin, where Q s is the sensor quaternion. Then, the human model depth map was synthesized, virtually inserting the human model at Once the human model depth map is synthesized, this is simply combined with the background depth map by pixelwise minimum depth selection. The depth map taken by the LiDAR sensor usually includes holes or missing pixels whose depth could not be measured. We leave these holes as they are for the synthesis process because these types of holes are equally obtainable in a real sensing process. In addition, the human labeling task can be simultaneously performed because we know which pixels correspond to the human model depth map.
In the velocity calculation process, pixel-wise velocity information of sensor origin is generated. Human velocity of ground origin V h , and sensor velocity of ground origin V s can be calculated by sensor trajectory. In addition, the transform function from the ground origin to the sensor origin can also be calculated using the sensor trajectory. Therefore, the velocity information for each pixel can be generated as follows, where p is one pixel in a frame: Figure 4 shows examples of the generated sequence. Detailed information for the generating Fig. 4 is described in Section V-A.

IV. NETWORK ARCHITECTURE FOR SEQUENCE TRAINING
To utilize the generated sequences for training data, we designed a network architecture for human segmentation and velocity estimation with depth images as shown in Fig 5. Because convolutional layers can represent any architecture, we used a typical convolutional neural network architecture for feature extraction.
The architecture for segmentation is described in black ink in Fig. 5. First we sampled depth images from the  sequence. Thereafter, each input was computed by weight shared convolutional layers for producing the features. Next, we concatenated each feature in the channel axis. Then we adapted another convolutional layer to concatenated features as a decoder. For activation, we employed softmax for the segmentation task. With this architecture, we can tune the network by feeding the sampled depth sequence and the label of the last frame in the sampled sequence. We used categorical cross-entropy for the loss function of the segmentation task.
The architecture for velocity estimation is described in red ink in Fig. 5. We concatenated the segmentation result and concatenated features in the channel axis. Then we adapted the other convolutional layers to re-concatenate the features as a decoder. For activation, we employed linear for the velocity estimation task. We used the Mean Square Error (MSE) for the loss function of the velocity estimation. In addition, we did not consider defected pixels in the input scene for MSE calculation.
For sequential data learning, we randomly selected one piece of sequential data from the training dataset at every step. Then, a certain frame length of sequentially sampled training data could be generated from the selected sequence.

V. EVALUATION
For evaluation, we generated 1,108 sequences as described in table 1, then, 1,000 sequences were used for training and 108 sequences were used for validation and estimation. As a result, 32,000 frames were used for training. We only used 108 final frames in each sequence for estimation. We also prepared 0.1K of manually labeled real data for evaluation.
In this Section, we employed a Intersection over Union (IoU) in this experiment. A IoU was calculated as positive as human. We computed a IoU as a True Negative (TN), False Negative (FN), True Positive (TP), and False Positive (FP) as follows:

A. DATA GENERATION SETTING
The background sequence was collected in the Miraikan 3rd floor using a HDL-32E LiDAR. In the background sequence collection step, the velocity command to the LiDAR equipped robot and the coordinate transformation matrix from the ground origin to the LiDAR origin were also recorded. The LiDAR trajectory is estimated using 'Real-Time 6DoF Monte-Carlo Localization' [37]. the estimated LiDAR trajectory is shown in Fig. 6.   In a real world environment, human beings may have multiple postures including standing, walking, and running. However, we assume that the human is always walking during the sequence generation for this study. Walking motion data is required to build an artificial human walking model. We used one period of walking data that was collected in [38]. This walking motion data consists of 230 frames for a single walking motion. We estimated the walking stride length based on a distance between left and right heels. For constructing the human model, we take fifteen typical combinations of height and weight as summarized in Table 3. We believe that these combinations cover a variety of relevant scenarios. For gathering human trajectory, the HOKUYO UTM-30LX sensor [39] was installed at a fixed laser sensor position as shown in Fig 6. With the HOKUYO UTM-30LX, we can obtain the sequential human location, direction, and velocity. Therefore, we used collected visitors' trajectories gathered on Sep 21,22 and Dec 06, 07 in 2018 [40]. As a result, 70,300 different kinds of walking trajectories were utilized for sequence generation. Fig. 6shows an example of the walking trajectories collected. By observing real walking trajectories, we can also avoid deploying human models into unreachable areas. With this information, we now have 230 frames of walking motion data, fifteen combinations of the human model, and 70,300 kinds of trajectories. As such, a total of 242,535,000 different types of human walking  For generating velocity information, we used the human velocity command linked to a LiDAR equipped robot, and the coordinate transformation matrix from the ground origin to the sensor origin. The human velocity is recorded as xy coordinates for ground origin. Therefore, we describe the human velocity as the xy coordinate of the sensor origin using the coordinate transformation matrix. Then, we can obtain the pixel-wise human velocity by substituting each human velocity to a whole pixel in each human label. For background velocity, we assumed that the opposite direction of velocity commands of ground origin are the same as the background velocity of sensor origin. Therefore, pixel-wise background velocity can be calculated by considering the velocity command. As a result, we can obtain the pixel-wise velocity map in xy coordinates by composing human and background velocities of sensor origin. We denote that the x-axis of sensor origin points to the forward direction of the LiDAR.
The parameters for the LiDAR data generation are summarized in Table 1. These parameters contain depth, xyz coordinates, human label, and velocity map in HDF5 format. We also provide further specific information in the shape of an xml file. Xml files contain a human number in the depth scene, location, weight, and height of each human model. The sampling rates of both human trajectory and LiDAR trajectory are 10 [Hz].

B. TRAINING PARAMETERS
We only used generated data for training data in this study. From the Table 1, the size of the input image is 32 × 1024. We employed Adam [41] with the learning rate = 0.001 and decay = 0.001 for the optimizer. We also set the weight to human label in every training data as background pixel number / human pixel number for each of the scenes. In addition, we set the weight to categorical cross-entropy loss for segmentation as 10,000, MSE for the background velocity estimation was set to 1, and MSE for human velocity estimation was set to 1,001.    In (a), blue points denote the background and red points denote a human label. In (b), blue points denote that the estimated result is the background and red points denote that the estimated result is a human. and improves the overall network performance. According to these experimental results, we conclude that utilizing the multi frame can improve performance for recognizing far human than a single frame in a human segmentation task with LiDAR sensing. Inform that the results of 16-frame are used for examples in Fig. 7 and 8 because 16-frame shows the best score in Table 4.
To confirm the effect of the sequence feature, we compared four different datasets as shown in Table 5. In the data that excludes the trajectory and walking model, all human walking models are fixed in a specific point and state. In the data without a walking model, all human walking models are also sliding along their trajectories in a fixed state. In case of the data without trajectory, all human walking models are walking in a fixed position. We trained the 16-frame with VOLUME Fig. (a).
100 sequences in each dataset and evaluated this with real data for the IoU. The results are shown in Table 5. According to Table 5, the network trained by data with all information shows the best score. We can also observe that the scores of the network trained by data excluding the walking model and trajectory are higher than data without both trajectory and walking model. Then, we assume that consideration of trajectory and walking model affect performance improvement. In recent study [15], ensemble network was proposed and evaluated using same real data. Therefore, the proposed network was compared with the results reported in [15] as shown in Table 6. We can confirm that the proposed network shows siginificantly better score than other networks.
From the human segmentation results, we assume that the network can extract the difference features in the time axis from sequential data, and the features can improve the segmentation task in far distance . In addition, we also assume that a network within the image domain is capable of learning the human shape only, whereas the video domain is capable of learning both the human shape and its movement. Accordingly, we conclude that the performance decrease problem with distance can be solved by taking time-series information into consideration. The example of human segmentation is illustrated in Fig. 7. Estimation of human velocity. As shown in Fig. 5, the network can estimate the human segmentation and pixelwise velocity map. Then, we can derive the velocity map of the segmented area. The examples of human velocity estimation are illustrated in Fig. 8. According to the Fig. 8, the estimated velocities show similar tendency with ground truth. As a consequence, we conclude that velocity estimation with LiDAR only is a feasible task.

VI. CONCLUSIONS
In this paper, we propose a fully automated sequence generation pipeline using a precise human model and motion for human detection with velocity estimation using LiDAR. With this process, we can easily generate labeled data with any properties for LiDAR. Following this result, we conclude that sequential data can improve the performance of human segmentation when compared to data collected from the image domain only. Furthermore, we also confirm the possibility of velocity estimation with LiDAR only. We present 0.1K of labeled real data, and more than 7K of generated sequences with human labels and velocity maps. With these sequences, we were able to confirm the effectiveness of using sequential data over image domain data.
We have considered two main points with regard to future improvements of our work. The first relates to pipeline improvement. Although we used a confirmed method to produce the human model, this is not entirely representative of the real world. As such, we will take into consideration pose, fashion, and other conditions for more accurate simulations. Because we made the human model walk, we will also try to apply different walking poses, stride lengths and other conditions. In addition, it should be noted that networks were only trained with generated data in this study. A comparison of network performance in manually labeled training data and generated training data would be beneficial to this study. However, based on our investigations, this particular experiment is expected to incur significant costs. The second point for improvement relates to utilizing sequences. Because generated data is sequential, this can be applied to many tasks such as human tracking and trajectory prediction. Future work will elaborate on how these points for improvement can be achieved.