Significance of Image Features in Camera-LiDAR Based Object Detection

In autonomous cars accurate and reliable detection of objects in the proximity of the vehicle is necessary in order to perform further safety critical actions which depend upon it. Many detectors have been developed in the last few years, but there is still demand for more reliable and more robust detectors. Some detectors rely on a single sensor, while some others are based upon fusion of data from multiple sources. The main aim of this paper is to show how image features can contribute to performance improvement of detectors which rely on pointcloud data only. In addition it will be shown, how lidar reflectance data can be substituted by low level image features without degrading the performance of detectors. Three different approaches are proposed to fuse image features with point cloud data. The extended networks are compared with the original network and tested on a well-known dataset and on our own data, as well. This might be important when the same pretrained model is to be used on data generated by a lidar using different reflectance encoding schemes and when due to the lack of training data retraining is not possible. Different augmentation techniques have been proposed and tested on the KITTI dataset as well as on data acquired by a different lidar sensor. The networks augmented with image features achieved a recall increase of a few percent for occluded objects.


I. INTRODUCTION
I N the field of autonomous driving and intelligent infrastructures the environment perception stands for a safety critical task where different type of static and dynamic objects must reliably and robustly be detected and localized under various circumstances, such as different weather conditions, limited sensing resolution of applied sensors, partial occlusions, etc.
Efficient sensing under different weather conditions might easily be handled by utilizing different type of sensors jointly (lidar, RADAR, Camera, thermal vision). Most often camera and Lidar sensors are used in sensor fusion algorithms [1] [2]. For various applications, camera and radar pairing is also common [3], [4], while there are also cases where radar and lidar data are considered for fusion. [5].
Lidar sensors are not affected by day and night lighting conditions, they can also reliably operate under various limited visibility conditions. Radar sensors are also unaffected by light and weather conditions such as fog and dust, they can sense longer distances than Lidars, however their resolution compared to Lidars is considerably lower. Cameras perform poorly under limited visibility conditions, although thermal imaging cameras can compensate for such limitations. [6] Individual application of certain sensors might strongly be limited by their low spatial resolution as in case of lidars for instance (depending on the displacement, number of channels and the field of view different sparse patterns can be observed on the generated pointcloud). Even the most advanced lidars are not able to capture objects being at longer distances VOLUME 4, 2016 (> 150 m) with good enough resolution which makes the detection task in such cases even more difficult. At distances more than 150 the number of rays crossing the body of an average sized vehicle is to low for its reliable detection (even in case of lidars having the highest available resolution).
There are numerous cases when long range detection of vehicles is required in order to perform the given task efficiently, such as for instance prediction of potentially dangerous traffic situations or scenarios in order to avoid accidents and increase road safety [7]; digital twin generation of longer road sections, where the range of detectability influences also the minimal number of sensors to be deployed in order to cover a given road section. This factor plays significant role first of all due to cost, energy consumption and maintenance related reasons in future intelligent infrastructure and road networks [8].
Multi-modal approaches are a promising alternative to handle (to some extent) problems caused by the sparsity of lidar pointclouds. For example by combining information from camera images (which obviously have much larger resolution than lidars) with pointcloud data (which on the other hand have good depth resolution) the detection performance as well as the reliability might be improved (compared to camera only or lidar only approaches).
The main contribution of this paper is represented by the proposed pointcloud augmentation techniques incorporated into a selected baseline lidar based detector model and by the evaluation and analysis of the impact of certain image features on 3D object detection performance compared to the lidar only detection case where the training of neural networks as well as the inference is solely performed on pointclouds. It is also examined how certain types of image features under different conditions (various scenarios including partially occluded, close and distant vehicles, data acquired by various lidar types) contribute to the performance improvement of lidar only based solutions by transforming the pointcloud into a "clever" pointcloud by applying the proposed augmentation techniques.
The paper is organized as follows: • in Section II the related work including the brief overview of the state of the art solutions of sensor fusion is described; • Section III summarises the problem addressed by the point cloud augmentation algorithms presented; • Section IV presents the proposed point cloud augmentation algorithms; • Section V analyses the results achieved by using augmentation networks; • Finally Section VI reports conclusions.

II. RELATED WORKS
Many methods appeared in the literature in the last few years (first of all machine learning based approaches) to tackle the detection problem especially in lidar pointclouds. Let us categorize the methods developed for object detection into two classes, i.e. approaches which operate on lidar pointclouds only and approaches utilizing camera images together with lidar pointclouds.

A. LIDAR ONLY APPROACHES
Lidar only approaches are efficient for short range detection, however at longer distances the density of lidar points is significantly reduced, which makes it difficult to detect objects reliably. By utilizing lidars the vehicle or pedestrian detection task might be performed under various weather conditions efficiently. Building on the PointNet design developed by Qi et al. [9], VoxelNet [10] was one of the first methods to perform true end-to-end learning in this area. VoxelNet creates voxels, applies a PointNet to each voxel, followed by a 3D convolutional middle to consolidate the vertical axis, after which a 2D convolutional detection architecture is applied. While the performance of VoxelNet is robust, inference time is too slow for real-time deployment. Recently, SECOND [11] improved the inference speed of VoxelNet, but 3D convolutions remain a bottleneck. The bottleneck was solved by PointPillars [12] which is still one of the most computationally efficient architecture (according to the KITTI benchmark site [13]) designed for 3D object detection task in lidar pointclouds. In PointPillars the 3D points are organized into columns (pillars) and transformed into a sparse tensor of learnt abstract features which are then processed by further convolutional layers to get detections in form of 3D bounding boxes. A different concept for object detection in pointclouds is proposed by the authors of the so called Self-Ensembling Single-Stage object Detector (SE-SSD) where they focus on exploiting both soft and hard targets by introducing two Single-Stage object Detector (SSD) networks being in a "student" "teacher" relation. [14]. The Semantic Point Generation (SPG) method proposed in [15] aims to recover missing parts of foreground objects by generating semantic points which might be utilized by pointcloud based object detectors directly to enhance detection.

B. CAMERA AND LIDAR BASED APPROACHES
In order to extend the range of detectability of objects and increase reliability, joint application of different sensor types is highly welcome. The authors in [1] proposed a multimodal approach by fusing information from lidar pointclouds and semantic-rich stereo images. They bridge the resolution gap between the lidar and Camera by introducing so called virtual points. Another multi-modal approach is proposed in [2] where the the lidar points are augmented by semantic information being extracted from images in form of pixel categories resulted by semantic segmentation of the image. In the so called EPFNet [16] the authors enhance lidar points with semantic image features in a point-wise manner without any image annotations. In the work [17] the pointcloud of occluded objects is handled by learning object shape priors based on which the shape of the complete object might be estimated. Authors in [18] consider geometric consistency between detections in the image and the pointcloud, meaning that 2D bounding boxes and the projected 3D bounding boxes 2 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181137 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of detections must be consistent as well as the so called semantic consistency which is related to the category of objects. The RPN model proposed in [19] performs multimodal fusion on high resolution feature maps in order to generate more reliable 3D object proposals for multiple object classes.
In this paper a camera-lidar fusion for object detection is proposed, which is based on augmenting the lidar points with corresponding image patterns as well as by individual pixel data. We will show how fusion strategies of this kind affect the performance of the baseline detector. The proposed fusion models enable real-time application (the frame rate of the detector is kept at 20 fps which is the frame rate of lidar sensors available today). The effectiveness of the proposed augmentation techniques is evaluated on the KITTI dataset as well as on further real world data collected by the authors. It is also shown how such augmentations may improve the performance of the selected baseline network when performing the inference on pointclouds generated by different lidar sensors.

C. FUSION RADAR WITH CAMERAS OR LIDAR
In sensor fusion radars are also well considered sensors, first of all due to their longer range, cost efficiency and applicability even under limited visibility conditions. The so called AssociationNet [3] generates a pseudo-image from radar pins, 2D bounding boxes and the original RGB camera image which is fed into a neural network to learn highlevel semantic representations. Camera-radar fusion might be applied at object level, as well [4].
Thermal imaging cameras operate well even under limited visibility conditions, they can jointly be utilized with Lidars to achieve more accurate object detection. In [20] for instance the authors use such combination of sensors while in [21] radar sensor is also included. Researchers at the University of Berlin have presented a solution where radar and Lidar detections are fused aimed for highway applications [22].

III. PROBLEM DESCRIPTION
There have many object detectors been proposed during the last few years operating on various types of sensory data (lidar pointcloud, camera image, radar pointcloud, etc.). Here our main goal is to show how the performance of an object detector operating on pointclouds only might further be improved by low level fusion with camera images. We will also show how a network trained on data acquired by a specific sensor performs on pointcloud data acquired by comparable sensors of other vendors and what improvement in detection performance might be expected when fusion is applied. Here under fusion we mean camera-lidar fusion, i.e. fusing image pixels with pointcloud data.
Here we would like to point out the impact of sensor specific data patterns -produced by different lidar sensorson detector performance (due to beam angles, resolution and sensitivity varying from sensor to sensor). Obviously some performance degradation of detectors might be expected due to sensor specific pointcloud patterns and reflectivity profiles representing the objects. We would like to show how camera-lidar fusion performed at lower level of abstraction may contribute to the reduction of performance degradation. Since there are many types of lidar sensors and many setups exist (each causing different pointcloud patterns to appear on surfaces of objects), collecting training data for each specific setup and sensor type individually is energy and time consuming. Instead of retraining the network on sensor or setup specific training datasets, we aim at improving its robustness by applying lower level fusion of pointclouds with image pixel data.
As baseline model we have chosen the PointPillars [12] object detector (operating on lidar pointclouds) which has remarkable performance considering its speed and precision of detections (according to the KITTI 3D object detection benchmark). Although some newer detectors managed to get higher precision but they have still much lower frame rate than PointPillars. We have trained the baseline model as well as the fusion capable network on the KITTI training dataset [23].

A. DIFFERENCE IN POINTCLOUD PATTERNS
The pointcloud patterns formed on object surfaces differ from manufacturer to manufacturer of lidar sensors (by considering the same scenario and sensor placement), which may strongly influence the performance of networks trained for a specific lidar sensor but applied on data acquired by a different one. The following figure shows two pointcloud patterns corresponding to two different lidar sensors (both sensors were modeled in accordance with their specification sheets by the dSpace SensorSim sensor simulator)(see Fig. 1). Let us call these sensors as sensor-A and sensor-B. In Figs. 1a and 1b vehicles being 25m apart from the sensor origin can be followed while in figs. 1c and 1d the vehicles were set to be 15m away from the sensor origin. The height of the lidar sensor for both scenarios was set to be 1.73m (according to the test vehicle of the Karlsruhe Institute of Technology [ [13] [23]]). The orientation of vehicles was 45°wrt. longitudinal axes of the lidar. The aim of this simulation is to point out the differences between pointcloud patterns. One may observe that the density of points as well as the formed pointcloud patterns differ in both cases. Another factor to be considered is the difference in reflectivity profile of lidar sensors The performance of the trained detector obviously degrades when running on data acquired by a different lidar sensor. Another important aspect here is the intensity profile of lidars, which may also differ from vendor to vendor and therefore it stands for an additional limiting factor for the usability of pretrained neural networks (trained on specific lidar data) in case of different lidars. Each manufacturer handles the reflection of the laser beam differently, from which the reflectivity value is calculated by the sensor.
In the upcoming sections we will show how the lidar reflectivity information influences the performance of the baseline model and how image pixel information may contribute to the performance improvement of detectors compared to lidar reflectivity values.

IV. PROPOSED POINTCLOUD AUGMENTATION
In order to combine data from different sensors to generate higher level features to enhance the performance of detectors we proposed a data driven sensor fusion approach where the fusion itself is done by neural network architectures, as well as in IMF-DNN architecture [24]. The data acquired by the lidar is combined with image features (see later in this section) which is applied during training as well as inference. The other class of fusion algorithms, where mathematical models are used to generate detections is called model-based fusion [25].

A. THE SELECTED BASELINE MODEL
We selected the PointPillars convolutional neural network proposed by authors in [12] as the baseline model in order to apply and evaluate the impact of our proposed augmentation techniques on detection performance. The main components of the PointPillars are the so called Pillar Feature Network, the Backbone, and the Single Shot Detector (SSD) head [26]. It converts the raw pointcloud to a stacked pillar tensor and a pillar index tensor. Then a feature encoder uses the stacked pillars to learn a set of features to form a so called 2D pseudoimage serving as input for the Backbone convolutional neural network. Based on the generated features the detection head predicts 3D bounding boxes of objects present in the scene [12]. Starting from this baseline model our aim was to include image pixel information into the process of pseudo image creation in order to force the network to learn higher level features from pointcloud and image data jointly. For transforming the augmented input into a higher level feature vector (see Fig.2) a fully connected layer has been applied similarly as in [ [10] [9]]. The next section (IV-B) gives a deeper insight into the extended architecture as well as the alternatives used for image-pointcloud fusion.

B. EXTENDED PILLAR FEATURE NET
The idea of using image features comes from the fact that when using different brands of lidars, we cannot fully align the reflectivity values. Another problem that arises is that pretrained models may be sensitive to internal sensor parameters, such as the angle or pitch of the beams. As extensions to the original baseline model, three different architectures have been proposed to increase the robustness against the influence of varying sensor parameters.
Let p i = [u i , v i ] T stand for the pixel coordinates of the projection of a 3D point P i = [X i , Y i , Z i ] T from the lidar pointcloud onto the camera image plane using the pinhole camera model as follows: wherep i andP i stand for the homogeneous coordinates of p i and P i , respectively, K denotes the camera matrix (which contains the focal length f x and f y expressed in terms of pixel width and height, respectively; principal point coordinates x 0 , y 0 and the axis skew s), R the rotation matrix and t the translation vector corresponding to the transformation from the lidar frame to the camera frame. Let I L i and I cam i stand for the reflected laser beam reflectivity and the image pixel intensity of P i and its projection p i , respectively.
Let us point out that in the baseline model [12], we augment each lidar point P i in the pillar it is contained in, as follows: where M j = [M j x , M j y , M j z ] and C j = [C j x , C j y , C j z ] denote the mean of points falling in the jth pillar and the center of the pillar, respectively. Considering the above original augmentation we have incorporated image pixel information into P * i as follows: Let P * * i denote the reduced version of the augmented point P * i , where r i is not included. The following cases have been considered: 1) Each P * * i is augmented by v i (1P1P) 2) Each P * * i is augmented by the intensity vector formed from a N × N neighborhood of p i (1P25P) 3) Each P * * i is augmented by the normalized intensity vector formed from a N × N neighborhood of p i (1P25PN) 4) Each P * i is augmented by r i and v i (1P1P) 5) Each P * i is augmented by r i and the intensity vector formed from the intensities of a N × N neighborhood of p i (1P25P) 4 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and   During our experiments we set N = 5. Together with the original baseline models (with and without considering r i ) eight networks corresponding to the above cases were trained, evaluated and tested. Each of these networks was trained and tested on the same splits of the KITTI [13] dataset. The original training data (7481 snapshots) was split by random selection into 3212 training, 3269 validation and 1000 test samples. After evaluating the networks on the test set, we tested their performance on KITTI RAW [23] data as well as on data collected by us using a lidar different from the one used in KITTI. Unfortunately, there is no ground truth for raw and custom dataset, so we cannot determine the accuracy of the detections for those cases, but we can draw useful conclusions from the number of true/false detections. In the following chapters let us describe the structure of the extended feature network in detail.
The modified architectures extend the PFN (Pillar Feature Network) network of the baseline model. The modified network is of size (9+K, 64), where the 9 features in the original input are augmented by K = 1 or K = 25 image features, while the output size is 64. The augmentation is performed as follows:

1) The 1P1P Network
First, the original network was modified by attaching to each point P i in the pointcloud the intensity value (taken from the HSV color space) of the pixel corresponding to the projection of P i in the camera image (see Fig.3). In order to project a 3D point onto the camera image plane the camera and the lidar must be calibrated first, i.e. the intrinsics and extrinsics must be estimated. For this purpose the calibration approaches in [ [14] [27]] have been used.

2) The 1P25P(N) Network
Second approach a vector of 25 pixel intensities is attached to each lidar point P i . Let us denote this vector by v i , which contains the intensity values of the 5 × 5 sized neighborhood of the projection p i . Let us denote this neighborhood by U i . v i can be expressed as v i = V ec(U i ). In order to ensure accurate comparison across U i we normalized the elements   of U i to have zero-mean and unit-variance (see Fig.5). However we have also tested the case when non-normalized neighborhood intensities are used for augmentation (see Fig.  4). By including neighborhood related information to the features of each 3D point, the network during training may "utilize" spatial image information, as well.

A. EVALUATION OF RESULTS ON A SEPARATED TEST SET
The performance of the detectors was tested on a separated test set containing 1000 training images from the KITTI 3D Benchmark. The metrics used for comparison here are the precision, recall and the mean Average Precision (mAP). The latter is calculated by averaging AP values over multiple Intersection over Union (IoU) thresholds used by COCO [28] [29].
Two groups of detectors (each using the same baseline model but different augmentation) were compared. The first group uses all data from the lidar sensor, i.e. the pointcloud as well as the reflectance value for each lidar point. In the second group of networks the reflectance was omitted in order to eliminate the influence of different reflectance encoding schemes being used across lidar manufacturers. Obviously in this case the network is forced to learn from reduced data however our goal here is to substitute the reflectance value by image pixel intensities and show their impact on the performance of trained detectors.
During training, the weights for all considered networks were saved at every 5000 steps for which the mAP metric was calculated on the test set (with available groundtruth) for each category (easy, moderate, and hard) according to the KITTI benchmark site [13](see Fig. 6). One can see that in case when the reflectance is included, the image based augmentation has no remarkable effect on the mAP (less than 1% difference). On the other hand when the reflectance is omitted, the image augmentation caused observable increase in the mAP. The largest contribution of image pixel intensities to mAP improvement can be observed in case of hard objects, i.e. when the number of rays reflected from the surface of objects is small.
The training of detectors was stopped after a certain number of steps which in case of the original and 1P1P detectors was roughly 300000 steps while in case of the 1P25P(N) networks roughly 600000 steps. We expected that more steps will be required for training a more complex network, but none of the considered networks produced remarkable improvement after 300000 steps. Each network was trained on the same splits of the KITTI dataset. To train the models, the hyperparameters used by the baseline model have been considered. The values of the most relevant hyperparameters are given in Table 1.

B. TEST RESULTS ON KITTI RAW DATA SCENARIO
In this section, we selected the weights from the detectors that performed best in the evaluation process. Depending on whether the reflectance was included or omitted the 1P1P and 1P25P networks showed the best performance (see Figs.12 and 6). The selected weights were used to run through the network the "0104" drive data from the KITTI RAW dataset recorded on 26.09.2011 [23]. Although due to the absence of groundtruth data, the previously applied metrics were not 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and    Fig. 8 a sequence of 5 frames can be seen. The top row shows the detections resulted by the original architecture while the bottom row shows the detections obtained by the 1P25P network. Here the lidar reflectivity values have also been taken into account. One may observe that there is no significant difference between the number of the detected objects for this sequence, thus the contribution of image features to the overall detection performance is negligible in this particular case. Fig. 9 shows the same sequence of 4 frames. The top row shows the detections of the original architecture and the bottom row shows the detections of the 1P1P network. In this series, the lidar reflectivity values were omitted from both the original as well as from 1P1P network. As one may observe the 1P1P network was able to detect more distant or occluded cars with a confidence larger than 70%, thus in this case the contribution of image features to performance improvement is remarkable. Figure 10 shows another sequence, also consisting of 5 frames. The top row shows the detections of the original architecture and the bottom row shows the detections of VOLUME 4, 2016 7 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181137  1P25P network. In this series, the lidar reflectivity values have been taken into account. There is no remarkable difference between the performance of the two networks. The same set of vehicles is detected by both networks, even in terms of orientation and location accuracy they are nearly of the same quality. Thus, by using image pixel intensity besides the lidar reflectivity, the performance improvement of the network is negligible.
In Fig. 11 the top row shows the detections resulted by the original architecture while the bottom one shows the detections yielded by the 1P1P network. Here the lidar reflectivity values were omitted. The number of objects detected by the 1P1P network compared to the original one (with reflectance omitted) has increased significantly. The confidence limit was set to 70%. The original network was able to detect nearby vehicles confidently, but in case of distant or occluded cars it did not perform as reliably as the 1P1P network did with image features. Significant increase in the number of true positive detections can be observed in case of the 1P1P network.
This section showed a comparison of the original and the 1P1P network for the case when lidar reflectivity values were taken into account, and the 1P25P network for case when the reflectivity was omitted. Here we considered these two networks only because according to Fig. 12 they proved to perform remarkably better than 1P25PN.

C. DETECTOR PERFORMANCE ON OUR CUSTOM RECORDED DATA
We recorded our custom dataset with a different type of lidar sensor and camera than the one used by the KITTI vision benchmark suite. From numerous recordings, two groups were selected to test the contribution of image features on the detection performance. The confidence limits of detections were set to 70% and 75%. The networks were tested on the same snapshots.   Fig. 13 shows the detections when the lidar reflectivity was taken into account on the same short series of recordings. The confidence limits of the detections were set to 75%. The results show no difference in the number of detected objects in this case. There are some cases where the detectors (original, 1P1P, 1P25P, 1P25PN) recognize different vehicles but the overall performance has not been improved.

1) The First Scenario From Our Custom Dataset
The detectors were also tested on these recordings by omitting the lidar reflectivity (see Fig. 14). The results show that the modified networks detect more vehicles on these frames. The reason behind this might be the low number of points for each vehicle due to occlusions. As it can be seen the 1P25P and the 1P25PN networks detected most of the vehicles, while the original network (by omitting the lidar reflectance) provided fewer detentions. Neither of the detectors was able to detect all vehicles in this scene.

2) The Second Scenario From Our Custom Dataset
Similarly to the first scenario, the detectors were evaluated with and without considering the lidar reflectance (see Fig.  15) and Fig. 16). The same phenomenon can be observed as in case of the first scenario. By including the reflectivity values the performance did not change. On the other hand by omitting reflectivity values the modified architecture proved to be more effective.

HARDWARE SETUP AND CALIBRATION
For recording the stream of image-pointcloud pairs a Hikvision DS-2CD2063G0-I camera having 6MP resolution and an Ouster OS-1 Uniform 64 channel lidar sensor was used. The calibration of the camera was performed by the method proposed by Zhang [30]. The Camera-lidar extrinsics have been estimated by the method proposed in [27].
The detector works by projecting the lidar points onto the camera plane, thus in addition to an accurate calibration, VOLUME 4, 2016 9 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3181137 FIGURE 12. KITTI 3D object detection evaluation metric for each network architecture. The individual rows depict the recall-precision curves for the original PointPillars, the 1P1P, the 1P25P and the 1P25PN networks, respectively. The 1st column corresponds to the recall-precision curve for the case when the lidar reflectivity was also considered while the 2nd column reflect the case when the lidar reflectivity was omitted. The 3rd and 4th column correspond to cases when the detection threshold was set to 70% with lidar reflectivity included and omitted, respectively.
it is essential to precisely synchronize the acquisition of data in order to determine the correct pixel intensity value corresponding to a given 3D point. The importance of time synchronisation is illustrated by Liu et. al. in Matter of time [31]. Inaccurate synchronization may affect the performance of the detector significantly.

VI. CONCLUSION
Reliable environment sensing is one of the most important tasks for self-driving vehicles. The most common types of available object detectors are the lidar only, camera-only and the camera-lidar based detectors. In this paper a low level camera-lidar fusion was proposed based on augmentation of pointcloud data by image features to improve the performance of lidar only based detectors. It was shown how pixel intensity patterns (compared to 3D spatial data) contribute to the reliability of detections especially in those cases when distant objects (represented by lower number of points in the pointcloud) have to be detected. The augmentation is performed by attaching reshaped image intensity patterns to each projected 3D point in the pointcloud. The network retains 20 FPS, which corresponds to the highest frame rate of available lidar sensors. The accuracy of the detector was evaluated and tested on the KITTI dataset as well as on custom data.  [24] Jian Nie, Jun Yan, Huilin Yin, Lei Ren, and Qian Meng. A multimodality 12 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.