Proactive Anomaly Detection for Robot Navigation with Multi-Sensor Fusion

Despite the rapid advancement of navigation algorithms, mobile robots often produce anomalous behaviors that can lead to navigation failures. The ability to detect such anomalous behaviors is a key component in modern robots to achieve high-levels of autonomy. Reactive anomaly detection methods identify anomalous task executions based on the current robot state and thus lack the ability to alert the robot before an actual failure occurs. Such an alert delay is undesirable due to the potential damage to both the robot and the surrounding objects. We propose a proactive anomaly detection network (PAAD) for robot navigation in unstructured and uncertain environments. PAAD predicts the probability of future failure based on the planned motions from the predictive controller and the current observation from the perception module. Multi-sensor signals are fused effectively to provide robust anomaly detection in the presence of sensor occlusion as seen in field environments. Our experiments on field robot data demonstrates superior failure identification performance than previous methods, and that our model can capture anomalous behaviors in real-time while maintaining a low false detection rate in cluttered fields. Code, dataset, and video are available at https://github.com/tianchenji/PAAD


I. INTRODUCTION
M OBILE robots are playing an important role in creating intelligent, productive, and easy-to-operate modern farms. Small and low-cost robots ( Figure 1a) deployed under crop canopies can increase agricultural sustainability by performing tasks that cannot be accomplished by overcanopy large equipment [1]. Although recent research efforts have made noteworthy progress on developing trustworthy autonomy for robot navigation [2]- [5], robots may still fail out in the field due to the environmental complexity, terrain variability, and sensor uncertainty in field environments. A lack of a detection system for anomalous behaviors before failures may cause damage to robots and plants due to collisions. The detection of such anomalous behaviors can stop the robot from entering failure modes, thus providing opportunities for executing recovery maneuvers and proceeding with the task.
Deep-learning based anomaly detection (AD) algorithms have been widely adopted in robotic applications [6]. Many previous works approached the AD problem in a reactive manner [7]- [10] (i.e., an anomaly is detected when the current system state reveals a different pattern from that of past successful experiences). Such reactive anomaly detectors make an inference merely based on the current sensory signals (e.g., velocity, torque, LiDAR readings) and lack the ability to predict potential failures in the future. As a result, the robot may still be damaged due to collisions (Figure 1b) or enter critical states (Figure 1c), the recovery from which is beyond the robot autonomy, due to the alert delay. An alternative solution is proactive anomaly detection, which predicts the probability of future failure based on the planned actions and the current sensory observation. Such predictive model has been explored in LaND [11] and BADGR [12] to choose optimal actions for outdoor navigation. However, the AD problem for robot navigation through natural field environments introduces challenges which are usually not considered when deploying autonomous systems in common outdoor environments. First, the perception system and control system both exhibit high uncertainty during operation. Useful features for AD tasks (e.g., relative position of the robot with respect to the crop rows) are usually buried in noisy sensory signals due to weeds, lodged plants, and low-hanging leaves ( Figure 2a). Meanwhile, the actions (e.g., linear and angular velocity) executed by the robot are constantly corrupted by varying wheel-terrain interactions [2], introducing high variance in control signals that can be problematic for pattern recognition. Second, the frequent sensor occlusion imposes challenges on the robot perceiving the environment (Figure 2b). Anomaly detectors relying on single sensor modality [7], [11], [12] can be easily fooled due to the lack of a robust perception system.
In this paper, we approach the proactive anomaly detection problem by identifying anomalous behaviors conditioned on the current observations. Formally, we define an anomalous behavior for robot navigation as a sequence of future motions which contains at least one time step with failure within the (a) (b) Fig. 2: Field environment. The robot perceives the environment through a forward-facing camera and a 2D LiDAR. The blue triangle in the 2D point cloud denotes the robot. Weeds and low-hanging leaves introduce high uncertainty in sensory signals and can block the sensor view as the robot navigates under canopy.
prediction horizon. Such future motion can be represented as a set of control actions or a planned path. We introduce a ProActive Anomaly Detection network, which we call PAAD, that reasons about probability of failure at each time step within the future time horizon by leveraging the planned motions from the predictive controller and the current observation from the perception system. Features from different modalities are extracted independently and fused in two stages to generate the final probability of failure. We train PAAD with a mixed cost function, consisting of a prediction task and a reconstruction task, to improve the generalization capability and increase the robustness against noisy sensory signals.
Our contributions can be summarized as follows: (1) We propose a novel deep neural network architecture called PAAD, which effectively fuses multi-sensor signals for robust perception in unstructured and uncertain environments. (2) We employ a low-variance image representation of planned motions, as opposed to raw control actions, to realize proactive anomaly detection and to facilitate efficient feature extraction from noisy signals. (3) Our proposed detector outperforms existing methods in failure identification performance on an offline real-world navigation dataset and is able to catch anomalous behaviors online while maintaing a low false detection rate in a real-time test.

II. RELATED WORK
Anomaly detection, also known as outlier detection or novelty detection, is an important problem that has been studied within diverse research areas and application domains [6], [13]. In robotics, AD has been used to detect failures of manipulation tasks [14], [15] and navigation tasks [7], [16].
Recent research efforts have made noteworthy progress in developing learning-based AD algorithms. Maalhotra et al. introduces an LSTM-based encoder-decoder scheme for multi-sensor anomaly detection (EncDec-AD) that uses reconstruction error to detect anomalies [10]. Park et al. proposes a multimodal LSTM-based variational autoencoder (LSTM-VAE) that fuses sensory signals and reconstructs their expected distribution. A reconstruction-based anomaly score is then used to detect anomalies [8]. Our previous work casts the AD problem as a multi-class classification problem and proposes the use of a supervised variational autoencoder model (SVAE) [7]. However, these reactive methods lack the ability to detect anomalous behaviors before failures and thus the safety is not necessarily enhanced [17].
In the domain of proactive anomaly detection / predictive collision avoidance, the most similar work to PAAD is the predictive model for future navigational events (e.g., collision) proposed in LaND [11] and BADGR [12]. The neural network takes as input an image and a sequence of future control actions, and predicts the probabilities of collision for each time step within the prediction horizon. The model has been shown to have reliable anomaly detection performance on sidewalks and off-road environments with large open space. However, such network suffers from sensor occlusion due to the unimodal input and can struggle with learning useful features due to the input uncertainty. In this work, we make use of both camera and LiDAR data to improve the robot perception capability, and use the image representation of the planned path rather than noisy control actions to facilitate the efficient training of the model.
Another widely explored research area that is relevant to our work is traversability analysis in unstructured environments. Terrain traversability analysis can be referred to as the problem of estimating the difficulty of driving through a terrain for a ground vehicle [18]. Bekhti et al. use terrain images and acceleration signals to train a Gaussian process regressor in order to predict vibrations using only image texture features [19]. Maturana et al. propose a real-time mapping strategy that provides a 2.5D grid map centered on the vehicle frame, encoding both geometry and semantic information of the environment [20]. Despite their similarity in methodology, traversability estimation and anomaly detection aim at fundamentally different tasks: navigating over a traversable terrain does not imply that the robot behavior is not anomalous. In field environments, for example, a trajectory that drives off the trail from one to another due to large gaps between crops can be collision free and incur no additional traversal cost; however, such behavior should be classified as an anomaly as the robot is deviating from the specified navigation task.
As an emerging research theme, the camera-lidar fusion has been applied to diverse research areas in robotics and autonomous driving [21]. Typical applications include depth completion [22], object detection [23], object tracking [24], and simultaneous localization and mapping (SLAM) [25]. However, one typical assumption that these application domains make is that the camera and LiDAR data are consistent (i.e., the perceived worlds from the two modalities can be matched with each other). The cluttered environment in fields breaks such assumption as one of the sensors can be occluded frequently and thus poses extra challenges in applying previous techniques. In agricultural settings, the perception error from sensor occlusion is often viewed as noise and is handled by Kalman filter [4], [5]. Although such filtering approach can overcome the problem of occlusion to some extent, the required assumption (e.g., the center line is free of obstacles) does not always hold in the real world. In this work, we develop a novel sensor fusion mechanism to combat sensor occlusion in cluttered environments.

III. METHOD
Our goal is to develop an AD module that enables a mobile robot to detect anomalous behaviors proactively during navigation in the field.
We assume that the sensor observations at time t, o t , are multi-modal, consisting of an RGB image x c ∈ R H×W ×C from camera and range readings x l ∈ R L from 2D LiDAR. The robot employs a predictive controller which plans a sequence of actions for the next T time steps in a receding horizon manner. We further retrieve the current planned path from subsequent actions and represent the resulting path as a separate image p t ∈ R H×W ×1 , in which the path is projected onto a blank front-view image plane. At each time step, the task for the AD module is to map from a set of current sensor observations and a planned path (o t , p t ) ∈ R H×W ×C × R L × R H×W ×1 to a sequence of probabilities of navigation failureŷ t:t+T := (y t , y t+1 , . . . , y t+T −1 ) for each time step along the path as shown in Figure 3.
Compared to the existing reactive anomaly detection method, PAAD is able to make use of the modality from the planning module to detect anomalous behaviors. Such a proactive nature of PAAD alerts the robot before entering critical states from which human interventions are required to recover the robot. Furthermore, the effective fusion of multimodal perception signals provides robustness against uncertainty and sensor occlusion in complex field environments. By contrast, false detection of an anomaly can be triggered frequently due to camera occlusion in anomaly detectors that use unimodal perception signals [11], [12]. Lastly, the adopted image representation of the planned path possesses less variance than the raw control actions, thus leading to a more efficient training procedure.
In the following sections, we will describe the data collection process, model architecture, and training procedure.

A. Data Collection
The TerraSentia robot is an ultra-compact 4-wheeled skidsteering mobile robot designed to drive through crop rows for automated phenotyping [2]. The robot is equipped with a forward facing monocular camera sensor (OV2710) and a 2D horizontal-scanning LiDAR (Hokuyu UST-10LX) which covers a 270 • range with 0.25 • angular resolution. The observation o t is defined by a 240×320 RGB image and a vector of LiDAR ranges of dimension 1081. The image representation of the planned path p t is generated from the output of onboard model predictive controller using perspective projection. The ground truth probability of failure y t is a binary number indicating if the robot is in a normal state or fails the navigation task.
During Data collection, the robot executes an autonomous control policy: in our case, the LiDAR-based navigation algorithm for agricultural mobile robots [26]. Once the robot enters a failure mode, the human disengages the autonomy, repositions the robot to the center line, and then reactivates the autonomy. We define the failure mode as any state upon entering which the robot is not able to continue the specified navigation task (e.g., following a crop row) without human intervention.
The robot collects the observations, planned paths, and drive modes (o t , p t , y t ) at each time step t. We note that PAAD does not require any additional data beyond what is typically collected for testing the robot autonomy. In fact, the data collection process described above is not dedicated to PAAD but to the testing of LiDAR-based autonomy for agricultural robots [4], [26].

B. Model Architecture
We denote PAAD as a function g : (o t , p t ) →ŷ t:t+T , which takes as input a set of current observations and a planned path (o t , p t ) and outputs a sequence of probabilities of failureŷ t:t+T within the prediction horizon.
The network structure is shown in Figure 3. Feature generators (FGs) are designed independently for each modality to extract robust features from different inputs. To strengthen the perception capability in harsh and cluttered agricultural fields, we adopt feature-level camera-lidar fusion, as opposed to signal-level fusion which can struggle with inconsistency in perception signals due to frequent occlusion of one of the sensors. As the final output, the probabilities of navigation failure in the next T time steps are evaluated on the planned path, conditioned on the current observations. 1) Feature Generator: The planned path and RGB image are processed by two separate convolutional pipelines to generate path features f path and camera features f camera , respectively. Each CNN module is followed by a flattening operation. For the path image, we crop according to the region of interest (ROI) so that the model is not provided non-essential data which do not include the actual path (e.g., the pixels above the horizon line are always in black).
To extract features from LiDAR point cloud, we borrow the idea from SVAEs [7]. The reconstruction task in the LiDAR pipeline serves as a regularization [27]- [29], which forces the encoder to learn representative features of high-dimensional LiDAR data that are critical to both the downstream inference and generative model. With additional attention on the reconstruction task, the model tends to improve the generalization performance on the inference task [7]. We approximate the posterior distribution of the latent variable z ∈ R d as a Gaussian with variational parameters φ: where µ φ (x l ) is a mean vector, σ φ (x l ) is a variance vector, and the nonlinear transformations µ φ : R L → R d and σ φ : R L → R d are parameterized by multilayer perceptrons (MLPs) in the encoder. For the downstream prediction task, we choose LiDAR features as: For the reconstruction task 1 , the decoder uses a generative model of the form: where MLP(z; θ) is a mean vector formed by a nonlinear transformation of the latent variable z, and σ is a hyperparameter.
Here, we choose the nonlinear transformation to be an MLP parameterized by θ. Note that the reconstruction branch in LiDAR pipeline follows the structure of a vanilla variational autoencoder (VAE).
2) Fusion Module: To form observation features from sensors, we employ a feature-level camera-lidar fusion by using a multi-head attention (MHA) with a residual connection [30]: which corresponds to the attention module in Figure 3. The query, key, and value are chosen identically to be the concatenation of f camera and f lidar , which can be viewed as a 1 During test time, the reconstruction branch is abandoned and LiDAR features f lidar are forwarded to the fusion module for the prediction task. sequence of length 2. We choose an MHA over an MLP for camera-lidar fusion due to the fact that we expect the model to generate observation features based on the signal quality of each sensor. For example, in cases where the camera is blocked by leaves while the LiDAR view is clear, the point cloud should contribute more to observation features than the image.
The final fusion of observation features and path features at time t produces the predicted probability of failure in the next T time steps: A sigmoid function is used to ensure that the final output probabilities are scaled into the valid range.

C. Training
The ImageCNN in camera pipeline uses a ResNet-18 backbone pretrained on visual navigation task, in which the network learns to predict robot heading and placement in a crop row using a front-view RGB image [5]. We construct the ImageCNN module by truncating the model of visual navigation right before the fully connected layers. The weights of the ImageCNN are fixed after pretraining.
Denoting the dataset collected in Section III-A by D, we specify the overall loss function for PAAD as: where L BCE is the binary cross-entropy loss, α is a hyperparameter controlling the relative weight between the discriminative and generative learning, and p θ (z) is a prior distribution over the latent variable z. As in SVAEs, we choose p θ (z) to be a standard Gaussian distribution z ∼ N (0, I).
The training objective consists of a prediction task and a reconstruction task. The first term in equation (6) penalizes the prediction error. We set α = 0.1 · N , where N is the total number of datapoints as in [7]. The last two terms in the loss function, which is also the negative of the evidence lower bound (ELBO) in vanilla VAEs, penalizes the reconstruction error of LiDAR data. The last KL divergence term can be viewed as a regularization.
The inference model and the generative model can be optimized jointly by stochastic gradient descent of the overall objective function (6). To enable the backpropagation through the sampling layer within the network, a common reparameterization trick is used to move the sampling process to a stochastic input layer [31].

IV. EXPERIMENTAL RESULTS
In our experiments, we evaluate the anomaly detection performance of PAAD on 4.1 km of real-world navigation data collected with the TerraSentia robot in corn fields from September to October 2020. The robot navigates between rows of crops under cluttered canopy without damaging the plants. Depending on the environmental conditions, the robot may or may not enter a failure mode in a run. The reference speed for the robot is set to be 0.6 m/s and the two consecutive points on the planned path from onboard MPC have an interval of 0.2 meters. After the data collection, we subsample the data to 3Hz so that the ground truth probability of failure is aligned in time with the predicted one along the planned path. For all proactive anomaly detectors, we use a prediction horizon of T = 10 time steps (i.e., a lookahead distance of 1.8 meters). A subset of our dataset 2 is visualized in Figure 4. To alleviate the negative effect on the evaluation of different models introduced by the covariance between datapoints closely related in time, we construct the training set and test set from experiments on independent days. The training set consists of 29292 datapoints and contains 2258 anomalous behaviors collected over five days, while the test set consists of 6869 datapoints and contains 689 anomalous behaviors from data collected on two additional days. The data were collected in part at the Illinois Autonomous Farm. We perform under-sampling of normal cases and over-sampling of anomalous cases on the training set to balance the learning of both types of behaviors while keeping the test set unchanged.
In experiments using PAAD, we construct the PathCNN with 3 convolutional layers with filter number {8, 16, 32}, filter size 3 × 3, and stride 2. Each convolutional layer is followed by a max pooling layer. We implement the ImageFC with one hidden layer with 64 hidden units. As in SVAEs, the encoder in LiDAR pipeline is constructed by one hidden layer and 128 hidden units, and the decoder mirrors the structure of the encoder. We choose a latent space of dimension 32 (z ∈ R 32 ). In the fusion module, the MHA has 8 attention heads and the FusionFC has 2 hidden layers with {128, T } hidden units. ReLU activation functions are applied and an Adam optimizer with a constant learning rate of 0.0005 is used to train the network.

A. Baselines and Numerical Evaluation
We evaluate the performance of the proposed method on the test set, along with the following baseline methods: • CNN-LSTM: An image-based, action-conditioned convolutional recurrent deep neural network introduced in LaND [11] and BADGR [12]. An LSTM unit, initialized by image features generated by a backbone convolutional network, sequentially processes each of the future T control actions and outputs the corresponding predicted probability of failure. • Cui et. al. [32]: A feedforwad convolutional neural network processing an image and robot's actions for behavior prediction. • NMFNet [33]: A multimodal fusion network for robot navigation in complex environments. To evaluate the future probability of failure, we take the two branches that handle LiDAR data and 2D images to process sensor observations and replace the branch of 3D point cloud with an MLP that processes robot's actions. To our knowledge, our work is the first to experiment sensor fusion of raw camera and LiDAR data for proactive anomaly detection, and the above baselines are state-of-the-art methods for either anomaly detection tasks using unimodal perception signals or related tasks using multimodal perception signals. For a fair comparison, we implement all the backbone convolutional neural networks used across different methods for the camera image as the ResNet-18 pretrained on visual navigation task, as described in Section III-C. All methods are trained on the same dataset.
Quantitatively, we compare different methods using the following two metrics: • F1-score: A comprehensive threshold-dependent index considering precision P and recall R, which can be expressed as 2P R/(P + R). We set the threshold to be 0.5, i.e., we declare a navigation failure if the predicted probability of failure is greater than that of being "normal" at a point in time. • PR-AUC: A threshold-independent index indicating the area under the Precision-Recall Curve. PR-AUC describes the ability to distinguish between positive and negative samples for anomaly detection models. We further employ the kernel density estimation [34] to fit probability density functions (pdfs) for normal and failure samples on the test set, respectively. We use a Gaussian kernel and apply the transformation trick [35] to make sure that the estimated pdfs have support on [0, 1].
The results are presented in Table I and Figure 5. As shown, PAAD achieves the best F1-score and highest PR-AUC with a large margin over other baselines. Although the CNN-LSTM model has been shown to have reliable anomaly detection performance for navigation tasks on sidewalks and off-road environments with large free space [11], [12], the method has not been shown to generalize well to harsh and cluttered field environments with limited open space. We argue that this is due to the fact that the control actions in such uncertain environments are high variance, making the network struggle with identifying true anomalous actions from  [11], [12] 0.5352 0.6988 Cui et. al. [32] 0.5748 0.7468 NMFNet [33] 0.5651 0.7554 PAAD (ours) 0.6453 0.8281 noises. In fact, all the three baselines, which take the future control actions as input, make overconfident predictions for false positives and false negatives as shown in Figure 5. As a result, these three models in general show inferior F1-score and PR-AUC compared to PAAD, which makes use of the image representation of the planned path. Despite an additional sensor modality from LiDAR, NMFNet fails to provide a solid improvement over unimodal approaches, which highlights the importance of robust feature generator and fusion mechanism in highly uncertain environments. Figure 6 shows the anomaly detection results of different methods in several challenging scenarios. In the first row, the LiDAR-based navigation algorithm falsely predict the orientation of the crop rows, making the robot take a left turn. As is further illustrated in Figure 7, CNN-LSTM and NMFNet make the prediction of navigation failure merely based on the image without considering the future behavior, thus refusing to declare failures in such a clear image near the center line. Cui et. al. [32] successfully detects a failure at the end of the path; however, the failure alert is too late to prevent the catastrophic collision. By contrast, the start time of the collision is more accurately predicted by PAAD. The second row shows a nearmiss case where the robot manages to recover to the center line from the edge. Although PAAD falsely predicts a failure at the last point with a score of 0.52, most part of the path is classified as normal correctly. However, all the other three methods generate overconfident scores for the entire path. The last row shows a normal case where the robot is tracking the center line while the camera is occluded by low-hanging leaves. The three baselines all failed while PAAD successfully distinguishes such normal behavior from an anomalous one.
To further verify our hypothesis that noisy actions, as opposed to planned paths, hinder the network from learning useful features of robot's behavior, we feed an image and several sequences of actions / paths sampled from the test set through different models to predict probability of failure within the horizon. As shown in Figure 7a, the three networks based on control actions always predict normal behaviors no matter how the future motion looks like, which indicates that the models are only making use of the image for anomaly detection. By contrast, PAAD can predict navigation failures based on the planned path, thus producing more promising results as shown in Figure 7b.
We further conduct an ablation study to reveal the benefit of different components in PAAD. The ablated versions of PAAD that we consider include:

B. Real-Time Test
To test the ability of PAAD to alert the robot before executing an anomalous behavior, we further perform a realtime anomaly detection task on additional data 3 . In this experiment, the robot was driven by the vision-based navigation algorithm [5]   common field environment and 550m of densely weedy environment. Three and eight human interventions were required to reset the robot after an anomaly occurred in common and weedy environment, respectively. We define the current anomaly score as a linear combination of probabilities of failure within the prediction horizon: where γ is a discount factor compensating the uncertainty in the future, and β is a scaling factor ensuring that the summation T −1 k=0 γ k equals 1. At each time step t, we declare an anomaly if s t is greater than 0.5.
To calibrate the difficulty of the task, we implement a Li-DAR baseline for the real-time test. Given range measurements within the forward-facing 90 • field of view, we declare an anomaly if 85% of the view is blocked by objects within 0.3 meters. We also compare PAAD against a unimodal approach, Cui et. al. [32], and a multimodal approach, NMFNet [33], from Section IV-A. To increase the robustness against frequent occlusions of camera and LiDAR sensors in cluttered field environment, all the anomaly detectors declare an anomaly only when 3 consecutive anomaly scores are over 0.5. We implement all the methods at a frequency of 10 Hz.
Table III summarizes the results. As shown, PAAD is able to detect anomalies reliably in both environments while maintaining a low false detection rate. On the contrary, the three baselines struggle with sensor occlusions and noisy actions in such cluttered and uncertain environments, thus frequently intervene the navigation system during the normal operation of the robot. Furthermore, we observe that PAAD is able to capture some rare failure modes, such as driving off the trail due to large gaps between crops. Scenarios in which PAAD failed usually contain dense weeds on the path and/or the robot executing near-miss maneuvers (see video). The detection of these anomalies could be potentially improved with additional data. Lastly, the reliable anomaly detection performance of the PAAD shown in the LiDARbased navigation system (Section IV-A) and the vision-based navigation system (Section IV-B) indicate that our method is agnostic to the underlying controller and can be applied to general systems that employ predictive control.

V. CONCLUSION
In this work, we presented a proactive anomaly detection method for robot navigation in challenging field environment using multi-sensor signals. Our approach predicts the probability of future failure based on the planned path and the current sensor observation. By introducing a feature-level camera-lidar fusion, the detector successfully detected navigation failures in agricultural environment with higher F1-score and PR-AUC than other previous state-of-the-art methods. We also demonstrated the reliable anomaly detection performance of the PAAD with low false alarms in the real-time test. Although our method showed robustness in uncertain environments, false detection is unavoidable when both camera and LiDAR are blocked. Active perception, which encourages the robot to collect richer sensory signals through additional interaction with the environment, could decrease perception uncertainty in such cases of full sensor occlusion and would be a future work direction.