Benchmark Analysis of Semantic Segmentation Algorithms for Safe Planetary Landing Site Selection

This paper presents an in-depth analysis of state-of-the-art semantic segmentation algorithms applied to spacecraft safe planetary landing via hazard detection and avoidance. Several architectures are trained from binary safety maps and the rich dataset of the High-Resolution Imaging Science Experiment (HiRISE) embedded on Mars Reconnaissance Orbiter for realistic purposes. The study incorporates several metrics comparisons such as recognition accuracy, computational complexity, model complexity, and inference time. The proposed performance indices and combinations are analyzed and discussed. The experiments were performed using a Raspberry Pi 4B, which is a relevant commercial-of-the-shelf microcontroller surrogate of NASA’s High-Performance Spaceflight Computer (HPSC) that will thrive within the next decades in space exploration. This paper allows researchers to know what has been tested on the subject and serves as a catalog for users to pick the most relevant architecture for their own application.


I. INTRODUCTION
Safe landing is by far the most critical part of every space mission aiming at conducting experiments on the ground. First performed by humans during the Apollo program with Apollo 11 [1], it switched to fully autonomous with the exploration of Mars as distances increased. Engineers even call the entry, descent, and landing phase on Mars the seven minutes of terror, as it takes 7 minutes to the spacecraft to safely land on the planet, and from which the outcome can only be seen after the landing has occurred due to this important delay.
At first, these landers developed and managed by NASA's Jet Propulsion Laboratory (JPL) had a predefined trajectory in their flight computer, and diversion was not possible. This was the case for all landers before the year 2020 with Viking 1 & 2 [2], [3], Pathfinder and its Sojourner rover [4], Spirit [5], Opportunity [6], Phoenix [7], Curiosity [8], and Insight [9]. On February 18, 2021, Terrain Relative Navigation (TRN), a revolutionary technique able to safely and autonomously land between hazards was first tested in real The associate editor coordinating the review of this manuscript and approving it for publication was Kumaradevan Punithakumar . conditions during the Mars 2020 mission embedding the Perseverance rover and the Ingenuity helicopter [11], as shown on Fig. 1. The lander would take photos during its descent, compare them to its orbital map, and divert if necessary.
Limited by the capabilities of space-rated flight computers, continuous analysis of the terrain and computation of new trajectories combined with increasing degrees of safety are becoming computationally too demanding and new technologies are now required to overcome this challenge. This is why NASA's JPL, along with industry partners, is developing the High-Performance Spaceflight Computer (HPSC) [12]. A hundred times more efficient than its monolithic counterpart, it will push the boundaries of space exploration, and allow missions that were thought impossible to happen in the next decades.
On the other hand, artificial intelligence, and especially machine learning and computer vision have been greatly explored and are today two of the most active fields studied by researchers around the world [13]. Offering the capabilities of learning optimal decisions to be retrieved orders of magnitude faster than classical approaches has shed light on attainable breakthroughs in various disciplines and domains in modern science and engineering. Specifically, deep neural networks have allowed semantic segmentation, a computer vision technique, to arise and enable the classification of each pixel in an image from a predefined set of classes, and now starts playing a role in spacecraft safe planetary landing.
Throughout the landing of the spacecraft, many technical challenges are to be anticipated in order to robustify mission success. The martian surface, and more specifically the regions of scientific interest, are full of obstacles such as craters, cliffs, cracks and jagged boulders, which in turn set high requirements on the safety level prediction accuracy of the landing sites [14]. Furthermore, as for Hayabusa's first rehearsal descent, it could be possible that the original landing site was surrounded by small but fatal hazards, which could not be detected until the vehicle is close enough to the surface [15]. This then constraints the algorithms to handle increased resolutions. Further, these vision processes have to fit on space-rated hardware and satisfy mission requirements such as inference time or prediction accuracy, thus being memory and computationally efficient. These algorithms must finally be trained on realistic data and be robust to sensor noise.
The main contributions of this paper can be summarized as follows: • Choice of the HPSC surrogate microcontroller for realistic performance predictions.
• Training on realistic and noisy data to obtain performance metrics during inference time of state-of-theart semantic segmentation algorithms for binary safety maps generation.
• Benchmark of all the algorithms and architectures in several metric spaces such as accuracy, memory consumption and inference time to find the most suitable one for the landing problem. The structure of this paper is as follows: Section § III finds a relevant commercial-off-the-shelf surrogate of the HPSC. Section § IV presents the algorithms and the metrics being compared. Section § V introduces and discusses the numerical results. Section §VI finally draws conclusions and perspectives on this work.

II. LITERATURE REVIEW
In prior studies, the necessity of providing safe landing for autonomous systems started with the emergence of Unmanned Aerial Vehicles (UAVs). The computer vision community played an essential role with primal developments in visual odometry ranging from monocular vision [16] to stereo vision [17], [18], providing depth estimation, or feature-based methods [19]. Moreover, Simultaneous Localization and Mapping (SLAM) [20], mostly using LiDAR-based point clouds, allowed the generation of 3D maps, and thus to estimate the topology of the terrain to land on. Safe landing was also thoroughly studied from the perspective of LiDAR-based ground filtering algorithms based on the process of Digital Elevation Maps (DEMs) [21]. Many other techniques were studied such as Stereo-Ranging [22], using two cameras with different angles to analyze pixel disparity and estimate depth, Structure from Motion (SFM) [23] to reconstruct 3D terrain from 2D-image sequences, Homography Estimation and Adaptive Control (HEAC) for image rectification and registration [24], but also Color Segmentation [25] and Optical Flow [26].
Pixel classification from semantic segmentation was used for numerous applications ranging from autonomous driving [27], robotic navigation and localization [28], to scene understanding [29]. Semantic segmentation based on deep learning was also involved in UAV visual landing site detection, such as with the PSPNet architecture [30]. Since recently, semantic segmentation has started playing a role in space engineering, and in particular for spacecraft safe planetary landing with networks like U-Net [31], [32]. Another main technical challenge of this task is that the state-of-theart algorithm, namely the Autonomous Landing and Hazard Avoidance Technology (ALHAT) [33], [34] only considers local features, whereas a global understanding of the entire image could provide more insight, and this is where semantic segmentation motivates this work. During the descent phase, every time the LiDAR-based cameras retrieve a point cloud of the ground, a digital elevation map is created. The semantic segmentation algorithm is then able to classify every pixel and predict its safety level, namely identify the nature of each landing site. This method has already proven to provide better results than ALHAT [35]. The benchmark study of the proposed paper is for the base architectures without uncertainty quantification, and adding them would increase the performance as shown in [36].

III. NASA's HIGH-PERFORMANCE SPACEFLIGHT COMPUTER's SURROGATE
To make realistic experiments, a relevant substitute of NASA's High-Performance Spaceflight Computer Surrogate (HPSC) must be chosen. This section gives the latest specifications of the HPSC and finds a surrogate of it.
To do that, the comparison is based on FLOPS, which is a measure of how many floating-point operations a microcontroller is able to perform per second. It is a function of the number of cores, their clock frequency, the number of VOLUME 10, 2022 floating-point operations performed at each cycle, and the computing method that enables processing of multiple data with a single instruction, also known as SIMD capabilities. The FLOPS are then defined by the following equation: In the case of NASA's HPSC, some information could be found from the preliminary design [12]. Here are listed the paper's assumptions about the expected HPSC specifications.
• Cores: A minimum of eight 64b CPU cores on the HPSC System-on-Chip (SoC) die, each of which contains integer, double-precision floating-point, and vector processing capability. It is desired that all eight cores be fully coherent. • Clock: All of the CPU cores, floating-point engines, and vector units shall operate at a minimum frequency of 1 GHz. Note the microcontroller core used for boot, health, and configuration can operate less than 1GHz. Taking into account the highly used parallelization for neural networks (e.g. matrix multiplications and additions), it is reached: In comparison, Raspberry Pi 4B [37] achieves a theoretical maximum of: This means that to compare effectively with Commercial off-the-shelf (COTS) hardware, the easiest approach is to reduce the clock frequency of the Raspberry pi 4B to match the 32 GFLOPS of HPSC, theoretically by .
So the frequency applied to Raspberry Pi 4B becomes:

IV. ARCHITECTURES AND METRICS
In this section, it is presented the semantic segmentation architectures and metrics that are being compared.

A. ARCHITECTURES
The following architectures are all composed of convolutional layers which govern the overall time complexity of the network. They are based on the 2D cross-correlation operation and lead, following [38], for squared input and kernel, to: with C t the time complexity, l the index of the layer, n l−1 the number of input channels of layer l, n l the width, or number of filters, s l the spatial size of the filter, and m l , the size of the output.

1) SegNet: 5, 4, AND 3 ENCODING-DECODING LAYERS
The first semantic segmentation model with which this paper deals with is SegNet [39], depicted on Fig. 2. This is an encoder-decoder network with a pixel-wise classification layer. His encoder is similar as the VGG16 network. Unlike other models, SegNet stores the pooling indices and retrieves them during the umsampling step. In this study, it is benchmarked three types of SegNets. The first one is the original with 5 layers for the encoding and 5 others for the decoding step. It is also tested with 4 and 3 on each side of the lowdimensional (or latent) space to see the effect of the parameters reduction on the accuracy. Fully Convolutional Networks [40], as shown on Fig. 3, share the same encoding process as for SegNet, but the indices are not stored to perform the upsampling. For that specific task of retrieving the input size, there are several types of options. The first one is called FCN-32s. From the latent space, it is directly upsampled or interpolated to the original image size. A lot of information is then lost. Then, it is compared FCN-16s.
In that case, the last step is upsampled by 2, the scores are summed with the step before, and the result is upsampled to the original image size.
In the final case, FCN-8s, the result of FCN-16s are upsampled by 2, the score are added with the step before, and the result is upsampled. The accuracy should increase from FCN-32s, to FCN-16s, to FCN-8s, as the upsampling loses less and less information. However, it is more computationaly heavy.

3) ICNet (IMAGE CASCADE NETWORK)
The Image Cascade Network (ICNet) [41], represented on Fig. 4, works in several steps. First, the low-resolution images reach the end of the network and a first mapping is obtained.   On top of that, medium and high resolution features refine it with cascade feature fusion unit and cascade label guidance strategy. The main idea is to replace pooling operations by upsampling operators to improve the resolution of the output. Also, the skip connections (copy and crop) technique in U-Net copies the image matrix from the earlier layers (left-hand side of Fig. 5) and uses it as a part of the later layers (right-hand FIGURE 5. U-Net [42] is based upon FCN but should in theory provide a better precision with less training images. The main idea is to replace pooling operations by upsampling operators to improve the resolution of the output. Also, the skip connections (copy and crop) technique in U-Net copies the image matrix from the earlier layers (left-hand side) and uses it as a part of the later layers (right-hand side). With this, the model preserves rich information, thus reduces information loss. The many channels in the decoder part helps propagating context information to layers with higher resolution and this symmetry leads to the U-shaped architecture. Reprinted by permission from Springer Nature Customer Service Centre GmbH: Springer Nature MICCAI 2015 [42], (2015).
side layers). With this, the model preserves rich information, thus reducing information loss. The many channels in the decoder part helps propagating context information to layers with higher resolution and this symmetry leads to the U-shaped architecture.
Regarding the TransUNet architecure [43], it adds Vision Transformers (ViT, concept of self-attention) to record comprehensive localization information at all network stages while conveying the context over a long-range within the network.

5) ENet (EFFICIENT NEURAL NETWORK)
ENet [44], also uses an encoder-decoder scheme and is more adapted to real-time applications. The idea is to early downsample in the decoder part to reduce the cost of processing large input frames. ENet also takes advantage of PReLUs as an activation function, dilated convolutions and spatial dropout.

6) ConvDeconv
ConvDeconv [45] is a semantic segmentation network following similar ideas as for SegNet but with a lot less layers and with smaller kernel sizes.

7) DeepLabV3 WITH A ResNet BACKBONE
The last architecture that was benchmarked is the DeepLabv3 [46]. Atrous convolution are used in cascade or in parallel to segment objects at multiple scales. State-of-the-art networks attach a ResNet (Residual network) backbone to it, acting as its main feature extractor. There exist several variants of this tandem, such as when changing the number of layers (ResNet-N), the number of groups / width per group (ResNeXt), or making the bottleneck number of channels twice larger in every block (Wide).

B. PERFORMANCE METRICS
It is now presented the performance metrics that are being used to benchmark the different algorithms and architectures on the same basis.

1) RECOGNITION ACCURACY
The accuracy is representing the pixel accuracy in this study, which is the number of pixels that were correctly judged as safe or unsafe.

2) COMPUTATIONAL COMPLEXITY
The computational complexity is represented by the number of floating-point operations (or GFLOPs). The requirement is that is must be below the capabilities of HPSC of 32 GFLOPS.

3) MODEL COMPLEXITY
The model complexity is obtained by counting the number of learnable parameters. This is important regarding the memomy capabilities of the hardware.

4) INFERENCE TIME
The inference time is the duration that needs the network to perform one forward pass. For statistical purposes, it was averaged over 100 runs.

A. INPUT DATASET AND TRAINING 1) DATASET
The data that was used in this study is directly imported from the Mars HiRISE camera on-board the Mars Reconnaissance Orbiter (MRO). The resolution is 1m/pixel and of size 100 × 100 upsampled with bilinear interpolation to 128 × 128 before being fed to the network.
For the ground-truth label, the ALHAT probabilistic algorithm is not used. It is instead measured slope and roughness just as ALHAT does, but safety maps are deterministically generated from the noise-free Digital Elevation Maps.
The dataset contained a total of 1000 normalized Digital Element Maps (DEMs), with 800 for training, 100 for validation, and the remaining 100 for testing. Examples of which can be seen on Fig. 6. All images are independent from one another as there was no data augmentation strategies involved.

2) TRAINING
Integrated in the Pytorch framework, the models were trained using the Adaptative Moment Estimation (Adam) stochastic optimizer. A batch size of 8 was used for the 1 million epochs of training, with a learning rate of 0.00001. The training took several days on an NVIDIA Quadro RTX 6000 on the Georgia Institute of Technology's High-Performance PACE Cluster. The benchmark was however performed online with the chosen HPSC surrogate, namely the Raspberry Pi 4B, as discussed in section III.

B. BENCHMARK 1) ACCURACY VS COMPUTATIONAL COMPLEXITY VS MODEL COMPLEXITY
The ball chart in Fig. 7 gives the accuracy of each model with respect to its computational complexity on the X-axis, and its model complexity (number of parameters) as the size of the ball. The highest pixel accuracy is reached by ConvDeconv with 95% which also has the least amount of parameters. It seems to be no direct relationship between the accuracy and GFLOPS observing that ICNet with respect to ENet has similar computational complexity but with 10 less percentage points. Further more, the same example also shows that having a large number of parameters does not imply a better accuracy. This is actually by reducing the number of learnable parameters and computational operations that the best performance occurs.

2) ACCURACY-RATE VS LEARNING POWER
The accuracy density, which is the ratio of the recognition performance by the number of parameters to achieve that result. It is first observed on Fig. 8 that TransUNet and the DeepLabv3s have the lowest results. On the other hand, architectures such as ConvDeconv, Enet, and SegNet 3 make a better use of their model complexity. This accuracy density is compared with the actual accuracy of each model. ConvDeconv, which the the most accurate model, uses 5 times more efficiently its parameters than SegNet 3 and also has a better accuracy of 2.5 percentage points.

3) ACCURACY-RATE VS INFERENCE TIME
The inference time of each model computed from the Raspberry Pi 4B is finally shown on Fig. 9. The grid plot shows 3 classes. The first one regroups the architectures FIGURE 7. This ball chart gives the accuracy of each model with respect to its computational complexity on the X-axis, and its model complexity (number of parameters) as the size of the ball. The highest pixel accuracy is reached by ConvDeconv with 95% which also has the least amount of parameters. It seems to be no direct relationship between the accuracy and GFLOPs observing that ICNet with respect to ENet has similar computational complexity but with 10 less percentage points. Furthermore, the same example also shows that having a large number of parameters does not imply a better accuracy. This is actually by reducing the number of learnable parameters and computational operations that the best performance occurs. FIGURE 8. The accuracy density, which is the ratio of the recognition performance by the number of parameters to achieve that result. It is first observed that TransUNet and the DeepLabv3s have the lowest results. On the other hand, architectures such as ConvDeconv, Enet, and SegNet 3 make a better use of their model complexity. This accuracy density is compared with the actual accuracy of each model. ConvDeconv, which the the most accurate model, uses 5 times more efficiently its parameters than SegNet 3 and also has a better accuracy of 2.5 percentage points. that perform 5 to 10 forward passes in lass than 2 seconds, namely ConvDeconv and ICNet. Then, 5 models perform their inference between 0.2 and 1 second. This group contains all the SegNets, as well as U-Net and ENet. Finally, all fully connected networks, TransUNet, and the DeepLabv3s take more than 1 second to perform the prediction. Among the architectures that produce their safety map is less than 1 second, ConvDeconv gives the best accuracy. It is the fastest, but also the most accurate. It can also be noted that among the SegNets, SegNet 3, the smallest one in terms of computational and model complexity, achieves the fastest and most accurate performance.
Since the accuracy and the inference time are the two most important factors, the models that give the best performance with respect to those metrics is ConvDeconv. Following Table 1, this architecture is the best in terms of almost all the computed metrics, namely inference time, specificity (true unsafe rate), fall-out (false safe rate), computational and VOLUME 10, 2022 FIGURE 9. The inference time of each model computed from the Raspberry Pi 4B is finally shown. The grid plot shows 3 classes. The first one regroups the architectures that perform 5 to 10 forward passes in lass than 2 seconds, namely ConvDeconv and ICNet. Then, 5 models perform their inference between 0.2 and 1 second. This group contains all the SegNets, as well as U-Net and ENet. Finally, all fully connected networks, TransUNet, and the DeepLabv3s take more than 1 second to perform the prediction. Among the architectures that produce their safety map is less than 1 second, ConvDeconv gives the best accuracy. It is the fastest, but also the most accurate. It can also be noted that among the SegNets, SegNet 3, the smallest one in terms of computational and model complexity, achieves the fastest and most accurate performance.

TABLE 1.
Results for every model obtained with the Rasberry Pi 4B. mIoU is the mean intersection over union, T-S/S the fraction of true safe pixels (sensitivity), F-US/S the percentage of pixels that were predicted as unsafe when they were actually safe (miss rate), T-US/US (specificity) the ratio of pixels that were correctly labeled as unsafe, F-S/US those which were incorrectly labeled as safe when they were unsafe (fall-out), GFLOPs the number of billions of FLoating OPerations used for one inference (computational complexity), and # of parameters is the number of parameters (model complexity). Best values are written using boldface letters. It can be observed that the shallower the architecture such as with SegNet 3 or ConvDeconv, the better the performance in terms of inference time, accuracy, computational complexity, and memory usage on this specific task. model complexity. Only SegNet 3, which is also a shallow architectures with respect to the others, shows approaching results, even producing better sensitivity (true safe rate), and miss rate (false unsafe rate).

4) RESULTS FOR ConvDeconv
The study finally zooms in on ConvDeconv, the chosen architecture for the landing problem. On Fig. 10, it is shown the training results. The loss function decreases and finally plateaus at the same step at the Mean Intersection over Union (mIoU), and the Pixel Accuracy. As stated in section V, the accuracy of ConvDeconv is of 95%, and its mIoU of 89%. Overfitting did not occur as the accuracy for the testing set did not decrease after a given iteration. On Fig. 11, it is seen the evolution of the prediction throughout the training process. The algorithm is able to predict the biggest unsafe areas, but struggles with the finest details in this setup. However, as the lander continues its descent, details will eventually FIGURE 10. Training results for the testing set of ConvDeconv. The loss function decreases and finally plateaus at the same step at the Mean Intersection over Union (mIoU), and the pixel accuracy. As stated in section V, the accuracy of ConvDeconv is of 95%, and its mIoU of 89%. Overfitting did not occur as the accuracy for the testing set did not decrease after a given iteration. It is seen the evolution of the prediction throughout the training process. The algorithm is able to predict the biggest unsafe areas, but struggles with the finest details in this setup.

FIGURE 12.
The inference time which follows a quadratic law as a function of the relative resolution, For instance, an increase from 128 × 128 pixels to 1280 × 1280 will lead to an explosion in time by a factor 100. Thus, increasing the resolution of the overall process will be costly and trade-offs would have to be made between accuracy and inference time to satisfy the requirements.
become large, but this paper wants to stress out that direct conclusions should not be made since the algorithms was only trained at a certain altitude. Nonetheless, as depicted in Fig. 12, the inference time follows a quadratic law with respect to the relative resolution. For instance, an increase from 128×128 pixels to 1280×1280 will lead to an explosion in time by a factor 100. Thus, increasing the resolution of the overall process will be costly and trade-offs would have to be made between accuracy and inference time to satisfy the requirements.

VI. CONCLUSION AND PERSPECTIVES
In this study, after choosing a relevant surrogate of NASA's High-Performance Spaceflight Computer, it was benchmarked several semantic segmentation models and architectures to find the most suitable for the landing problem task, taking into account hardware and mission specifications.
The key findings of this paper are the following: • Raspberry Pi 4B is a relevant surrogate model for NASA's High-Performance Spaceflight Computer. It allows to reproduce the physical limits of a 64-bit 8-core architecture with a computational limit of 32-GFLOPS.
• There is no correlations between most of the different metrics. However, shallow architectures seem to have a prominent impact on the performance.
• ConvDeconv is the architecture that achieves the best accuracy (>95%). It also provides the fastest inference time with more than 10 predictions per second. It is finally the one requiring the least computational power and memory usage, thus the chosen solution for the given problem.
Further studies would have to vary the resolution of the input Digital Elevation Maps, but also the slant distances and angles of the lander with respect to the ground, in a perspective of finding the most robust architecture. Another VOLUME 10, 2022 idea would be to make use of different models in concert depending on the lander's configuration.
THOMAS CLAUDET is currently pursuing the dual master's degree in engineering (control of dynamical systems) with the IMT Atlantique, France, and in mechanical engineering with the Georgia Institute of Technology, writing his master's thesis on safe planetary landing using semantic segmentation. He has been working with the Jet Propulsion Laboratory as a Visiting Student Researcher on the stabilization of planetary-explorer balloons, deep space network scheduling, spacecraft swarm trajectory optimization, and autonomous robotics with team CoSTAR (JPL, Caltech, and MIT) for the DARPA subterranean challenge finals. He is the Founder, a Project Manager, and a Lead Scientist of his CubeSat club.
KENTO TOMITA is currently pursuing the Ph.D. degree with the Georgia Institute of Technology advised by Koki Ho. His specializations are space systems modeling and optimization, and decision making under uncertainty. He is investigating online hazard detection problems for planetary landing missions. He was working as an ADCS Engineer for EQUULEUS, a 6U piggyback satellite of NASA's Artemis-1 Mission, at The University of Tokyo. His research interests include space systems applications under the intersection between artificial intelligence, combinatorial optimization, and probabilistic optimization.
KOKI HO is currently the Director of the Space Systems Optimization Group and an Assistant Professor with the Daniel Guggenheim School of Aerospace Engineering, Georgia Tech. His research interests include space logistics systems design for applications including human space exploration campaigns, on-orbit servicing assembly and manufacturing (OSAM), and large-scale satellite constellations. Motivated by the growing complexity of space missions, he has pioneered a new research direction of logistics-based space systems modeling, substantially improving the efficiency and rigor of the space mission formulation process. His work has been sponsored by NASA, NSF, DARPA, USSF, and industry (including ULA, Maxar Technologies, and so on). He was a recipient of the NSF CAREER Award, the NASA Early Career Faculty Award, and the DARPA Young Faculty Award. He also serves as the Chair for the AIAA Space Logistics Technical Committee.