METER: A Mobile Vision Transformer Architecture for Monocular Depth Estimation

Depth estimation is a fundamental knowledge for autonomous systems that need to assess their own state and perceive the surrounding environment. Deep learning algorithms for depth estimation have gained significant interest in recent years, owing to the potential benefits of this methodology in overcoming the limitations of active depth sensing systems. Moreover, due to the low cost and size of monocular cameras, researchers have focused their attention on monocular depth estimation (MDE), which consists in estimating a dense depth map from a single RGB video frame. State of the art MDE models typically rely on vision transformers (ViT) architectures that are highly deep and complex, making them unsuitable for fast inference on devices with hardware constraints. Purposely, in this paper, we address the problem of exploiting ViT in MDE on embedded devices. Those systems are usually characterized by limited memory capabilities and low-power CPU/GPU. We propose METER, a novel lightweight vision transformer architecture capable of achieving state of the art estimations and low latency inference performances on the considered embedded hardwares: NVIDIA Jetson TX1 and NVIDIA Jetson Nano. We provide a solution consisting of three alternative configurations of METER, a novel loss function to balance pixel estimation and reconstruction of image details, and a new data augmentation strategy to improve the overall final predictions. The proposed method outperforms previous lightweight works over the two benchmark datasets: the indoor NYU Depth v2 and the outdoor KITTI.


I. INTRODUCTION
Acquiring accurate depth information from a scene is a fundamental and important challenge in computer vision, as it provides essential knowledge in a variety of vision applications, such as augmented reality, salient object detection, visual SLAM, video understanding, and robotics [1]- [3].Depth data is usually captured with active depth sensors as LiDARs, depth cameras, and other specialised sensors capable of perceiving such information by perturbing the surrounding environment, e.g. through time-of-flight or structured light technologies.These sensors have several disadvantages, including unfilled depth maps and restricted depth ranges, as well as being difficult to integrate into low-power embedded devices.In addition, we also need to consider the power consumption in the case of hardwares with low-resource constraints.
On the contrary, passive depth sensing systems based on deep learning (DL) could potentially overcome all the active The authors are with the Department of Computer, Control and Management Engineering, Sapienza University of Rome, Italy (e-mail: [papa, paolo.russo,amerini]@diag.uniroma1.it).
depth sensor limitations.Moreover, in some settings such as indoor or hostile environments, where the use of small robots and drones could introduce additional constraints, the presence of a single RGB camera offers an effective and low-cost alternative to such traditional setups.The monocular depth estimation (MDE) task consists in the prediction of a dense depth map from a video frame with the use of DL algorithms, where the estimation is computed for each pixel.
Recent MDE models aim at enabling depth perception using single RGB images on deep vision transformer (ViT) architectures [4]- [6], which are generally unsuitable for fast inference on low-power hardwares.Instead, well-established convolutional neural networks (CNN) architectures [7], [8] have been successfully exploited on embedded devices with the goal of achieving accurate and low latency inferences.However, ViT architectures demonstrate the advantage of a global processing by obtaining significant performance improvements over fully-CNNs.In order to balance computational complexity and hardware constraints, we propose to integrate the two architectures by fusing transformers blocks and convolutional operations, as successfully exploited in classification and object detection [9], [10] tasks.
This paper presents METER, a MobilE vision TransformER architecture for MDE that achieves state of the art results with respect to previous lightweight models over two benchmark datasets, i.e.NYU Depth v2 [11] and KITTI [12].METER inference speed will be evaluated on two embedded hardwares, the 4GB NVIDIA Jetson TX1 and the 4GB NVIDIA Jetson Nano.To improve the overall estimation performances, we focus on three fundamental components: a specific loss function, a novel data augmentation policy and a custom transformer architecture.The loss function is composed of four independent terms (quantitative and similarity measurements) to balance the architecture reconstruction capabilities while highlighting the image high-frequency details.Moreover, the data augmentation strategy employs a simultaneous random shift over both the input image and the dense ground truth depth map to increase model resilience to tiny changes of illumination and depth values.
The proposed network exploits a hybrid encoder-decoder structure characterized by a ViT encoder, which was inspired by [9] due to its fast inference performances.We focus on the transformer structure in order to identify and to improve the blocks with the highest computational cost while optimizing the model to extract robust features.In addition, we designed a novel lightweight CNN decoder to limit the amount of Fig. 1: METER depth map predictions (third-row) over the KITTI and NYU Depth v2 datasets.GT depth maps are resized to match METER output resolution.The depth maps are converted in RGB format with a perceptually uniform colormap (Plasma-reversed) extracted from the ground truth (second-row), for a better view.operations while improving the reconstruction process.Furthermore, we propose three different METER configurations; for each variant, we reduce the number of trainable parameters at the expense of a slight increase of the final estimation error.Figure 1 shows several METER depth estimations for both indoor and outdoor environments.
Moreover, to the best of our knowledge, METER is the first model for the MDE task that integrates the advantage of ViT architectures in such lightweight DL structures under low-resource hardware constraints.The main contributions of the paper are summarized as follows: • We propose a novel lightweight ViT architecture for monocular depth estimation able to infer at high frequency on low-resource (4GB) embedded devices.
• We introduce a novel data augmentation method and loss function to boost the model estimation performances.• We show the effectiveness and robustness of METER with respect to related state of the art MDE methods over two benchmark datasets, i.e.NYU Depth v2 [11] and KITTI [12].• We validate the models through quantitative and qualitative experiments, data augmentation strategies and a loss function components, highlighting their effectiveness.This paper is organized as follows: Section II reviews some previous works related to the topics of interest.Section III describes the proposed method and the overall architecture in detail.Experiments and hyper-parameters are discussed in Section IV, while Section V reports the results and a quantitative analysis of METER with respect to other significant works.Some final considerations and future applications are provided in Section VI.

II. RELATED WORKS
In this section, we report state of the art related works on monocular depth estimation, grouped as follows: fully CNN-based methods are covered in Section II-A, ViT-based approaches in Section II-B and lightweight (CNN) MDE methods in Section II-C.

A. CNN-based MDE methods
Fully convolutional neural networks based on encoderdecoder structures are commonly used for dense prediction tasks such as depth estimation and semantic segmentation.In the seminal work of Eigen et al. [13] it is presented a CNN model to handle the MDE task by employing two stacked deep networks to extract both global and local informations.Cao et al. present [14] and [15] two works based on deep residual networks to solve the MDE defined as a classification task, respectively, over absolute and relative depth maps.Alhashim et al. [16] propose DenseDepth, a network which exploits transfer learning to produce high-resolution depth maps.The architecture is composed of a standard encoder-decoder with a pre-trained DenseNet-169 [17] as backbone and a specifically designed decoder.Gur et al. [18] present a variant of the DeepLabV3+ [19] model where the encoder is composed of a ResNet [20] and of an atrous spatial pyramidal pooling while introducing a Point Spread Function convolutional layer to learn depth informations from defocus cues.Recently, Song et al. [21] propose LapDepth, a Laplacian pyramid-based architecture composed of a pretrained ResNet-101 encoder and a Laplacian pyramid decoder that combined the reconstructed coarse and fine scales to predict the final depth map.
However, those methods, which often rely on deep pretrained encoders and high-resolution images as input, are unsuitable for inferring on low-resource hardwares.In contrast, we propose a lightweight architecture that takes advantage of transformers blocks to balance global feature extraction capabilities and the overall computational complexity of convolutional operations.

B. ViT-based MDE methods
Vision Transformers [22] gain popularity for their accuracy capabilities thanks to the attention mechanism [23] that simultaneously extract information from the input pixels and their inter-relation, outperforming the translation-invariant property of convolution.In dense prediction tasks, ViT architectures share the same encoder-decoder structure that has significantly contributed to face many CNN vision-related problems.Bhat et al. [5] have been the first to handle the MDE task with ViT architectures by proposing Adabins: it uses a minimized version of a vision transformer structure to adaptively calculate bins width.Ranftl et al. [4] investigate the application of ViT proposing DPT, a model composed of a transformer-CNN encoder and a fully-convolutional decoder.The authors show that ViT encoders provide finer-grade predictions with respect to standard CNNs, especially when instantiated with a large amount of training data.Yun et al. [24] improves 360 • monocular depth estimation methods with a joint supervised and self-supervised learning strategies taking advantage of non-local DPT.Recently, Li et al. [25] design MonoIndoor++, a framework that takes in account the main challenges of indoor scenarios.Kim et al. [26] propose GLPDepth, a globallocal transformer network to extract meaningful features at different scales and a Selective Feature Fusion CNN block for the decoder.The authors also integrate a revisited version of CutDepth data augmentation method [27] which is able to improve the training process on the NYU Depth v2 dataset without needing additional data.Li et al. propose DepthFormer [6] and BinsFormer [28], where the first one is composed of a fully-transformer encoder and a convolutional decoder interleaved by an interaction module to enhance transformer encoded and CNN decoded features.Differently, in BinsFormer the idea of the authors is to use a multi-scale transformer decoder to generate adaptive bins and to recover spatial geometry information from the encoded features.
Instead of following the recent trend of high-capacity models, we propose a novel lightweight ViT architecture that is able to achieve accurate, low latency depth estimations on embedded devices.

C. Lightweight MDE methods
The models reported so far are not suitable for embedded devices due to their size and complexity.For this reason, developing lightweight architectures could be a solution to perform inference on constrained hardwares as shown in [29], [30].To provide a clearer overview of those approaches we also provide the frames per second (fps) published in the original papers that focus on inference frequency, remarking that they are not comparable due to the different tested hardwares.Poggi et al. [31] propose PyD-Net, a pyramidal network to infer on CPU devices.The authors use the pyramidal structure to extract features from the input image at different levels, which are afterwards upsampled and merged to refine the output estimation.Such model achieves less than 1 fps on an ARM CPU and almost 8 fps on an Intel i7 CPU.Spek et al. [32] present CReaM, a fully convolutional architecture obtained through a knowledge-transfer learning procedure.The model is able to achieve real-time frequency performances (30 fps) on the 8GB NVIDIA Jetson TX2 device.Wofk et al. [8] develop FastDepth, an encoder-decoder architecture characterized by a MobileNet [33] pre-trained network as backbone, and a custom decoder.Furthermore, the authors show that pruning the trained model guarantees a boost of inference frequency at the expense of a small increment of the final estimation error.FastDepth achieves 178 fps on the 8GB NVIDIA Jetson TX2 device.Recently, Yucel et al. [34] propose a small network composed by the MobileNet v2 [33] as encoder and FBNet x112 [35] as decoder, trained on an altered knowledge distillation process; the model achieves 37 fps on smartphone GPU.Papa et al. [7] design SPEED, a separable pyramidal pooling architecture characterized by an improved version of the MobileNet v1 [36] as an encoder and a dedicated decoder.This architecture exploits the use of depthwise separable convolutions, achieving real-time frequency performances on the embedded 4GB NVIDIA Jetson TX1 and 6 fps on the Google Dev Board Edge TPU.
As previously mentioned, all those lightweight MDE works are designed over fully-convolutional architectures.In contrast to previous methodologies, METER exploits a lightweight transformer module in three different configurations, achieving state of the art results over the standard evaluation metrics.

III. PROPOSED METHOD
This section outlines the design of METER, the proposed lightweight monocular depth estimator.In particular, in Section III-A, we provide a detailed architecture analysis for both encoder and decoder modules, in Section III-B we describe the proposed loss function and in Section III-C the employed augmentation policy.

A. METER architecture
The vision transformer architecture has demonstrated outstanding performances in a variety of computer vision tasks, usually relying on deep and heavy structures.On the other hand, to reduce the computational cost of such models, lightweight CNN usually relies on convolutional operations with small kernels (i.e.3x3, 1x1) or on particular techniques such as depthwise separable convolution [37].Based on those statements, we design an hybrid lightweight ViT characterized by convolutions with small kernels and as few transformers blocks as possible reducing the computational impact in the overall structure.Motivated by this, in the following, we present METER: a MobilE vision TrasformER architecture characterized by a lightweight encoder-decoder model designed to infer on embedded devices.METER encoder redesign computational demanding operations of [9] to improve the inference performances while maintaining the feature extraction capabilities.The high-level features extracted from the encoder are then fed into the decoder through the skipconnections to recover the image details.The proposed fully convolutional decoder has been structured to upsample the compact set of encoder high-level features while enhancing the reconstruction of the image details to obtain the desired output depth map (i.e. a per-pixel distance map).A graphical overview of the architecture is reported in Figure 2 while the number of channels employed in the different METER configurations, METER S, METER XS, and METER XXS are reported in Table I.The number of trainable parameters of the three proposed networks consist of 3.29M , 1.45M , and 0.71M , respectively.METER encoder exploits a modified version of MobileViT network due to its light structure demonstrated in [9].As can be noticed in Figure 2, METER presents a hybrid network composed of convolutional MobileNetV2 blocks (red) and transformers blocks (green).The MobileViT blocks with the highest computational cost, i.e. the ones composed of cascaded transformers and convolution operations, have been identified and replaced with new modules (METER blocks).Such modules are able to guarantee low latency inference while tuning the entire structure to minimize the final estimation error.Along the lines of [9], we propose three variants of the same encoder architecture with decreasing complexity and computational cost namely S, XS, and XXS.
The proposed METER block (green in Figure 2) is composed by three feature extraction operations, two Convolutional blocks composed by a 3 × 3 convolution and a pointwise one (purple) and a second 1 × 1 convolution (yellow) interleaved by a single transformer block (blue).Such module computes an unfold operation to apply the transformer attention on the flattened input patches while reconstructing output feature map with an opposite folding operation, as described in [9].Moreover, in order to apply an attention mechanism to the encoded features, the input of METER block (gray) has been concatenated with the output of the transformer and fed to the previous 1 × 1 convolution layer.When compared with MobileViT architecture, characterized by four convolutions operations and a number of cascaded transformers blocks, the proposed design allows to reduce the computational cost of the overall model while producing an accurate estimation of the depth (as will be shown in Section V-B).
Finally, we halved the number of output encoder features (channel C 6 ) and we replaced the MobileViT SiLU non linearity function with the ReLU.Despite the fact that SiLU activation function is differentiable at every point1 , it does not ensure better performance, likely due to the depth-data distribution.
METER decoder is designed with a fully convolutional structure to enhance the estimation accuracy and the reconstruction capabilities while keeping a limited number of operations.As can be seen in Figure 2, the decoder consists of a sequence of three cascaded upsampling blocks (light blue) and two convolutional layers (yellow) located at the beginning and at the end of the model.Each upsampling block is composed by a sequence of upsampling, skip-connection and feature extraction operations.The upsampling operation is performed by a transposed convolutional layer (orange) which doubles the spatial resolution of the input.Then, a Convolutional block (purple) is used for feature extraction; the skip-connection (dashed blue arrow) linking METER encoder-decoder modules allows to recover image details from the encoded feature maps.

B. The balanced loss function
The standard monocular depth estimation formulation consider as loss function the per-pixel difference between the i th ground truth pixel y i and the predicted one ŷi .However, as reported in literature [16], [38], [39] several modifications have been proposed to improve the convergence speed and the overall depth estimation performances.In particular, the addition of different loss components focuses on refinement of fine details in the scenes, like object contours.
Derived from [38], [39], we propose a balanced loss function (BLF) to weight the reconstruction loss through the L depth (y i , ŷi ) and L SSIM (y i , ŷi ) components with the highfrequency features taken into account by the L grad (y i , ŷi ) and the L norm (y i , ŷi ) losses.The BLF L(y i , ŷi ) mathematical formulation is reported in Equation 1, where λ 1 , λ 2 , λ 3 are used as scaling factors.
In detail, the loss L depth (y i , ŷi ) in Equation 2is the pointwise L1 loss computed as the per-pixel absolute difference between the ground truth y i and the predicted image ŷi .
The L grad (y i , ŷi ) and the L norm (y i , ŷi ) losses reported respectively in Equation 3 and Equation 4 are designed to penalize the estimation errors around the edges and on small depth details.The L grad (y i , ŷi ) loss computes the Sobel gradient function to extract the edges and objects boundaries.
We report with ∇ the spatial derivative of the absolute estimation error with respect to the x and y axes.The L norm (y i , ŷi ) loss, reported in Equation 4, calculates the cosine similarity [40] between the ground truth and the prediction.
We identify with ⟨n yi , n ŷi ⟩ the inner product of the surface normal vectors n yi and n ŷi computed for each depth map i.e.
The last component L SSIM (y i , ŷi ) loss, Equation 5, is based on the mean structural similarity (SSIM ) [41].Similarly to [16], [39] we add this function to improve the depth reconstruction and the overall final estimation.
In conclusion, the proposed BLF balances the image reconstruction L depth , the image similarity L SSIM , the edge reconstruction L grad and the edge similarity L norm losses.The impact of each loss will be quantitatively evaluated in Section V-C.

C. The data augmentation policy
Deep learning architectures and especially Vision Transformer need a large amount of input data to avoid overfitting of the given task.Those models are typically trained on largescale labelled datasets in a supervised learning strategy [4].
However, gathering annotated images is time-consuming and labour-intensive; as result, the data augmentation (DA) technique is a typical solution for expanding the dataset by creating new samples.In the MDE task, the use of DA techniques characterized by geometric and photometric transformations are a standard practice [5], [16].However, not all the geometric and image transformations would be appropriate due to the introduced distortions and aberrations in the image domain, which are also reflected on the ground-truth depth maps.
With METER we propose a data augmentation policy based on commonly used DA operations while introducing a novel approach named shifting strategy.In particular we consider as default augmentation policy the use of the vertical flip, mirroring, random crop and channels swap of the input image as in [16] to make the network invariant to specific color distributions.The key idea is to combine the default augmentation policy with the shifting strategy augmentation, based on two simultaneous transformations applied respectively to the input image and to the ground truth depth map.The first one applies a color (C) shift to the RGB input images, while the second one is a depth-range (D) shift, which consists of adding a small, random positive or negative value to the depth ground truth.The mathematical formulation of the computed transformations are following reported; we refer with rgb un and rgb aug respectively the unmodified and the augmented input for RGB images and with d un and d aug the unmodified and the augmented depth map.
The C shift augmentation, applied on RGB images, is composed of two consecutive steps.In the first operation we apply a gamma-brightness transformation (rgb gb ), as reported in Equation 6, where β and γ are respectively the brightness and gamma factors that are randomly chosen into a value range experimentally defined between [0.9, 1.1].
Then, the color augmentation transformation reported in Equation 7 is applied, where I is an identity matrix of H × W resolution and η is a scaling factor that is randomly chosen into a value range empirically set between [0.9, 1.1].
The D shift augmentation, Equation 8, is made up of a random positive or negative value summed to the groundtruth depth maps (d un ).The random value, with a range of [−10, +10] centimeters for the indoor dataset and [−10, +10] decimeters for the outdoor one, is uniformly applied to the whole depth map.
In Figure 3 we report a sample frame before and after the application of the proposed strategy with the minimum and the maximum shift values.To emphasise the impact of the D shift, we focus on a narrow portion of the original depth map (in a distance range between 150 and 300 centimeters) by applying a perceptually uniform colormap and highlighting the minimum and maximum depth intervals through the associated color bars.The reported frames show that the depth with the positive displacement (+10 centimeters) has a lighter colormap, while the depth with the negative displacement (−10 centimeters) has a darker one; this effect is emphasised by the colormap of the original distance distribution.The introduced depth-range shift augmentation, along with the color and brightness shift and the commonly used transformations, leads to better final estimations as will be shown in Section V-D providing also invariance to color and illumination changes.

IV. EXPERIMENTAL SETUP
This section gives a detailed description of the experimental setup, including training hyper-parameters, benchmark datasets and evaluation metrics respectively in Sections IV-A, IV-B, and IV-C.

A. Training hyper-parameters
METER has been implemented using PyTorch2 deep learning API, randomly initializing the weights of the architectures.All the models have been trained from scratch using the AdamW optimizer [42] with β 1 = 0.9, β 2 = 0.999, weight decay wd = 0.01 and an initial learning rate of 0.001 with a decrement of 0.1 every 20 epochs.We use a batch size of 128 for a total of 60 epochs.For the balanced loss function we empirically choose the scaling factors λ 1 = 0.5 and λ 2 , λ 3 = {1, 10, 100} depending on the unity of measure used for the predicted depth map, i.e. meters, decimeters or centimeters.We apply a probability of 0.5 for all the random transformations set in the data augmentation policy.

B. Benchmark datasets
The datasets used to show the performance of METER are NYU Depth v2 [11] and KITTI [12], two popular MDE benchmark datasets for indoor and outdoor scenarios.
NYU Depth v2 dataset provides RGB images and corresponding depth maps in several indoor scenarios captured at a resolution of 640 × 480 pixels.The depth maps have a maximum distance of 10 meters.The dataset contains 120K training samples and 654 testing samples; we used for training the 50K subset as performed by previous works [5], [16].The input images have been downsampled at a resolution of 256 × 192.
KITTI dataset provides stereo RGB images and corresponding 3D laser scans in several outdoor scenarios.The RGB images are captured at a resolution of 1241 × 376 pixels.The depth maps have a maximum distance of 80 meters.We train our network at a input resolution of 636 × 192 on Eigen et.al [13] split; it is composed of almost 23K training and 697 testing samples.Similarly to [21], due to the low density depth maps, we evaluate the compared models in the cropped area where point-cloud measurement are reported.

C. Performance evaluation
We quantitatively evaluate the performance of METER using common metrics [13] in the monocular depth estimation task: the root-mean-square error (RMSE, in meters [m]), the relative error (REL), and the accuracy value δ 1 , respectively reported in Equations 9, 10, and 11.We remind that y i is the ground truth depth map for the i th pixel while ŷi is the predicted one, n is the total number of pixels for each depth image, and thr is a threshold commonly set to 1.25.

RM SE
Moreover, we compare the different models through the number of multiply-accumulate (MAC) operations and trainable parameters.METER has been tested on the low-resource embedded 4GB NVIDIA Jetson TX1 3 and the 4GB NVIDIA Jetson Nano4 that have a power consumption of 10W and 5W respectively.Those devices are equipped with an ARM CPU and a 256-core NVIDIA Maxwell GPU 5 for the TX1 and a 128-core for the Nano.The inference speed reported in Section V are computed as frame-per-second (fps) on a single image averaged over the entire test dataset.

V. RESULTS
In this section, we report the results obtained with METER on the two evaluated datasets, NYU Depth v2 and KITTI, described in the previous Section IV-B.In Section V-A METER is compared with lightweight, state of the art related works in terms of the metrics described in Section IV-C; then, we report multiple ablation studies to emphasize the individual contribution of each METER component.In particular, Section V-B is related to the architecture structure, while Sections V-C and V-D analyze respectively the effect of each element of the proposed balanced loss function and of the shifting strategy used for data augmentation.Finally, in SectionV-E, we provide an example of METER application in a real-case scenario.

A. Comparison with state of the art methods
In this section, METER is compared with state of the art lightweight models as [7], [8], [31], [32], [34], which are designed to infer at high speed on embedded devices while keeping a small memory footprint (lower than 3GB).This choice is due to the limited amount of available memory in the chosen platforms.Usually a portion of available RAM is reserved for the operating system, thus lowering the overall amount of available space for the model allocation.In particular METER and its variants allocate less than 2.1GB of available memory, a value that does not saturate the hardware's memory and which gives the opportunity to perform other operations on the same device.Moreover, for each compared architecture we also report the number of trainable parameters (in million [M]) and the number of Multiply-And-Accumulate (MAC) operations (in giga [G]).
The results can be found in Table II; as can be noticed, ME-TER outperforms all the other methods on both the datasets.When compared with [7], METER S achieves a boost of 17%, 15%, and 6% respectively for the RMSE, REL and δ 1 metrics over NYU Depth v2 dataset and of 11%, 30% and 7% over KITTI.As before, METER XS achieves superior performances, with a boost of 9%, 10% and 5% over NYU Depth v2 dataset and of 10%, 29% and 7% over KITTI.The last configuration, METER XXS, can still obtain good predictions compared with state of the art models while using just 0.7M trainable parameters and 0.186G MAC operations.
Moreover, in order to assess the frequency performances of such architectures, we choose as baseline models SPEED, due to its accuracy, and FastDepth, which is one of the most popular technique.When tested on the NVIDIA Jetson TX1, such models achieve 30.9 fps and 18.8 fps, while METER S, XS and XXS achieve respectively 16.3 fps, 18.3 fps and 25.8 fps.From these results we can remark that our most accurate model shows similar fps values with respect to FastDepth with TABLE III: Comparison between the MobileViT [9] and METER encoders over different activation functions (ReLU, SiLU) keeping METER decoder fixed.The fps are measured on the two benchmark hardwares, the NVIDIA Jetson TX1 and the NVIDIA Jetson Nano.In bold the best results for each configuration in terms of RMSE, REL and δ 1 .a sensible lower estimation error, while the lightweight XXS variant exhibits comparable estimation performance and fps with respect to SPEED.
Regarding MAC operations, it is possible to see that SPEED MAC value is on par with METER XS, while FastDepth MAC is sensible higher than all METER architectures.Furthermore, a qualitative analysis between the proposed variants of METER is reported in Figure 4 over an indoor and outdoor scenarios.The estimated depths and their associated difference (Diff) maps, which are per-pixel differences between the ground truth depth maps (GT Depth) and the predicted (Pred Depth) ones, show how the estimation error is distributed along the frame.Precisely, we notice an error increment fairly distributed over the frame as the model trainable parameters of the model are reduced.

B. Ablation study: the encoder-decoder architecture
In this subsection we compare the performances of the encoder and the decoder components of METER; results are reported in Table III and Table IV, respectively.In particular, the first analysis highlights the contribution of the novel METER block for each configuration (S, XS, and XXS) while keeping METER decoder fixed.The second analysis focuses on the use of alternative decoders with respect to the default METER decoder, such as NNDSConv5, NNConv5 [8] and MDSPP [7] using METER S encoder since it is the encoder that shows the best performances in the evaluated metrics.
Encoder architectures are compared in Table III, reporting a one-to-one comparison between METER encoder and the MobileViT; evaluating the effects of two different activation functions (ReLU, SiLU).From the obtained results, we highlight that METER encoder (in bold) achieves better depth estimation in all the proposed variants, as well as when compared with the same activation function, using fewer trainable parameters and a reduced number of MAC operations.In particular, when compared with the MobileViT, METER achieves an average improvement of 10%, 14% and 6% on RMSE, REL, and δ 1 metrics in the indoor dataset and of 2%, 7% and 2% respectively on the outdoor dataset.Based on those findings, the overall estimation contribution of the proposed encoder over the three configurations is equivalent to 7%, which almost 3% is due to the use of ReLU activation function with respect to SiLU.Moreover, regarding MAC operations we obtain a reduction of 20%, 29%, and 60% with respect to the corresponding MobileViT variants (S, XS, XXS), while the fps improvements are respectively 16% fps, 22% fps, and 32% on the NVIDIA Jetson TX1 and of 16% fps, 15% fps, and 28% fps over the NVIDIA Jetson Nano.
In light of the previous experiments, we can state that all METER variants show good accuracy and frequency performances on the NYU Depth v2, while in the case of KITTI dataset METER XXS variant should be preferred in order to get a reasonable inference speed.Focusing on the timings, the METER XXS variant shows the fastest inference speed, with reasonable results also on high resolution images of KITTI dataset, avoiding the needing of cropping or downscaling the original images.   of the NNConv5 that takes advantage of depthwise separable convolution to reduce the computational cost.Our encoderdecoder architecture is able to achieve higher speed and a significant improvement in all the estimation metrics with comparable MAC operations with respect to NNDSConv5.Finally, when compared with the NNConv5 decoder, ranked second in our analysis, the proposed structure is able to achieve an overall improvement equal to 12% over the two scenarios.Moreover, it can be noticed that the decoder has little influence on the inference frequency; however, METER decoder still shows the best fps on the two hardwares (e.g.11% of METER S compared to NNConv5 on the TX1 hardware and NYU Depth v2 dataset).The overall MAC operations decrement with respect to NNConv5 and MDSPP decoders is equal to 15% on the same configuration as before, suggesting that the optimized METER decoder is able to produce more accurate estimations while using less operations.

C. Ablation study: loss function
In this subsection we analyze the impact of the different components of the proposed balanced loss function introduced in Section III-B.METER S architecture is used as a baseline model.The quantitative and qualitatively comparisons are provided in Table V and Figure 5 respectively, while Figure 6 shows the converging trends of each introduced component, referring to L depth (blue), L grad (orange), L norm (green) and L SSIM (red).
The curves shape show that the initial loss contribution is mostly attributed to the L depth and L grad , while the contribu- The L depth component showed to be fundamental for the training convergence, thus it is applied on every experiment of Table V.The obtained results demonstrate that each loss component is crucial to get the final METER performance, balancing the reconstruction of the entire image and of edges details.In fact, the loss formulation in the second row focuses only on the overall image, failing at reaching satisfying results.At the same time, the third row shows a typical loss exploited in [38] focusing on edge details but not taking into account the image structure similarity, thus producing an unbalanced loss achieving a worse result with respect to the proposed one, which is able to obtain the lowest estimation error by balancing all the components.In detail, the BLF achieves an improvement of 10%, 12%, and 5% for RMSE, REL and δ 1 metrics on NYU dataset, and a boost of 13%, 24%, and 10% over the KITTI dataset compared to [38].Moreover, to better show the qualitative contribution of each loss component, provided in Figure 5 the estimated depth under the four analyzed configurations given an input sample from KITTI dataset.Based on such example, we can observe a similar behaviour to the one found in Figure 6 and Table V: the L depth component is fundamental for a correct image reconstruction while the weighted addition of specific loss components (λ 1 L grad , λ 2 L norm , λ 3 L SSIM ) can quantitatively and qualitatively improve the final estimation.This improvement may also be noticed by observing the predicted frames from left to right, where the object details and the overall estimation increase significantly as difference maps darken.
Therefore, we can conclude that the proposed balanced loss function can successfully enhance the training process, while each component can effectively contribute to more accurate estimations, hence enhancing the entire framework.Precisely, the overall quantitative contribution of the balanced loss function over the two scenarios is equal to 25% when compared with L depth , and 13% with respect to the loss formulation used in [38].

D. Ablation study: data augmentation
In this ablation study, we evaluate the performances of the proposed data augmentation strategy in comparison with standard MDE data augmentation.We report in Table VI the quantitative results of shifting strategy (C shift, D shift) and the default DA (flip, random crop and channel swap) and the combinations of the two.The proposed shifting strategy (last row) achieves, on METER S architecture, an improvement of 8%, 6%, and 2% over the RMSE, REL and δ 1 on the NYU Depth v2 dataset, and of 6%, 2%, and 1% over the KITTI dataset.On the other hand, the single use of the C shift or D shift with the default augmentation does not lead to an improvement in the final estimation, resulting in equivalent or slightly worst final prediction.Then, the overall improvement of the shifting strategy over the two scenarios is equal to 4% with respect to the default data augmentation policy.

E. Real-case scenario
One of the main objectives of exploring lightweight deep learning solutions is to close the gap between computer vision and practical applications, where the proposed models may be integrated as perception systems, such as robotic systems, thus taking into account possible hardware limitations.Therefore, in this subsection, we present an example of a real-case application in which METER is used to estimate the depth scene obtained from a generic camera image.We used a Kinetic V2 to measure the reference depth of the scene.The extracted acquisition is reported in Figure 7.
Qualitatively comparing the reference depth and the estimated one, we can notice a less sharp prediction, which can be mainly attributed to the lower working resolution that ensures high frame rates on edge devices.However, the object shapes are still adequately defined, and the overall estimation is visually comparable with the reference frame.
Moreover, in order to perform a quantitative analysis, we compute the average error of three salient objects that appear in the input frame (RGB Input), which are point A for the armchair, point B for the box and point C for the curtain.The estimation error for the first two points (A and B) is almost equal to 0.79m, respectively.The obtained value is related to the fact that we are working in a challenging open-set scenario with different statistics with respect to the training set.On the other hand, by analyzing point C, we can identify one of the main drawbacks of active depth sensing, i.e. missing or incorrect depth measurements under particular lighting conditions.In this scenario, although the estimated depth error is unknown, most likely due to the intense light source directed towards the camera sensor, our model can still correctly identify and estimate the area as a single surface.

VI. CONCLUSIONS
In this work, we propose METER, a MDE architecture characterized by a novel lightweight vision transformer model, a multi-component loss function and a specific data augmentation policy.Our method exploits a lightweight encoder-decoder architecture characterized by a transformer METER block, which is able to improve the final depth estimation with a small number of computed operations, and a fast upsampling block employed in the decoder.METER achieves high inference speed over low-resource embedded hardwares such as the NVIDIA Jetson TX1 and the NVIDIA Jetson Nano.Moreover, METER architecture in its three configurations is able to outperform previous state of the art lightweight related works.Thanks to the obtained performances on inference frequency and accuracy in the estimation, such proposed architectures can be good candidate to work on multiple MDE scenarios and real-world embedded applications.Precisely, METER S outperforms the accuracy of state of the art lightweight methods over the two datasets, METER XS represents the best trade-off between inference speed and estimation error, and METER XXS reaches a high inference frequency, up to 25.8 fps, on the two hardwares at the cost of a small increment in the estimation error.
The obtained results and the limited MAC operations of the proposed network demonstrate that our framework could be valuable in a variety of resource-constrained applications, such as autonomous systems, drones, and IoT.Moreover, we also test METER in a real-case scenario with a frame captured by a generic camera achieving a reasonable estimation error.
Finally, METER architecture could be a valuable starting point for future studies, in order to get real-time inference frequency on high resolution images, as well as building transformer architectures to take advantage of the attention mechanism both in encoder and decoder structures.

Fig. 2 :
Fig.2: Overview of METER encoder-decoder network structure.The processing flow, i.e. the sequence of operations and the skip-connection, is represented with a blue dashed arrow.The (H, W, C) format refers to the input-output spatial dimensions, while the ↑ and ↓ refers to the feature resolution upsampling and downsampling.

Fig. 3 :
Fig. 3: Illustration of an augmented sample with the proposed shifting strategy.The shifting factors (β, γ, η, and S) are set as their maximum and minimum values, i.e. {0.9, −10} and {1.1, +10} respectively.The min/max depth ranges for the regions of interest are given through the respective colored bars.

Fig. 4 :
Fig.4: A graphical comparison among METER (S, XS, XXS) configurations.For a better visualization, we apply to depth images and difference maps uniform colormaps with the same depth range.Precisely, in the ground truth (GT) and predicted depth maps (Pred) a lower color intensity corresponds to further distances, while in the difference map (Diff = |GT − Pred|) a lower color intensity corresponds to a smaller error.

Fig. 5 :
Fig.5: Qualitative comparison of a predicted frame taking into account different loss components.For a better visualization, we apply to the depth images and to the difference maps uniform colormaps with the same depth range.Precisely, in the ground truth (GT) and predicted depth maps (Pred) a lower color intensity corresponds to further distances, while in the difference map (Diff = |GT − Pred|) a lower color intensity corresponds to a smaller error.

Fig. 6 :
Fig. 6: Plot of the individual loss components composing the balanced loss function in the first ten epochs, i.e. almost 3600 iterations.

Fig. 7 :
Fig. 7: METER application in a real-case scenario.Missing depth measurements of reference (Ref) depth are shown as yellow pixels.A uniform colormap, with the same depth range, has been applied to the depth maps.Points A (armchair), B (box), and C (curtain) on the RGB frame indicates object used quantitative comparison.

TABLE I :
Number of channels (C i ) used in METER configurations.

TABLE II :
Comparison with state of the art lightweight methods on the two benchmark datasets.The best scores are in bold and second best are underlined; the -represents a value which is not reported in the original paper.

TABLE IV :
Comparison between lightweight decoder architectures keeping METER S encoder fixed.The best scores are in bold.
Table IV, comparing METER decoder and those of other lightweight models; we used the METER S encoder as baseline.METER decoder achieves an RMSE improvement of 16% and 19% on NYU Depth v2 dataset and of 6% and 11% on KITTI dataset with respect to NNConv5 and MDSPP models.Furthermore, we compare METER decoder with the NNDSConv5 [8], a variant L depth + λ 1 L grad + λ 2 L norm L depth L depth + λ 3 L SSIM L depth + λ 1 L grad + λ 2 L norm + λ 3 L SSIM

TABLE V :
The effect of each balanced loss function components on the METER S over the considered metrics.The best scores are in bold.depth + λ 1 L grad + λ 2 Lnorm + λ 3 L SSIM (BLF)

TABLE VI :
Comparison between different augmentation strategies.The default policy comprises the flip, random crop and channel swap while the others represent the different components of the shifting strategy described in Section III-C.The reference model is METER S. The best scores are in bold.