Learning Depth Estimation From Memory Infusing Monocular Cues: A Generalization Prediction Approach

Depth estimation from a single image is a challenging task, yet this field has a promising prospect in automatic driving and augmented reality. However, the prediction accuracy is degraded significantly when the trained network is transferred from the training dataset to real scenarios. To solve this issue, we propose MonoMeMa, a novel deep architecture based on the human monocular cue, which means humans can perceive depth information with one eye through the relative size of objects, light and shadow, etc. based on previous visual experience. Our method simulates the process of the formation and utilization of human monocular visual memory, including three steps: Firstly, MonoMeMa perceives and extracts real-world objects feature vectors (encoding). Then, it maintains and replaces the extracted feature vector over time (storing). Finally, MonoMeMa combines query objects feature vectors and memory to inference depth information (retrieving). According to the simulation results, our model shows the state-of-the-art results on the KITTI driving dataset. Moreover, MonoMema exhibits remarkable generalization performance when our model is migrated to other driving datasets without any finetune.


I. INTRODUCTION
Obtaining accurate depth from images is one of the most important tasks in computer vision. In recent years the depth estimation has attracted a wide range of applications in many fields such as automatic driving, robot navigation, 3D depth reconstruction and augmented reality. Although LIDAR technology is quite popular, attaining depth from images is worth more preference. Compared with LIDAR sensors, using a camera to collect depth information has several potential advantages: cheap, easy to be installed, and adaptive in various environments.
The popular solution of visual depth prediction so far is the stereo estimation, which infers disparity (i.g., the inverse of depth) using two or more cameras from different points of The associate editor coordinating the review of this manuscript and approving it for publication was Felix Albu. view. However, these binocular approaches are limited by the problems of calibration error and synchronization. Therefore, currently predicting depth from a single image has become a hot area in depth estimation. Monocular depth estimation is a very challenging work as the image is the projection of the 3-D scene, while the projection only captures the 2-D information. Different from the binocular depth estimation methods, monocular depth estimation regards predicting depth as a regression problem and focuses on finding a relationship between the pixel value and the depth value [1]. To achieve this goal, early methods [1]- [4] use techniques in machine learning to build monocular depth estimation models. With the development of deep learning in recent years, monocular depth estimation approaches [5]- [8] based on deep neural networks become popular. Based on these studies [5], [7], [9], experimental results obtained by monocular depth estimation exhibit excellent performance, which indicates that the deep neural networks are suitable for the task of mapping the pixel value with to depth values. However, the critical challenge is the generalization task performance. Although the deep learning methods display excellent results on a single dataset such as KITTI [10], they rarely show the equally good performance in the generalization tasks where the input data has various aspect ratios, different camera settings, and distinctive vehicle poses.
Therefore, the approaches based on deep neural networks generally lack generalization ability. Although the literature [11] develop tools that enable mixing multiple datasets with incompatible annotations, it is unrealistic and impractical to obtain datasets of all scenes in the real world [12]. So the generalization ability is important to support the modern intelligent applications, such as automatic driving tasks.
Human beings can predict the depth of pictures taken by various cameras with different configurations. That is to say, people can estimate the object depth from an image, even without the pre-knowledge of camera specifications. The reason is shown in [13], humans perform well at monocular depth estimation by exploiting monocular cues such as perspective, scaling relative to the known size of familiar objects. To perceive depth in new scenes, humans utilize the monocular cues by comparing the size of the unfamiliar objects with the size of the familiar objects which are memorized before. It is precise because we have formed a rich and structural understanding of the world through the past visual experience, so we humans can model the real-world scenes well [14].
Inspired by the monocular cues in human depth perception, in this work, we propose Monocular Memory Matching (MonoMeMa) architecture to estimate object depth from a single image based on monocular cues. In the first stage, we use an encoder to perceive and extract real-world objects feature vectors. Then we utilize the extract feature vectors to search for empirical information stored in the memory storage, which stores monocular cues extracted from past experiences such as size, type of the objects and depth labels (The storage maintains and replaces the extracted feature vector over time). Finally, we use these matching information obtained from the memory and a decoding network to inference the depth.
(1) An external memory storage:The memory can simulate humans past visual experience and store monocular cues for target objects in order to restore depth in new scenes. (2) An encoder-decoder architecture: The architecture cooperates with the external storage to restore the depth. It can extract the feature of target objects from a single picture and send it to the decoder to restore the depth of the object through the combination of similar past experience information in the external storage. (3) A novel memory storage control mechanism:The control mechanism can determine whether the current training data is valuable for future prediction tasks, and learn to write as little information as possible while maintaining considerable accuracy.

II. RELATED WORK
In this section, we review the literature relevant to our work concerned with stereo and monocular depth estimation approaches.

A. STEREO DEPTH ESTIMATION
Given a pair of rectified stereo images, the goal of stero depth estimation is to compute the disparity d for each pixel in the reference image. Disparity refers to the difference in horizontal location of a pixel in the left and right image -a pixel at position (x, y) in the left image appears at position (x − d, y) in the right image. Then the depth of this pixel is calculated by f * B d , where f is the camera's focal length and B is the distance between two camera centers.
Most conventional dense stereo algorithms calculate disparity based on the four steps summarized by [15]. These methods rely on 2-frame stereo correspondence and are organized by matching cost computation, cost support aggregation, disparity computation and optimization, or disparity refinement. Current state-of-the-art studies focus on how to compute the matching cost accurately and how to refine the disparity map. With the rapid development of deep learning, convolutional neural networks (CNNs) have been applied to learn how to match corresponding points, and a deep network trained to match 9 × 9 image patches was shown by [16] to produce then state-of-the-art results. [17] regards the correspondence problem as a multi-scale task and propose a notably faster Siamese network. [18] design a deep network to compute disparity from the images. A popular and effective approximation to refine the disparity map is the Semi-Global Matching (SGM) of [19], where dynamic programming optimizes a pathwise form of the energy function in many directions.
Recently, end-to-end networks have been developed to predict whole disparity maps without post-processing. Mayer et al. [20] created a large synthetic dataset to train an end-to-end network for disparity estimation (DispNet) and optical flow (FlowNet), improving the state-of-theart.Kendall et al. [21] introduce GC-Net, an end-to-end network to efficiently learn context in the disparity cost 21360 VOLUME 10, 2022 volume using 3-D convolutions. Chen [22] proposed a novel pyramid stereo matching network (PSMNet) to exploit global context information in stereo matching. PSMNet use Spatial pyramid pooling (SPP) [23] and dilated convolution [24] to enlarge the receptive fields and improve the utilization of global context information.
The methods above rely on a large amount of ground truth disparity data. However, image pairs they use for training are hard to obtain in the real world; calibration errors and synchronization problems can also reduce the accuracy of the training data.

B. MONOCULAR DEPTH ESTIMATION
Monocular depth estimation refers to the problem setup where only a single image is available at test time. Before the deep learning era, some monocular depth estimation methods [1]- [4] are based on machine learning techniques. Saxena et al. [1] treat the task of recovering depth from pixels as a regression problem. They utilize Markov Random Field (MRF) and some hand-designed multi-scale texture features to incorporates multiscale local and global image features, modeling both depths at individual points as well as the relation between depth at different points. With the increasing availability of the ground truth data, supervised approaches outperform the previous works.
Eigen et al. [5] propose a model employing two deep network stacks where one makes a coarse global prediction based on the entire image and another that refines this prediction locally. Liu et al. [7] present a deep convolutional neural field model for estimating depths from single monocular images and design a deep structured learning scheme which learns the unary and pairwise potentials of continuous conditional random field (CRF) in a unified deep CNN framework to avoid hand-crafted features. Li et al. [25] combines deep learning features on image patches with hierarchical CRFs defined on a superpixel segmentation of the image. Work by Laina et al. [26] models the ambiguous mapping between monocular images and depth maps using a fully convolutional architecture encompassing residual learning. They also introduce the reverse Huber loss that is particularly suited for the tasks driven by the value distributions commonly presenting in depth maps. Some methods combine depth map prediction with semantic segmentation, Ladick et al. [27] simplify the deep prediction to a classification problem and proposed a new pixel-wise classifier, that can jointly predict a semantic class and a depth label from a single image. Liu et al's method [28] semantic segmentation of the scene and then use the semantic labels to guide the 3D reconstruction.
Recently, some unsupervised depth estimation methods have also been proposed. Compared to general supervised learning, these methods do not need to use vast amounts of manually labelled data to train them. Garg et al. [29] propose a stereopsis based encoder-decoder architecture,which predicts depth by training on an image reconstruction loss. Zhou et al. [14] present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. Guo et al. [30] propose a framework that can make full use of Cross-domain synthetic data, which uses the stereo matching networks as a proxy to learn depth from synthetic data, and uses predicted stereo disparity maps to supervise training monocular depth estimation networks. [31] makes monocular camera move in an unknown indoor environment acquiring continuous images sequences. Depth estimation and object detection is respectively implemented through FCN and Faster RCNN.
Finally, mostly related to our work is the work by J. Konrad et al. [32]. Their approaches regard the depth estimation task as a matching problem. Instead of relying on a deterministic scene model for the input 2D image, they propose to ''learn'' the model from a large dictionary of stereo pairs such as YouTube 3D. Based on the assumption that two stereo pairs whose left images are photometrically similar are likely to have similar disparity fields, they predict the depth information by matching the input image with the stereo-pair images in the dictionary. Inspired by their work, we design a network with memory where the memory contains past useful experiences collected during the training step. Thus, our model can predict the depth depending on the feedback obtained by inquiring the input image feature from memory.

III. MONOCULAR MEMORY MATCHING
In this section, we describe the detailed MonoMeMa architecture designed to infer accurate depth estimation for an object in a supervised manner from a single image. We begin with the encoder-decoder structure of our model, and then depict the memory control strategy used to accumulate useful memories. Finally, we present our training loss in each part of our model. Figure2 shows an overview of our framework, depicting an input frame and the outcome of MonoMeMa.

A. MODEL ARCHITECTURE
Our model focuses on the depth of the specific target objects instead of the pixel depth as the object depth estimation is more useful and practical for the real-world application like auxiliary driving, where detecting the critical objects such as cars, pedestrians and obtaining their depth is an efficient way to parsing the scenes.
The structure of the proposed MonoMeMa is shown in Figure 2, which consists of an object encoder and a depth decoder and memory storage.
The purpose of our object detection network is to detect and encode specific targets such as vehicles from an image. The object encoder consists of a feature extractor network based on CNNs, and a proposal network (RPN) proposed in [33], which enables the object detection by the regression of bounding boxes [33], [34] and the non-maximum suppression (NMS). Firstly, the image is processed by convolution and pooling layers to obtain the feature map. Then the encoder will extract a fixed-length feature vector from the feature map by using the ROI pooling layer. After that, each feature vector is fed into two parallel output layers. One layer is to perform eight classifications (Car, Van, Truck, Pedestrian, VOLUME 10, 2022 Person_sitting, Cyclist, Tram, Misc), and output the probability distribution of each RoI for the target on eight object classes. The other layer outputs four real values (Bounding-Box regression) for each of the eight objects. The other layer outputs four real values (Bounding-Box regression) for each of the ten objects. Consequently, these real values represent the bounding-box locations for each class.
Inspired by the monocular depth cues, we design an external memory storage that collects the high-dimensional knowledge of the depth cues viewed in the training process such as the class, the size, and the distance of different samples. By matching the target object with historical objects in memory, the external memory storage proposes the memory vectors and their ground truth depth labels.
After doing this, the decoder predicts the depth based on the vectors and labels obtained from both the object encoder network and the memory. Since there are multiple target objects in an image, we make predictions for each query feature vector separately, and we choose the recursive decoder LSTM that shares weights as the model decoder. For a specific query vector, we can use the KNN algorithm to find multiple similar historical feature vectors from the Memory Store. We use the MLP network to map the true depth labels corresponding to the historical vector from low dimensions to high dimensions. Then the high-dimensional vector and the memory feature vector are stitched, and the new vector is used as the input of each time step of the LSTM. The LSTM is followed by a fully connected network to output the depth prediction for the target object.
Our design requires that the decoder must be able to recover the depth of the target object from memory. Recursive decoder such as LSTM has the internal memory and the hidden activation is similar to the register. The decoder can mix information across multiple time steps of operation and select different weights for different memories to recover the depth information of the query vector. We also compared other decoders in the Ablation Study. Several more complex feed-forward neural network decoders and LSTM comparisons are also mentioned in the [35] to prove the performance of LSTM.
Suppose that the capacity of the memory is m, which has been obtained during the training process. Each memory segment M i involves two elements, the feature vector E i and the corresponding depth label L i , so that the set M = . . , m}. The depth label can be obtained from the disparity value and the camera parameters provided in the dataset.
The goal of the inference process is to obtain the depth of each specific object from the given image I . Assuming that the CNN function f C , the RPN function f R and the LSTM function f L have been optimized during the training process. In the object encoder network, the extracted feature vectors e = {e 1 , e 2 , . . . , e N } (N represents the number of the specific objects in a single image) are obtained by four steps as shown in Figure 2 : firstly, a feature map F of the input raw image I is computed by the CNNs as F = f c (I ); then we use the RPN network [33] to predict the boundaries B j and classes C j (background or foreground), which are given as {B j , C j } = f R (F); after that, we use boundary and class information to extract the region of interests (ROIs) R j and reshape them to a uniform size through a pooling layer; finally, the feature vectors e j are obtained by stretching the ROIs to one-dimensional vectors as e j = MLP R j , j = 1, 2 . . . N .
The decoder in Figure 2 shows that the depth decoding step for each feature vector involves two parts: (1) searching for the matching memory vectors and (2) decoding through LSTM. We use the Euclidean distance to measure the similarity between the jth query feature vector and the ith memory vector, and the similarity is given as d j i = | e j − E i | . Based on the similarity, we select the first k vectors with the smallest Euclidean distance as the output of the memory store. By the way, k denotes the output data size of memory each time and the value of k depends on the LSTM size. We will give the detail value in section V(C). In the search and match phase Finally, we can obtain the predicted depth for the jth object as where f F and f L represent full connection layer and LSTM respectively.
To summarize, our model consists of two main parts: an object encoder extracts the depth features of the specific object, and a depth decoder network recovers the depth.

B. MEMORY CONTROL
In order to make our model work more efficiently, the memory should learn to write as little information as possible while maintaining considerable accuracy. To this aim, we design a memory control strategy that is used to judge whether the present training data is valuable for the later predicting missions. Based on such two assumptions: (1) a model will be improved by the labels of the data that is not accurately predicted (we call this data as valuable information) and (2) there is no need for a model to store the data that is already precisely predicted (we call this data as valueless information), we expect the memory which can provide our model with valuable information. When the predictions are quite precise for present training data, we choose to skip them. When our model shows a bad performance on the present training data, we assume this data contains information that our model didn't learn well before. Hence, we write such data into the memory to help with further predictions.
Based on these considerations, we design a controller to measure the value of each training data and set a memory control threshold ζ a to adjust the capacity of the memory.
In the controller, we calculate the Absolute Relative Error as σ a = |D * −D| D * , where D is the predicted depth and D * is the ground truth label. If σ a > ζ a , it means that there is an unacceptable difference between the predicted depth and the ground truth so that this valuable vector should be added to the memory. This will enable the stored ground truth labels to greatly assist the decoder in correcting predictions the next time a similar query vector is encountered.On the contrary, if σ a < ζ a , it indicates our model has already learned it well, we only need to use the existing memory and decoder to predict the depth well, so there is no need to store it and its labels. By changing the value of ζ a in the Ablation Study, we can not only show a positive effect of memory on depth estimation tasks but also display the robustness of our work. Figure 3 shows a data flow during the training step. Noticeably, the backpropagation is only realized in the process marked by the red line and the process marked by the black line works only in the forward calculation. It means that the selection process of the memory in the training step does not need backpropagation, which reduces the time complexity of the calculation.

C. LOSS FUNCTION
As our model consists of two main networks, the loss function is also composed of the object encoder loss E O and the depth decoder loss E H .

1) OBJECT ENCODER LOSS
This part measures the model's ability to demarcate specific objects from a single image. It consists of a regression loss in finding the bounding boxes and a classification loss caused by recognizing whether it's foreground or background. The object encoder loss E O is given as where i is the index of a bounding box, N cls represents the batch size, and N bbox represents the number of the bounding boxes. C i and C * i represent the predicted class and the corresponding label where 1 stands for foreground and 0 stands for the background. B i and B * i are the four-dimensional predicted bounding boxes and their labels. E bbox = R B i − B * i where R is the robust loss proposed in [34] and E cls is crossentropy function. τ is a constant factor set for weighting these two parts. The complete reference can be find in [33] equation 1.

2) DEPTH DECODER LOSS
This part measures the model's ability to decode depth from feature vectors. In order to compensate for the inaccuracy brought by the small difference between predictions and the ground truth or the large distribution of distance in the dataset, we adapt the Huber loss [36] that is particularly suited for VOLUME 10, 2022 the task [26].
where D and D * are the predicted depth and the corresponding labels, c is a constant.

IV. EXPERIMENT
In this section, we describe the datasets, implementation details, metrics, and then present exhaustive evaluations of MonoMeMa on various training/testing configurations, showing that our method outperforms supervised state-ofthe-art approaches. As standard in our experiment, we assess the performance of monocular depth estimation techniques following the protocol by Eigen et al. [5], extracting data from the KITTI [10] dataset, CityScapes [37] dataset and ApolloScape [38] dataset. By the way, we extract the depth labels in 2D layer and the label depth is obtained from the center pixel points. So that the depth value will not disturb by the background regions. Additionally, we also perform an exhaustive ablation study proving that the decoder based on LSTM and memory enables our strategy to improve the predicted depth accuracy. Additionally, we also perform an exhaustive ablation study proving that the decoder based on LSTM and memory enables our strategy to improve the predicted depth accuracy.

A. DATASETS 1) KITTI
The KITTI dataset [10] is a collection of several outdoor scenes concerning driving scenarios. It consists of 61 scenes that contain about 42382 stereo frames. The standard image size is 1242 × 375 pixels. Each image contains up to 15 vehicles and 30 pedestrians with different degrees of occlusion. The LIDAR device measures the depth information. Since the encoder needs to be trained, we chose the pictures in the KITTI dataset that contain the detection box labels and this dataset provides the true depth labels for the target objects. So we can directly use the detection box labels and the true depth labels to train the encoder and decoder. We split 7481 images from the object detection dataset of KITTI and select 6058 of them for the training set, 674 of them for validating set, and 749 of them for the test set.

2) CITYSCAPES
The CitysScapes dataset [37] includes stereo pairs (contains about 22973 frames) covering 50 cities in Germany captured by a moving vehicle in various weather conditions. Its standard image shape is 2084× 1024 pixels. In our experiment, we select 1525 from the 5000 images with fine annotations in three main cities as the test set for the generalization task. Note that the CitysScapes dataset does not provide target detection box labels and disparity maps, but we only perform generalization tasks for depth prediction on this data set, so we do not need target detection box labels. First we use the SGM algorithm to calculate the disparity depth map of the Cityscapes dataset as the ground truth. Then use the trained encoder to find the target objects in the original image. Finally, we calculate the depth label of the target objects by average pooling the depth map with the target objects mask.

3) APOLLOSCAPE
The ApolloScape dataset [38] is a large-scale dataset for autonomous driving. It's composed of 140K images with a shape of 3130 × 960 pixels in three Chinese cities captured in various traffic conditions, the number of moving objects averages from tens to over one hundred. We select 1000 images in four distinctive regions from two cities as the test set for the generalization task. The method of obtaining the depth labels of the ApolloScape dataset is the same as the previous one.

B. IMPLEMENTATION DETAILS
The network which is implemented in Tensorflow [39] contains 132 million trainable variables (126 million for the encoder and 6.5 million for the decoder) and takes around 20 hours (16 hours for training the object encoder and 4 hours for decoder) using a single GTX1080 GPU on the dataset of 6 thousand images. The inference takes less than 160ms, or more than 6 frames per second, for a 1242 × 375 image, including transfer times to and from the GPU. Figure 4 shows the results of the KITTI dataset. Please see our code for more details.
During training, we set the learning rate for the object encoder to α obj = 1e −3 and the learning rate in decoder to α dep = 1e −4 . As for the memory capacity, we set the max capacity to m = 100 and the threshold to ζ a = 0.1. When calculating metrics, we regard the object as the minimal calculating unit. For example, 2926 objects with corresponding depth are generated by our model from 749 images on the KITTI testing set. Thus, we assess the average performance in objects instead of images. Moreover, for a fair comparison with other methods, we use the bounding boxes obtained from the object encoder to split the depth maps obtained in other methods and calculate the object depth by average pooling [40]. Generally, the depth labels provided in datasets are disparities. Thus, we obtain the depth maps using the formula D = b * f /d, where D is depth, b is the baseline, f represents focal of the camera and d represents disparity.

C. EVALUATION METRICS
The main evaluation indicators of the object detection network is mean Average Precision(mAP). And in the depth estimation task we evaluate our model using the metrics proposed in prior works [5].
where x i represents the index of the objects, ρ (x i ) represents the predicted depth, g (x i )represents the ground truth depth and N is the number of objects.

A. COMPARISON WITH THE STATE-OF-THE-ART
In this section, we compare our framework with stateof-the-art approaches for monocular depth estimation on the KITTI dataset. For a fair competition, we adopt the code that formally published in their GitHub and evaluate the metrics using pre-trained models provided by the authors. Figure 5 shows the statistical results of our model comparing with others. We can see the predicted depth value of our model converges well on the diagonal from 1m to 80m, but traditional pixel-by-pixel prediction depth while other methods tend to obviously diverge in the long-distance prediction, which indicates that our algorithm outperforms all other existing methods especially in estimating the depth of the objects that are far from the camera.    The mAP measures the accuracy of our method in the feature extraction stage. And Table 1 also shows quantitative results evaluated on all kinds of metrics. We can observe from the table that in terms of Abs Rel, our method surpasses [41] by 42%, [6]by nearly 53%, [8]by 63% and [42] by 58%. This proves that the depth estimation ability of our model in supervised learning tasks on the KITTI dataset is obviously better than other methods.

B. GENERALIZATION TO OTHER DATASETS
To illustrate that our model can efficiently generalize to other datasets, we compare our approach with several methods on the CityScapes and ApolloScapes datasets. Traditional methods generally need finetune when generalizing to other datasets, but this is not appropriate in real scenes, because we are difficult to capture a large number of real ground datasets. Compared with these methods, thanks to the memory module brings powerful depth clues and LSTM modules in the coding phase act as valid and reasoned elements to strengthen the capability of the network to take advantage of what has been internally stored, we can see from the table that without any finetune, our approach outperforms our competitors in the generalization task. Figure 6 shows qualitative results on both of the datasets. A quantitative comparison is also displayed in Table 2. The numerical result in Abs Rel exceeds 37% and 47% of work by Ibrahhem [8], 44% and 38% of work by Goard [6], 47%, 23% of work by Fabio [41] and 44%, 27% of work by Godard et al. [42] on CityScapes and Apol-loScape respectively. We also note that if we increase one bit of the storage memory, the model performance will double. Although the value of mAP is enough to support significant object detection, there are some scenes that may exist miss or wrong detection. Generally, this characterisitic is denoted by Average Recall(AR). According to our experiment, the value of AR is 0.62. This value approaches the best case of CNN. Besides, we detect objects in the constant process in selfdriving. It means that there is always a moment can detect the missed object. If the missed detection occurs frequently, we can reduce the IOU threshold (our IOU value here is 0.5) to obtain more detection bounding box.

C. ABLATION STUDY
To verify the function of the LSTM and the memory in the decoder, we replace the depth decoder module with the other three structures: (1) a non-memory model ablating both the LSTM and memory where a linear layer is added behind the object encoder. (2) A non-parameter model which ablates the LSTM and predicts the depth by averaging the depth labels obtained from the k nearest memory vectors. (3) A basic-parameter model ablating the LSTM where we concatenate the k nearest memory vectors and send the concatenated vectors into a fully connected layer. The fully connected layer predicts the depth. Three different decoder structures are shown in Figure 7. The parameters in the encoder is fixed when training the decoder. We have trained encoder before training the decoder.

1) LSTM ANALYSIS
We think the selection of LSTM is important, because the recursive decoder such as LSTM can mix information across multiple time steps of operation and select different weights for different memories to recover the depth information of the query vector. Figure 8 shows the comparison results on KITTI. We can see from the figure that: (1) The basic-parameter model outperforms the nonparameter model, which means that the neural parameters can more effectively analyze the monocular depth cues from the memory and use them for prediction. (2) The basic-parameter model shows lower performance than our model, which means our module contributes to the accuracy by improving the ability to match the present data with the historical information. In other words, our model can better dig out relative size clues and familiar size clues, and can use LSTM to recover depth information more accurately. Ramalho et al. [35] compared relational self-attention feedforward decoder, relational working memory decoder and LSTM decoder for the classification case and found they perform equally well.

2) MEMORY ANALYSIS
One of the key reasons why human can accurately predict the depth is that we have formed a rich understanding of the world through the past visual experience, and stored a large number of past experience. Human memory selection is not random, but more able to remember those failed cases. Similarly, our model pays special attention to the imprecise prediction cases in the training process. We evaluate the contribution that memory makes to our model by ablating the memory capacity. In the training phase, we set the memory capacity from 0 to 100 by adjusting the memory control threshold ζ a . Figure 9 shows that the accuracy is slightly affected by the decrease of memory size, this indicates our model has good robustness. While the memory size drops below the number of LSTM (in this paper, we set it to 10), the accuracy decreases sharply, it's because the model has no enough prior information used as references. It should be noted that when the memory size decreases to zero, the LSTM model degenerates into the non-memory model (shown in Figure 9), where the prediction only relies on the linear layer.

VI. CONCLUSION
In this paper, we proposed MonoMeMa, a novel framework for monocular depth estimation for specific objects. It combines (1) an object encoder for extracting the object feature and (2) a recursive neural network decoder used to predict the depth based on the memory mechanism. To choose memory efficiently, we also design a memory control strategy that allows the input data that brings additional information to be written in the memory. Noteworthily, our model not only outperforms present approaches in the supervised tasks on the KITTI dataset but also shows a state-of-the-art experimental result in the generalization tasks on CityScapes and ApolloScape datasets. Through exhaustive experiments, we prove that the decoder network based on the LSTM structure is flexible for depth prediction, and the memory leads to a more accurate network. In addition, we think that the task we are considering is novel, which transforms the depth prediction problem pixel by pixel into the depth estimation for the target object. And we propose an original and innovative strategy to combine object knwoledge and depth estimation with the aim of taking full advantage of their strong connection in the real world.
In future work, we will consider migrating our model to the real-time auxiliary driving tasks. In the real world, road scenes vary rapidly. Hence, we expect to decrease the amount of the parameters in our model for higher processing speed. It may be a scheme to realize it by infusing some knowledge in meta-learning and using lightweight object detection networks.  His current research interests include VLSI circuit designs, low-power circuit designs, stochastic computation-based system designs, machine learning-based signal processing, artificial intelligence for networking, and circuit-system design. He also served as the Symposium Chair for Globalsip and a TPC Member for Globecom and ICC. VOLUME 10, 2022