Introduction
Learning 3D information about the environment from images is a challenging task in computer vision, which can be applied to robot navigation and autonomous driving. Humans can naturally estimate the complete geometric shape of the object from the images observed only from a single perspective, and build a 3D model of the environment. However, the robot’s ability in this respect is still relatively weak [1]. Accurately perceiving the surrounding 3D scene is crucial for autonomous vehicles because it directly affects downstream tasks, such as path planning and map reconstruction. Unfortunately, due to the lack of correct depth information and obstruction of sight [2], it is difficult to obtain complete and reliable 3D information about the real world. To address this issue, the 3D semantic scene completion [3] task was developed. This task involves jointly inferring the geometric occupancy and semantic information of the surrounding environment from a limited observational perspective. It utilizes voxel grids with semantic labels to represent the physical world.
Most existing semantic scene completion methods use LiDAR [4], [5], [6], [7], [8] as the main sensor, because it can provide accurate depth information. However, LiDAR sensors are usually expensive and point clouds are sparse. In contrast, camera sensors are cheaper and can provide rich semantic information. Therefore, the researchers proposed several solutions to complete the 3D scene from only images [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. MonoScene [9] provided the first method to the task by using only RGB images. This work utilizes the classic 2DUNet network to extract features from images. However, in the face of complex traffic scenes, it is difficult for the traditional network to extract accurate features.
In this paper, in order to improve the ability of network to extract visual features and thus enhance the effect of semantic scene completion, we introduce a monocular semantic scene completion method based on Contextual Transformer (CoT) [20] and Recurrent Residual Convolutional Neural Network (RRCNN) [21]. The encoder of the image feature extraction network utilizes the EfficientNetB7 backbone network. After the encoder, the context transformer module is installed to collect contextual information. And the decoder combines the recurrent residual convolution module to accumulate image features. Experimental results on NYUv2 [22] and Semantic KITTI [23], [23] show that the method proposed in this paper can improve the accuracy of semantic scene completion.
The main contributions of this paper can be summarized as follows. First, the context transformer module is introduced between encoder and decoder to use the contextual information to guide the learning of dynamic attention matrices, which improves the ability of visual representation. Second, we add the recurrent residual convolution module after the up-sampling of the image decoder to help accumulate features on different time steps. Experiments are conducted on two datasets and the results indicate our method can achieve better effect of semantic scene completion.
The rest of this paper is organized as follows: In Section II, we discuss the existing methods about monocular semantic scene completion. Section III introduces the overall framework, the context transformer and recurrent residual convolution module in detail. In section III, the experimental results of our method and ablation study on two datasets are introduced quantitatively and qualitatively. Sections V provides the conclusion of this paper.
Related Work
The challenge of monocular semantic scene completion has attracted a lot of in-depth research in recent years. According to whether depth information is needed, the existing work can be divided into two categories.
A. Depth-Based Methods
The early methods relied on RGB-D images, which required explicit depth information and were limited to small indoor scenes. Song et al. [24] proposed a solution by utilizing an enhanced truncated signed distance function to encode the depth map into 3D voxel features. A 3D convolutional neural network was used to extract geometric and contextual information to generate voxel occupation probability distributions and semantic categories. This method involves serial 3D convolutions, which leads to a large amount of calculation loads. In order to reduce the number of parameters, Guo and Tong [25] replaced some of the 3D convolution with normal 2D convolution. The input depth image was initially processed through normal convolution. Then the feature maps were projected into 3D and sent to the 3D convolutional neural network, which effectively reduced the computational load of the network.
B. Image-Based Methods
When the scene is vast and far away, it is challenging to collect accurate depth information. Therefore, researchers suggested using monocular RGB images to complete semantic scene. These studies are mainly aimed at the large-scale outdoor traffic scenes. Cao and de Charette [9] provided the pioneer method to complete 3D semantic scene from only one RGB image without relying on any depth information. This approach employed the 2D UNet to extract image features and generated 3D features by the proposed dense line-of-sight projection. Then, 3D geometric features and contextual information were extracted by 3D UNet network. The author also proposed a context priority layer, which was inserted between the encoder and decoder of 3D UNet to learn the semantic relationships between voxels. Finally, a 3D segmentation header was applied to generate the results of semantic scene completion. Although this method has achieved competitive results, it may be challenging to distinguish objects with similar semantic information. Based on this work, this paper proposes an image feature extraction network to assist the model discriminate similar items more effectively.
Yao et al. [27] introduced a novel semantic completion network with normalized device coordinates, which gradually restored the depth dimension instead of directly expanding to the 3D world space. In this case, most of the calculations were carried out in NDC space. The author also developed a depth adaptive dual decoder for up-sampling and merging 2D and 3D feature maps, which greatly improved overall performance. Li et al. [2] argued that using dense feature projections directly could easily lead the assignment of inaccurate semantic labels to occluded areas. Therefore, they proposed a two-stage framework. In the first stage, a depth estimation network was used to predict the depth value of each pixel in the image, which was then back-projected to the 3D point cloud space to obtain a sparse query set of occupied voxels in the visible area. In the second stage, deformable cross-attention and self-attention mechanisms were utilized to generate a dense 3D voxel grid. This strategy was effective for short-distance areas and small targets. To improve the efficiency of 3D convolution, Zhang et al. [28] proposed a dual-path transformer network to extract 3D features. They extracted 2D features from a single image, and then converted them into 3D using the LSS paradigm [29]. In the proposed dual-path transformer encoder, 3D features were extracted along the local and global paths respectively. These two features were fused and sent to the transformer. In the last, a mask classification model was utilized for 3D semantic occupancy prediction. The suggested method decomposes 3D features processing into two paths. Compared with the traditional 3D convolution, this strategy has obvious advantages in efficiency.
Methods
A. Framework
This paper is generally based on the [9]. The overall framework is shown in Figure 1. Given only one RGB image, the features of the image are extracted by the backbone network EfficientNetB7. The CoT module processes the lowest level features and sends them to the decoder. The extracted multi-scale features are then fed into the FLoSP (Features Line of Sight Projection) module, which uses back-projection to transform the 2D features into 3D features. Specifically, the center of each 3D voxel is projected to the 2D feature map, and the corresponding features are obtained by sampling from the multi-scale feature maps, where
B. Context Transformer
The image information of outdoor traffic scenes is usually complex. Due to the influence of illumination, distance and other factors, the pixel features of objects with the same semantic labels may be different. To this end, this paper introduces the CoT module following the encoder’s lowest-level features, as illustrated in Figure 2, which uses contextual information to improve the abilities of feature extraction. Figure 3 shows the structure of the CoT module.
The input key is contextually encoded by
Figure 4 depicts the detailed computation method of the CoT module. Given the 2D feature map \begin{equation*} A=(K_{1}+Q)W_{\theta }W_{\delta } \tag {1}\end{equation*}
\begin{equation*} K_{2}=V\ast A \tag {2}\end{equation*}
C. Recurrent Residual Convolution
The typical decoder structure directly conducts convolution processing to generate multi-scaled feature maps after up-sampling image features. Unfortunately, this kind of design usually does not perform well in distinguishing objects with similar semantic information, such as tables and furniture, cars and trucks, bicycles and motorcycles, etc. To address this issue, this paper introduces a layer of recurrent residual convolution module following the decoder up-sampling process, as shown in Figure 2. Figure 5 shows the structure of the RRCNN module, including one layer of
Let \begin{align*} O_{ijk}^{l}\left ({{ t }}\right )& = \left ({{ w_{k}^{f} }}\right )^{T}\ast x_{l}^{f\left ({{ i,j }}\right )}\left ({{ t }}\right ) \\ & \quad +\left ({{ w_{k}^{r} }}\right )^{T}\ast x_{l}^{r\left ({{ i,j }}\right )}\left ({{ t-1 }}\right ) +b_{k} \tag {3}\end{align*}
\begin{equation*} F\left ({{ x_{l},w_{l} }}\right )=f\left ({{ O_{ijk}^{l}\left ({{ t }}\right ) }}\right )=\max \left ({{ 0,O_{ijk}^{l}\left ({{ t }}\right ) }}\right ) \tag {4}\end{equation*}
\begin{equation*} x_{l+1}=x_{l}+F(x_{l},w_{l}) \tag {5}\end{equation*}
D. Loss Function
This paper uses multi-class cross-entropy loss
The loss of relation prior layer is used in the 3D CRP module. The features extracted by the 3D network are sent to ASPP convolution to expand the receptive field, and then divided into M relationship matrixes \begin{align*} {\mathcal {L}}_{rel}& =\sum _{m\in M,i}\bigg [\left ({{ 1-A_{i}^{m} }}\right )\log \left ({{ 1-\hat {A}_{i}^{m} }}\right ) \\ & \quad + \frac {\sum _{i} \left ({{ 1-A_{i}^{m} }}\right )}{\sum _{i} A_{i}^{m}}A_{i}^{m}log\hat {A}_{i}^{m}\bigg] \tag {6}\end{align*}
The semantic category and geometric occupancy loss is extended on the basis of 2D binary affinity loss [26], optimizing multiple categories of geometric scenes and semantic categories, using Precision, Recall and Specificity to evaluate metrics. Assuming that \begin{align*} p_{c}(\hat {p},p)& =log\frac {\sum \nolimits _{i} {\hat {p}_{i,c}\left [{{\hspace {-0.15em}\left [{{ p_{i}=c }}\right ]\hspace {-0.15em}}}\right ]}}{\sum \nolimits _{i} \hat {p}_{i,c}} \tag {7}\\ R_{c}(\hat {p},p)& =log\frac {\sum \nolimits _{i} {\hat {p}_{i,c}\left [{{\hspace {-0.15em}\left [{{ p_{i}=c }}\right ]\hspace {-0.15em}}}\right ]}}{\sum \nolimits _{i} \left [{{\hspace {-0.15em}\left [{{ p_{i}=c }}\right ]\hspace {-0.15em}}}\right ]} \tag {8}\\ S_{c}(\hat {p},p)& =log\frac {\sum \nolimits _{i} {(1-\hat {p}_{i,c})(1-\left [{{\hspace {-0.15em}\left [{{ p_{i}=c }}\right ]\hspace {-0.15em}}}\right ])}}{\sum \nolimits _{i} {(1-\left [{{\hspace {-0.15em}\left [{{ p_{i}=c }}\right ]\hspace {-0.15em}}}\right ])}} \tag {9}\end{align*}
\begin{align*} {\mathcal {L}}_{scal}\left ({{ \hat {p},p }}\right )=-\frac {1}{C}\sum \limits _{c=1}^{C} \left ({{\begin{array}{l} P_{c}\left ({{ \hat {p},p }}\right )+R_{c}\left ({{ \hat {p},p }}\right ) \\ +S_{c}\left ({{ \hat {p},p }}\right ) \\ \end{array}}}\right ) \tag {10}\end{align*}
The influence of occlusion on the prediction results cannot be eliminated from a single observation perspective, as the visible objects are typically expected as a part of the occluded area. To mitigate this impact, the frustum scale loss function \begin{equation*} {\mathcal {L}}_{fp}={\sum \limits _{k=1}^{l2}} {D_{KL}(P_{k}\vert \vert \hat {P}_{k})} =\sum \limits _{k=1}^{l2} \sum \limits _{c\in C_{k}} {P_{k}(c)log\frac {P_{k}(c)}{\hat {P}_{k}(c)}} \tag {11}\end{equation*}
\begin{equation*} { {\mathcal {L}}_{total}={\mathcal {L}}_{ce}+{\mathcal {L}}_{rel}+{\mathcal {L}}_{scal}^{sem}+{\mathcal {L}}_{scal}^{geo}+{\mathcal {L}}_{fp} } \tag {12}\end{equation*}
Experiments
A. Training Setup
All experiments in this paper use the following environment: Ubuntu18.04 operating system, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz, NVIDIA RTX A5000 (24GB) GPU, Pytorch 1.8 framework and Python 3.7 programming language. Using AdamW optimization algorithm to train 30 epochs, the batch size is 4, and the learning rate is 0.0004.
B. Datasets and Metrics
NYUv2 is an indoor dataset with a total of 1449 scenes captured by Kinect, and the image resolution is
Semantic KITTI is a large public dataset for road driving scenarios, based on the famous KITTI Odometry benchmark dataset, and the image resolution is
In these experiments, we use IoU (Intersection over Union) to assess the scene completion of each category. And the average IoU (mIoU) of all categories is calculated to evaluate the overall performance of the semantic scene completion, which is the most important metric for this task.
C. Performance
The comparison results between our method and other available algorithms on the NYUv2 and Semantic KITTI datasets are shown in Table 1 and Table 2 respectively. A higher value of IoU and mIoU leads to a better effect. The baseline method is denoted by *, the best values are shown in bold, and the second best are underlined. The experimental results show that, compared with the baseline on two datasets, the strategy proposed in this paper improved both IoU and mIoU.
Figure 7 and Figure 8 display the visual results on NYUv2 and Semantic KITTI datasets. The first column is the input RGB images, the second and third columns show the results of other methods, and the results of baseline and this paper are given in the fourth and fifth columns, the last column shows the Ground Truth.
The first row in Figure 7 demonstrates that our method correctly recognizes a corner of the bed, even only part of it can be shown in the picture. Whereas, the original method incorrectly identifies it as the sofa and other objects. The results of the second row show that this paper effectively recognizes the shape of the photo on the wall, but the baseline method incorrectly mixed it with the nearby objects. The third row indicates that our method accurately recognizes the table whose texture information is close to furniture, while the original method mistakenly recognizes it as the furniture.
The first row in Figure 8 shows that we correctly recognize the car that is largely obstructed at the intersection but baseline method does not detect it. The second row demonstrates that this paper efficiently distinguishes the trucks and cars parked together in the distance, while the original result only displays cars. The third row shows that our method effectively differentiates the tree trunks and poles, while the baseline method incorrectly conflates them together. The fourth row tells that the traffic sign on the pole is detected correctly, but the baseline result believes they are the same thing.
D. Ablation Study
1) Loss Function Ablation:
We carried out ablation experiments on NYUv2 and Semantic KITTI datasets to verify the effectiveness of each part of the loss function. The model’s performance was assessed by removing each loss function while keeping the other conditions unchanged. The results are shown in Table 3 and Table 4.
2) Module Ablation:
In order to verify the effectiveness of the CoT module and the RRCNN module, we carried out ablation experiments. Table 5 and Table 6 show the detailed results on NYUv2 and Semantic KITTI datasets respectively, and the best values are shown in bold, and the second best are underlined. We can see that only adding the CoT module can increase IoU and mIoU by 1.24 and 0.87 respectively on the indoor dataset, and increase IoU and mIoU by 0.16 and 0.26 on the outdoor dataset. Adding the RRCNN module only increases IoU and mIoU by 1.18 and 0.83 respectively on the indoor dataset, and decreases IoU by 0.37 on the outdoor dataset, but increases mIoU by 0.33, which is superior to CoT module in performance.
We also give the visualization results of ablation experiments. Figure 9 and Figure 10 show the results of the baseline method and applying the Cot module and RRCNN module respectively to the same RGB image. As shown in the two figures, after adding the CoT module, the completion effect of the scene has been improved. Compared with the baseline method, there are fewer large area of blank. The results of the first, third and fourth rows in Figure 9 and the first and third rows in Figure 10 can prove this. As shown in the results of the second and fourth rows of Figure 9 and the first, second and fourth rows of Figure 10, after adding the RRCNN module, the accuracy of detecting objects has been improved, and some objects with inconspicuous features can be detected, such as transparent windows, small objects on the desktop, poles and traffic signs, etc.
Conclusion
In this paper, we propose a monocular 3D semantic scene completion method based on context transformer and recurrent residual convolution module to improve the image feature extraction network. In order to improve the ability of visual representation, the context transformer module was introduced between encoder and decoder to use the contextual information to guide the learning of dynamic attention matrices. After the up-sampling of the image decoder, the recurrent residual convolution module was added to accumulate features on different time steps to improve the ability of the network to distinguish similar objects. The experimental results show that the evaluation metrics on the NYUv2 and Semantic KITTI datasets are better than the baseline method, and the semantic scene completion effect is superior. Although the method proposed in this paper improves the effect of semantic scene completion, the prediction accuracy for some small targets is relatively low. A future study goal would be to increase the completion effect of the model for small targets in indoor and outdoor environments.