Loading web-font TeX/Math/Italic
Semantic Scene Completion Through Context Transformer and Recurrent Convolution | IEEE Journals & Magazine | IEEE Xplore

Semantic Scene Completion Through Context Transformer and Recurrent Convolution


Overview of the proposed semantic scene completion method based on monocular image.

Abstract:

The purpose of monocular semantic scene completion is to predict detailed 3D scene with semantic information using only one image. In order to improve the ability of extr...Show More

Abstract:

The purpose of monocular semantic scene completion is to predict detailed 3D scene with semantic information using only one image. In order to improve the ability of extracting image features of the classical network and achieve better semantic scene completion effect, we propose a monocular semantic scene completion method based on context transformer and recurrent residual convolution. The context transformer module was added between the encoder and decoder of the image feature extraction network, which uses context information to guide the learning of the dynamic attention matrix and improve the visual representation ability. We also introduce a recurrent residual convolution module into the decoder to accumulate features at different time steps, thus helping to distinguish similar objects. Experimental results show that, on indoor dataset NYUv2 and outdoor traffic scene dataset Semantic KITTI, compared with the baseline method, the evaluation metrics mIoU of the semantic scene completion task is improved by 5% and 8% respectively.
Overview of the proposed semantic scene completion method based on monocular image.
Published in: IEEE Access ( Volume: 12)
Page(s): 69700 - 69709
Date of Publication: 15 May 2024
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

Learning 3D information about the environment from images is a challenging task in computer vision, which can be applied to robot navigation and autonomous driving. Humans can naturally estimate the complete geometric shape of the object from the images observed only from a single perspective, and build a 3D model of the environment. However, the robot’s ability in this respect is still relatively weak [1]. Accurately perceiving the surrounding 3D scene is crucial for autonomous vehicles because it directly affects downstream tasks, such as path planning and map reconstruction. Unfortunately, due to the lack of correct depth information and obstruction of sight [2], it is difficult to obtain complete and reliable 3D information about the real world. To address this issue, the 3D semantic scene completion [3] task was developed. This task involves jointly inferring the geometric occupancy and semantic information of the surrounding environment from a limited observational perspective. It utilizes voxel grids with semantic labels to represent the physical world.

Most existing semantic scene completion methods use LiDAR [4], [5], [6], [7], [8] as the main sensor, because it can provide accurate depth information. However, LiDAR sensors are usually expensive and point clouds are sparse. In contrast, camera sensors are cheaper and can provide rich semantic information. Therefore, the researchers proposed several solutions to complete the 3D scene from only images [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19]. MonoScene [9] provided the first method to the task by using only RGB images. This work utilizes the classic 2DUNet network to extract features from images. However, in the face of complex traffic scenes, it is difficult for the traditional network to extract accurate features.

In this paper, in order to improve the ability of network to extract visual features and thus enhance the effect of semantic scene completion, we introduce a monocular semantic scene completion method based on Contextual Transformer (CoT) [20] and Recurrent Residual Convolutional Neural Network (RRCNN) [21]. The encoder of the image feature extraction network utilizes the EfficientNetB7 backbone network. After the encoder, the context transformer module is installed to collect contextual information. And the decoder combines the recurrent residual convolution module to accumulate image features. Experimental results on NYUv2 [22] and Semantic KITTI [23], [23] show that the method proposed in this paper can improve the accuracy of semantic scene completion.

The main contributions of this paper can be summarized as follows. First, the context transformer module is introduced between encoder and decoder to use the contextual information to guide the learning of dynamic attention matrices, which improves the ability of visual representation. Second, we add the recurrent residual convolution module after the up-sampling of the image decoder to help accumulate features on different time steps. Experiments are conducted on two datasets and the results indicate our method can achieve better effect of semantic scene completion.

The rest of this paper is organized as follows: In Section II, we discuss the existing methods about monocular semantic scene completion. Section III introduces the overall framework, the context transformer and recurrent residual convolution module in detail. In section III, the experimental results of our method and ablation study on two datasets are introduced quantitatively and qualitatively. Sections V provides the conclusion of this paper.

SECTION II.

Related Work

The challenge of monocular semantic scene completion has attracted a lot of in-depth research in recent years. According to whether depth information is needed, the existing work can be divided into two categories.

A. Depth-Based Methods

The early methods relied on RGB-D images, which required explicit depth information and were limited to small indoor scenes. Song et al. [24] proposed a solution by utilizing an enhanced truncated signed distance function to encode the depth map into 3D voxel features. A 3D convolutional neural network was used to extract geometric and contextual information to generate voxel occupation probability distributions and semantic categories. This method involves serial 3D convolutions, which leads to a large amount of calculation loads. In order to reduce the number of parameters, Guo and Tong [25] replaced some of the 3D convolution with normal 2D convolution. The input depth image was initially processed through normal convolution. Then the feature maps were projected into 3D and sent to the 3D convolutional neural network, which effectively reduced the computational load of the network.

B. Image-Based Methods

When the scene is vast and far away, it is challenging to collect accurate depth information. Therefore, researchers suggested using monocular RGB images to complete semantic scene. These studies are mainly aimed at the large-scale outdoor traffic scenes. Cao and de Charette [9] provided the pioneer method to complete 3D semantic scene from only one RGB image without relying on any depth information. This approach employed the 2D UNet to extract image features and generated 3D features by the proposed dense line-of-sight projection. Then, 3D geometric features and contextual information were extracted by 3D UNet network. The author also proposed a context priority layer, which was inserted between the encoder and decoder of 3D UNet to learn the semantic relationships between voxels. Finally, a 3D segmentation header was applied to generate the results of semantic scene completion. Although this method has achieved competitive results, it may be challenging to distinguish objects with similar semantic information. Based on this work, this paper proposes an image feature extraction network to assist the model discriminate similar items more effectively.

Yao et al. [27] introduced a novel semantic completion network with normalized device coordinates, which gradually restored the depth dimension instead of directly expanding to the 3D world space. In this case, most of the calculations were carried out in NDC space. The author also developed a depth adaptive dual decoder for up-sampling and merging 2D and 3D feature maps, which greatly improved overall performance. Li et al. [2] argued that using dense feature projections directly could easily lead the assignment of inaccurate semantic labels to occluded areas. Therefore, they proposed a two-stage framework. In the first stage, a depth estimation network was used to predict the depth value of each pixel in the image, which was then back-projected to the 3D point cloud space to obtain a sparse query set of occupied voxels in the visible area. In the second stage, deformable cross-attention and self-attention mechanisms were utilized to generate a dense 3D voxel grid. This strategy was effective for short-distance areas and small targets. To improve the efficiency of 3D convolution, Zhang et al. [28] proposed a dual-path transformer network to extract 3D features. They extracted 2D features from a single image, and then converted them into 3D using the LSS paradigm [29]. In the proposed dual-path transformer encoder, 3D features were extracted along the local and global paths respectively. These two features were fused and sent to the transformer. In the last, a mask classification model was utilized for 3D semantic occupancy prediction. The suggested method decomposes 3D features processing into two paths. Compared with the traditional 3D convolution, this strategy has obvious advantages in efficiency.

SECTION III.

Methods

A. Framework

This paper is generally based on the [9]. The overall framework is shown in Figure 1. Given only one RGB image, the features of the image are extracted by the backbone network EfficientNetB7. The CoT module processes the lowest level features and sends them to the decoder. The extracted multi-scale features are then fed into the FLoSP (Features Line of Sight Projection) module, which uses back-projection to transform the 2D features into 3D features. Specifically, the center of each 3D voxel is projected to the 2D feature map, and the corresponding features are obtained by sampling from the multi-scale feature maps, where \rho () represents perspective projection and \mathrm {\phi }() represents sampling. These features are then accumulated to obtain the output 3D feature, which is used as the input of 3D network. The 3D Context Relation Prior (3D CRP) module is utilized between the 3D encoder and decoder to learn the relationship between different voxel grids. Concretely, ASPP convolution is used to enlarge the receptive field of the input 3D features. Then 1\times 1 convolution and sigmoid activation functions are used to generate the relationship matrices of 3D features, and cross entropy loss is used for training. The relation matrices are multiplied by the reconstructed voxel features to collect the global context. Finally, a 3D segmentation header is employed to complete the semantic output. The overall architecture of image feature extraction network proposed in this paper is shown in Figure 2. The given resolution corresponds to the images in Semantic KITTI dataset.

FIGURE 1. - The overall framework.
FIGURE 1.

The overall framework.

FIGURE 2. - Structure of the proposed image feature extraction network in this paper.
FIGURE 2.

Structure of the proposed image feature extraction network in this paper.

B. Context Transformer

The image information of outdoor traffic scenes is usually complex. Due to the influence of illumination, distance and other factors, the pixel features of objects with the same semantic labels may be different. To this end, this paper introduces the CoT module following the encoder’s lowest-level features, as illustrated in Figure 2, which uses contextual information to improve the abilities of feature extraction. Figure 3 shows the structure of the CoT module.

FIGURE 3. - Structure of the context transformer.
FIGURE 3.

Structure of the context transformer.

The input key is contextually encoded by 3\times 3 convolution to generate the static context representation. The static features and input key are then spliced together by using two successive 1\times 1 convolution to learn a dynamic attention matrix, which is then multiplied by the input value to obtain a dynamic context representation. Ultimately, the static and dynamic context representations are combined to provide the output result. This strategy makes full use of the context information provided by the input keys to guide the learning of the dynamic attention matrix, improves the visual representation ability of the network, and can extract accurate visual features in the face of complex traffic scenes.

Figure 4 depicts the detailed computation method of the CoT module. Given the 2D feature map X\in R^{H\mathrm {\times W\times C}} , the key, query, and value are represented by K, Q, V respectively. The CoT module first conducts convolution operation spatially for all adjacent keys in the k\times k grid surrounding input X, using k\times k convolution kernel to obtain the static contextual features K_{1}\in R^{H\times W\times C} . Then K_{1} and Q are concatenated and convolved with two successive 1\times 1 convolutions W_{\theta } and W_{\delta } to get the attention matrix A. The calculation formula is as follows:\begin{equation*} A=(K_{1}+Q)W_{\theta }W_{\delta } \tag {1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where W_{\theta } uses the ReLU activation function but W_{\delta } does not. The dynamic context representation K_{2} is obtained by multiplying A and V:\begin{equation*} K_{2}=V\ast A \tag {2}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Finally, the static context representation K_{1} and the dynamic context representation K_{2} are fused through the attention mechanism, which is represented the final output Y.

FIGURE 4. - Detailed calculation procedure for CoT module.
FIGURE 4.

Detailed calculation procedure for CoT module.

C. Recurrent Residual Convolution

The typical decoder structure directly conducts convolution processing to generate multi-scaled feature maps after up-sampling image features. Unfortunately, this kind of design usually does not perform well in distinguishing objects with similar semantic information, such as tables and furniture, cars and trucks, bicycles and motorcycles, etc. To address this issue, this paper introduces a layer of recurrent residual convolution module following the decoder up-sampling process, as shown in Figure 2. Figure 5 shows the structure of the RRCNN module, including one layer of 1\times 1 convolution and two layers of Recurrent Convolution Layers (RCL) connected by residuals. The feature accumulation of RCL in different time steps ensures better feature representation and helps to improve the ability of the network to distinguish similar objects. Figure 6 gives the RCL structural diagram based on time steps.

FIGURE 5. - Structure of the recurrent residual convolution.
FIGURE 5.

Structure of the recurrent residual convolution.

FIGURE 6. - Recurrent convolutional layer.
FIGURE 6.

Recurrent convolutional layer.

Let X_{l} be the input of RRCNN module of layer l, and O_{ijk}^{l}(t) denotes the output of the network at time t. The output is represented by the following formula:\begin{align*} O_{ijk}^{l}\left ({{ t }}\right )& = \left ({{ w_{k}^{f} }}\right )^{T}\ast x_{l}^{f\left ({{ i,j }}\right )}\left ({{ t }}\right ) \\ & \quad +\left ({{ w_{k}^{r} }}\right )^{T}\ast x_{l}^{r\left ({{ i,j }}\right )}\left ({{ t-1 }}\right ) +b_{k} \tag {3}\end{align*}

View SourceRight-click on figure for MathML and additional features. where x_{l}^{f(i,j)}(t) and x_{l}^{r(i,j)}(t-1) represent the input of standard convolutional and RCL of layer l respectively, (i,j) represents the pixel coordinates, w_{k}^{f} and w_{k}^{r} represent the weights of normal convolutional and RCL of k-th feature map respectively, b_{k} is the bias. The output F(x_{l},w_{l}) of RCL is obtained by computing O_{ijk}^{l}(t) using ReLU activation function f:\begin{equation*} F\left ({{ x_{l},w_{l} }}\right )=f\left ({{ O_{ijk}^{l}\left ({{ t }}\right ) }}\right )=\max \left ({{ 0,O_{ijk}^{l}\left ({{ t }}\right ) }}\right ) \tag {4}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where w_{l} represents the weight of l layer, and the output of RRCNN module x_{l+1} can be calculated by the following formula:\begin{equation*} x_{l+1}=x_{l}+F(x_{l},w_{l}) \tag {5}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

D. Loss Function

This paper uses multi-class cross-entropy loss {\mathcal {L}}_{ce} , relation prior layer loss {\mathcal {L}}_{rel} , semantic category loss {\mathcal {L}}_{scal}^{sem} , geometric occupancy loss {\mathcal {L}}_{scal}^{geo} , and frustum proportion loss {\mathcal {L}}_{fp} to train the network end-to-end.

The loss of relation prior layer is used in the 3D CRP module. The features extracted by the 3D network are sent to ASPP convolution to expand the receptive field, and then divided into M relationship matrixes \hat {A}^{m}, m\in M , supervised by the ground truth A^{m} . The multi-label binary cross-entropy loss is as follows:\begin{align*} {\mathcal {L}}_{rel}& =\sum _{m\in M,i}\bigg [\left ({{ 1-A_{i}^{m} }}\right )\log \left ({{ 1-\hat {A}_{i}^{m} }}\right ) \\ & \quad + \frac {\sum _{i} \left ({{ 1-A_{i}^{m} }}\right )}{\sum _{i} A_{i}^{m}}A_{i}^{m}log\hat {A}_{i}^{m}\bigg] \tag {6}\end{align*}

View SourceRight-click on figure for MathML and additional features. where i loops through each element in the relationship matrix.

The semantic category and geometric occupancy loss is extended on the basis of 2D binary affinity loss [26], optimizing multiple categories of geometric scenes and semantic categories, using Precision, Recall and Specificity to evaluate metrics. Assuming that p_{i} is the true category of voxel i, and its possible category is predicted as c, which is represented by \hat {p}_{i,c} . The following formulas are defined:\begin{align*} p_{c}(\hat {p},p)& =log\frac {\sum \nolimits _{i} {\hat {p}_{i,c}\left [{{\hspace {-0.15em}\left [{{ p_{i}=c }}\right ]\hspace {-0.15em}}}\right ]}}{\sum \nolimits _{i} \hat {p}_{i,c}} \tag {7}\\ R_{c}(\hat {p},p)& =log\frac {\sum \nolimits _{i} {\hat {p}_{i,c}\left [{{\hspace {-0.15em}\left [{{ p_{i}=c }}\right ]\hspace {-0.15em}}}\right ]}}{\sum \nolimits _{i} \left [{{\hspace {-0.15em}\left [{{ p_{i}=c }}\right ]\hspace {-0.15em}}}\right ]} \tag {8}\\ S_{c}(\hat {p},p)& =log\frac {\sum \nolimits _{i} {(1-\hat {p}_{i,c})(1-\left [{{\hspace {-0.15em}\left [{{ p_{i}=c }}\right ]\hspace {-0.15em}}}\right ])}}{\sum \nolimits _{i} {(1-\left [{{\hspace {-0.15em}\left [{{ p_{i}=c }}\right ]\hspace {-0.15em}}}\right ])}} \tag {9}\end{align*}

View SourceRight-click on figure for MathML and additional features. where the Iverson brackets \left [{{\hspace {-0.15em}\left [{{ \cdot }}\right ]\hspace {-0.15em}}}\right ] are used, the value is one if the conditions inside the square brackets are met, otherwise it is zero. The loss {\mathcal {L}}_{scal} is utilized to maximize the above formulas:\begin{align*} {\mathcal {L}}_{scal}\left ({{ \hat {p},p }}\right )=-\frac {1}{C}\sum \limits _{c=1}^{C} \left ({{\begin{array}{l} P_{c}\left ({{ \hat {p},p }}\right )+R_{c}\left ({{ \hat {p},p }}\right ) \\ +S_{c}\left ({{ \hat {p},p }}\right ) \\ \end{array}}}\right ) \tag {10}\end{align*}
View SourceRight-click on figure for MathML and additional features.
In practice, we use semantic loss {\mathcal {L}}_{scal}^{sem}={\mathcal {L}}_{scal}(\hat {y},y) and geometric loss {\mathcal {L}}_{scal}^{geo}={\mathcal {L}}_{scal}(\hat {y}^{geo},y^{geo}) to optimize semantic information and geometric information respectively, where \left \{{{ y,y^{geo} }}\right \} are the real semantic and geometric labels and \left \{{{ \hat {y},\hat {y}^{geo} }}\right \} are the prediction results.

The influence of occlusion on the prediction results cannot be eliminated from a single observation perspective, as the visible objects are typically expected as a part of the occluded area. To mitigate this impact, the frustum scale loss function {\mathcal {L}}_{fp} is used. The input image is divided into l\times l local blocks of equal size, and the loss is applied in the view frustum corresponding to the local image block. Given a frustum k, using P_{k} and P_{k,c} to represent the true class distribution and the proportion of the class c in k respectively, the corresponding predicted values are represented by \hat {P}_{k} and \hat {P}_{k,c} . {\mathcal {L}}_{fp} is defined as the sum of KL divergence within the local frustum, the formula is as follows:\begin{equation*} {\mathcal {L}}_{fp}={\sum \limits _{k=1}^{l2}} {D_{KL}(P_{k}\vert \vert \hat {P}_{k})} =\sum \limits _{k=1}^{l2} \sum \limits _{c\in C_{k}} {P_{k}(c)log\frac {P_{k}(c)}{\hat {P}_{k}(c)}} \tag {11}\end{equation*}

View SourceRight-click on figure for MathML and additional features. The total training loss function is the sum of the standard cross-entropy loss and the above loss functions:\begin{equation*} { {\mathcal {L}}_{total}={\mathcal {L}}_{ce}+{\mathcal {L}}_{rel}+{\mathcal {L}}_{scal}^{sem}+{\mathcal {L}}_{scal}^{geo}+{\mathcal {L}}_{fp} } \tag {12}\end{equation*}
View SourceRight-click on figure for MathML and additional features.

SECTION IV.

Experiments

A. Training Setup

All experiments in this paper use the following environment: Ubuntu18.04 operating system, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz, NVIDIA RTX A5000 (24GB) GPU, Pytorch 1.8 framework and Python 3.7 programming language. Using AdamW optimization algorithm to train 30 epochs, the batch size is 4, and the learning rate is 0.0004.

B. Datasets and Metrics

NYUv2 is an indoor dataset with a total of 1449 scenes captured by Kinect, and the image resolution is 640\times 480 . The voxel grid occupancy is represented by 240\times 144 \times 240 voxels labeled with 11 semantic categories. 795 of these data samples were used as the training set, and the remaining 654 as the test set.

Semantic KITTI is a large public dataset for road driving scenarios, based on the famous KITTI Odometry benchmark dataset, and the image resolution is 1226\times 370 . The voxel grid occupancy is represented by 256\times 256 \times 32 voxels labeled with 19 semantic classes. The training set consisted of 3834 scenarios, and the validation set was evaluated on 815 scenarios.

In these experiments, we use IoU (Intersection over Union) to assess the scene completion of each category. And the average IoU (mIoU) of all categories is calculated to evaluate the overall performance of the semantic scene completion, which is the most important metric for this task.

C. Performance

The comparison results between our method and other available algorithms on the NYUv2 and Semantic KITTI datasets are shown in Table 1 and Table 2 respectively. A higher value of IoU and mIoU leads to a better effect. The baseline method is denoted by *, the best values are shown in bold, and the second best are underlined. The experimental results show that, compared with the baseline on two datasets, the strategy proposed in this paper improved both IoU and mIoU.

TABLE 1 Results of Different Methods on the NYUv2 Dataset (%)
Table 1- Results of Different Methods on the NYUv2 Dataset (%)
TABLE 2 Results of Different Methods on the Semantic KITTI Dataset (%)
Table 2- Results of Different Methods on the Semantic KITTI Dataset (%)

Figure 7 and Figure 8 display the visual results on NYUv2 and Semantic KITTI datasets. The first column is the input RGB images, the second and third columns show the results of other methods, and the results of baseline and this paper are given in the fourth and fifth columns, the last column shows the Ground Truth.

FIGURE 7. - Comparison of results on the NYUv2 dataset.
FIGURE 7.

Comparison of results on the NYUv2 dataset.

FIGURE 8. - Comparison of results on the Semantic KITTI dataset.
FIGURE 8.

Comparison of results on the Semantic KITTI dataset.

The first row in Figure 7 demonstrates that our method correctly recognizes a corner of the bed, even only part of it can be shown in the picture. Whereas, the original method incorrectly identifies it as the sofa and other objects. The results of the second row show that this paper effectively recognizes the shape of the photo on the wall, but the baseline method incorrectly mixed it with the nearby objects. The third row indicates that our method accurately recognizes the table whose texture information is close to furniture, while the original method mistakenly recognizes it as the furniture.

The first row in Figure 8 shows that we correctly recognize the car that is largely obstructed at the intersection but baseline method does not detect it. The second row demonstrates that this paper efficiently distinguishes the trucks and cars parked together in the distance, while the original result only displays cars. The third row shows that our method effectively differentiates the tree trunks and poles, while the baseline method incorrectly conflates them together. The fourth row tells that the traffic sign on the pole is detected correctly, but the baseline result believes they are the same thing.

D. Ablation Study

1) Loss Function Ablation:

We carried out ablation experiments on NYUv2 and Semantic KITTI datasets to verify the effectiveness of each part of the loss function. The model’s performance was assessed by removing each loss function while keeping the other conditions unchanged. The results are shown in Table 3 and Table 4.

TABLE 3 Results of Loss Function Ablation on the NYUv2 Dataset (%)
Table 3- Results of Loss Function Ablation on the NYUv2 Dataset (%)
TABLE 4 Results of Loss Function Ablation on the Semantic KITTI Dataset (%)
Table 4- Results of Loss Function Ablation on the Semantic KITTI Dataset (%)

2) Module Ablation:

In order to verify the effectiveness of the CoT module and the RRCNN module, we carried out ablation experiments. Table 5 and Table 6 show the detailed results on NYUv2 and Semantic KITTI datasets respectively, and the best values are shown in bold, and the second best are underlined. We can see that only adding the CoT module can increase IoU and mIoU by 1.24 and 0.87 respectively on the indoor dataset, and increase IoU and mIoU by 0.16 and 0.26 on the outdoor dataset. Adding the RRCNN module only increases IoU and mIoU by 1.18 and 0.83 respectively on the indoor dataset, and decreases IoU by 0.37 on the outdoor dataset, but increases mIoU by 0.33, which is superior to CoT module in performance.

TABLE 5 Results of Ablation Study on the NYUv2 Dataset (%)
Table 5- Results of Ablation Study on the NYUv2 Dataset (%)
TABLE 6 Results of Ablation Study on the Semantic KITTI Dataset (%)
Table 6- Results of Ablation Study on the Semantic KITTI Dataset (%)

We also give the visualization results of ablation experiments. Figure 9 and Figure 10 show the results of the baseline method and applying the Cot module and RRCNN module respectively to the same RGB image. As shown in the two figures, after adding the CoT module, the completion effect of the scene has been improved. Compared with the baseline method, there are fewer large area of blank. The results of the first, third and fourth rows in Figure 9 and the first and third rows in Figure 10 can prove this. As shown in the results of the second and fourth rows of Figure 9 and the first, second and fourth rows of Figure 10, after adding the RRCNN module, the accuracy of detecting objects has been improved, and some objects with inconspicuous features can be detected, such as transparent windows, small objects on the desktop, poles and traffic signs, etc.

FIGURE 9. - Results of ablation experiments on the NYUv2 dataset.
FIGURE 9.

Results of ablation experiments on the NYUv2 dataset.

FIGURE 10. - Results of ablation experiments on the Semantic KITTI dataset.
FIGURE 10.

Results of ablation experiments on the Semantic KITTI dataset.

SECTION V.

Conclusion

In this paper, we propose a monocular 3D semantic scene completion method based on context transformer and recurrent residual convolution module to improve the image feature extraction network. In order to improve the ability of visual representation, the context transformer module was introduced between encoder and decoder to use the contextual information to guide the learning of dynamic attention matrices. After the up-sampling of the image decoder, the recurrent residual convolution module was added to accumulate features on different time steps to improve the ability of the network to distinguish similar objects. The experimental results show that the evaluation metrics on the NYUv2 and Semantic KITTI datasets are better than the baseline method, and the semantic scene completion effect is superior. Although the method proposed in this paper improves the effect of semantic scene completion, the prediction accuracy for some small targets is relatively low. A future study goal would be to increase the completion effect of the model for small targets in indoor and outdoor environments.

References

References is not available for this document.