Compact Depth-Wise Separable Precise Network for Depth Completion

Predicting a dense depth map from synchronized LiDAR scans and RGB images using compact deep neural networks presents a significant challenge. While most state-of-the-art models enhance prediction accuracy by increasing the number of parameters, leading to substantial memory consumption, depth completion tasks in areas such as autonomous driving primarily utilize edge devices powered by embedded GPUs. In this paper, we introduce a methodology for creating an efficient, high-fidelity depth completion model derived from a base model. Our proposed compact model replaces conventional convolutional encoder layers with depth-wise separable convolutions, and transposed convolutional decoders with up-sampling plus depth-wise separable convolution. We further employ random layer pruning as a stability test, guiding the design of our architecture and preventing over-parameterization. Additionally, we introduce a straightforward yet robust knowledge distillation method to enhance network performance and improve model scalability to meet higher quality requirements. Our experimental results demonstrate substantial improvement over existing compact models in terms of state-of-the-art performance, while significantly reducing the number of parameters compared to larger models.


I. INTRODUCTION
Generating a precise depth map can be a fundamental prerequisite for various downstream visual recognition tasks, including autonomous driving, 3D object detection, and augmented reality. Such applications necessitate a detailed understanding of the spatial depth of their environments. Depth completion, achieved through the fusion of sparse LiDAR scans and RGB camera images, is an effective method for generating a precise dense depth map. This multisensor approach leverages the rich information from RGB imagery and the direct, discrete quantitative depth measurements from LiDAR sensors, yielding benefits over single sensor type implementations.
However, the sparsity and non-uniform distribution of LiDAR depth scans within a 2D viewpoint make depth completion a challenging task. Consequently, this area has attracted significant research attention in recent years.
The associate editor coordinating the review of this manuscript and approving it for publication was Long Wang . Numerous studies have selected the KITTI dataset [1] as the benchmark. This dataset captures 3D point clouds via the Velodyne LiDAR HDL-64E, projected onto an image plane synced with RGB cameras mounted on the same data collection platform.
Following the success of convolutional neural networks in computer vision, deep convolutional neural networks have become a popular choice for state-of-the-art models. Multiscale feature extraction using Hourglass deep convolutional neural networks, along with a refinement module for additional processing of the coarse depth map, is a common approach. However, such networks often comprise multiple layers and branches with complex interconnections. As a result, these networks tend to be resource-heavy, with more than a hundred million parameters and a relatively slow inference time.
Despite the effectiveness of these methods, the nature of depth completion means they are typically deployed on embedded devices in production. These devices face hardware constraints in terms of space and cost, and real-time inference is a fundamental requirement. Therefore, there is a significant demand for compact, efficient, and accurate depth completion implementations.
In response, we propose a methodology for constructing a Compact Depth-wise Separable Precise Network (CDSPN) from a base model. Our approach leverages the highly efficient depth-wise separable technique to significantly reduce the parameters and utilizes selective fine-tuning via pruning to further compact the model without sacrificing precision. We also employ black-box knowledge distillation to boost accuracy, along with model scaling that does not require knowledge of the teacher models. We have demonstrated this methodology on two popular networks, PE-Net [2] and DeepLiDAR [3], to illustrate its simplicity and general applicability to different base models.
In particular, we have replaced the convolutional layers, including the encoder and decoder, with highly efficient depth-wise separable convolution. During experimentation, we found that the attention module, such as the squeeze-andexcitation in MBConv [4], had limited utility. Through the application of fully grouped convolution and point-wise convolution, we have significantly reduced the number of parameters required, as well as the inference time. Additionally, we replaced the transposed convolution with up-sampling combined with convolutional operations to eliminate undesirable checkerboard artifacts.
After integrating the depth-wise separable convolution, we implemented random layer pruning as a stability test on our backbone model. This approach guided the selective fine-tuning of the model, preventing over-parameterization and eliminating redundant connections. As a result, after several iterations of pruning, our redesigned backbone model excelled due to its simple architecture and improved ratio of quality to the number of parameters. Overall, our model outperforms existing depth-wise separable convolution models and other compact models, while training on significantly fewer parameters than the base model. The model's performance and speed also make it suitable for deployment on embedded devices.
To mitigate precision loss from these architectural changes, and to imbue the network with scalable capabilities, we introduced a straightforward but effective knowledge distillation method. This method solely relies on the predictions of the black-box teacher model, eliminating the need for access to its intermediate layers. By using distillation on the prediction and its gradient, our model can easily scale to meet higher accuracy requirements without complicated adjustments, fine-tuning of the network architecture, or retraining of the teacher. This approach ensures that distillation remains viable in real-world applications where only the API to the teacher is accessible.
The contributions of this work are as follows: 1) We introduce a comprehensive workflow for creating compact models from existing state-of-the-art architectures. This process aims to produce flexible and efficient agents capable of performing visual recognition tasks at various scales. It provides a practical method for compressing and improving many large-scale models. 2) We highlight depth-wise separable convolution as a promising approach for designing compact models suitable for model compression and operation within resource-constrained environments. This technique can make a significant contribution to the field of depth completion. 3) Our models outperform existing depth-wise separable implementations [5] across all metrics of accuracy, parameter count, and inference speed. Furthermore, our models compete with the state-of-the-art compact model [6] in root mean square error (RMSE) measurements, with [7] or without knowledge distillation, while maintaining similar parameter counts and inference speeds. When scaled using our knowledge distillation method, our models match the performance of the teacher PE-Net model, but require over 50 times fewer parameters and deliver faster inference speeds.

II. RELATED WORK A. DEPTH-WISE SEPARABLE CONVOLUTION
Depth-wise separable convolution seeks to represent a regular dense convolution using two distinct convolutional operations. Initially, a group convolution is applied independently to each channel of the input feature. Typically, the output channels are equivalent to the input channels, although the spatial properties of each channel may change. Subsequently, a point-wise convolution (with 1 × 1 kernels) is conducted, projecting the output feature into a new channel space. The computational cost of regular convolution (P conv ) and depthwise separable convolution (P DS ) can be described using the equations: and Here, K represents the kernel size, and C the channel numbers. Depth-wise separable convolution (P DS ) requires fewer parameters compared to regular convolution (P conv ), especially when the channel numbers (C) are large. As a result, this technique and its variations have been widely used in many previous works [8], [9], [10], [11], and [12], including depth completion [5]. However, certain studies [13] have indicated that depth-wise separable convolution can lead to higher memory access consumption, which in turn can result in longer inference times than dense convolution with an equivalent number of parameters. Our implementation strives to achieve a balance between accuracy and speed.

B. MULTI-BRANCH DEPTH COMPLETION NETWORKS
Many state-of-arts researches [2], [3], [14], and [15] have adopted a multi-branch design for their network architecture, The backbone's predicted depth is further processed by a refinement module called DA-CSPN++ [2]. For more detailed information on the Model layer architecture for the backbone, refer to Table 1. The process of layer pruning is explained in Section III-B. Offline knowledge distillation can be applied considering the teacher model as a black box.

FIGURE 2.
This figure depicts the methodology when DeepLiDAR is employed as the base model. After the pruning test, the normal prediction branch, denoted as I, is removed and its output is replaced with a zero tensor. The details of the Model layer architecture for the active branches, II and III, can be found in Table 2. As with the PE-Net example illustrated in Fig. 1, we apply the same knowledge distillation method to the final output.
as it has a stronger capacity to exploit contextual features. PE-Net and DeepLiDAR are two of them that has been selected as the base model in our experiments. Our main work, based on PE-Net, employs a two-branch structure, taking advantage of RGB information and spacial depth information respectively. Geometric information is fused into the convolutional layers to enhance the capability of the model. And the involvement of the Convolutional Spatial Propagation Networks [16] (CSPN++) acts as a refinement module, further recovering structural details. DeepLiDAR leverages additional training data sourced from a synthetic dataset created using the open urban driving simulator, Carla [17]. This simulator generates ground normal data, which is used to train a normal prediction branch within the network.

C. MODEL PRUNING
Model pruning represents a set of compression techniques that allow for the fine-tuning of a model to enhance its efficiency and reduce its size. There are two primary pruning paths: weight pruning and layer pruning. Weight pruning [18] targets the increase of model sparsity by nullifying weights of lower magnitude. As a result, the pruned model retains the original architecture, but becomes sparser as weights of insignificant magnitude are zeroed out.  Consequently, large yet pruned models (large-sparse) can outperform their smaller, but denser (small-dense) counterparts in terms of accuracy, albeit with the same memory footprint.
The second pruning method involves filter pruning [19] and layer pruning. These methods eliminate filters and layers in the model, as opposed to individual weights. Contrasted with weight pruning, which might not always result in clear memory saving and speed improvements when deployed on edge devices due to irregular network connections, filter and layer pruning are more straightforward and effective. Furthermore, these methods allow for entire filters or layers to be removed, facilitating the exportation of the resulting model for production purposes.

D. KNOWLEDGE DISTILLATION
Knowledge distillation involves the transfer of knowledge from a larger, more accurate ''teacher'' network to a smaller ''student'' network. This allows the student model to achieve better convergence without incurring additional overhead during inference. The student model, by learning from the teacher network's patterns, can avoid certain learning pitfalls, thereby achieving improved performance compared to training solely with ground truth. This technique has become popular for enhancing the performance of compact models without adding extra weights or costs.
In the context of depth completion tasks, Wang, Hu, Li and Gong. Reference [7] proposed a self-paced knowledge distillation method. They suggested a progressive learning approach, gradually introducing more challenging pixels to distill depth and structure (gradient) knowledge from the teacher to the student network. Not only is the final teacher model utilized, but snapshots of the teacher model at intermediate steps are also captured for the purpose of identifying challenging pixels.
Furthermore, Liu, Agrawal, Chen, Hong and Wong. Reference [20] demonstrated a knowledge distillation method that uses multiple teacher models. This approach allows the student model to avoid learning the error modes of a single teacher, as different error modes appear in different teachers across different depth ranges. Consequently, the distillation loss is calculated by selecting predictions from the teacher that minimizes the error most effectively. In regions where all teachers yield high errors, a default loss is chosen instead of the distillation loss.

III. PROPOSED METHOD
In this work, we select two influential multi-branch hourglass depth completion models, PE-Net and DeepLiDAR, to illustrate the process of building our Compact Depth-wise Separable Precise Network (CDSPN) models. An illustration of the method is shown in Fig.1. Fig.2

A. DEPTH-WISE SEPARABLE CONVOLUTION
The cornerstone of our approach is the use of lightweight depth-wise separable convolutions, replacing the regular convolutions within our chosen models. Depth-wise separable convolutions separate standard convolutions into depth-wise and point-wise convolutions. This transformation significantly reduces the total number of parameters, thereby reducing the computational complexity (flops) and enhancing the efficiency of our models. We initially applied MBConv [4] blocks, but later discovered that the squeeze-and-excitation attention mechanism did not contribute significantly to our models.
We also experimented with Coordinate Attention [21], however, we found it to be computationally expensive, consuming a large portion of the inference time. Therefore, we opted for a simpler architecture, as shown in Fig.3, which only includes depth-wise separable convolutions. These consist of a group convolution and a point-wise convolution, along with a residual connection, which performs elementwise addition when the input and output channels are the same. Instead of transpose convolutions [22], we used bilinear upsampling in the decoder to avoid creating checkerboard artifacts. This method also allowed us to use the same layer design as encoder across all decoders.
After replacing the encoder and decoder elements of the model, we employed a stability test through random layer pruning to guide the refinement of the model architecture.

B. RANDOM LAYER PRUNING STABILITY TEST
A stability test is carried out through the process of random layer pruning, specifically by deactivating a percentage of layer filters. Typically, this results in a reduction in model accuracy. If the pruned layer is integral to the model, a significant drop in accuracy can be observed, indicating the layer's large influence on prediction. Conversely, an increase in accuracy following pruning may indicate over-parameterization of the model, allowing us to eliminate over-parameterized layers, reduce the quantity of filters in less significant layers, and consider increasing the number of filters in key layers.  TABLE 3. This table illustrates an example of the pruning test conducted on the layers. In each layer, 75% of the filters are pruned individually and subsequently, the accuracy is evaluated. In this specific example, despite pruning most filters near the bottleneck layers (e10 and d1), the accuracy remains close to the baseline. This outcome is notable as the baseline parameter number is several times larger.
Detailed implementation can be found in Algorithm 1, with the choice of ϵ being dependent on the base accuracy. An example is provided in Table 3.
Moreover, with careful pruning [23], the model can be further compressed for production use, albeit with a slight decrease in accuracy.
In the case of the PE-Net, we employed several iterations of fine-tuning cycles. As a result, as shown in Fig. 1, the final architecture of our model is no longer symmetrical in the encoder and decoder parts. This is primarily because we VOLUME 11, 2023 found that the decoder requires fewer layers and filters than the encoder.
When adapting the DeepLiDAR model, we noticed that the normal prediction path contributed minimally to the final output. Thus, we replaced the output of the normal prediction with a tensor filled with zeros and pruned this branch from the model.
After obtaining a fine-tuned compact model, we devised a straightforward method to adjust the number of channels according to different requirements. In this work, we introduce a simple multiplier that can easily tweak the number of channels to match diverse scale requirements.

C. A SIMPLE DISTILLATION METHOD
Knowledge distillation can often serve as an effective solution to local maxima issues when training light or sub-optimal models. In this work, we employ a simple approach to distill knowledge from a pre-trained network, such as E-Net and PE-Net. Unlike other techniques [7], [24], only the prediction portion of these models is incorporated into the distillation process as a part of the loss function.

Algorithm 2 Add Knowledge Distillation in Training
The distillation losses L are calculated as follows: and MSE refers to the Mean Squared Error, D t and D s are the depth maps predicted by the teacher and student models, respectively, g(x) calculates the spatial gradient of x. Our overall loss function is then formulated as: In these equations: D g represents the ground truth depth map, m(x) is a binary mask where D i,j g is not zero. λ 1 , λ 2 are hyperparameters set to 0.1 in our case.
According to prior studies [25], stronger teacher models tend to result in more effective distillation. Our approach allows for easy utilization of more powerful models to achieve enhanced performance without the need for retraining or modification of the teacher model.
Our straightforward distillation approach significantly improves the student model's performance. In our experiments, we observed that larger or less fine-tuned models benefited more from distillation. We surmise that smaller models may lack sufficient parameters for representation and well-tuned models may have already learned more optimal weights.
By employing a combination of a simple channel multiplier and knowledge distillation, the resulting models can deliver satisfactory performance across different parameter sizes.

IV. EXPERIMENTS A. DATASET
The KITTI Depth Completion Dataset [1] is utilized for evaluating our proposed methods. The KITTI dataset, a renowned benchmark dataset for outdoor depth completion tasks, provides RGB images alongside aligned sparse depth maps. It offers semi-dense depth maps generated via LiDAR, with consistency checks performed using stereo camera pairs as ground truths.
The dataset is comprised of 85,895 training samples, 1,000 validation samples, and 1,000 test samples, the latter lacking ground truth data for benchmarking purposes. Although we report our results on the test dataset, most of our experiments, including ablation studies, are conducted on the selected validation dataset.

B. EXPERIMENTAL SETUP
For models utilizing PE-Net, most of our training hyperparameters are consistent with those of PE-Net. Tweaking these hyperparameters is not the focus of our work. We follow the same multi-stage training strategy by initially training the backbone model, subsequently freezing it while the DA-CSPN++ module is being trained, and finally training the entire network. A fourth stage has been introduced for knowledge distillation, where we incorporate the distillation loss, guided by a pretrained PE-Net, to further enhance performance.
For models based on DeepLiDAR, our model is constructed upon a simplified, scaled-down implementation of DeepLiDAR, developed in Python 3, available at https://github.com/cyshih704/DeepLiDAR. In this implementation, the base model parameter count has been reduced from 143M to 47M, training image size has been reduced, and fewer training data are used. The extra training data for the normal path is also omitted. Consequently, the results from this implementation are less satisfactory than those of the published version. While training our modified model, we simply train the entire network, and later add the , without open access to ground-truth. The primary evaluation metric is RMSE, with lower values indicating better performance. The first runtime column (Runtime1) is sourced directly from the benchmark site, while the second (Runtime2) is determined using our testing hardware. Overall, our series of models deliver competitive results, even though they utilize only a fraction of the total parameters employed by the base models. Our model, based on PE-Net, outperforms MSGCHN, which is among the most efficient models at various scales, with or without distillation (as represented by SKPD_MSGCHN64). Furthermore, it provides comparable or superior accuracy, coupled with faster inference speeds, when set against other considerably larger models. distillation loss, guided by a pretrained PE-Net, since our modified model is in fact more accurate than the base model.

C. RESULTS
In this section, we evaluate our series of models on the KITTI dataset, comparing our results with those of contemporary approaches. Among compact models in this category, our compact PE-Net model stand out as exceptional performers, particularly with respect to the RMSE metric employed by KITTI. We've also significantly enhanced our compact DeepLiDAR model, demonstrating the generalizability of our methodology across various models. This improvement underlines the robustness of our approach, indicating its applicability to diverse model architectures. A key advantage of our models is their scalability and flexibility, permitting customization across a broad spectrum of sizes to meet diverse requirements. This adaptability makes our models highly competitive across various conditions. For a comprehensive understanding of the test results, please refer to Table 4.

D. ABLATION STUDIES 1) USE OF ATTENTION MODULE
Attention mechanisms are frequently used in Inverted Residual Blocks (MBConv [26]) and many depth completion networks [27], [28] to enhance performance. Our base block only incorporates depth-wise convolution and point-wise projection to offset latency and further reduce the number of parameters. Our studies indicate that although attention can potentially improve the MAE, it significantly increases latency due to the presence of too many small operators, which reduce parallelism [13], and typically require a greater number of parameters. We conduct tests on classic squeeze and excitation [29] and coordinate attention [21].

2) IMPACT OF KNOWLEDGE DISTILLATION
Our approach to knowledge distillation incorporates two loss components. To ascertain the effect of each loss term, we present the results in Table 6 and Fig. 4. While selecting the PE-Net as the base model, instead of training the backbone and the entire network separately, we conduct the training in a single stage to better demonstrate the impact of distillation. VOLUME 11, 2023   In the context of depth distillation, we solely draw upon the depth predictions from the teacher model. In contrast, gradient distillation relies on the spatial gradient of the teacher model's depth predictions. When both distillation methods are used concurrently, each one contributes to half of the total distillation loss.
From the experiment, as both of the distillation losses help reduce the primary metric RMSE, the structure gradient loss is superior to direct distillation with the teacher's depth prediction. However, there is a trend of overestimating the depth when using only the gradient loss, in real work applications like autonomous driving, it is often not desired. Although involving depth distillation will transfer some of the behaviors from the teacher, such as checkerboard patterns when using PE-Net as the teacher, a mixture of the depth distillation loss helps fix the above problem while still having better accuracy. Therefore, we choose to use the combination of both losses in our later experiments. For all of the distillation methods, an increase of MAE can be seen, this is possibly caused by the shift of the local minimal where the model converges when involving the distillation losses in addition to the base L2 loss.
Our distillation method is a helpful piece when scaling the model, we compare different scaled models with or without distillation to different scaled PE-Net. Results are in Table 7 and Fig. 5. Our distillation method greatly improves the Initially, our larger-scaled models perform worse than their PE-Net counterparts with similar scales. However, after the application of knowledge distillation, these models demonstrate an improvement that surpasses that of the larger-scaled PE-Net models. This shows the efficacy of our proposed methodology in enhancing the performance of depth completion models, especially those with larger scales. performance of larger models, allowing developers to scale a compact light model flexibly without fine-tuning.
However, the distillation is not as effective on the fine-tuned compact models. We hypothesize if the model is already very compact, converging very close to the global minimal at its parameter count, then adding a teacher does not help guild the network to any better convergence.
From our experimental results, we find that both types of distillation losses aid in reducing the primary metric RMSE, with the structure gradient loss demonstrating superiority over direct distillation using the teacher's depth prediction. However, a trend of overestimation is observed in PE-Net when employing only the gradient loss, which may be undesirable in practical applications such as autonomous driving.
Incorporating depth distillation can transfer some behaviors from the teacher model, including potential checkerboard patterns when PE-Net is used as the teacher. Nevertheless, a mixture of depth distillation loss can help mitigate overestimation while maintaining improved accuracy. Thus, we opt for a combination of both loss types in our subsequent experiments.
For nearly all distillation methods, an increase in MAE is observed, likely caused by a shift in the local minimum where the model converges when distillation losses are added to the base L2 loss.
Our distillation method proves advantageous when scaling the model. We compare differently scaled models, with and without distillation, to various scaled PE-Net models. The results, found in Table 7 and Fig. 5, demonstrate that our distillation method significantly enhances the performance of larger models, thereby offering developers the flexibility to scale a compact, lightweight model without requiring fine-tuning.
Nevertheless, the distillation does not prove as effective on fine-tuned compact models. We theorize that if the model is already highly compact and has converged closely to the global minimum given its parameter count, then adding a teacher does not necessarily guide the network towards better convergence.

V. CONCLUSION
In this paper, we introduce a systematic workflow for the generation of compact models, specifically for the depth completion problem. The presented methodology is adaptable to various base models and begins with the replacement of standard convolution layers with more efficient depth-wise separable convolutions, resulting in a significant reduction in parameter size.
Further compression of the network is achieved through strategic pruning via a random layer pruning test. This allows for the removal or reduction of over-parameterized layers. To offset any loss in precision, we apply a black-box knowledge distillation technique to enhance the performance of the compact model across various scales.
Benchmarked on the KITTI dataset, the series of models generated through our methodology are not only competitive compared to other compact models but also demonstrate performance on par with some larger implementations.
While our approach delivers promising results, it's not without its limitations. Our straightforward knowledge distillation method, although effective in improving RMSE, tends to adversely affect MAE. Thus, a more sophisticated distillation method that can balance these two metrics could potentially offer better performance. Additionally, the success of our model is inherently tied to the quality of the base model; a more advanced base model could likely yield an even more robust and efficient compact model. In future work, exploring various base models will be an exciting avenue to potentially elevate the efficacy of our compact model even further.
Overall, this paper represents a significant contribution to the existing body of research in the field of depth completion.