SDDS-Net: Space and Depth Encoder-Decoder Convolutional Neural Networks for Real-Time Semantic Segmentation

In this paper, we propose novel convolutional encoder-decoder architectures for real-time semantic segmentation based on an image-to-image translation approach via the space-to-depth and depth-to-space modules. We present architectures that compress the spatial information of the image using the space-to-depth (SD) instead of the commonly used pooling methods (Max-pooling and Average-pooling) or strided convolution approaches. The SD module can reduce the image size while preserving the spatial information of the image in the form of extra depth information, this approach is much better than the pooling approaches which introduce a loss in the information and the details of the image. We also propose a lightweight and simple decoder stage using the depth-to-space (DS) module which constructs a high-resolution dense prediction map from a large number of low-resolution feature maps. The proposed architectures are efficient in learning image classification and semantic segmentation with high accuracy and average processing speed. We trained and tested our proposed architectures on image classification (i.e. CIFAR10 and Tiny ImageNet), and indoor and outdoor benchmarks for semantic segmentation specifically NYU-depthV2 and CITYSCAPES. The proposed architectures could attain high accuracy in classification (94.28% on CIFAR10 and 72.25% on Tiny ImageNet) and high mean average precision and pixel accuracy values in semantic segmentation (pixel accuracy of 78.55% on NYU-depthV2 and 87.9% on CITYSCAPES) while maintaining a real-time speed of frame processing outperforming recent state-of-the-art methods in semantic segmentation.


I. INTRODUCTION
The general architecture of the convolutional neural networks (CNN) uses a down-sampling method (e.g., pooling or strides in convolution layers) to compress the representation to a more informative one, make the training process more efficient, and speed up the training process.Most of the CNN architectures use Max-pooling (MP) or Average pooling (AP) The associate editor coordinating the review of this manuscript and approving it for publication was Bo Shen .
to compress the feature space to more reduced representation, however, MP and AP introduce information loss as they compress the features considering that the important information exists only in the maximum value or the average value of the window used to slide over the input data.Those pooling approaches give a lossy compressed representation which negatively affects the overall learning process using the neural network architecture due to the information lost during the pooling process.Other researches [1], [2], [3] showed that the strided convolution in some architectures is able to learn the best down-sampling parameters and is better than the Max-pooling which is a non-learnable mathematical process however the strided convolution adds more parameters and complexity to the model compared to the pooling dependent models.The recent progress in CNN architectures has shown the superior ability of CNNs in performing many computer vision tasks however most CNN models use inefficient feature compression methods.Among the recent critical tasks in computer vision, semantic segmentation is one of those important tasks.It is employed in robotics [4], 3D image understanding [5], medical diagnosis [6], Virtual/Augmented reality [7], Video coding (Region-of-interest coding) [8], [9], and self-driving vehicles [10].Thus performing this task in real-time is extremely beneficial for those applications.Many models have achieved challenging semantic segmentation performance depending on CNN architectures.Some researches [13], [14] showed that a single architecture can perform multiple computer vision tasks.Almost all the existing semantic segmentation architectures use MP or strided convolution for feature compression, which are inefficient as they introduce a loss of information.We address the problem of the optimized method for the down-sampling of the features without losing major or minor information.The proposed down-sampling method reduces the spatial size of the input features however adds the spatial reduction as extra depth channels through a convolutional learnable technique that preserves the same amount of information.The proposed method uses the space-to-depth (SD) module [11] and the depth-to-space (DS) module [12] which were originally proposed for the image/video super-resolution task.Our proposed method, namely the Space-to-Depth encoder and Depth-to-Space decoder network (SDDS-Net) can perform the task of semantic segmentation with high accuracy.We can brief our contribution in this paper as follows: • We propose new convolutional encoder-decoder architectures based on the robust SD and DS learnable modules which can learn dense prediction (semantic segmentation) task efficiently.
• We compare the performance of different encoder architecture configurations such as convolutional architecture with MP, convolutional architecture with strided convolution, convolutional architecture with SD, and depth-wise separable convolutional architecture with residual connections and SD.
• We show that our proposed method can perform semantic segmentation with high speed of processing (∼25).We first experiment with the SD downsampling-based architecture in the task of the image classification to prove the robustness of the SD module in the image and feature downsampling and its performance compared to the traditional Max-pooling and strided convolution.Then, we extend the classification architecture to perform semantic segmentation based on an image-to-image translation approach.The organization of the remaining of this paper is as follows, section II presents the related work to our proposed work, section III presents the details of the proposed method, section IV presents the experiments done in this work, section V presents the results obtained by the proposed methods and comparisons with other state-of-the-art (SOTA) methods in semantic segmentation, and section VI states the conclusion of our work.

II. RELATED WORK
The two key ideas behind our architecture are the SD layer [11] which was proposed by Wang et al. to down-sample a high-resolution optical flow map to a low-resolution map with extra depth channel for video super-resolution task, and the DS layer [12] which was proposed by Shi et al. under the name ''efficient sub-pixel CNN'', they used this layer to construct a high-resolution image from many low-resolution images.The SD layer is employed in the encoder stage as a down-sampling module similar to the pooling methods with an output depth dependent on the spatial size of the input data.While the DS module is used as the decoder stage to construct the high-resolution dense prediction map from the small feature maps generated by the encoder stage.
In Image classification, the recent convolutional neural networks have shown an outstanding performance in the task of image classification, especially ImageNet classification.Most of those architectures [15], [19], [20], [21], [22], [23], [30], [37], [49] are based on Maxpooling for the downsampling of the images or features.Other research such as Inception [16] presented a hybrid approach of downsampling using both strided convolution with different kernel sizes and Max-pooling, then the output of all operations is concatenated.Springenberg et al. [3] proposed the all-convolution network which depends exclusively on the strided convolution for down-sampling.Xie et al. [24] proposed ResNext which depends mainly on the strided 3 × 3 convolution with a stride of 2 for downsampling.Liu et al. [25] proposed ConvNext which also depends on strided convolution in addition to an image patching approach instead of the whole image as an input.Although the previous methods are efficient in learning the image classification task, it also introduce some information loss due to the dependency on inefficient downsampling techniques.The proposed downsampling approach using convolution and SD module grantee the largest possible feature information compared to the max-pooling and the strided convolution.
Semantic segmentation is that dense prediction task that aims to predict the label of each pixel in an image.Most of the recent research on semantic segmentation [28], [29], [31], [32], [35], [36] employ encoder-decoder CNN architectures to perform such task.Fully convolutional networks (FCN) [29] was the first encoder-decoder architecture that used VGG16 [30] classification network for segmentation after removing the few final fully connected layers and added an up-sampling layer as a decoder.SegNet [31] and U-Net [32] are the most popular encoder-decoder architectures which employed encoding architectures to compress the input image to a latent vector then they constructed the semantic segmentation predictions using a deconvolution decoder stage with some other tricks such as pooling location sharing between encoder and decoder in SegNet and residual connections between the encoder and decoder layers in U-Net.Chen et al. [33], [34], [35], [36] proposed four versions of their approach 'Deeplab' which aimed to perform semantic segmentation efficiently.In DeeplabV1 [33], the authors proposed the Atrous algorithm to increase the receptive field of the convolution and they also proposed Conditional Random Field (CRF) to enable the model to learn the fine details of the objects in the image.In DeeplabV2 [34], they proposed the multi-scale processing using the Atrous Spatial Pyramid Pooling (ASPP) and they replaced VGG16 architecture with ResNet101 [37].While in DeeplabV3 [35], they improved the ASPP by adding different sampling rates and batch normalization layers, they also showed that using 1 × 1 convolution is better 3 × 3 to eliminate the image boundary effect.Finally, in Deeplabv3+ [36], they proposed a depth-wise separable convolutional encoder using Aligned-Xception [38] and they optimized the decoder stage to improve the accuracy of the segmentation learning process.The recent state-of-the-art methods on semantic segmentation use transformers to model that task, transformers deal with the image in a similar way to the text, in which there is an inter-dependency between the words in a phrase.The transformer deals with the image as a sequence of patches where there are inter-relations between those patches.Zheng et al. [39] proposed a sequence-tosequence transformers-based method to perform the semantic segmentation task using an image patch encoder to model the global context of the input image and employed a simple decoder to provide the segmentation.While Wu et al. [40] proposed fully transformer networks (FTN) for semantic segmentation using a pyramid group transformer as a convolutional transformer encoder.All the previously mentioned methods employ a complex implementation of the encoder and decoder networks, while the proposed method employs a simple encoder and decoder implementation however it outperforms the SOTA methods in semantic segmentation.The DS module is proved to be superior in feature decoding in dense prediction tasks (semantic segmentation and depth estimation) as it is applied in recent research [41], [42], [43].

III. PROPOSED METHOD
The proposed method depends on two main blocks; the SD layer as a down-sampling module similar to the pooling layers, and the DS layer is used as the decoder stage to merge the feature depth in order to up-sample the feature maps to form the dense map at the same size of the input.

A. SPACE-TO-DEPTH AS AN ENCODING LAYER
Space-to-Depth (SD) module was first proposed by Wang et al. [11] as a way of obtaining a dense representation of the optical flow to be used for video super-resolution.In our proposed method, we employ it as a learnable spatial down-sampling layer similar to the pooling method.The difference in the SD module from pooling is that no feature compression happens to the input feature maps but the reduction in the spatial size is converted to depth data via pixel aggregation technique.This pixel aggregation is done by converting input feature maps of shape rW × rH × C into feature maps of shape W ×H ×C×r 2 through a learnable aggregation process which mathematically can be stated as follows: where Y and X are the low-resolution output with extended depth channel and the high-resolution input of the DS layer, respectively.W L and b L are the weights and biases in the DS layer, W is the image width, H is the image height, C is the image channels, r is the depth of the image, and f is the activation function for the layer.This layer is applied five times in our proposed architectures each time it reduces the spatial size by r=2 in the width and r=2 in the height and increases the depth 4 times (r 2 = 4).Each time the input image is downsampled, convolutional layers, relu, and batch normalization are applied in a different order depending on the architecture.

B. DEPTH-TO-SPACE AS A DECODER NETWORK
Depth-to-space (DS) module was first proposed by Shi et al. [12] as a way of aggregating the pixels of the input features to obtain a higher-resolution image for the image 119364 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.super-resolution task.In our proposed method, we employ it as a one-stage learnable up-sampling decoder.This pixel aggregation is done by converting the input feature maps of shape W × H × r 2 obtained from the encoder into a dense map of shape rW × rH through a learnable process which mathematically can be stated as follows: This layer is applied five times in our proposed architectures as the decoding stage, it up-samples the image of the final feature maps obtained from the encoder stage by a factor of 32 (2 5 ) to obtain a dense prediction map at the same size of the input image.

C. PROPOSED ARCHITECTURES
We propose four architectures with different CNN configurations and we compare their performance and highlight the advantages of each one.We propose a simple CNN applying max-pooling to reduce the spatial size of the input features with repeated two or three convolutional layers with Relu activation followed by batch normalization (BN).The feature depth through the down-sampling stages are 3, 16, 64, 256, and 1024 and then a global average pooling followed by a fully connected layer is added in case of image classification.
In the case of semantic segmentation, the final dense map is constructed from 1024 low-resolution features obtained by 1 × 1 convolutional layer at a size of 32 × W and 32 × H using the DS decoder, the first convolutional architecture is shown in Figure 3-a.The second architecture is a CNN architecture with the same configurations as the previous one but by replacing the max-pooling with a 3 × 3 strided convolution, we remove the final convolution of each block of the three final blocks and modify the stride to be 2 in the final convolution each block as shown in figure 3-b.The third CNN architecture is SD-Net (SDDS-Net for segmentation) which also has the same architecture as the first architecture with MP but the SD layer is applied instead of MP to down-sample the spatial size of the input and extend the depth of the output features as shown in figure 3-c.The fourth proposed architecture is an architecture with depth-wise separable convolution (DW) [49] and residual connections so-called SD-Net-DW (SDDS-Net-DW for segmentation).The depth-wise separable convolution is another type of convolution proposed by François [49] which consists of depth-wise convolution (convolution for each channel separately) and point-wise convolution (1 × 1 convolution to project the depth of the features into less number of channels), DW-convolution is much faster than normal convolution as it has a lower number of parameter, exactly 1  D + 1 N than that for conventional convolution as D and N are the size and the number of the input filter sequentially.We built this architecture using depth-wise block (DW-block) which consists of repeated Relu+dw-convolution+BN as shown in figure 3-e, gradually decreasing the spatial resolution and increasing the number of filters using the SD as shown in figure 3-d.Similar to other architectures, a global average pooling followed by a fully connected layer is added at the end of the architecture in case of classification.The final features are fed to 1 × 1 × 1024 to construct a dense map at the same size as the input image using the DS decoder in case of semantic segmentation.We compare the performance of the four proposed architectures in section VI (Results) showing that SDDS-Net and SDDS-Net-DW have much better accuracy than the MP-CNN and Strided-CNN.The proposed architectures for image classification and semantic segmentation are shown in Figure 3 and Figure 4, respectively where the difference is that in the case of semantic segmentation, the decoder network (1 × 1 convolution followed by a depth-to-space layer) is added instead of the global average pooling and the fully connected layer in case of image classification.

D. LOSS FUNCTION
The loss function used for image classification is the categorical cross-entropy loss as follows: where q is the ground truth label and p is the predicted label.i is an iterator over classes.The loss function used for learning the semantic segmentation is the Huber loss (a function which selectively operates either like L1 loss or L2 loss depending on a threshold value ''t''), it is mathematically stated as: where I is the ground truth pixel value and Ĩ is the predicted pixel value, while the threshold value, t, is selected as 1 since empirically it speeds up the training process.L1 and L2 are also tested separately for the proposed method training in two different experiments however each one had a slow loss improvement problem at some point during the training.

IV. EXPERIMENTS
We trained and tested our proposed method on image classification and semantic segmentation.For image classification, we trained and evaluated the proposed method on CIFAR10 and Tiny-ImageNet benchmarks.For semantic segmentation, we trained and evaluated the proposed models on the challenging NYU depth V2 and CITYSCAPES benchmarks to test the performance of the model on both indoor and outdoor scenes.

A. BENCHMARKS FOR IMAGE CLASSIFICATION EVALUATION
To evaluate the proposed encoding architectures using MP, strided convolution, and SD, we train and test the architectures on CIFAR10 [44] and Tiny ImageNet [45]

B. BENCHMARKS FOR SEMANTIC SEGMENTATION EVALUATION
The first benchmark for semantic segmentation evaluation that we trained our models on is NYU depth V2 [50].Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The second benchmark for semantic segmentation evaluation which we trained our model on is CITYSCAPES [51].It consists of urban street scenes in Germany with their corresponding semantic segmentation maps with 19 different categories.The dataset contains 5000 fine-labeled images and 20,000 coarse-labeled images for semantic segmentation.We train and test our model on the fine-labeled images only since we aim to predict fine and clear segmentation maps.The RGB image size and the corresponding segmentation have the size of 2048 × 1024, we down-sample the images to 1024 × 512 to speed up the training process while keeping high-resolution predictions.The final dense prediction map is constructed at the same size as the input image using feature maps of the size 32 × 16 × 1024.

C. COMPARISON BETWEEN DIFFERENT CNN ARCHITECTURES
We compare our proposed architectures (SDDS-Net, SDDS-Net-DW) with the other architectures with similar configurations while using MP and strided convolution for down-sampling instead of the SD layer.The proposed MP-based CNN architecture is similar to SDDS-Net architecture but with replacing the SD module for down-sampling with MP as shown in Figure 3-a.While the strided convolution-based architecture has similar architecture but we replace the SD module with 3 × 3 convolution with strides of 2 in both horizontal and vertical axes as shown in Figure 3-b.In the result section, we show that our proposed architecture with DS for down-sampling attains much better accuracy in dense predictions than those using MP and strided convolution.

D. TRAINING AND TEST CONFIGURATIONS
We train and test our model on a desktop computer using the same hardware configuration.The hardware configuration includes Nvidia RTX3090 GPU which has Ampere RTX architecture and 24 GB of high-speed G6X memory, Intel Core i7-8700 CPU with 3.20 GHz clock speed, and 64 GB RAM.All the proposed architectures trained using Tensorflow Keras environment for 1500 epochs with Adam's optimizer with the standard image/mask augmentation, the training and test input image sizes for CIFAR10 and Tiny ImageNet are 32 × 32 and 64 × 64, respectively.The image size for NYU depth V2 is 640 × 480 and for CITYSCAPES is 1024 × 512.

V. RESULTS
In this section, we show the results obtained using the proposed method on CIFAR10, Tiny ImageNet, NYU depthV2, and CITYSCAPES benchmarks and we compare the obtained results with SOTA methods in image classification and semantic segmentation.

A. EVALUATION METRICS
We evaluate the classification performance using the classification accuracy (Acc.) using the following equation.
where N is the number of classes, true-positive (TP) is the pixels that are truly predicted, true-negative (TN ) is the pixels that are predicted that it is not of that class, false-positive (FP) is the pixels which are mispredicted to be that class, and falsenegative (FN ) is the pixels which are mispredicted to be not of that class.We evaluate the semantic segmentation task using the mean intersection over union (mIOU) which is the area of intersection between the predicted P mask and the ground truth G mask over the union of the two masks as shown in Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the following equation: we also evaluate the segmentation performance using the pixel accuracy (Pix.acc.) using equation ( 4) but the difference is that accuracy is additionally measured for each pixel: where P is the number of pixels in the segmentation mask.

B. IMAGE CLASSIFICATION EVALUATION RESULTS
We evaluated the proposed method on CIFAR10 and Tiny ImageNet test sets.We trained the proposed models using two different training strategies.The first strategy uses Mixup [46] data augmentation (Mixing two images using the alpha channel and reflecting the mixing values in the image labels by percentage) and cutout [47] data augmentation (replacing random patch from the image with a fixed value TABLE 3. Comparison between the proposed method with the different architectures and the SOTA methods on the CIFAR10 test set for image classification in terms of parameter count and top1 accuracy.Note that the accuracies were copied from a previous research [65]. or random gaussian noise), in addition to some standard augmentations (random horizontal flipping, random crop and resize, and random rotations).The second strategy uses Randaugment [48] augmentation (a sequential probabilistic policy of various augmentations using predefined probabilities for each transformation, the transformations include translation, rotation, scale, shear, contrast, brightness, and other transformations) and cutout [47] in addition to the previously mentioned standard augmentations.The evaluation results on CIFAR10 shows that SDDS-Net attains the best top-1 accuracy (94.28%) using the second strategy of augmentation, and SDDS-Net-DW attains the second best top-1 accuracy (91.81%).The evaluation results on Tiny ImageNet show that SDDS-Net-DW attains the best top-1 accuracy (72.25%) with the second strategy of augmentation and SDDS-Net attains the second-best top-1 accuracy (71.49%).The performance of SDDS-Net is better than the other models on CIFAR10 as it has few number of labels and SDDS-Net-DW is better than the other models on Tiny ImageNet as it has a larger number of labels (200 labels).

C. SEMANTIC SEGMENTATION EVALUATION RESULTS
We tested our proposed architectures on both NYU depth V2 and CITYSCAPES benchmarks for the task of semantic segmentation.

D. COMPARISON WITH SOTA METHODS ON IMAGE CLASSIFICATION
We compare the proposed classification models with the space-to-depth downsampling (SD-Net and SD-Net-DW) with the SOTA methods on CIFAR10 and Tiny ImageNet classification.outperforming most of the SOTA methods except for VGG16, ResNet50, and InceptionV3 which proves the outstanding performance of the proposed architectures.Table 4 shows a comparison between the proposed models (SD-Net and SD-Net-DW) and the SOTA methods on Tiny ImageNet classification.On this dataset, SD-Net-DW outperforms the SOTA methods with a top-1 accuracy of 72.25%.This differs from the results obtained on CIFAR10 which showed that SD-Net overpassed SD-Net-DW in the accuracy.Those results prove that each architecture has an advantage in a specific task such as learning more number of classes with a high accuracy in the case of SD-Net-DW in the Tiny ImageNet classification task against SD-Net which could learn fewer number of classes (in the case of CIFAR10 classification) with much better accuracy than SD-Net-DW.

E. COMPARISON WITH SOTA METHODS ON SEMANTIC SEGMENTATION
We compared the proposed method with the SOTA methods on semantic segmentation.SDDS-Net could outperform the SOTA methods on NYU depth V2 semantic segmentation task in terms of Pix.acc. as shown in Table 5 while our proposed architectures are much simpler than those of the SOTA methods however, SDDS-Net-DW outperforms most of the SOTA methods (almost all the SOTA methods except for the method proposed by Wang et al. [58]).SDDS-Net-DW also outperforms the SOTA methods on CITYSCAPES semantic segmentation task even the attention-based methods such as EANet [62], and HRNetV2 [61] and the transformer-based method such as Trans4Trans [63] and SETR-PUP [39] (transformers are recently one of the most efficient architectures in deep-learning) as shown in table 6 with an acceptable speed of processing (∼12 fps).Both SDDS-Net and SDDS-Net-DW attain higher speeds (12.5 fps for SDDS-Net and 11.63 fps for SDDS-Net-DW) than all other methods in the comparison.

VI. CONCLUSION
The proposed architectures can efficiently learn the image classification as a result of using the powerful SD module for the lossless image down-sampling instead of the traditional pooling and strided convolution methods.It also could learn the dense prediction task of semantic segmentation as a result of using the SD module in the encoder stage and the DS module for up-sampling in the decoder stage.The evaluation results on CIFAR10 and Tiny ImageNet classification tasks show the superior performance of the proposed SD module for downsampling (SD-Net attains 94.28% and 71.49% on CIFAR10 and Tiny ImageNet, respectively, and SD-Net-DW attains 91.81% and 72.25% on the same benchmarks) in addition to the efficient design of the architectures.The proposed SD-Net and SD-Net-DW outperform the SOTA methods in Image classification while the model consists of relatively a few number of parameters.Additionally, the proposed SDDS-Net and SDDS-Net-DW could perform the task of segmentation with high speed which is convenient for 119370 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
real-time applications.The strength of the proposed method was proved by the evaluation results on NYU depthV2 and CITYSCAPES semantic segmentation results (SDDS-Net attains Pix.acc of 78.55% for NYU depthV2, and 85.01% for CITYSCAPES, and SDDS-Net-DW attains a Pix.acc.value of 75.26 for NYU depthV2 and 87.9 for CITYSCAPES) as the proposed architectures based on SD and DS modules outperform the SOTA methods on both NYU depthV2 and CITYSCAPES benchmarks.In this work, we focused on the optimization of the encoder architecture of the encoder-decoder model for semantic segmentation however, in our future work, we aim to design an architecture that depends on multiple stages of the depth-to-space module in the decoder instead of the single stage we applied in the proposed architectures in this paper.We think this approach can improve the performance of the semantic segmentation models.

FIGURE 1 .
FIGURE 1.The general architecture of the proposed method with Space-to-Depth (SD) encoder and Depth-to-Space (DS) decoders for the task of semantic segmentation.

FIGURE 2 .
FIGURE 2. The two main modules in our proposed method.a) Space-to-Depth (SD) which is used to down-sample the input feature map of size rW × rH × C to a lower resolution map of size W × H × C × r 2 via a learnable process.b) Depth-to-Space (DS) which is used as the decoder stage in our method to up-sample the input low-resolution feature map of size W × H × r 2 to a higher resolution map of size rW × rH through a learnable process.

FIGURE 3 .
FIGURE 3. The proposed architectures for image classification: (a) CNN architecture with max-pooling (MP) for down-sampling, (b) CNN architecture with strided convolution for down-sampling (s refers to both the vertical and the horizontal strides), (c) CNN architecture with Depth-to-space (SD) for down-sampling, d) CNN with depthwise separable convolution and SD for down-sampling, and e) is the depthwise block (DW-block) (used in the architecture d ) which consists of a single repetition of Relu followed by depth-wise separable convolution and batch normalization.

FIGURE 4 .
FIGURE 4. The proposed architectures for semantic segmentation: (a) CNN architecture with max-pooling (MP) for down-sampling and DS decoder, (b) CNN architecture with strided convolution for down-sampling (s refers to both the vertical and the horizontal strides) and DS decoder, (c) CNN architecture with Depth-to-space (SD) for down-sampling and DS decoder, d) CNN with depthwise separable convolution, SD for down-sampling, and DS decoder, and e) is the depthwise block (DW-block) (used in the architecture d ) which consists of a single repetition of Relu followed by depth-wise separable convolution and batch normalization.

FIGURE 5 .
FIGURE 5. Comparison between the outputs obtained from each architecture: (a) Input RGB image.(b) The ground truth semantic segmentation mask.(c), (d), (e), and (f) are the predicted semantic segmentation masks from the following configurations: max-pooling (MP) based CNN architecture, the strided convolution-based CNN architecture, SDDS-Net (space-to-depth (SD) based CNN Architecture), and SDDS-Net-DW (SD and DW based architecture), respectively.

FIGURE 6 .
FIGURE 6. Sample results were obtained from the proposed architectures (SDDS-Net and SDDS-Net-DW) based on our method.The columns from left to right represent the input image, ground truth segmentation map, SDDS-Net predicted segmentation map, and SDDS-Net-DW predicted segmentation map.The first to fourth row represent samples from the NYU depthV2 test dataset and the fifth and sixth rows represent samples from the CITYSCAPES test dataset.

TABLE 1 .TABLE 2 .
Comparison between the obtained accuracy between the proposed architectures (MP-based CNN, Strided Conv.based CNN, SDDS-Net, and SDDS-Net-DW) on both CIFAR10 and Tiny ImageNet benchmarks reporting the model's parameters count (PC) in Millions for each model.Note that Strategy 1 is Mixup+Cutout and Strategy 2 is Randaug+Cutout.Comparison between the obtained mIOU and speed in fps between the proposed architectures (MP-based CNN, Strided Conv.based CNN, SDDS-Net, and SDDS-Net-DW) on both NYU depthv2 and Cityscapes benchmarks reporting the model's parameters count (PC) in Millions for each model.

Table 1
attained the best mIOU (82.35%) at 11.63 fps speed on CITYSCAPES and SDDS-Net attains an mIOU of 80.12% at 12.5 fps.Those results show that SDDS-Net is better at learning a lower number of segmentation classes (14 classes of NYU) than SDDS-Net-DW which could efficiently learn the 19 classes of the CITYSCAPES benchmark.

TABLE 4 .
[66]arison between the proposed method with the different architectures and the SOTA methods on Tiny-ImageNet validation set for image classification in terms of parameter count and top1 accuracy.Note that the accuracies were copied from previous research[66].

TABLE 5 .
Comparison between the proposed method with the different architectures and SOTA methods on NYU depth V2 semantic segmentation test benchmark.

TABLE 6 .
Comparison between the proposed method with the different architectures and the SOTA methods on CITYSCAPES semantic segmentation validation benchmark.