A Semantic Segmentation Network Simulating the Ventral and Dorsal Pathways of the Cerebral Visual Cortex

Aiming at the problem of spatial information loss in the semantic segmentation process, we propose a semantic segmentation network, termed the ventral and dorsal network (VDNet), which simulates the ventral and dorsal pathways of the cerebral visual cortex. The ventral pathway network focuses on extracting semantic information, and the dorsal pathway network focuses on extracting spatial information. We use the semantic enhancement module (SEM) in the ventral pathway network to fuse information of different scales to enhance the extraction of semantic information, and we use the spatial attention module (SAM) in the dorsal pathway network to assign weights to different locations in space to enhance the extraction of spatial information. By fusing the information of the two pathways, the final semantic segmentation result is obtained. Since the dorsal pathway network is used to specifically enhance the extraction of spatial information, the problem of spatial information loss during the segmentation process is effectively improved, and higher segmentation accuracy can be achieved by using only a small backbone network. On the CamVid, Cityscapes and PASCAL VOC 2012 datasets, we achieve the mean intersection over union (mIoU) of 82.1%, 77.8%, and 81.0%, respectively, which verifies the effectiveness of the proposed method.


I. INTRODUCTION
Semantic segmentation refers to classifying the objects in an image at the pixel level to segment different objects. Semantic segmentation is one of the basic tasks of computer vision, and it has a wide range of applications in medical image processing, remote sensing classification, defect detection, automatic driving, video monitoring, etc.
Deep convolutional neural networks (DCNNs) [1] have powerful feature extraction capabilities. DCNNs gradually acquire rich semantic information through the superposition of convolutional layers and a continuous downsampling process. This is very useful for image classification tasks, and the abilities of current classification models [2] based on DCNNs have even surpassed those of humans. However, the semantic The associate editor coordinating the review of this manuscript and approving it for publication was Hossein Rahmani . segmentation task not only needs to obtain the semantic information of each pixel but also needs to obtain the spatial information such as the position and shape of the objects. DCNNs will inevitably lose spatial information during the downsampling process, which will affect the segmentation accuracy.
To obtain higher segmentation accuracy, how to maintain spatial information while obtaining semantic information becomes an important consideration. There are currently three main ways to maintain spatial information: (1) Limiting the number of downsampling operations. Compared with the common model with up to 32 downsampling operations, DeepLabV3 [3] limits the downsampling to 16 operations, and ENet [4] limits the downsampling to 8 operations. This method is beneficial for preserving spatial information, but it also leads to the inability to obtain a larger receptive field, thereby reducing the richness of semantic information. Although DeepLabV3 and ENet both use the atrous convolution to compensate for the lack of large receptive fields, due to the sparsity of the atrous convolution kernel, its effect is limited.
(2) U-shaped structure. The FCN [5] is a pioneering work applying deep neural networks to semantic segmentation tasks. In the decoding stage, FCN-8s gradually merges 16 times and 8 times downsampled feature maps in the encoding stage to retrieve lost spatial information. U-Net [6] has made further improvements on the basis of the FCN. The network is designed as a symmetrical and complete U-shaped structure. In each stage of the decoder, the feature maps from the corresponding stages of the encoder are integrated. ExFuse [7] improves the U-shaped structure and proposes the semantic embedding branch (SEB) to involve more semantic information from high-level features to guide the feature fusion.
(3) Multipath structure. BiSeNet [8] uses an independent, shorter path to obtain 8 times downsampled feature maps to supplement the missing spatial information. ICNet [9] uses three processing paths to input the original resolution, 1/2 resolution and 1/4 resolution images, respectively, and obtain feature maps with 1/8, 1/16, and 1/32 resolution, respectively, which can also supplement the spatial information. However, using the U-shaped structure or multipath structure alone has a limited ability to recover spatial information.
From a biological point of view, according to the research of Mishkin et al. [10], the visual cortex of monkeys is divided into two pathways. One leads to the ventral side of the brain, termed the ventral pathway, which is mainly responsible for object recognition; the other leads to the dorsal side of the brain, termed the dorsal pathway, which is mainly responsible for sensing the spatial positions of objects. We found that the functions of these two pathways correspond to the semantic information and spatial information needed in the semantic segmentation task, respectively. Therefore, we are inspired by the ventral and dorsal processing pathways of the cerebral visual cortex, design a novel semantic segmentation model with a two-pathway structure and call the model the ventral and dorsal network (VDNet). The experimental results show that the model can effectively improve the problem of spatial information loss while acquiring rich semantic information to obtain more precise segmentation results than other methods.
Our contributions mainly include the following aspects: (1) We design two pathways, one long and one short, to simulate the ventral and dorsal pathways of the cerebral visual cortex, respectively, to extract semantic information and spatial information and combine the U-shaped structure to obtain richer spatial information. Finally, the two pathways of information are fully fused to obtain high-precision segmentation results.
(2) We propose the semantic enhancement module (SEM), which enhances the extraction of semantic information from the ventral pathway by combining information of different scales.
(3) We propose the spatial attention module (SAM), which enhances the extraction of spatial information from the dorsal pathway by assigning weights to different positions in space.

II. RELATED WORK A. SIMULATION OF THE BRAIN'S VISUAL MECHANISM
The biological visual system includes retinas, lateral geniculate nuclei, visual cortexes, etc. The visual cortexes include the ventral pathway composed of functional areas such as V1, V2, V4, and IT and the dorsal pathway composed of functional areas such as V1, V2, and V5. Kubilius et al. [11] proposed a deep convolutional neural network CORNet-S that simulates the ventral pathway of the cerebral visual cortex and the recurrence of some of its functional areas for image classification. Shibuya and Hotta [12] proposed a neural network model called Feedback U-Net for semantic segmentation. They recurrently calculate data on the entire network and use convolutional LSTM (long short-term memory) to fuse the data of two adjacent recurrences. The experimental results of the two models prove that the model is effective at simulating the visual mechanism of the brain.

B. SEMANTIC INFORMATION EXTRACTION
Semantic information is one of the two types of basic information required by semantic segmentation tasks, and it means the correct class to which each pixel in the image belongs. Current semantic segmentation models mostly use transfer learning methods, which use classification models that have been trained on large-scale datasets such as ImageNet [13] as pretrained models and use these pretrained models as backbone networks after removing the last fully connected layers. This method can enable better initialization of network parameters, which is very helpful for the extraction of semantic information. Generally, the better the classification performance of the backbone network that is used, the higher the accuracy of semantic segmentation will be. Commonly used backbone networks include VGG [14], ResNet [2], Xception [15], ResNeXt [16], EfficientNet [17], etc. These models have been proven to achieve good results on image classification tasks, and many semantic segmentation models [18]- [31] that use them as backbone networks have also obtained good results.

C. SPATIAL INFORMATION EXTRACTION
Spatial information, which refers to the position and shape of each object in the image, is another type of basic information required by semantic segmentation tasks. Convolutional neural networks will gradually lose spatial information in the downsampling process. At present, the following methods are the main methods used to retrieve the missing spatial information: (1) Limiting the number of downsampling operations, such as ENet, DeepLabV3, etc. However, this approach is not conducive to the extraction of semantic information.
(2) Using the U-shaped structure to integrate the highresolution feature maps, such as FCN-8s, U-Net, ExFuse, etc. VOLUME 9, 2021 (3) Using additional pathways to extract spatial information, such as BiSeNet, ICNet, etc. However, only using a U-shaped structure or additional paths has a limited ability to restore spatial information.

D. ATTENTION MECHANISM
As we all know, instead of paying equal attention to the entire field of vision, human eyes always focus on a certain part when observing things, which helps us quickly obtain more important visual information. Hu et al. [32] proposed the squeeze-and-excitation (SE) block, which is a channel attention mechanism. The weight of each channel is obtained through global average pooling and the nonlinear calculation of the feature map, thus paying more attention to the more important channels. Woo et al. [33] proposed a two-dimensional spatial attention method, which assigns weights to each spatial position of the input feature map by calculating a two-dimensional feature map so that more attention can be paid to important positions in space. Wang et al. [34] proposed an integrated attention mechanism to distribute the probability weights of the key features, thereby enhancing the selection of features in the model. Cai and Wei [35] proposed a cross-attention mechanism that assigns horizontal weight coefficients to each row of features through a horizontal attention mechanism and assigns vertical weight coefficients to each column of features through a vertical attention mechanism. In addition, the differences between features are enlarged through the weight multiplication and maximum matching strategies. You et al. [36] constructed a graph convolutional network (GCN) under the attention mechanism such that the features that have high-probability weights are related to each other. The application of these attention mechanisms has brought significant performance improvements to the models.
The above studies clearly show that semantic information and spatial information are two types of basic information indispensable in the semantic segmentation process, and it is difficult for existing semantic segmentation models to take both into account. The model we propose simulates the visual mechanism of the brain using two pathways to extract semantic information and spatial information, respectively, thereby reducing the information loss. In addition, we applied the channel attention mechanism and the spatial attention mechanism in the two pathways, respectively, to strengthen the extraction of semantic information and spatial information. Experimental results show that the model proposed in this paper can extract both semantic information and spatial information.

III. VENTRAL AND DORSAL NETWORK A. NETWORK ARCHITECTURE
The structure of VDNet, which is an encoder-decoder structure, is shown in Fig. 1. The encoder contains two pathways. We use the names of the two pathways in the cerebral visual cortex and call them the ventral pathway and the dorsal pathway. The two pathways input images independently. The ventral pathway has a longer path, which is mainly responsible for extracting semantic information and uses the SEM to enhance the extraction of information. The dorsal pathway has a shorter path and retains feature maps with a larger resolution, is mainly responsible for extracting spatial information, and uses the SAM to enhance the extraction of information. The decoder gradually merges the feature maps of the ventral pathway and the dorsal pathway from top to bottom to obtain the final segmentation result.

B. VENTRAL PATHWAY
The ventral pathway is composed of a pretrained backbone network and the SEM. VP-1, VP-2, VP-3, and VP-4 are different stages of the backbone network, and the 4 times, 8 times, 16 times, and 32 times downsampled feature maps are obtained, respectively. The ventral pathway is longer, and rich semantic information can be gradually obtained through more convolution and downsampling operations.
To enhance the semantic information extraction capability of the ventral pathway, we propose the SEM. The SEM is improved from the ASPP [19]. The ASPP uses convolution kernels with different atrous rates to capture information at different scales. The larger the atrous rate of the convolution kernel, the larger the receptive field. The small receptive field falling on the small targets can extract useful information. However, it cannot extract enough information when it falls on large targets. The large receptive field can extract more complete information from large targets. However, it may also cover multiple targets of different classes at the same time, resulting in the inability to extract useful information. In Fig. 2, we use boxes to represent the receptive fields with different sizes. The small solid box in Fig. 2a and the large solid box in Fig. 2b cover most of the area of a single object, so effective semantic information can be extracted. However, the dashed small box in Fig. 2a only covers a small part of the building, and the large dashed box in Fig. 2b covers multiple types of objects such as trees, pedestrians, buildings, sky, and poles. As a result, it is difficult to extract effective semantic information for both classes.
To avoid their own shortcomings as much as possible, it is necessary to make a trade-off between different sized receptive fields. Therefore, we introduce the channel attention mechanism into the ASPP. Through continuous learning, different weights are assigned to the feature maps from different receptive field convolutional layers so that the network can pay more attention to the important receptive fields and achieve the best information extraction effect.
The structure of the SEM is shown in Fig. 3. The SEM includes five parallel processing paths, including a 1 × 1 convolution, global average pooling, and three kinds of atrous convolutions with different atrous rates, to obtain different scales of receptive fields from local to global. ''1 × 1 Conv'' represents the convolution operation with the 1 × 1 convolution kernel, and ''3 × 3 Conv'' represents that with the 3 × 3 convolutional kernel. ''BN'' represents batch normalization [37]. ''Rate'' represents the atrous rate of convolution. ''GAPooling'' represents global average pooling. ''Upsample'' represents bilinear interpolation upsampling. ''Concat'' represents the concatenation operation in the channel dimension. We use the Swish [38] activation function. The structure of the CAM is shown in Fig. 4, where the ''×'' symbol represents the dot product of two feature maps, and the ''+'' symbol represents the sum of two feature maps.

C. DORSAL PATHWAY
The dorsal pathway is composed of the pretrained ResNet18 and the SAM. We only use the part of ResNet18 that is cut to 8 times downsampling, and it is represented by DP-1 and DP-2. The dorsal pathway is shorter, which can retain richer spatial information, such as the boundaries of objects.  To enable the network to learn the importance of different locations in space and thereby enhance the representation of spatial information by the dorsal pathway, we improve the original spatial attention module [33] and propose our SAM.
As shown in Fig. 5, we visualize the intermediate feature maps before and after the SAM in the dorsal pathway.   5b shows the ability of the dorsal pathway used to extract and retain spatial information such as object boundaries. The boundaries of objects such as cars, buildings, traffic lights, and poles are effectively extracted. From Fig. 5c, it can be seen that the boundary information has been enhanced, which reflects the SAM's ability to strengthen spatial information.
The structure of the SAM is shown in Fig. 6, where ''Mean'' represents calculating the average value in the channel dimension to obtain the feature map with a single channel, and ''Max'' represents taking the maximum value in the channel dimension to obtain the feature map with a single channel. On the basis of the original spatial attention module, we added a branch that uses a 3 × 3 convolution to output a feature map with two channels to enhance the learning ability of the module. After connecting the feature maps of these three branches, a 3 × 3 convolution is used to output a feature map with a single channel. After being calculated by the sigmoid activation function, this feature map is multiplied and then added by the input feature map of the module to obtain the final output feature map of the module.

D. DECODER
We believe that the ability of the decoder to recover spatial information through learning is limited, so we focus on the fusion with high-resolution feature maps rich in spatial information. The decoder includes three fusion modules (FM) and four segmentation heads (SegHead). The feature maps of the ventral pathway and the dorsal pathway are gradually merged from top to bottom, thereby effectively fusing the semantic information and spatial information of the image. The structure of a FM is shown in Fig. 7. FM-1 has two inputs, FM-2 and FM-3 have three inputs. The SegHead follows FM-3. We use a simple design for the SegHead, and its structure with only two convolution layers is shown in Fig. 8. After the SegHead, the segmentation result of the input image size is obtained through 4 bilinear interpolation upsampling operations.  In addition to the SegHead after FM-3, we also add SegHeads after SEM, SAM, FM-1, and FM-2 to obtain segmented images to calculate the auxiliary loss.

E. LOSS FUNCTION
Next is the design of the loss function. As shown in equation (1), we use the joint loss to train the network.
L 1 , L 2 , L 3 , L 4 , and L 5 are the respective losses calculated using the feature maps output by the SegHeads after FM-3, FM-2, FM-1, SEM, and SAM. As shown in equation (2), we use the focal loss [39].
N represents the total number of classes contained in the image. p(x i ) represents whether a pixel is the ith class. If it is, p(x i ) is equal to 1; otherwise, it is 0. q(x i ) represents the probability that a certain pixel belongs to the ith class, which is predicted by the network. The value of β is set to 2.
The later the network is, the richer the information contained, the greater the importance, and the greater the weight of the loss calculation. In the formula, λ 1 , λ 2 , λ 3 , λ 4 , and λ 5 are the weights of each loss, and we set them as 1, 0.8, 0.6, 0.4, and 0.2, respectively.
Using the joint loss function can shorten the gradient transfer distance, avoid the vanishing gradient problem, speed up the convergence speed of the model, and improve the final segmentation accuracy.
The CamVid dataset is a road scene semantic segmentation dataset with 701 images in total, including 367 images in the training set, 101 images in the validation set, and 233 images in the test set. We use 11 segmentation classes including road, pedestrian, car, building, etc. The remaining classes are all classified into the 12th class and are ignored when calculating the segmentation accuracy. The resolution of the images is 720 pixels × 960 pixels. We use the training set and validation set to train our model and the test set for testing.
The Cityscapes dataset is a semantic segmentation dataset with 5000 finely annotated images, including 2975 images in the training set, 500 images in the validation set, and 1525 images in the test set. We use 19 segmentation classes including roads, buildings, sky, person, traffic light, etc. The remaining classes are all classified into the 20th class and are ignored when calculating the segmentation accuracy. The resolution of the images is 1024 pixels × 2048 pixels. We use the training set to train the model, the validation set to verify the model, and the test set to test the model. We upload the prediction results on the test set to the evaluation server of Cityscapes to obtain the segmentation accuracy.
The PASCAL VOC 2012 dataset involves 20 object classes and one background class. The original training set has 1464 images. The augmented training set with 10582 images is provided by [44]. The validation and test sets have 1449 and 1456 images, respectively. We employ the augmented training set and the validation set for training, and the test set for testing.
We use random horizontal flips and randomly scale the input images in training to perform data augmentation. We use 7 scales {0.5, 0.75, 1, 1.25, 1.5, 1.75, and 2} for all datasets. In addition, we employ multiscale and random horizontal flip testing to improve the segmentation accuracy. We use 6 scales {0.75, 1, 1.25, 1.5, 1.75, and 2}.
We use the pixel accuracy (pixAcc) and the mean intersection over union (mIoU) as the evaluation metrics. The mIoU metric is a more stringent metric since it penalizes false positive predictions. Therefore, we use mIoU as the main evaluation metric. The calculation method of pixAcc is shown in equation (3), and the calculation method of the mIoU is shown in equation (4).
N represents the number of classes contained in the image; T i represents the total number of pixels of the ith class; x ii represents that the actual class is i and the predicted class is also i, that is, the total number of pixels that are predicted correctly; and x ji represents that the actual class is i and the predicted class is j, that is, the total number of pixels with incorrect predictions.

B. IMPLEMENTATION DETAILS
We use the deep learning framework Pytorch to train and test the model on a NVIDIA GeForce RTX3090 card and a NVIDIA GeForce GTX1080Ti card. We use the Adam as the optimizer with a weight decay of 1 × 10 −4 . We combine ''warmup'' and ''cosine annealing [45]'' strategies, and the learning rate is updated according to equation (5).
T ≥ T warm (5) lr max represents the maximum value of the learning rate; lr min represents the minimum value of the learning rate and is set to 1/100 of lr max ; iters represents the current number of iterations; T represents the current number of training rounds; T warm represents the number of ''warmup'' epochs and is set to 5, and iters warm represents the corresponding number of iterations.
On the CamVid dataset, we use the original image with a resolution of 720 pixels × 960 pixels and conduct 200 epochs VOLUME 9, 2021 of training. We use ResNet18, ResNet50 and ResNeXt50 as the respective backbone networks. The batch size of training is set to 4. The maximum value of the learning rate is set to 4 × 10 −5 .
On the Cityscapes dataset, we use ResNeXt50 as the backbone network. We first scale the image to a resolution of 512 pixels ×1024 pixels and conduct 200 epochs of training. The batch size of training is set to 12. The maximum value of the learning rate is set to 6 × 10 −5 . Then we use the original image with a resolution of 1024 pixels × 2048 pixels and conduct 100 epochs of training. The batch size of training is set to 3. The maximum value of the learning rate is set to 6 × 10 −6 .
On the PASCAL VOC 2012 dataset, we resize the original images to a resolution of 513 pixels ×513 pixels and conduct 200 epochs of training. We use ResNeXt50 as the backbone network. The batch size of training is set to 8. The maximum value of the learning rate is set to 5 × 10 −6 . Table 1 shows the size and number of channels of the output feature maps of each module of the network when the backbone network of the ventral pathway is ResNet18 and the CamVid dataset is used. In addition, the size of the output feature maps of the SegHeads used to obtain the auxiliary loss that is not listed in the table is 720 pixels × 960 pixels, and the number of channels is 12.

C. ABLATION STUDY
To prove the effectiveness of our design, we use ResNet18 as the backbone network and use CamVid for the following ablation experiments. We use the CamVid training set and validation set to train the model and use the CamVid test set to test the model.
(1) We remove the dorsal pathway, and only retain the ventral pathway. As shown in Table 2, the pixAcc is 92.2%, which decreased by 2.6%; and the mIoU is 69.8%, which decreased by 6.5%. The table shows that the ability of the dorsal pathway to extract spatial information is helpful to improve the overall network performance.
(2) We remove the SAM, and the FM-2 is changed to fuse the feature maps from FM-1, VP-2, and DP-2. As shown in Table 2, the pixAcc is 92.9%, which decreased by 1.9%; and the mIoU is 71.2%, which decreased by 5.1%. The table shows that the SAM is useful to enhance the spatial information in the dorsal path. (3) We remove the SEM, and FM-1 is changed to fuse the feature maps from VP-4 and VP-3. As shown in Table 2, the pixAcc is 91.8%, which decreased by 3.0%; and the mIoU is 71.0%, which decreased by 5.3%. The table shows that the enhancement of semantic information by the SEM is beneficial to improving the final segmentation accuracy.
(4) We replace the SEM with the ASPP. As shown in Table 2, the pixAcc is 92.5%, which decreased by 2.3%; and the mIoU is 72.6%, which decreased by 3.7%. The table shows that our improvement to the ASPP is effective. Assigning the learned weights to different sized receptive fields can indeed achieve better information extraction results. Table 3 shows the segmentation accuracy of VDNet on the CamVid test set, including the intersection over union (IoU) of each class, the pixAcc, and the mean intersection over union (mIoU) of 11 classes. ''Backbone'' indicates the backbone network pretrained on the ImageNet dataset.

D. EXPERIMENTAL RESULTS AND COMPARISON
It can be seen that VDNet has high segmentation accuracy for large-scale objects such as the sky, roads, sidewalks, and cars, but it has low segmentation accuracy for small objects such as signs and poles. Compared with ResNet18, using ResNet50 and ResNeXt50 improves the mIoU by 1.97% and 2.67%, respectively.
As shown in Table 4, we compare our models with some other models, including ENet [4], DFANet B [20], DFANet A [20], SegNet [21], RTA [22], Dilation8 [23], ICNet [9], BiSeNet [8], PSPNet [24], DenseDecoder [25], SwiftNet [26] and VideoGCRF [27]. '' * '' indicates the models are pretrained on Cityscapes. ''-'' means that the metric was not given by the paper. Our model achieves the pixAcc of 95.7% and the mIoU of 82.1%, which is better than those of the compared models, indicating that our method is effective. Table 5 shows the segmentation accuracy of VDNet on the Cityscapes test set, including the IoU of each class and the mIoU of 19 classes. Our test results can be accessed on the evaluation server of Cityscapes (https://www.cityscapesdataset.com/anonymous-results/?id=3d390bb00184b8aa 30680644 f8719fa870e9e078dcd946d3b53443138aad0e7d).  As shown in Table 6, we compare our model with some other models, including DeepLab [18], Dilation10 [23], LRR [28], DFANet B [20], DeepLabv2 [19], DFANet A [20], FRRN [46], RefineNet [29], SwiftNet [26], DeepLabv3 [3], and DeepLabv3+ [47]. Here we have an additional evaluation metric called IoU categories, which is given by the evaluation server of Cityscapes. It is the mIoU calculated by summing 19 classes into 7 major categories. The pixAcc is not given here because the evaluation server of Cityscapes does not include this metric. ''-'' means that the metric was not given by the paper. Our model achieves 90.6% IoU categories and 77.8% mIoU on the Cityscapes test set. Table 7 shows the segmentation accuracy of VDNet on the PASCAL VOC 2012 test set, including the IoU of each class and the mIoU of 21 classes. Our test results can be accessed on the evaluation server of PASCAL VOC 2012 (http://host.robots.ox.ac.uk:8080/anonymous/ WNPHHF.html).
As shown in Table 8, we compare the segmentation accuracy of FCN-8s [5], ESPNetv2 [48], DeepLab [18], VOLUME 9, 2021   CRF-RNN [30], BoxSup [31], Dilation [23], DeepLabv2 [19], DeepLabv3 [3], and DeepLabv3+ [47]. Only the mIoU is used because the PASCAL VOC 2012 evaluation server only contains this metric. Our model achieves 81.0% mIoU on the PASCAL VOC 2012 test set. Fig. 9 visualizes part of the segmentation results on the CamVid test set. It can be seen that the overall segmentation  effect of VDNet is better, and the segmentation results for most objects such as pedestrians, cars, roads, sidewalks and even large poles are relatively fine. However, the segmentation of small objects such as small poles is still lacking. Fig. 10 shows the partial segmentation results of VDNet on the Cityscapes validation set and the comparison with the segmentation results of DeepLabv2. We use the white dashed 47238 VOLUME 9, 2021 box to show where VDNet has achieved a notable improvement compared to DeepLabv2. Fig. 11 shows the partial segmentation results of VDNet on the PASCAL VOC 2012 validation set and the comparison with the segmentation VOLUME 9, 2021

V. DISCUSSION
According to research [10], the cerebral visual cortex contains a ventral pathway and a dorsal pathway, which are respectively responsible for object recognition and spatial position perception. This exactly corresponds to the semantic information and spatial information required by the semantic segmentation task. Therefore, we propose a novel semantic segmentation model that uses two branches to simulate the ventral and dorsal pathways of the cerebral visual cortex. We achieve 82.1%, 77.8%, and 81.0% mIoU on the CamVid, Cityscapes, and PASCAL VOC 2012 datasets, respectively, proving that our method is effective. In addition, in order to strengthen the feature extraction capabilities of the two pathways, we respectively propose the SEM and SAM. The SEM, which can extract semantic information of different scales more efficiently, improves upon the ASPP. The SAM improves upon the original spatial attention module and can learn the importance of different spatial positions. Our ablation studies verify the effectiveness of the design of the two modules.
However, through the segmentation results of each class, we can see that our segmentation effect for some objects, such as poles, fences, and signs in CamVid, walls and fences in Cityscapes, and chairs and sofas in PASCAL VOC 2012, is still lacking. They are mainly small objects or objects with a small number. Small objects are more likely to lose information in the calculation and therefore be more difficult to detect. Objects with a small number are easily misclassified due to lack of training samples. This is where we need to improve in our future work.
In addition, there are still some gaps between VDNet and DeepLabv3 and DeepLabv3+. However, VDNet is much smaller. The multiply-adds of VDNet is 130.4B, and the multiply-adds of DeepLabv3+ is 177.10B. Moreover, because DeepLabv3 and DeepLabv3+ limit the number of downsampling operations without reducing the depth of the model, a larger video memory is required to store intermediate feature maps with larger resolution. Moreover, DeepLabv3 and DeepLabv3+ are pretrained on additional datasets such as MS-COCO, JFT-300M, and Cityscapes train_extra set, but we do not.

VI. CONCLUSION
The novel semantic segmentation network we proposed simulates the ventral and dorsal pathways of the cerebral visual cortex and achieves good segmentation results on the CamVid, Cityscapes and PASCAL VOC 2012 datasets. Through experimental verification, our proposed SEM and SAM can respectively strengthen the extraction of semantic information and spatial information, thereby effectively improving the segmentation accuracy. In future works, we will continue to seek inspiration from the visual mechanism of the brain; explore and simulate the biological structures such as binocular input, recurrence structures, and the interaction of the ventral and dorsal pathways; and apply them to computer vision tasks such as semantic segmentation.