Fully Convolutional Pyramidal Networks for Semantic Segmentation

Semantic segmentation networks focus on the scene parsing of an unrestricted open scene. The typical segmentation architectures are stacks consisting of convolutional layers, which are used to extract semantic features. The feature map dimension is sharply changed at sampling units for most of networks, which ensure effective propagation of the gradient in deep nets. In this article, we proposed a state-of-the-art network model named Fully Convolutional Pyramidal Networks (FC-PRNet), which employs pyramidal residual structure to change the feature map dimension at all convolutional layers. This design is an effective way of improving generalization ability and optimizing parameters, and FC-PRNet could achieve excellent capability of semantic extraction. We used urban scene benchmark CamVid and KITTI dataset to test our network, the experimental results show that FC-PRNet achieves better results without any pre-training or post-treatment module. Moreover, due to smart construction of pyramidal residual structures, FC-PRNet has less parameters than other existing networks trained on these datasets.


I. INTRODUCTION
In 2012, Hinton proposed AlexNet [1], which occupies an important historical position in the field of convolutional neural networks (CNNs). Nowadays CNNs are driving advances in different vision tasks such as: image classification, style transfer, object detection, and local recognition. Scene parsing is a fundamental topic in local recognition tasks. Its goal is to assign each pixel in the image a category label. Scene parsing frameworks are mostly based on Fully Convolutional Networks (FCNs) [2], which is one of the natural extensions of CNNs tackling per pixel prediction problems of semantic segmentation. FCNs design an up-sampling path after CNNs and introduces skip connections compensating for the feature loss in pooling layers. Due to the up-sampling path, FCNs The associate editor coordinating the review of this manuscript and approving it for publication was Yilun Shang .
can process the input images at any resolution and meet the requirements of most images take out pixel.
A large number of CNNs networks have been extended to FCNs. For the more traditional especially Deep Residual Networks (ResNet) [3] implements hundreds of convolutional layers by introducing a new building block, which consists of two convolutional layers and a shortcut. The block dose the sum of the input and non-linear transformation of input. ResNet has a problem of diminishing feature reuse, which is that gradients are not forced through the convolution layers in deep networks [6], [7]. Many scholars have studied this problem from the network structure and training process [6]- [8]. FCNs extended from ResNet have achieved very good results [4], [5].
Recently, Dongyoon Han proposed Deep Pyramidal Residual Networks (PyramidNet) [9], which utilizes a new method of dimension growth. It is a strictly linear relationship between the dimensions of the network and the number of convolutional layers. PyramidNet shows good performance in solving the diminishing feature reuse problem. The linear increase of dimension leads to fewer parameters of deep convolutional layers and this structure has a high utilization rate of parameters to improve accuracy [9].
In this article, we introduce the PyramidNet architecture to FCNs for semantic segmentation and proposed a new network named Fully Convolutional Pyramidal Residual Networks (FC-PRNet). We designed residual blocks in up/down-sampling paths, and the up/down-sampling paths form a complete semantic segmentation network by connecting sampling layers with several skip connections [10]. The FC-PRNet achieves good segmentation results in CamVid and KITTI datasets. In part II, we will introduce pyramidal residual blocks and the constructions of FC-PRNet in detail. After introducing different kinds of FC-PRNet in part III, we will show the results of two urban scene benchmark datasets. Part IV is the summary of this article and the arrangement of future work.

A. PYRAMIDAL RESIDUAL NETWORKS
Most CNNs [7], [10]- [13] employ an approach whereby feature maps dimensions and feature maps sizes change at downsampling layers. In the case of the original ResNet, the feature size is down-sampled by half and the number of dimensions is doubled. PyramidNet is a derivative network of ResNet and it proposes a new method of dimensions growth: the dimension is increased by a value during each extracting layer and the feature size still decreases during down-sampling layers. To achieve this dimensional variation, PyramidNet designs a unique feature extraction unit referred to as a pyramidal residual block (PR-block). There are two kinds of PyramidNet, which are additive mode and multiplicative mode as shown in Figure 1.  Figure 2, are the basic building bolcks of PyramidNet. Y denotes the output, whose dimension is n bigger than that of input x. Due to different dimensions among individual PR-blocks, an identity-mapping shortcut is unusable. Therefore, only a zero-padded shortcut or a projection shortcut is available. In view that a projection shortcut will hamper feature propagation and lead to a problem of optimization [14], PyramidNet adopted a zeropadded shortcut, which does not introduce additional nonzero parameters. Moreover, each zero-padded shortcut can provide a mixture of the residual network and the plain network. With the dimension increasing at each unit, the mixture effect is more marked.

PR-blocks, as shown in
The variation of dimensions between adjacent PR-block is called growth-rate. It can be constructed in two different ways: additive mode expressed as (1) and multiplicative mode expressed as (2): where α and β are both growth-rate and k is the number of PR-blocks. The main difference between additive networks and multiplicative networks is that the dimension of additive networks increases linearly, whereas the dimension of multiplicative networks increases geometrically. The process of multiplication network is similar to that of original deep network architectures, whose dimension of input-site increases slowly and the dimension of output-site increases sharply. It means that multiplicative networks have more parameters than additive networks as the network gets deeper. Two kinds of PyramidNet show similar performance due to unobvious difference in their significant structures when they are shallow. As the nets get deeper, they show some differences in capabilities of feature extraction. The feature map dimensions of multiplicative networks tend to be much larger at the output-side than that of additive networks, and redundant parameters will make the network harder to train and affect network performance. Comparative experiments show that additive network has better performance than multiplicative the network.

B. FULLY CONVOLUTIONAL PYRAMIDAL RESIDUAL NETWORKS
The PyramidNet architecture described in section 2.1 builds the down-sampling path of our FC-PRNet. In order to recover the high-dimensional feature, FC-PRNet introduces a corresponding up-sampling path, which is composed of PR-block, up-sampling layers and skips connections. We design basic PR-block and bottleneck PR-block as shown in Figure 3. Basic PR-block consists of two 3 × 3 Conv. While, bottleneck PR-block uses a combination of 1×1 Conv, 3 × 3 Conv and 1 × 1 Conv, which can reduce the parameters effectively.
FC-PRNet adopted two sampling layers to change the feature size. In the down-sampling path, we introduced a transition down layer (TD) to reduce the feature size. In the up-sampling path, we introduced a transition up layer (TU) to recover the feature size. Note that TD has an operation of pooling, which will lead to some losses of information from earlier PR-blocks. Nevertheless, this information is available in the down-sampling path of the network and can be passed via skip connections. Besides two kinds of transition layers, different structures of PR-blocks are used in two paths. Different from the PR-blocks in the down-sampling path, the PR-blocks in the up-sampling path gradually reduces the feature map dimensions. Figure 4 shows the schematic

III. EXPERIMENTS A. ARCHITECTURE
In this section, we will introduce architectures of FC-PRNet with additive and multiplicative mode used in the subsequent experiments. Firstly, in Figure 5, we define 6 kinds of building blocks used in the network. Differ from PR-block in the down-sampling path, PR-block in up-sampling path uses 1 × 1 convolutional layers for adjustment of dimensions. TD consists of BN, ReLU, 1 × 1 Conv and Maxpooling. TU only contains one 3 × 3 Transposed Conv.
Secondly, we define additive FC-PRNets and multiplicative FC-PRNets. Both kinds of networks were modeled in basic PR-blocks and bottleneck PR-Block. We summarize all kinds of FC-PRNet in Table 3 and take FC-PRNet94 with basic PR-Blocks as an example to introduce the networks. FC-PRNet94 with basic PR-Block is built from 94 convolutional layers: a layer with 48 convolution cores to process RGB images, 48 layers in the down-sampling path, 44 layers in the up-sampling path and a final layer to process output data followed by a SoftMax non-linearity to predict each pixel. If the growth-rate α is 4, the biggest dimension is 128. Compared with additive FC-PRNets, multiplicative FC-PRNets with basic PR-Block are much wider and the biggest dimension is 1840 when the growth-rate β is 1.2.
Thirdly, we test out models using a desktop computer with an Intel I7 4790k CPU and a TITAN XP GPU. We use minimum pixel cross-entropy and Adam (Adaptive Moment Estimation) in training. The learning rate was set to 0.001, reduced by 5% per epoch and the batch_size is 4. We monitor the mean intersection over union (MIoU) and the global accuracy.

B. CAMVID DATASET
CamVid [15] is the first video collection with semantic tags, providing each pixel with an associated tag in 11 semantic classes. Differ from the fixed-position mode of other videos, this data set is taken from the perspective of a driving automobile. We use data of CamVid Group III, including 367 frames for training, 101 frames for validation and 233 frames for testing, with a resolution of 360 × 480 per frame. We trained and predicted with full-size images without any post-treatment or pre-training module. Table 1 and Figure 6 show the comparison of FC-PRNet with other networks. The experimental results show that additive FC-PRNets94 with basic PR-block (α = 4) gets the best results. The pyramid residual structure has a maximum result, and can effectively improve the MIoU of all kinds by 15%-20%, especially trees, bicycles and road signs. It is noteworthy that the image in the camera corresponds to the video frame, so the data set contains temporal information. If we introduced advanced video timing processing methods, the performance of the network can be improved. Figure 7 shows some segmentation results of FC-PRNet94 with basic PR-block (α = 4) on CamVid datasets. The qualitative results are in good agreement with the quantitative results, showing clear segments explaining many details, such as cars, pedestrians, trees and the rest of the labels of the dataset. In the category of poor segmentation, we can see that there are some misidentifications of road signs, columns, buildings and cars.

C. KIITI DATASET
The KITTI [18], [19] road benchmark is a comprehensive dataset and it is very popular with road detection researchers.     We compare the performance of FC-PRNet with other state-of-the-art methods on the KITTI road benchmark. The compared algorithms include StixelNet [21], SPRAY [22], Up_Conv_Poly [23] and RBNet [24]. Table 2 shows the results of different algorithms on the evaluation. It is worth that KITTI road benchmark includes LiDAR data, while our algorithm only uses image data from visual cameras. Therefore, the comparison experiment only contains algorithms that use the same image data. It can be seen that in all these metrics the FC-PRNET algorithm outperformed its competitors, neral road detection. Some qualitative results of FC-PRNet algorithm are shown in Figure 8.

IV. DISCUSSION AND CONCLUSION
This article focuses on segmenting objects at scene parsing. By introducing pyramid residual blocks, FC-PRNet can avoid the diminishing feature reuse problem. The dimensions are forced to grow gradually in order to reduce the parameters. To analyze the effectiveness of the proposed algorithm, wellknown datasets of CamVid and KITTI were tested. FC-PRNet achieves good semantic recognition results for segmenting objects at scene parsing without additional post-processing and pre-training.
At present, FC-PRNet just processes one piece of the color image. However, the datasets contain information about time series and LiDAR, which are important to improve the results of scene parsing. In the follow-up work, we managed to incorporate information about time series and LiDAR in the training process to obtain better results. Meanwhile, we will design optimal models with more layers to improve segmentation performance.
In conclusion, we study a new semantic segmentation network named Fully Convolutional Pyramidal Residual Networks (FC-PRNet). By designing pyramid residual blocks and sampling modules in down/up-sampling paths, the network achieves excellent capability of semantic recognition with few parameters. In the CamVid dataset, FC-PRNets obtained 75.4% of MIoU and 93.6% of Pacc, higher than Seg-Net, DeepLab V3 and FC-Densenet. In  MINGFU ZHAO is professor, doctor of engineering, doctoral supervisor. Vice Dean of the School of Electrical and Electronic Engineering, Chongqing University of Technology. His current research interests include optical fiber biochemical sensing theory and application, intelligent optical fiber sensing theory and technology, bionic multisensing fusion technology, information acquisition and processing, and intelligent information processing.
BIN TANG is mainly engaged in the research work of spectroscopic water quality detection, pesticide residue analysis, environmental spectroscopy analysis, and digital signal processing. He is also mainly responsible for the design of water quality parameter detection schemes by spectroscopy, the establishment of suspended solids scattering models in water, the research of spectral information processing algorithms, and the design and implementation of key algorithm modules.