Spatial Attention Based Real-Time Object Detection Network for Internet of Things Devices

Target detection algorithms for Internet of things (IoT) devices often require both high real-time performance and low computational complexity. Real-time object detection network: You Only Look Once Version 3 (YOLOv3) makes full use of multi-scale features to detect objects by using feature pyramid network structure, and achieves good performance on the premise of guaranteeing fast detection speed. The feature pyramid network of YOLOv3 includes bottom-up feature extraction, top-down sampling and lateral connection of low-level detail features and high-level semantic features. But not all features are useful for object detection. In this article, a novel object detection network Spatial Attention based YOLOv3 (SA-YOLOv3) is proposed. The proposed method adds spatial attention network to the top-down sampling process. The spatial attention network calculates the feature weight matrix based on the up-sampling feature map. SA-YOLOv3 uses the feature weight matrix to filter low-level features and retain more valuable features. Finally, the selected low-level feature map and high-level feature map are concatenated together and feature maps with both spatial information and rich semantic information are obtained. The experimental results on PASCAL VOC2012 datasets and RSOD datasets show that SA-YOLOv3 outperforms YOLOv3.


I. INTRODUCTION
Object detection is a key step for IoT devices to realize intelligent perception and recognition of objects and processes, such as autonomous driving [1], [2] and robot vision [3]. The early object detection methods require the following steps. Firstly, traversing the entire image with sliding window and marking the location of the object to be detected. Secondly, extracting features from the marked areas. In this regard, some handcrafted features such as Histogram of Oriented Gradient (HOG) [4], Haar-like [5] and Scale-invariant feature transform (SIFT) [6] are commonly used. Finally, judging whether there are objects in the windows corresponding to these features by trained classifier, and then classifying the object. However, the early methods produced many redundant windows, and the handcrafted features were not robust enough for variable backgrounds, which resulted low detection efficiency.
The associate editor coordinating the review of this manuscript and approving it for publication was Zhenyu Zhou .
Krizhevsky proposed AlexNet [7] based on Convolutional Neural Network (CNN) in 2012. It proved that the features extracted by deep neural network are more abundant and accurate than the handcrafted features. After that, researchers committed to studying the object detection methods based on deep neural network. These methods are divided into two categories: the proposal-based approaches and the end-to-end-based approaches.
The proposal-based approaches include two stages. Firstly, a series of region proposals are generated from the images, and then the targets in the candidate regions are classified by using CNN. In [8] and [9], Selective Search (SS) algorithm is used to generate region proposals in the first stage. SS algorithm is faster and more accurate than traditional sliding window methods. However, the SS algorithm extracted about 2000 region proposals, which resulted huge calculation and low detection speed. To further improve the detection speed, RPN (Regional Proposal Network) [10] is applied to simplify the generation of region proposal in Faster R-CNN (Region-Convolutional Neural Network). VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In the following years, several detection methods, including Region-based Fully Convolutional Networks (R-FCN) [11], Rotation Region Proposal Networks (RRPN) [12], and R-FCN-3000 [13] have been proposed. These methods have improved the accuracy of target detection in varying degrees. However, the generation of region proposals makes this kind of detection algorithm consume a lot of time and space, which affects the real-time performance of the algorithm. Lin et al. [14] Analyzed and pointed out the reason why the end-to-end method lagged behind the two-stage detection method. From then on, the end-to-end detection methods began to develop rapidly. In 2016, Redmon et al. [15] firstly proposed an end-to-end-based detection algorithm: you only look once (YOLO). This algorithm could simultaneously predict category and coordinates of objects in the image by a single convolutional network. The detection speed of YOLO is much higher than the proposal-based approaches, while it has poor performance on some smaller objects in the image. The following research results improve the detection effect of the end-to-end detection algorithm for small size targets. In [16], Single Shot Multi Box Detector(SSD)algorithm was proposed. This algorithm learned from the idea of anchors in Faster R-CNN, and fused the feature maps of different resolutions. Fu et al. [17] replaced the backbone VGG16 [18] in SSD with the deeper ResNet-101 [19] to improve the feature extraction capabilities of the backbone, and then added some additional deconvolution to extract richer context information. Shen et al. [20] designed a network structure without loading pre-trained weights. This structure learned from the idea of dense connection in DenseNet [21], and then applied the Stem Block structure to SSD algorithm, which enabled the network to improve the accuracy while reducing the model parameters. In [22], Redmon and Farhadi proposed YOLOv2. The accuracy and detection speed of YOLOv2 are both higher than YOLO. However, YOLOv2's Darknet 19 basic network features extraction ability is not strong, and it does not make full use of the multi-scale features of the target, making the performance of YOLOv2 slightly inferior to that of R-CNN. Lin et al. [23] proposed Feature Pyramid Network (FPN) with a top-down and laterally connected. This structure fuses multi-layer feature information and makes predictions on feature maps of different scales, and improves detection performance. In 2018, Redmon et al. proposed YOLOv3 [24]. YOLOv3 introduced Darknet53, which has strong ability of feature extraction, as its backbone. The research results of FPN [23] prove that deep features contain more abundant semantic information and are suitable for object classification. Low-level features contain detail spatial structure details and are suitable for locating object boundaries. YOLOv3 drew on the idea of FPN and combined the high-level semantic features with the low-level detail features, improved the ability of classification and location on small objects. However, YOLOv3 fuses unselected low-level features directly with high-level features. In fact, not all the low-level detail features are beneficial to detection. In order to solve this problem, we proposed Spatial Attention based YOLOv3 (SA-YOLOv3) network. Our contributions in this work are as follows: (1) We proposed Spatial Attention based YOLOv3 (SA-YOLOv3) network. The proposed SA-YOLOv3 has more accurate detection accuracy and similar speed than the YOLOv3.
(2) The spatial attention network is used to filter the low-level detail features, so as to retain the effective features and eliminate the redundant features.
(3) We combine the filtered shallow detail features with deep semantic features, and make full use of the advantages of the two features and achieve better detection results.
The remainder of this article is organized as follows: Section 2, we give a briefly review of spatial attention network and YOLOv3 network; Spatial Attention based YOLOv3 network is proposed in Section3; Section4 proves the effectiveness of SA-YOLOv3; Section5 is the conclusions of this article.

II. RELATED WORKS A. SPATIAL ATTENTION NETWORK
Zhao et al. [25] proposed a spatial attention network (SA) and used it to calculate the saliency of pictures. SA can better locate the boundary of the object in the picture. The structure of SA network is shown in Fig. 1.
The network is divided into two branches. The inputs of the two branches are the high-level semantic feature map (such as G in Fig. 1, where H, W, and C represent the height, width, and number of channels of the feature map, respectively) and the low-level detail feature map (see L in Fig. 1). The upper branch called location network in Fig. 1 is the core of SA network. Firstly, location network gets spatial features from high-level semantic feature map using asymmetric convolution structure 1 × k and k × 1, and then outputs two feature maps of U and U' in Fig. 1. The number of channels of U and U' is half of the original input, that is C/2 and C'/2. After that, location network reduces the channel dimension of U and U' to 1 using k × 1 and 1 × k convolution, and the feature maps V and V' are obtained. Finally, location network fuses V with V' by feature map elements point-to-point addition and sigmoid operation to produce a weight matrix Z of size W × H. The other branch multiplies the low-level detail feature map L by the weight matrix Z extracted from the location network to generate the attention map Y of the spatial domain. The function of spatial attention network is to select the features in the low-level detail feature map L by using the weight matrix Z.

B. YOLOv3 NETWORK
YOLOv3 network consists of two parts, namely bottom-up feature extraction network and top-down multi-scale detection network. The former is responsible for extracting lowlevel detail features of the input image. The latter extracts high-level semantic features from the low-level detail features. Then these high-level features are concatenated with the corresponding low-level features to detect objects in  detection network. As shown in Fig. 2, Darknet53 with orange dotted frame on the left side is used as backbone in feature extraction network. The depth of Darknet53 is 75 layers and it is formed by stacking 53 convolution layers and 22 shortcut connection layers. The 75-layer Darknet53 network can be divided into a single DBL module and five residual blocks including res1, res2, res8, res8, and res4. DBL is a basic component of YOLOv3. DBL consists of convolution layer, Batch Normalization (BN) operation and Leaky ReLU. Resn indicates that each residual block contains n residual units. Each residual unit consists of two DBL modules connected in series. For the convenience of analysis, we divide the Darknet53 into three stages. The first stage is the blue box part in Fig. 2. In the first stage, Darknet53 uses three blue residual blocks (including res1, res2, and res8 in Fig. 2) to extract the features of the input image with the size of 416 × 416 × 3, and gets the feature graph C1 with the size of 52 × 52× 256. In the second stage, Darknet 53 uses the res8 module of the purple box in Fig. 2 to downsample C1 and get the feature map B1 with the size of 26 × 26 × 512. In the third stage, feature map A1 is generated by res4 block with orange box in Darknet53. The size of A1 is 13 × 13 × 1024.
In Fig. 2, the dotted line box (red) on the right is a top-down multi-scale detection network. This network is mainly used to predict the output feature maps of three different scales. The prediction process is as follows.
The output feature map A1 of the Darknet53 generates feature map A2 after passing through the DBL × 5 module in the multi-scale detection network. The size of A2 is 13 × 13 × 512. Then, the DBL and convolution operations in the predict1 module (blue dotted boxes in Fig. 2) change the feature graph A2 into the output1 of the first scale (size 13 × 13 × 75).
The block1 module (the first green dotted line frame in Fig. 2) performs DBL and up-sample operation on feature map A2, and connate it with feature map B1. The concatenated feature map is then processed through the second DBL × 5 module in the multi-scale detection network, and the feature map B2 is obtained, which is 26 × 26 × 256 in size. The second scale output output2 (size 26 × 26 × 75) is obtained via the predict2 module (whose operation is the same as predict1).
The process of obtaining feature map C2 and output3 of the third scale is the same as that of B2 and output2.
The outputs of the above three scales (Output1, Output2, and Output3) are used to detect large, moderate, and small objects in the input image, respectively.

A. THE LOCATION OF SPATIAL ATTENTION NETWORK IN SA-YOLOv3
We use a simplified structure of YOLOv3 network to illustrate the location of the SA network in SA-YOLOv3. As shown in Fig. 3, we simplify the network structure of YOLOv3 to the form of FPN [22]. The left half of Fig. 3 is the bottom-up feature extraction process. Each layer in the left represents the output feature map of each stage of Dark-net53 network in Fig. 2. Input corresponds to the input of YOLOv3, that is, the input with yellow box of the Dark-net53 network in Fig. 2. C1 corresponds the output feature map of the first stage (blue box in Fig. 2, namely DBL module, res1 module, res2 module and res8 module) of Dark-net53 network. B1 corresponds the output feature map of the second stage (purple box in Fig. 2, namely res8 module) of Darknet53 network. A1 corresponds to the output feature map of the third stage (orange box in Fig. 2, namely res4 module) of Darknet53 network.
The right half of Fig. 3 shows the top-down up-sampling process. In this process, A2, B2, and C2 layers correspond the output feature maps of three DBL × 5 modules in the multi-scale detection network in Fig. 2, respectively. The two blocks with the green dotted line correspond to the two blocks in Fig. 2, respectively. In Fig. 3, Predict1, Predict2, and Predict3 correspond to the predict1, predict2, and pre-dict3 modules in the multi-scale detection network in Fig. 2, respectively.
The three lateral connections in Fig. 3   feature maps, we introduce SA network [24] to the two blocks in Fig. 3. We call the block after joining SA as SA-block. The proposed SA-block can select the low-level detail features, and then enable high-level semantic features in the detection network to be combined with more efficient detail features.

B. SA-BLOCK
The structure of SA-block is shown in Fig. 4. The high-level semantic feature map (A 2 or B 2 ) from the upper layer is processed through two DBL modules and an up-sample operation to serve as input to the SA network (red solid wireframe in Fig. 4). The SA obtains the weight matrix W by passing the input feature map through the location network. The location network obtains spatial features from the semantic feature map, using the asymmetric convolution structure 1 × k and k × 1. After that, the location network reduces the channel dimension of the spatial features to 1 dimension using the k × 1 and 1 × k convolution structures, respectively. Finally, the location network fuses the two branches' output by feature map elements point-to-point addition to get the weight matrix W. The weight matrix W is multiplied by the corresponding detail feature map (B 1 or C 1 in Fig. 3) to get the filtered low-level feature map SA_out, that is, the output of the SA network. Finally, we fuse the high-level semantic feature VOLUME 8, 2020 map with the selected feature map SA_out. The purpose of doing this is to combine high-level semantic features with more effective detail features.

C. ARCHITECTURE OF SA-YOLOv3
The proposed SA-YOLOv3 network structure is shown in Fig. 5. The location of the SA-block in the SA-YOLOv3 network is shown by the red boxes of SA1 and SA2 in Fig. 5.
In SA-block1, the feature map A2 is passed through two DBL modules (where the second DBL module is used for increasing channel dimensions to get richer features) and an up-sample operation to serve as the input of the SA1 network, and its size is 26 × 26 × 512. The SA1 network is divided into two branches. The left branch uses conv1 operation with a kernel of 1 × k (k=3) and conv2 operation with a kernel of k × 1, and then obtains a feature map of size 26 × 26 × 1. The right branch uses the conv2 operation and the conv1 operation in sequence, and also obtains a feature map of size 26 × 26 × 1. The weight matrix W1 is obtained by adding and activating (by sigmoid function) the output of the two branches. The weight matrix W1 is multiplied by the low-level detail feature map B1 in Darknet 53, and then the map SA1_out (size 26 × 26 × 512) is obtained as the output of SA1 network. Finally, we concatenated SA1_out and the high-level semantic feature map together.
In SA-block2, the feature map B2 is passed through two DBL modules and an up-sample operation to serve as the input of the SA2 network, and its size is 52 × 52 × 256. The output of SA2 is the weight matrix W2. Then W2 is multiplied by the low-level detail feature map C1 and the output feature map SA2_out (size 52 × 52 × 256) of the SA2 network is obtained. Finally, the SA2_out and the high-level semantic feature map are concatenated and generate the output feature map C2 of the SA-block2 module.

IV. EXPERIMENTS
In this subsection, we carry out experiments on PASCAL VOC2012 [26] datasets and RSOD [27], [28] datasets to evaluate effectiveness of the proposed SA-YOLOv3 network. The PASCAL VOC2012 datasets includes 17,125 images with 20 categories, including airplanes, bicycles, birds and so on. Fig. 6(a) shows some sample images of PASCAL VOC2012 datasets. The RSOD data set includes 936 images with 4 categories, including aircraft, oiltank, overpass and playground. Fig. 6(b) shows some sample images of the RSOD datasets.
The training parameters of the SA-YOLOv3 network were set as follows. The SGD optimizer was adopted to update the weights of the network, momentum: 0.9, weight decay: 0.0005, the initial learning rate was 0.001, and batch size: 2. We set the size of the pictures in PASCAL VOC2012 datasets to 416 × 416. and the size of the pictures in RSOD datasets to 608 × 608. In the RSOD datasets, the size of input image is set to 608 × 608. Our experiment was performed on Ubuntu 18.04LTS with an NVIDIA GeForce GTX 1050Ti GPU.
In the experiment, we use mIoU (mean Intersection-over-Union), mAP (mean Average Precision) [25] and fps (frames per second) to evaluate performance of network. The IoU value is the ratio of the bounding box of the object marked by the datasets to the bounding box of the object predicted by the network. The mIoU is the average of all IoUs (IoU > IoU thresh , IoU thresh = 0.5) on the premise that the network correctly predicts the object category. The mAP is the average of the average precision values for all categories. The mAP can be computed by Eq. (1) and (2).
The average precision can be obtained by (1). Where , p = Object ∩ Detected/Detected which represents the precision, is the number of correct results identified by the network. Where , r = Object ∩ Detected/Object which represents the recall, is the proportion of the number correctly recognized by the network for each category. The p interp (r n+1 ) is the maximum precision ofr greater than or equal to r n+1 . The mAP can be obtained by (2). Where Q is the total number of categories. Fps is the detection rate of network (frames per second).

A. EXPERIMENTS ON PASCAL VOC2012 DATASETS
In the experiment, we randomly selected 60% of the images in PASCAL VOC2012 datasets as training set (including 10275 images), and the remaining 40% as test set (including 6850 images). Tables 1 and 2 show the results of YOLOv3 and SA-YOLOv3 on VOC2012 test set. Fig. 7 shows detection results of SA-YOLOv3 network on PASCAL VOC2012 test set.
From the results shown in Tables 1 and 2, we can make the following conclusions: (1) The mAP value of SA-YOLOv3 is 1.5% higher than that of YOLOv3 while maintaining the same speed as YOLOv3. At the same time, the mIoU value of SA-YOLOv3 is also 1.4% higher than that of YOLOv3.
(2) The accuracy of SA-YOLOv3 is slightly lower than that of YOLOv3 in chair, cow, horse and train categories. However, in other categories, the mAP value of SA-YOLOv3 is the same as that of YOLOv3, and even has a substantial increase (such as bike, bus, car, and tv categories).
The results of the above two experiments indicate that the filtered detail features after spatial attention network can better locate the boundary of the object in image.

B. EXPERIMENTS ON RSOD DATASETS
Target detection in remote sensing images [26], [27] is also one of the important applications of object detection. We use RSOD datasets to evaluate the effectiveness of SA-YOLOv3 in remote sensing image target detection task.  In the experiment, we used 60% of RSOD datasets as training set (including 655 pictures), and the remaining 40% as test set (including 281 pictures). The experimental results are shown in Tables 3 and 4. Fig. 8 shows detection results of SA-YOLOv3 network on RSOD test set.
From the results shown in Tables 3 and 4, we can get the following conclusions: (1) Although the fps of SA-YOLOv3 is slightly lower than that of YOLOv3, it still has real-time detection capability. The mAP value of SA-YOLOv3 is 1.4% higher than that of YOLOv3. At the same time, the mIoU value of SA-YOLOv3 is also increased by 3.1% compared to YOLOv3.
(2) The performance of SA-YOLOv3 in the two categories of aircraft and playground is similar to that of YOLOv3, but the mAP value in the two categories of oiltank and overpass is significantly improved. VOLUME 8, 2020 The results of the above two experiments indicate that the filtered detail features after spatial attention network can better locate the boundary of the object in image.

V. CONCLUSION
In this article, we proposed Spatial Attention based YOLOv3 network (SA-YOLOv3). The proposed SA-YOLOv3 network uses spatial attention network to filter the low-level detail features in the top-down up-sampling process, retaining more valuable detail features. Then the filtered detail features are concatenated with the high-level semantic features to obtain the features of both spatial information and rich semantic information. Experiment results show that the SA-YOLOv3 outperforms YOLOv3 on the premise of ensuring similar speed to YOLOv3.