Video Object Detection Guided by Object Blur Evaluation

In recent years, the excellent image-based object detection algorithms are transferred to the video object detection directly. These frame-by-frame processing methods are suboptimal owing to the degenerate object appearance such as motion blur, defocus and rare poses. The existing works for video object detection mostly focus on the feature aggregation at pixel level and instance level, but the blur impact in the aggregation process has not been exploited well so far. In this article, we propose an end-to-end blur-aid feature aggregation network (BFAN) for video object detection. The proposed BFAN focuses on the aggregation process influenced by the blur including motion blur and defocus with high accuracy and little increased computation. In BFAN, we evaluate the object blur degree of each frame as the weight for aggregation. Noteworthy, the background is usually flat which has a negative impact on the object blur degree evaluation. Therefore, we introduce a light saliency detection network to alleviate the background interference. The experiments conducted on the ImageNet VID dataset show that BFAN achieves the state-of-the-art detection performance, exactly 79.1% mAP, with 3 points improvement compared to the video object detection baseline.


I. INTRODUCTION
The deep learning network has achieved significant progress in object detection [1], [2]. Compared to the still image object detection, the video object detection is more challenging because the drastic appearance variation occurs in the video frames. The common appearance variation is blur, which could decrease the detection accuracy to a great extent. As shown in Figure 1, the motion blur and the defocus both discourage the accurate inference of the still image detector [3]. Therefore, the exploitation of the blur information is beneficial to video object detection.
The state-of-the-art video object detection algorithms can be categorized into two types: box-level post processing and feature aggregation. In the early stage, the box-level postprocessing methods combined the CNN based still image detector and the tracker on the detected bounding box [4]. This kind of methods first applied the still image object detector, and then manipulated the detected bounding box The associate editor coordinating the review of this manuscript and approving it for publication was Byung-Gyu Kim . across the temporal dimension as a dedicated post processing step. T-CNN [5] and D&T [6] both improved the detection accuracy on the basis of the still image object detector by optimizing the detected bounding box. Moreover, these methods were not trained end-to-end since the generation of the proposal boxes and the box-level post processing are independent. Although these methods achieved promising results compared to the still image detectors, they were computationally expensive. To tackle the misalignment among the ajacent frames [7], the other solution based on feature aggregation become the mainstream to this end. These methods constructed the connections among frames on the feature level including pixel and instance levels. The feature aggregation based methods achieved higher accuracy with higher efficiency compared to the former box-level post processing methods owing to end-to-end training. The proposed method belongs to the latter elegant framework.
The existing feature aggregation based methods compensate the misalignment among frames by aggregating features of many ajacent frames. One critical issue is that whether these frame should be treated equally. There are two existing FIGURE 1. The detection results from Faster R-CNN (the top row) and BFAN (the bottom row) on two blur frames. (a) and (b) are the frames suffering from motion blur and defocus, respectively. The Faster R-CNN fails to produce a prediction for the motion blur case, and it also gives a wrong prediction for the defocus case. The proposed method succeeds in the inference with a high confidence in comparison.
solutions to answer the issue. One solution is to treat each frame equally and assign them the same weight. The other one is to adopt a light network to learn the weight in the training process. These two solutions both lack of the special consideration for the blur influence. The blur influence has been considered to discriminatively treat consecutive frames for saliency detection [8]. To address the limitation of the existing methods, we focus on the blur influence in the feature aggregation process and develop a blur-aid feature aggregation network (BFAN) for video object detection. In our opinion, because the appearance of the objects in some frames deteriorates due to the motion blur or defocus, these frames are supposed to be assigned low weights. On the contrary, the frames whose object appearance is clear are supposed to be assigned high weights. Specifically, we evaluate the object blur degree of each frame and use it as the weight of the frame. In this way, the frames whose object appearance is clear make more contribution to the result than those whose object appearance is blur. Moreover, we are only concerned about the blur degree of the objects, not the whole frame. The background is usually sophiscated, which could disturb the object blur degree evaluation. A light saliency detection network is introduced to alleviate the negative impact of the background [9]. We thereby only evaluate the blur degree of the objects to be detected in each frame.
The main contribution of this article can be summarized as follows: • We advocate a novel BFAN focusing on the blur influence in video object detection. The frames whose object appearance is clear contribute more to the result than those frames whose object appearance is blur. In this way, the blur objects are easier to be detected guided by other clear frames.
• Because we only care the blur degree of the objects, the background interference could decrease the detection accuracy. We adopt a light saliency detection network to alleviate the background interference.
• The experiments on VID dataset exhibit that the proposed method achieves the state-of-the-art performance, exactly 79.1% mAP, which increases by 3% compared to the adaptive weight video object detection baseline. The remaining part of this article has been organised in the following way. Section II gives a brief review of the existing object detection algorithms and blur estimation methods. Section III describes our implementation in detail. The experimental results and the ablation analysis are shown in Section IV. Finally, the conclusion is provided in Section V.

II. RELATED WORK
In this section, we first review the still image and video object detection algorithms. Then a short review of the blur mapping methods is given afterwards.

A. STILL IMAGE OBJECT DETECTION
We have witnessed the great success of CNN in various domains such as image classification [10], object detection [11] and image restoration [12]. The object detection for still images is one of the most successful domains until VOLUME 8, 2020 now. R-CNN [13] first used the CNN to extract the features followed by SVM classification, and it improved the accuracy greatly compared to the traditional detection methods such as DPM [14]. The following Fast R-CNN [15] and Faster R-CNN [3] improved the ROI pooling and put forward the region proposed network to accelerate speed and improved accuracy. The above three detection algorithms belong to the two-stage framework. They give the proposal bounding box, and then classify the conrresponding features. Different from the two-stage framework, YOLO [16] and SSD [17] are the representatives of the one-stage framework. They predict the categories and the locations without the proposal bounding boxes, which is more efficient.

B. VIDEO OBJECT DETECTION
Compared to the datasets for still image object detection such as PASCAL VOC [18] and COCO [19], the first dataset for video object detection ImageNet VID [20] was introduced in 2015. The early algorithms paid attention to the boxlevel post processing. T-CNN [5] and D&T [6] introduced the tracking on the detected bounding boxes. Because the object detection on each frame and the post tracking are independent, these methods were difficult to be trained endto-end and they were computative expensively. Afterwards, Zhu et al. put forward the first two feature aggregation methods DFF [21] and FGFA [22]. These two methods dug into detection network and they were more efficient because of the end-to-end training. DFF and FGFA both introduced the flow guided warping to optimize the detection process, but they pursued the high efficiency and the high accuracy, respectively. DFF improved the feature extraction process, and it utilized the flow to get the feature of the non-key frame based on the key frame, which saved much time. However, FGFA focused on the accuracy improvement by aggregating the features of ajacent frames via flow. The subsequent methods MANet [23] and MFCN [24] took the instance level feature into consideration to boost the performance. STMM [25] and LWDN [26] adopted the memory network such as LTSM to balance the accuracy and speed. STSN [27] and SSVD [28] considered the sampling stream other than the motion stream via deformable convolution [29]. The above video object detection algorithms exploited the motion stream and sampling stream on pixel level and instance level for the compensation. However, the special design for blur influence is still blank until now.

C. BLUR MAPPING
One important step of our algorithm is to evaluate the blur degree of the objects, not the whole frame. Therefore the existing works about the blur degree for the whole frame are not suitable for our work. We utilize the blur mapping method to label every pixel as either blurry or non-blurry, and then extract the object parts guided by the saliency detection. The first representative dataset for blur mapping was proposed by Shi et al. [30], and it contained two types of blur: motion blur and out-of-focus. The subsequent blur mapping methods focused on either the out-of-focus [31] or the motion blur [32]. The most existing blur mapping algorithms were based on hand-craft features, thus they were not robust enough to discriminate the truly blur region and the flat region in the nature such as the sky. Futhermore, the proposed method in this article is based on CNN, hence a blur mapping method based on CNN is essential due to the end-to-end training. Ma et al. proposed a deep blur mapper (DBM) [33] to separate the truly blur region including motion blur and out-of-focus from the whole image robustly. Moreover, DBM was a fully convolutional network which could be utilized as one part of our whole network in the end-to-end training process.

III. PROPOSED METHOD A. OVERVIEW
As shown in Figure 2, the overall architecture of BFAN is based on the pixel level feature aggregation framework [22]. Our model contains a feature extraction backbone N feat that generates the deep intermediate feature, a flow network N flow that calibrates the features from the supporting frames such as f t−τ and f t+τ , a module that produces the weights for frames as shown in Figure 2 (b) and a detection head N det that gives the final detection results including categories and locations. The input frames are partitioned into the reference frame I t and the supporting frames I t−τ , I t+τ . The supporting frames provide the information to boost the detection performance for the reference frame.
Our main contribution is a feature aggregation based detection framework guided by the object blur evaluation. We illustrate the proposed model in three stages step by step: 1. the pixel level feature aggregation for detection; 2. the weight calculation guided by the object blur evalution; 3. the object blur evaluation calibrated by the saliency detection. In the following section, we describe above three parts in detail.

1) PIXEL LEVEL AGGREGATION
It is a consensus that there are movement among frames in a video. To efficiently utilize the information of the ajacent frames, a flow network [34] is recommended to compensate the misalignment due to the movement. Given a reference frame I t and a supporting frame I t−τ , the flow network N flow can estimate the flow field M t−τ →t = N flow (I t−τ , I t ). The flow field M t−τ →t predicts the distance from the pixel in I t−τ →t to the corresponding pixel in I t . Therefore, the feature of the supporting frame is warped to the reference frame according to the flow field as follows: where f t−τ →t is the warped feature from frame I t−τ to I t and W(·) denotes the bilinear warping function.
With the warped features of the supporting frames, we have accumulated the information from nearby frames for the reference frame. These features from ajacent frames provide much useful information to make up for the weakness of FIGURE 2. The proposed BFAN architecture. We take three frames t − τ , t , t + τ as an example for illustration. (a) is the whole architecture. Each frame is fed into the feature extraction network N feat to obtain its own deep convolutional feature f t −τ , f t , f t +τ . The flow network N flow is utilized to get the flow map between two frames, which is used to compensate the motion misalignment by WARP, namely f t −τ →t , f t +τ →t . The warped features f t −τ →t , f t +τ →t and the feature of the reference frame f t are assigned with different weights ω t −τ , ω t +τ , ω t , respectively. (b) generates the weights ω t −τ , ω t , ω t +τ for frames I t −τ , I t , I t +τ , respectively. Each frame is fed into the blur mapping network N blur and the saliency detection N saliency network simultaneously as shown in (c). The blur map M blur is dot multiplied (⊗) by the saliency map M saliency to obtain the calibrated blur map M blur _cali . Then a step function with threshold 0.5 is utilized for binarization. The sum of whole blur map binarization is used as the calibrated blur value Vcb of the frame. Lastly, all calibrated blur values are normalized and mapped into [0,1] by softmax function to achieve weigths ω t −τ , ω t , ω t +τ .
the reference frame such as rare poses and blur appearance. For aggregation, there are two common solutions. One solution is to assign each feature with the same weight, i.e. treating all features equally. The other solution is to assign each feature with different weight. One representative of the second solution is to adopt a tiny network to predict the weight for each frame, and the parameters of the tiny network are optimized in the training process. Different from the adaptive weight, we introduce the object blur evaluation to guide the weight, which is illustrated in next subsection. Finally, the aggregated feature is fed into to detection network N det to produce categories and locations for objects.

2) WEIGHT GUIDED BY THE OBJECT BLUR EVALUATION
As shown in Figure 2 (b), each frame is fed into the combine network N combine (N combine is described in Figure 2 (c) in detail) to obtain the value, which stands for the blur degree of the frame. Because the values are too large to mapped into [0,1] by softmax function, we first normalize all values as follows: where Vcb denotes the calibrated blur value and VcbNorm is the normalization result for Vcb. Followed by the softmax VOLUME 8, 2020 function, VcbNorm is converted to the weight ω for each frame.

3) OBJECT BLUR EVALUATION CALIBRATED BY THE SALIENCY DETECTION
The details of N combine is shown in Figure 2 (c). Each frame is fed into the the blur map network N blur and the saliency detection network N saliency simultaneously. The blur mapping network N blur is able to label each pixel as either blur or nonblur. However, we only care about the blur degree of the objects, thus the background interference is supposed to be excluded. Therefore, a saliency network N saliency is adopted to extract the region of interest, which could alleviate the background interference to a great extent. The blur map is calibrated by alleviating the background interference via dot multiplication with the saliency map.
With the calibrated blur map M blur_cali , we utilize a step function for binarization as follows: Finally, all pixels are accumulated in the whole map to achieve Vcb in Figure 2 (b).

B. MODEL ARCHITECTURE
The proposed BFAN contains five essential subnetworks: the feature extraction network, the flow estimation network, the blur mapping network, the saliency detection network and the detection network. Firstly, the feature extraction network extract the deep feature from the input frames. Secondly, the flow estimation network estimated the flow field between two arbitrary frames to obtain the warped features. Thirdly, the blur mapping network and the saliency network extract the blur map and the saliency map, respectively. The saliency map is utilized to alleviate the background interference for object blur evaluation, i.e. the weights for frames. Finally, the warped features multiplied by the corresponding weights are aggregated to fed into the detection network, and the objects of interest are obtained. To design these five subnetworks is out of scope of this article, and there are many existing works focusing on each special field. We hence employ the existing networks directly, and describe them below.

1) FEATURE EXTRACTION NETWORK
We choose the Resnet-101 [35] as the feature extraction network. In order to extract the feature for subsequent process, we remove the last average pooling and fully-connected layers. Following the same strategy in [22], we enlarge the resolution of the feature maps by changing the stride of the first convolutional layers in the conv5 from 2 to 1. Furthermore, the dilation of these convolutional layers is set as 2 to keep the receptive field.

2) FLOW ESTIMATION NETWORK
There are many existing work focusing on flow estimation such as FlowNet [34], FlowNet2 [36], PWC-Net [37]. Since the state-of-the-art video object detection algorithms mostly use the Flownet (the simple version), we follow the same strategy for fairness. As there is a mismatch between the resolution of the output flow field and the resolution of the feature maps from the feature extraction network, we resize the flow field to match the feature maps.

3) BLUR MAPPING NETWORK
Our goal is to evaluate the blur degree of the object, and the blur includes motion blur and out-of-focus. The existing blur mapping algorithms are mostly designed for either motion blur or out-of-focus, which cannot meet our demand. We choose the Deep blur mapping (DBM) [33] as blur mapping network. Because DBM is able to discriminate the motion blur and out-of-focus at the same time, and it is robust enough to distinguish the out-of-focus and flat region. Furthermore, DBM is an end-to-end fully convolutional network, which is convenient for training. Therefore, DBM is the best choice for the proposed model.

4) SALIENCY DETECTION NETWORK
To alleviate the impact of the background, we introduce a saliency detection network. Most saliency detection networks are computative expensively [38]- [40], thus they are unsuitable for the proposed method. We choose a light saliency detection network CSNet [41]. CSNet reduces the representative redundancy with a flexible convolutional module, i.e. gOctConv, and it achieves comparable performance with only 0.2% parameters. The experimental results show that CSNet improves the detection performance of BFAN, and the increased computation is very little. As shown in Figure 3, the orginal images and their corresponding blur maps, saliency maps and calibrated blur maps are listed from the top row to the bottom row. The cars in (b) and (d) both contain motion blur compared to the cars in (a) and (c). The blur maps in (b) and (d) are darker than those in (a) and (c). The saliency maps in the third row are able to alleviate the background interference, thus the calibrated blur maps which only care about the object blur degree are obtained in the bottom row. Therefore, the frames in (a) and (c) are assigned higher weights in feature aggregation.

5) DETECTION NETWORK
We mainly use the Faster R-CNN [3] as our default detection network. Different the orginal setting in Faster R-CNN, we choose 12 anchors for each position in Region Proposal Network (RPN). The 12 anchors includes 3 aspect ratios {1:2, 1:1, 2:1} and 4 scales {64 2 , 128 2 , 256 2 , 512 2 }. We choose 300 anchors for each frame with an NMS threshold 0.7 in the training and inference process. Finally, the ROI-Align layer followed by a 1024-D fully-connected layer after conv5 stage is utilized for classification.  range K , BFAN sequentially processes the frames with a 2K + 1 range as the supporting frame set. We construct a butter to store the feature maps and the value of calibrated blur map (Vcb) of each frame in the supporing frame set. However, at the begin K frames and the end K frames, we replicate the first frame and the last frame to fill the butter, respectively. At the beginning, we extrace the feature maps and the values of calibrated blur map (Vcb) of the first K + 1 frames to initialize the butter (L2-L6 in Algorithm 1). Moreover, we replicate the feature maps and Vcb of the first frame K times to make the buffer contain 2K + 1 frames (L8-L9 in Algorithm 1). With the initialized buffer, BFAN sequentially processes the video frames (L11-L17 in Algorithm 1) and update the buffer (L18-L23 in Algorithm 1). For the i-th reference frame, the feature maps of the supporting frames are warped to the reference frame (L12 in Algorithm 1). The warped features are aggregated with the corresponding weights (L14-L16 in Algorithm 1). Finally, the aggregated feature is fed into the detection network to obtain categories and locations of the objects (L17 in Algorithm 1).

D. COMPLEXITY ANALYSIS
According to Algorithm 1, we analyze the complexity of BFAN. Aside from the feature extraction network N feat , BFAN contains following modules: the flow estimation network N flow , the warp bilinear function W, the blur mapping network N blur , the saliency detection network N saliency , the weight calculation denoted as (dot multiplication, step function, accumulation, normalization and softmax) and the detection network N det . For the supporting frames range K , the complexity of the proposed method is (4) where O measures the complexity. Compared to the still image detector, the ratio of BFAN versus Faster R-CNN is Typically, the complexity of N det , W, can be ignored compared to N feat . The ratio hence is approximated as follows: Therefore, the increased computational cost mainly comes from the blur mapping network, the saliency detection network and the flow estimation network. The blur mapping network and the flow estimation network are both fully connected network, and they are of nearly the same complexity. The complexity of these two network is much lower than N feat in general [22]. As for N saliency , it is a very light network whose complexity is much more lower than N feat . As shown in the following execution time Table 2, the increased computational time is affordable. VOLUME 8, 2020 Warp the feature to the reference frame 13: end for 14: Calculate ω for each frame 15: Select the corresponding weight for each frame 16:

A. EXPERIMENT SETUP
We evaluate the proposed method on the prevalent large-scale dataset for video object detection, ImageNet VID dataset. It contains 3862 training sets, 555 validation sets and 937 test sets, and they have been well fully annotated. There are 25 or 30 frames in most video snippets. Following the strategy in [22], we implement our training on the combination of the DET training set and the VID training set. The DET training set contains 200 classes. the VID training set contains 30 classes, which is a subset of the categories in the DET training set. We hence only extract the same 30 classes annotations in the DET trainning set for training. The validation set is used for mean average precision (mAP) evaluation.
We train the proposed model using PyTorch [42] framework on a PC with one Xeon E5-25678 v2 @2.50GHz CPU and four NVIDIA 2080Ti GPU. The input images are all resized to 600 pixels for the shorter sides. The model is traned on 4 GPUs. Each GPU holds only one mini-batch and each mini-batch contains one sets of images for one reference frame. We utilize SGD to optimize the network for totally 120K iterations. The learning rate is set as 1 × 10 −3 for the first 80K iterations and 1 × 10 −4 for the last 40K iterations. In the training phase, we take random two frames in the 2K + 1 ranges to increase the robustness. In the inference phase, we set K = 9, that is the features of 19 frames are aggregated for the reference frame detection. We abandon the post-processing methods such as Seq-NMS to refine the detection results for simplicity, because it is not our emphasis.

B. RESULTS
We compare the proposed method with the state-of-the-art video object detection algorithms as shown in Table 1. The algorithms listed in Table 1 all use ResNet-101 as the feature extraction backbone. Moreover, no post-processing steps such as Seq-NMS are utilized for fairness. The proposed method is modified based on FGFA [22]. Compared to the baseline FGFA, BFAN makes progress in 25 categories and improves 2.8% for all classess. D&T [6] combine the detection and the tracking algorithms, and it falls behind other endto-end algorithms except DFF [21]. DFF is designed for high speed, thus it sacrifices the precision. MANet [23] is also based on FGFA, which introduces the instance level feature aggregation. SCNet [43] achieves 77.9% with the scale-aware module and the coupling-structure ROI module. However, BFAN still outperforms MANet and SCNet with only the object blur evaluation. Quantitative results on ImageNet VID validation set. The mAP for each class in VID dataset is listed and as well as the mAP for all classes. The feature extraction backbone is denoted as N feat . R101 is short for ResNet-101.

C. EXECUTION TIME
We test the execution time on ImageNet VID dataset. We select 100 images with the resolution of 1280 × 720, and calculate the average execution time for processing the 100 images. Different from the training phase, we test the execution time on a PC with an Intel CPU i5-9600K@3.7GHz, 16GB RAM and one NVIDIA 1080 GPU. The execution time only includes the process of running the network without other processes such as decoding input images. As shown in Table 2, Faster R-CNN and DFF only consider the reference frame, thus they are faster. BFAN and FGFA both take 19 frames as the supporting frames for the reference frame. The process of feature aggregation takes more time compared to the still image detector. BFAN is based on FGFA by adding the blur evaluation guided weights calculation. As a result, BFAN improved 2.8% with only 3% more running time, which is valuable. Table 3 lists the comparison among the proposed method and its different variants. Because we modify the network based on FGFA, we list the performance of FGFA as Method (b). Moreover, we replace the adaptive weight with the average accumulation as Method (a) for comparison. Noteworthy, the item ''end-to-end training'' only refers to the loaded pre-trained parameters of the blur mapping network and the saliency detection network.

D. ABLATION STUDY
Method (a) is a variant of FGFA. We remove the adaptive weight part and assign each feature with 1 2K +1 . We find that mAP for all categories decrease little by only 0.2%, which indicates that the adaptive weight has a limited effect. The adaptive weight improves the detection precision significantly for the fast motion cases, but it brings a little drop for the slow and medium motion cases. Therefore, some existing video object detection algorithms [23] adopt the simple average accumulation instead of adaptive weight for simplicity.
Method (b) is FGFA. Compared to Method (a), FGFA increased the detection precision for all categories by only 0.2%. It indicates that the most important part of FGFA is flow motion guided feature aggregation, and the adaptive weight is limitedly effective. We hence preserve the flow motion guided feature aggregation backbone, and propose a more effective method for weight assignment.
Method (c) introduces the blur mapping network to calculate the blur degree of each frame. The pre-trained parameters of the blur mapping network is optimized in the training process. Although Method (c) underperforms FGFA for the fast motion, it improved 2.4% for all categories compared to the adaptive weight in FGFA.
Method (d) introduces the saliency detection network to alleviate the background interference. Method (d) is able to focus on the region of interest, i.e. objects with the help of saliency detection network. Method (d) also improved the performance compared Method (a) and Method (b), which indicates that the background interference is harmful to detection.
Method (e) adopts both the blur mapping network and the saliency detection network. However, the pre-trained parameters of these two sub-networks are frozen in the training process. Although the parameters of these sub-networks cannot be optimized for video object detection task, Method (e) still outperforms Method (c) and Method (d) who only use either blur mapping network or saliency detection network. It can be inferred that the combination of blur mapping and saliency detection achieves the goal of object blur degree evaluation without background interference.
Method (f) is the proposed BFAN method, which unfreeze the pre-trained parameters of the blur mapping network and the saliency detection network based on Method (e). It increases the mAP score by 3% to 79.1% compared to Method (a). The improvement for slow motion and fast motion cases are both significant, which indicates that BFAN is more balance than FGFA.
To sum up, aggregating the feature maps from ajacent frames guided by the object blur degree evaluation is more effective than the adaptive weight module in FGFA. The combination of blur mapping network and the saliency detection network achieves the goal of alleviating the background interference. Through above the modules, the mAP for all categories is improved by 2.8% to 79.1%.

E. AGGREGATION FRAMES ANALYSIS
We exploit the influence the number of the supporing frames in the testing phase as shown in Table 4. We tried 3, 7, 11, 15, 19, 23 frames in inference using 2 frames in training and ResNet-101 as backbone. As expected, the detection accuracy improves with the increased aggregated frames in inference. However, the execution time also increases with more frames are taken into consideration. Results in Table 4 show that the improvement saturates at 23 frame with much more time taken. We hence select 19 frames for the balance between accuracy and running speed. Figure 4 shows the visual examples on ImageNet VID datasets. We list three methods including Faster R-CNN (still image detector), FGFA (video detector baseline) and the proposed BFAN. In theory, FGFA intorduces the flow motion compensation and adaptive weight module into Faster R-CNN, the proposed BFAN replaces the adaptive weight module with the weight guided by object blur evaluation. Faster R-CNN detects the incorrect ''bicycle'' in the first two frames, and FGFA also fails in (b) and (c). The proposed BFAN not only detects the correct ''motorcycle'' in all five frames, but also gives very high confidence scores compared to Faster R-CNN and FGFA.

G. LIMITATION
The proposed method may fail when the object in the input frame is too blurry to be recognized. As shown in Figure 5, the dog in the top row becomes more and more blurry from left to right. Although the proposed BFAN method succeeds in the first frame, it detects a squirrel by mistake in the second frame. The dog in the third frame is too blurry, which is difficult for BFAN to give the correct detection result. The core idea of BFAN is to adopt the strong features of the  clear object appearance in adjacent frames to make up the weak features of the current frame. However, BFAN may fail to give the correct results when most object appearance in adjacent frames is too blurry, which cannot support the strong features. Another failure case is due to the severe occlusion as shown in the bottom row in Figure 5. The zebra in the first frame is detected successfully, but the zebras in the latter two frames are occluded by the pillar. The BFAN fails to detect the zebra due to the weak features affected by the occlusion. The possible solver is training the network with more blurry object cases and occluded cases to improve the robustness.

V. CONCLUSION
In this article, we propose a video object detection algorithm guided by the object blur degree evaluation. We improve the weight assignment for the aggregated frames with the blur prior. Especially, a blur mapping network is introduced to label each pixel as either blur or non-blur. Because we only care about the object blur degree without the background, a saliency detection network is adopted to focus on the objects. Calibrated by the saliency map, the calibrated blur map which focus on object blur degree is obtained to calculate the weight for each frame. The extensive experiments demonstrate that the proposed method outperforms state-of-the-art video object detection algorithms with affordable increased computation. However, the blur mapping and saliency networks may fail for some unusual cases that the objects are too small to be distinguished, which can be improved in the future work. Futhermore, another important degenerate element in video object detection is rare poses.
We will design special module to tackle rare poses in the future. It is beneficial to video object detection accuracy improvement.  Professor with the Image Processing Center. Her research interests include stereo vision, 3-D reconstruction, camera calibration, and camera's ego-motion estimation. VOLUME 8, 2020