Constrained Image Splicing Detection and Localization With Attention-Aware Encoder-Decoder and Atrous Convolution

Constrained image splicing detection and localization (CISDL) is a newly formulated image forensics task and plays an important role in verifying the generating process of a forged image. CISDL conducts dense matching between two investigated images and detects whether one image has forged regions pasted from the other. In this work, we introduce a novel attention-aware encoder-decoder deep matching network named as AttentionDM for CISDL. An encoder-decoder with atrous convolution is newly designed for hierarchical features dense matching and fine-grained masks generation. A novel attention-aware correlation computation module is built on normalization operations and informative features recalibration with channel attention blocks. Last but not least, VGG and ResNets are respectively formulated as feature extractors for comprehensive comparisons in CISDL. Extensive experiments demonstrate the superior performance of AttentionDM over the state-of-the-art methods.


I. INTRODUCTION
Malicious image forgery is becoming a global epidemic in recent years, due to the rapidly declining cost of digital cameras and quick development of sophisticated image editing tools [1]. Forgers may use forged images to produce fake news, spread rumors or give false testimony, which result in negative social impacts [2]. Image forensics, which seeks to distinguish forged images and prevent forgers from using forged images for unscrupulous business or political purposes [3], has attracted great attention in research and industrial communities [4].
A variety of image forensics methods investigate an individual image and detect its high-level [5]- [8] or low-level inconsistencies caused by image manipulation [1], [2], [9]. However, it is still a challenging task to accurately distinguish forged images, due to advanced image manipulation techniques and limited information provided by a single image [2], [3]. Moreover, these image forensics methods identify forged images or regions without providing the source of forged regions or specific tampering process, but these auxil-The associate editor coordinating the review of this manuscript and approving it for publication was Paolo Napoletano . iary evidences can provide more clues and make results more convincing in real applications [10].
Considering the afore-mentioned limitations, constrained image splicing detection and localization (CISDL) is newly formulated in the Media Forensics Challenge [10], [11]. Different from ''conventional'' splicing detection, ''constrained'' means that the inputs are two images: one is a probe image and the other is a potential donor image. CISDL can be depicted in Figure 1: given a probe image P and a potential donor image D, CISDL aims to detect if a region of D has been spliced into P, and consequently provide mask images P m and D m indicating the region(s) of P were spliced from D. In [12], Wu et al. proposed the pioneering CISDL approach, i.e., Deep Matching and Validation Network (DMVN). DMVN generates correlation maps by comparing high-level low-resolution feature maps of VGG [13], and constructs an inception-based mask deconvolution module [14] to locate suspected regions. However, low-resolution feature maps restrict DMVN's ability to detect accurate boundaries and find small suspected regions. In [10], we proposed a deep matching network based on atrous convolution (DMAC) to generate high-quality candidate masks from high-resolution feature maps. The basic DMAC architecture VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ FIGURE 1. Overview of the proposed AttentionDM. In Encoder, three pairs of feature maps with the same size are generated by integrating atrous convolution. Each pair of feature maps are processed by L2 normalization and channel attention. The correlation computation with pre/post-normalization and attention blocks is called attention-aware correlation computation. Then, we use Decoder with ASPP (Atrous Spatial Pyramid Pooling) to generate fine-grained masks. achieves significant improvements than DMVN, and a hybrid adversarial learning framework was proposed to further optimize the pretrained DMAC. Although massive computations are needed in the adversarial learning procedure, the performance can not be dramatically improved. Besides, the simple scalar product in correlation computation of DMVN and DMAC still limits the discriminative capability of deep matching.
In this work, we propose a novel attention-aware encoderdecoder deep matching network named as AttentionDM, as shown in Figure 1. AttentionDM adopts an encoderdecoder architecture with atrous convolution for fine-grained masks generation. Different from our previous work [10], we propose to construct a decoder with alternative convolutional and upsampling operations to recover the spatial information, instead of a single bilinear upsampling layer at the end in DMAC. During correlation computation, normalization operations are adopted to limit feature values to certain ranges and filter redundant features. A novel channel attention block is proposed to highlight channel-wise informative features in an innovative way, consequently a weighted scalar product can be conducted in the correlation computation procedure. To the best of our knowledge, we first propose to use an attention mechanism [15] in CISDL and adjust it to fit our task, e.g., constructing an embedding network to extract channel-wise high-order features, integrating with our skip architecture [10], [16].
The main contributions of the proposed AttentionDM can be summarized in four folds: • An encoder-decoder deep matching network with atrous convolution is newly designed for CISDL.
• Attention-aware correlation computation is proposed based on hierarchical feature normalization operations and channel attention blocks.
• ResNets are firstly formulated as the feature extractor in the CISDL task. Abundant comparisons are conducted between VGG and ResNets.
• Extensive experiments on public datasets demonstrate the superior performance of our AttentionDM.

II. RELATED WORK
Image Splicing Detection and Localization have been widely studied in recent years. Because the forger generically tries to satisfy both high-level and low-level consistency constraints during image manipulation, tampering detection and localization can also be conducted from two levels.
In the aspect of high-level constraints, different cues can be investigated, e.g., blur type inconsistency [6], shadows and lighting inconsistency [7], traces of perspective and geometry [8]. In certain contexts, forgery detection and localization methods based on high-level traces can achieve excellent performance, however they can not well adapt to complex and practical scenarios [7]. Furthermore, low-level signatures are exploited in a more general way, e.g., photo-response non-uniformity noise [17], color filter array artifacts [18], JPEG coding traces [19], steganalysis features [20]. There are also many researches which are targeted at one specific type of forgery, e.g. copy-move [1], [21], seam carving [22]. Despite the tremendous progress so far, much potential and many more discoveries lie ahead because of the breakthrough in deep learning [23], [24], many CNN-based methods are investigated and achieve significant improvements [2], [5], [9], [25]. However, as we describe in Section I, these methods all investigate individual images and can not provide the source of splicing images.
Image Matching using global features extracted by Convolutional Neural Networks (CNNs) has attracted lots of attention [26], [27]. These methods conduct dense comparisons on high-level features extracted by CNNs. A variety of networks have been proposed for estimating inter-frame motion in videos [28] or instance-level homography estimation [29]. These methods attempt to find highprecision correspondences between images, while only need to search surrounding areas with limited appearance variation and background clutter. Otherwise, some deep matching methods were proposed for long-range category-level matching [27], [30]. These methods target at finding objects of the same category with similar appearance, which are quite different from CISDL [10]. Wu et al. firstly utilized deep matching techniques to solve the CISDL problem, and proposed DMVN [12]. In [10], we took advantage of the advanced techniques in fully convolutional neural networks [31], and proposed the DMAC network. Both DMVN and DMAC adopt a naive scalar product operation to compute the correlation maps between the two investigated images, and we try to make use of the attention mechanism to enhance this procedure.
Attention Mechanism can be viewed as a strategy to bias the allocation of available processing resources to the most informative components of an input signal [32]. It has been widely applied to recurrent neural networks (RNN) and long short term memory (LSTM) [33] to tackle sequential decision tasks. In [34], a sequence-to-sequence task was formulated as an encoder-decoder network, in which a source sentence was encoded into a fixed-length vector and then fed into a decoder network. To solve the bottleneck problem of the fixed-length vector, Bahdanau et al. [35] proposed to utilize the attention mechanism to dynamically generate the vectors. Consequently, a variety of attention-based models were proposed to solve the sequence-to-sequence tasks [15]. The attention mechanism is also applicable to image and video problems in computer vision [36], e.g. image classification [37], object detection [38], video classification [32], etc.

III. ATTENTIONDM A. ENCODER-DECODER WITH ATROUS CONVOLUTION 1) BASIC ATTENTIONDM WITH VGG
Encoder-decoder with atrous convolution of AttentionDM is highly motivated by its great success in the semantic segmentation task [31], [39]. In our work, we investigate a decoder architecture with ASPP to capture multi-scale features and gradually generate fine-grained masks from correlation maps. And the correlation maps are computed from hierarchical large feature maps of an encoder with atrous convolution and a skip architecture.
Our detailed encoder-decoder architecture and parameter settings are presented in Figure 2. The encoder is a variant form of VGG [13] with atrous convolution. Let y(i s , j s ) denote the output of the atrous convolution of a 2-D input signal x(i s , j s ), and the atrous convolution can be computed as: is a floor function), φ(k 1 , k 2 ) denotes a K × K filter, rate r s denotes the sampling stride of the input signal. Different from the original VGG, we remove the last two maxpooling layers, and adopt atrous convolution with r s = 2 in the last convolutional block to keep their original field-of-views. A skip architecture is used to capture hierarchical information, and three groups of feature maps with the same scale can be generated, i.e., F 3 , F 4 , F 5 , which are all used for correlation computation. In the decoder, atrous rates are set to {6, 12, 18}, and their feature maps are concatenated and fed into the subsequent layers which are constructed by alternative convolutional and upsampling layers to gradually recover high-resolution finegrained masks.

2) ATTENTIONDM BASED ON ResNet
As we all know, in computer vision tasks, e.g., image classification, object detection and semantic segmentation, deep convolutional neural networks can significantly improve their performance [40]. However, in previous CISDL methods [10], [12], [41], they all use VGG16 as the basic feature extractor, and we do not know whether deep networks can improve the deep matching performance. Thus, we formulate the popular ResNet50 and ResNet101 [42] as the feature extractor, and utilize their hierarchical features by integrating atrous convolution. The detailed architectures with atrous convolution are presented in Table 1. In this formulation, we still can get 3 sets of feature maps with the same size, i.e., F 3 , F 4 and F 5 . The same as the basic architecture of VGG, these features are fed into correlation layers and mask generation layers.

B. ATTENTION-AWARE CORRELATION COMPUTATION
As shown in Figure 1, the key component of our AttentionDM is the attention-aware correlation computation module. In fact, it consists of three parts, i.e., attention blocks, normalization operations and correlation computation. We first introduce our channel attention block. Then, we present the whole correlation computation procedure with normalization and attention operations.

1) CHANNEL ATTENTION BLOCK
Suppose we have two groups of c-channel h×w feature maps F (k) ∈ R h×w×c (k ∈ {1, 2}), and we flatten these feature maps to d-dimensional (d = h × w) feature vectors and get d ×c feature matricesF (k) flat . As these feature vectors have high VOLUME 8, 2020 dimensions and contain strong spatial information, we propose to use a two-layer embedding network to extract highorder low-dimensional features. Referring to the definition in attention-based sentence embedding of natural language processing [33], our feature extraction network is called the embedding network to extract embedded features: where W E 1 , W E 2 denote the parameter matrices of two linear layers with corresponding bias terms b E 1 and b E 2 , and δ refers to a ReLU function. r is a reduction ratio, and is set to 4 in our work. By constructing an embedding network, we get d e × c embedded features E (k) , and d e = d r 2 . Channel attention measures channel responses as follows: where is a c-dimensional vector which is designed to indicate channel relations. By multiplying a (k) andF (k) ∈ R h×w×c , we can recalibrate informative channels to improve the discriminative capability of features. Details are presented as follows.

2) CORRELATION COMPUTATION
Let F (1) , F (2) denote feature maps extracted by the encoder, and f (1) (i 1 , j 1 ) ∈ F (1) , f (2) (i 2 , j 2 ) ∈ F (2) denote c-dimensional descriptors at specific coordinates. Before the attention and correlation computation, L2-normalization is conducted: By adopting L2-normalization, we can restrict the value ranges of descriptors, and adopt two normalized feature maps F (1) ,F (2) . Then, we useF (k) (k ∈ {1, 2}) to get channel attention weights a (k) based on Eq. (2) and Eq. (3). Thus, we can get channel attention weighted feature maps as follows: By constructing attention blocks, we can recalibrate informative features and improve the descriminative capabilities of features. In fact, channel attention has been adopted for enhancing image classification in [37]. They construct a SE block for each convolutional block, while we only modulate three groups of hierarchical features for the following correlation computation. Thus, we can elaborately construct a more complex architecture as Eq. (2) and Eq. (3), instead of a lightweight gating mechanism in [37].
The correlation maps C (12) are generated by comparing f (1) (2) under strong spatial restrictions: in which All the compared feature locations in the same channel m 12 are under the same translation (i t , j t ), i 1 , i 2 , i t ∈ [0, h) and j 1 , j 2 , j t ∈ [0, w). With the correlation maps C (12) at our hands, match pooling operations are conducted to suppress uncorrelated information in C (12) . Average, maximum and sorted correlation maps are generated: 4 , C = L2_norm(max(C (2) , 0)) Output: Correlation mapsC (1) andC (2) of I (1) and I (2) C (12) max (i 12 , j 12 , 0) = arg max p 12 {C (12) (i 12 , j 12 , p 12 )} (9) C (12) srt (i 12 , j 12 , p) = C (12) (i 12 , j 12 , p t ), p t ∈ Top_T_index × (sort p 12 (sum(C (12) (:, :, p 12 )))) (10) where Top_T_index(·) denotes the function which selects indexes of the top-T values. Finally, we can get the output feature mapsĈ (12) = {C (12) avg , C max , C srt }, and C (12) ∈ R h×w×(T+2) , in which 2 dimensions are the average and max correlation maps, and the other T dimensions are the sorted maps. The afore-mentioned procedure (Eq. (6)-(10)) is denoted as:Ĉ (12) = Corr(F (1) ,F (2) ) With three pairs of feature maps (in different levels as shown in Figure 2, ResNets are shown in Table 1) as inputs, i.e., F The generated raw correlation maps C (k) are followed by a ReLU layer to zero out negative values. The consideration is that features of close correlated regions should have the same sign, thus the correlation values should be positive. We zero out negative values to discard weak correlated regions and reduce computational costs. Then these maps are processed by L2-normalization (Eq. (4)) to get the normalized correlation mapsC (k) . Finally,C (k) are fed into the decoder based on ASPP introduced in Section III-A to generate the final mask.

IV. EXPERIMENTS A. STEP-BY-STEP ANALYSES OF LOCALIZATION PERFORMANCE
Localization performance is evaluated on our released synthetic testing foreground pairs [43]. According to the ratios r pa of the pasted areas, the image pairs are divided into three sets, namely Difficult (1% ≤ r pa < 10%), Normal (10% ≤ r pa < 25%), and Easy (25% ≤ r pa < 50%). For each set, 3000 image pairs are generated with annotated ground-truth masks. Localization performance is evaluated by the pixel-level IoU (Intersection over Union) [40], MCC (Matthews Correlation Coefficient), NMM (Nimble Mask Metric) [11]. We compute the average IoU, MCC and NMM of all the tested image pairs. Note that since the state-of-the-art CISDL methods [10], [12], [41] all adopt VGG as the basic feature extractor, the default Atten-tionDM adopts VGG and is directly denoted as ''Atten-tionDM''. Models using ResNet50/ResNet101 are denoted as ''AttentionDM-ResNet50/ResNet101''.
Compared with our previous work [10], three major improvements are made. In this section, we conduct a detailed step-by-step analysis on the synthetic testing sets: • Firstly, we test the effectiveness of the proposed encoderdecoder architecture with atrous convolution, i.e., ''Encoder-Decoder'' in Table 2. ''Encoder-Decoder'' directly utilizes convolutional features extracted from encoder for correlation computation without L2-normalization and channel attention. ''Encoder-Decoder'' can already achieve better performance than DMVN and DMAC. Especially, a huge leap is achieved on the Difficult set, which demonstrates its great ability to detect small regions.
• Then, we attempt to normalize input hierarchical features, and process computed correlation maps using ReLU and L2-normalization, i.e., ''Encoder-Decoder-Norm''. The localization scores are further improved, and the basic encoder-decoder architecture with normalization builds a solid foundation for our outstanding performance.
• Finally, we evaluate our channel attention block by building it on ''Encoder-Decoder-Norm'', i.e., ''AttentionDM''. Figure 3 provides IoU scores across training iterations on the Difficult set. Our channel attention block can yield consistent improvements over the basic architecture both in the training procedure ( Figure 3) and the final scores (Table 2). We successfully propose an effective channel attention block which can measure channel relations and recalibrate channel-VOLUME 8, 2020 Step-by-step analyses on the synthetic testing sets.  wise informative features. Our channel attention block tightly integrates with our correlation computation and skip architectures. AttentionDM only builds three channel attention blocks ( Figure 1) with a slight increase of the parameter number, while achieves a steady improvement of localization scores.

1) SLIDING WINDOW BASED MATCHING STRATEGY EVALUATION
In [10], we propose a sliding window based matching strategy to proccess high-resolution images. With the efficiency sacrifice (refer to Table 4 in [10]), they can achieve similar results with our Encoder-Decoder-Norm model. So, we also evaluate the effectiveness of this strategy on our AttentionDM model. Since this strategy contains two stages, i.e., sliding matching and resizing matching, the methods adopted this strategy are annotated with postfix ''-SR256/128''. ''256/128'' denotes the sliding stride. Since with the decrease of the sliding stride, the computational complexity increases exponentially, we only test stride 256 and 128 referring to the experiments in [10]. The results are shown in Table 3. With the help of this strategy, the localization scores can be obviously improved on the Difficult set. In Normal and Easy sets, the sliding window based versions can also achieve comparable performance. Because AttentionDM can already achieve superior performance, and has great ability to detect small regions and accurate boundaries, with the help of sliding window based matching, the score improvements are not as large as DMAC, e.g., IoU scores 0.6911 − 0.5433 = 0.1478 vs. 0.7608 − 0.7228 = 0.0380 on the Difficult set. Even though, our sliding window based version can achieve the best performance to detect small regions.

2) ResNet FEATURE EXTRACTOR EVALUATION
AttentionDM with different feature extractors and the corresponding sliding window versions are evaluated on the  synthetic testing sets, as shown in Table 4. With more complicated feature extractors, the localization performance can be improved slightly. However, the parameter number and computing time obviously increase. Especially, AttentionDM-ResNet50 and AttentionDM-ResNet101 can achieve comparable performance, while AttentionDM-ResNet101 has much more parameters. So we do not recommend the use of ResNet101 for CISDL. In image classification and other corresponding tasks, deep networks can provide richer highorder semantic information. While in the CISDL task, this high-order information is not very useful, and discriminative features with more spatial information are more helpful. For comprehensive comparisons, we compare the IoU scores of CISDL methods and our variants on the synthetic testing sets in Figure 4. It can be clearly seen that AttentionDM achieves a siginificant performance leap, while there is no big gap between the VGG version and ResNets versions.

3) COMPLEXITY ANALYSES
The testing time, parameters numbers and implemented frameworks are reported in Table 5. All the experiments are conducted on a machine with Intel(R) Core(TM) i7-5930K CPU @ 3.50GHz, 64GB RAM and a single GPU (TITAN X). With slightly more parameters, AttentionDM is slightly slower than DMAC. While AttentionDM indeed achieves significant performance improvements. With more complicated feature extrators and more parameters, the ResNets versions are slower than the VGG version, i.e., AttentionDM.

B. DETECTION PERFORMANCE COMPARISONS
We evaluate the detection performance of AttnetionDM on (1) The paired CASIA dataset: In [12], they generated 3, 642 positive samples by pairing the 1, 821 spliced images in CASIA TIDEv2.0 dataset with their true donor images, and collected 5, 000 negative samples by randomly pairing 7, 491 color images from the same CASIA-defined content category. For the lack of ground truth masks, this dataset is designed for evaluating detection performance [12]. (2) The MFC2018 dataset: There are 1, 327 positive image pairs and 16, 673 negative pairs in the evaluation dataset of Media Forensics Challenge 2018 [11]. Detection performance is quantitatively evaluated, and the localization performance is evaluated by visual comparisons for the large number of negative pairs and some imperfect ground-truths. (3) The PS-Battles dataset: it has 11, 142 subsets consisting of the original image and several corresponding derivative fake images for a total of 102, 028 images which are collected from the online Reddit community of Photoshop battles [44]. The detection performance is measured by the precision, recall, F1-score, AUC (Area Under Curve), EER (Equal Error Rate) and detection rate [10].

1) CASIA
In [12], the authors compared DMVN with copy-move forgery detection methods [45]- [48] for the lack of other CISDL methods. In the comparison on CASIA, we directly borrow their scores from [12]. Using our AttentionDM, the forged probabilities are computed as: for each generated mask, we compute the average score {s (k) |k = 1, 2} of the detected regions, and the final forged probability is computed as their mean value (s (1) + s (2) )/2. As shown in Table 6, detection scores on CASIA have been lifted to a new level by AttentionDM. All our scores of AttentionDM, i.e., precision, recall, F1-measure, and AUC, are greater than 0.9. AttentionDM can achieve the highest recall, F1-score and AUC scores. Visual comparisons are provided in Figure 5, in which AttentionDM achieve clearly better performance. It has strong ability to detect small regions and accurate boundaries, and it is robust to deformation and rotation changes. While the ResNet50 version can achieve higher precision with lower recall, its F1-score and AUC are slightly lower than the VGG version. The detection performance of the ResNet101 version is even worse. Since the majority of images in CASIA are smaller than 512 × 512, we do not test the sliding strategy in this dataset [10]. VOLUME 8, 2020

2) MFC2018
Since experiments on the CASIA dataset have demonstrated the superiority over conventional copy-move forgery detection methods which have high computational complexity [49], we compare AttentionDM with DMVN and DMAC in the experiments on MFC2018. AUC and EER scores on MFC2018 are shown in Table 7. AttentionDM can achieve the highest AUC score, and AttentionDM-ResNet50 can achieve the lowest EER score. Visual comparisons are provided in Figure 6. Apparently, AttentionDM can generate more accurate boundaries and detect small regions. Besides, the majority of images in MFC2018 have high resolutions, we also compare the sliding window matching versions. It shows that with the help of sliding window matching, the performance of AttentionDM is further improved.

3) PS-BATTLES
Images in the PS-Battles dataset are elaborately designed and edited by amateur or professional digital artists, and these images have been uploaded to the online Reddit community. There are many challenging image pairs which can evaluate the effectiveness of CISDL methods. Since all the image pairs are correlated, in other words, one is a fake image, and the other is the source image in one image pair. So we use the detection rate to evaluate the compared methods, as shown in Table 8. And visual comparisons are provided in Figure 7. It can be seen that AttentionDM and its corresponding sliding-window version can achieve the highest detection rates, while the ResNets versions have lower detection rates. Although the detection rate is slightly lower, the ResNet50 version has better localization performance according to Figure 7. While AttentionDM-ResNet101 has a lower detection rate, higher computation complexity and unremarkable localization performance. According to our experiments on four datasets, we can conclude that Atten-tionDM and AttentionDM-ResNet50 can achieve comparable detection performance, while AttentionDM-ResNet50 can achieve better localization performance. And we do not recommend the use of ResNet101 in the CISDL task.

C. IMPLEMENTATION DETAILS
AttentionDM is implemented on PyTorch, and is trained using a single spatial cross entropy loss. The parameters in the encoder are initialized using VGG16/ResNet50/ ResNet101 which are trained for image classification. Three epochs of training are conducted, the batch size is set to 24, and more than 129, 000 iterations are conducted. The Adadelta optimizer is adopted with PyTorch default settings. AttentionDM is trained on synthetic training image pairs.
We automatically generate over one million synthetic training image pairs from the MS COCO dataset [50]. We randomly select one annotated region in one image under different transformations, and past it into another randomly selected image. Five types of transformations are adopted, i.e., shift U(−256, 256), rotation U(−30, 30), scale U(0.5, 4), luminance U(−32, 32), deformation U(0.5, 2) changes. Specifically, all pasted regions suffer from the shift change. For other types of transformations, it has a 50% probability of suffering each transformation. The forged regions in synthetic images may suffer from several types of transformations. The selected regions all satisfy that their areas should be larger than 1% of the images and smaller than 50%, for that extremely small regions are too difficult to FIGURE 7. Visual comparisons on the PS-Battles dataset. VOLUME 8, 2020 detect and excessively large regions are meaningless. Finally, over one million training pairs are generated with 1/3 foreground pairs, 1/3 background pairs and 1/3 negative pairs.

V. CONCLUSION
In this work, an attention-aware encoder-decoder deep matching network named as AttentionDM is proposed for CISDL. An encoder-decoder architecture with atrous convolution is constructed for hierarchical features dense matching and fine-grained masks generation. Normalization operations are designed to re-enforce the convolutional features and correlation features. A channel attention block is proposed for channel-wise features recalibration to enhance correlation computation. Extensive experiments verify that AttentionDM achieves a significant performance leap. To the best of our knowledge, AttentionDM can achieve the best performance among all the published CISDL methods.
Although AttentionDM can achieve excellent performance, it still has some limitations. For example, similar as the forerunners [10], [12], [41], it can only process fixed-size images. The remedial measure of our sliding window based matching strategy results in high computational complexity. Besides, AttentionDM can achieve the lowest EER score on MFC2018, however there are still many false-alarm images, especially the majority of investigated images are uncorrelated or unforged in real applications [10]. The objects which have similar appearance can mislead our detection. There is still a long way to go for real application.