A Full-Image Full-Resolution End-to-End-Trainable CNN Framework for Image Forgery Detection

Due to limited computational and memory resources, current deep learning models accept only rather small images in input, calling for preliminary image resizing. This is not a problem for high-level vision problems, where discriminative features are barely affected by resizing. On the contrary, in image forensics, resizing tends to destroy precious high-frequency details, impacting heavily on performance. One can avoid resizing by means of patch-wise processing, at the cost of renouncing whole-image analysis. In this work, we propose a CNN-based image forgery detection framework which makes decisions based on full-resolution information gathered from the whole image. Thanks to gradient checkpointing, the framework is trainable end-to-end with limited memory resources and weak (image-level) supervision, allowing for the joint optimization of all parameters. Experiments on widespread image forensics datasets prove the good performance of the proposed approach, which largely outperforms all baselines and all reference methods.

A Full-Image Full-Resolution End-to-End-Trainable CNN Framework for Image Forgery Detection

Francesco Marra, Diego Gragnaniello, Luisa Verdoliva and Giovanni Poggi
Abstract-Due to limited computational and memory resources, current deep learning models accept only rather small images in input, calling for preliminary image resizing. This is not a problem for high-level vision problems, where discriminative features are barely affected by resizing. On the contrary, in image forensics, resizing tends to destroy precious high-frequency details, impacting heavily on performance. One can avoid resizing by means of patch-wise processing, at the cost of renouncing whole-image analysis.
In this work, we propose a CNN-based image forgery detection framework which makes decisions based on full-resolution information gathered from the whole image. Thanks to gradient checkpointing, the framework is trainable end-to-end with limited memory resources and weak (image-level) supervision, allowing for the joint optimization of all parameters. Experiments on widespread image forensics datasets prove the good performance of the proposed approach, which largely outperforms all baselines and all reference methods.
Index Terms-Digital image forensics, CNN, forgery detection.

I. INTRODUCTION
In this work, we propose a new framework for image forgery detection based on convolutional neural networks (CNN). This may not look particularly exciting: deep learning is bynow common practice to solve all kinds of vision-related problems. However, image forensics has some peculiarities that set it apart from standard computer vision problems. We can summarize them in the need to look, at the same time, at the whole image but also at its tiniest details. Consider the example of Fig.1. This well-crafted splicing does not show obvious artifacts that allow detection by visual inspection, but a suitable structural analysis reveals differences that may be due only to the insertion of alien material in the host image. Indeed, many state-of-the-art forensic tools rely on the statistical analysis of local micro-patterns. However, local analyses alone are necessarily suboptimal. Clues emerging from the whole image, and at multiple scales, should be combined and processed jointly to make a reliable decision. Therefore, our goal is to design CNN-based forensic tools that, overcoming current technological limitations, meet the contrasting requirements of full-resolution and full-image training and analysis.
It should be realized that this problem is indeed peculiar of multimedia forensics. Typical CNN classifiers for computer vision problems rely on macroscopic features, which bear high-level semantic clues on the scene. For example, F.Marra, D.Gragnaniello and G.Poggi are with the DIETI, L.Verdoliva is with the DII, Università degli Studi di Napoli Federico II, Naples, Italy. Email: {francesco.marra, diego.gragnaniello, verdoliv, poggi}@unina.it. a face detector may look for the presence of specific facial features with suitable spatial relationships. Such large-scale information persists nicely after resizing the image. And in fact, target images of wildly different sizes are routinely resized to match the input CNN layer. Actually, resizing is even used on purpose, during training, to gain robustness to scale changes. In the context of image forensics, instead, resizing may destroy the very same information classifiers rely upon, the pixel-level micro-patterns that characterize different digital histories. By analyzing such patterns one can identify camera models, individual devices, or discover the traces of out-camera processing. A huge scientific literature testifies on the importance of such high-frequency features. Hence, image resizing and resampling should be definitely avoided when performing forensic tasks.
So, one could naively think of using a network with an input size as large as the target image. Besides the lack of generality (images can be of any size) a more fundamental issue concerns computational and memory resources. Acquisition devices are continuously improving their resolution, with commercial smart-phone cameras delivering photos with many millions of pixels. Deep learning hardware capabilities do not increase at the same rate. Due to computation and memory limitations, state-of-the-art architectures accept only small images in input, especially when very deep networks are used. Therefore, the highly informative image samples cannot be directly fed to a network and analyzed as a whole.
Eventually, when high-resolution must be preserved, a simple solution is to perform patch-wise feature extraction, followed by some forms of feature aggregation to exploit the full-image information. This approach makes full sense, and largely predates deep learning. Yet, even with good CNN-based feature extractors and classifiers, it is inherently arXiv:1909.06751v1 [cs.CV] 15 Sep 2019 suboptimal for several reasons: i) poor feature extraction; ii) poor global decision; iii) need of over-detailed ground truth.
First of all, since the patch-wise feature extractor is trained without taking into account whole-image information, the best it can do is to learn good features for local decisions, which are not necessarily the best ones in view of future aggregation. Then, the global classifier, trained after freezing the patch-level processing, operates only on intermediate features, hence is necessarily suboptimal with respect to a classifier trained endto-end on the original data. Last, patch-wise training requires a detailed, handcrafted, ground truth. Therefore, the large datasets necessary to train deep learning models require a huge man-power and are inevitably affected by errors, with a sure impact on the eventual performance.
All these considerations motivate our work, and allow us to define the final goal more clearly. We want to design deep learning models for image forgery detection which are: 1) full-image: make decisions based on information gathered from all over the image; 2) full-resolution: do not perform any harmful image resizing; 3) end-to-end trainable: optimize jointly all model parameters for image-level classification, based only on imagelevel (weak) supervision. To achieve this goal, we propose a framework comprising three blocks in cascade performing, respectively, patch-wise feature extraction, image-wise feature aggregation, and global decision. All blocks are fully trainable, based on imagewise labels, allowing information to flow backward through the whole network. The global decision takes into account features extracted from the whole image, whatever its size, and based on local micro-patterns. Memory problems are solved by means of gradient checkpointing, with a very limited increase of computational costs.
With these solutions, the proposes framework allows one to optimize jointly the local information extraction, the global feature aggregation, and the whole-image classification, whatever the input image size. We implemented several versions of this general framework, through appropriate selection of the major architectural blocks. After training on suitable synthetic datasets, we performed extensive experiments on realistic datasets widespread in the image forensics community, focusing on local manipulations, such as splicings, copy-moves, and inpainting, likely indicators of malicious attacks. Results fully support our approach which largely outperforms both baseline methods and state-of-the-art references, including methods requiring strong supervision.
In the following, we analyze related work (Section II), describe the proposed approach (Section III), report on the results of numerical experiments (Section IV), and finally draw conclusions (Section V).

II. RELATED WORK
Forgery detection is a central topic in image forensics, and there is a large bulk of relevant literature. In addition, it is necessary to consider both forgery detection and localization, since these tasks are tightly related. Indeed, detection methods can be used for localization through sliding-window analysis, and localization method may allow detection by suitable postprocessing. So, to limit the scope, in the following analysis we take a historical perspective, but focus especially on recent CNN-based methods. Moreover, we neglect global manipulations, such as histogram equalization or gamma correction, as they can be hardly regarded as malicious forgeries.
Early contributions were mostly model-based, looking for statistical anomalies related to the color filter array (CFA) [1], [2], double JPEG compression [3], [4], or sensor noise [5], [6]. Most of these methods assume a priori the presence of a forgery and pursue localization through pixel-level analysis, generating a heat-map. Then, a global score can be easily computed from the latter and used for detection. Model-based approaches are elegant and do not require extensive training, but work only in quite restrictive hypotheses.
The advent of data-driven solutions granted a quantum leap in performance and ensured higher generality. Methods based on machine learning extract suitable hand-crafted features from the image, both in the spatial domain [7], [8], [9], [10], [11] and in the transform (DCT, wavelet) domain [12], [13], [14], which are used to train a classifier. Extracting features from the whole image allows direct and reliable image forgery detection. Instead, localization can be obtained by working in sliding-window modality and using a suitable local score. The most discriminative features rely on high-order image statistics which help revealing spatial inconsistencies originated by the presence of forgeries. To this end, high-pass residual images are often used, obtained by means of derivative filters [15] or image denoisers.
In recent years, methods based on deep learning have become dominant. Some early papers, inspired by the success of residual-based machine learning methods, propose CNN architectures with a first layer of high-pass filters, either fixed [16], [17], or trainable [18], meant to extract residual feature maps. In [19] it is even shown that successful methods based on hand-crafted features can be recast as CNNs and fine tuned for improved performance. In [20] these low-level features are augmented with high-level ones in a two-stream CNN architecture. Recent findings [21], [22], however, show that such constrained first layer is only useful with small networks and datasets. Given a suitably large training set, general-purpose very deep architectures provide the same good results in favourable cases, but ensure higher robustness to compression and training/test misalignments.
Several papers, to begin with [16], followed more recently by [23] and [24], train explicitly the net to distinguish between homogeneous and heterogeneous patches, the latter characterized by the presence of both pristine and forged areas. The rationale is to catch the patterns that characterize transitions regions, anomalous with respect to the background, so as to localize possible forgeries. This idea is followed also in [25], where an hybrid CNN-LSTM architecture is trained end-toend to produce a binary mask for forgery localization. These methods, however, require detailed ground truth maps to train the net, which may not be available or precise.
For architectural constraints, most of these methods carry out a patch-based analysis, working on relatively small patches, with further steps needed to compute a global score at image-level. In [16], for example, the CNN extract features patch-wise and later aggregates them in a global feature vector used to feed a SVM classifier. This may impact on detection performance. A more fundamental limit concerns the need of strongly aligned training and test sets. Some methods, e.g., [24], [25], carry out experiments on a single database split into training and test, others [20] require fine-tuning on target data. All this highlights the limited generalization ability of supervised learning, as also shown in [26].
A more promising line of research is to revisit the anomaly detection approach under a data-driven paradigm. Anomalies are detected by means of single-image analyses, with a sort of blind source identification. In [27] this was accomplished in a fully unsupervised fashion by using an autoencoder architecture. More recent proposals [28], [29], [30] use camera-model features, gathered off-line by dedicated CNNs, or leverage metadata information [31] for direct detection. A strong pro of this approach is that training is performed only on pristine images, with no need of aligned datasets and ground truths, which ensures good robustness and adaptability to unseen manipulations. In [29] and [31], in particular, this is achieved by using a Siamese training on pairs of patches extracted from pristine images, with a suitable consistency metric.
Besides its technical content, this short review of ideas makes clear that there is high and growing interest for new solutions in this field, to face the threats posed by increasingly sophisticated fake multimedia tools.
III. PROPOSED METHOD Our aim is to design a deep network to detect the presence of localized forgeries in a target image, irrespective of the image size and the forgery size. Of course, images can have wildly different sizes, depending also on the context, but the trend is towards higher and higher resolutions. Today's smartphones feature cameras with resolutions of 10 Mpixels and more. On the other hand, due to computation/memory bottlenecks, deep networks accept rather small images in input, for example 256×256-pixel. Hence, a strong size mismatch typically occurs between target image and network input. For most image analysis applications, this mismatch is not a big problem and two solutions can be considered: 1) images are rescaled to fit the network input, or 2) images are processed patch-wise, and results are fused off-line to make a global decision. In the following paragraphs, we first explain why such solutions are not viable for image forgery detection, then describe the proposed architecture, and finally show how it can be trained end-to-end based on the gradient checkpointing method.
A. The need for full-image full-resolution processing The first solution listed before is to rescale the image to fit the network first layer. However, this is not advisable when dealing with forgery detection. In some cases, the forged region could be so small to become practically undetectable after strong downsampling. A more fundamental problem, however, is that some sophisticated forgeries may only be detected based on the statistical analyses of micro-textures. However, these precious high-frequency components are strongly corrupted when the image is resized or resampled. Fig.2 shows a clear example, in which the markedly different textures highlighted in Fig.1, after resizing become very similar to one another and basically useless for forensic analyses.
The second solution is to perform patch-level detection, with no resampling, followed by some form of information fusion to make a global decision. Indeed, given an ideal patchlevel classifier, the fusion problem has an obvious solution, and the presence of a forgery can be declared if at least one forged patch is detected. However, real-world detectors are far from ideal, they always have non-zero missing-detection and false-alarm rates. For example, assuming a rather optimistic 1% patch-level false-alarm rate, and independent decisions, a 100-patch pristine image would present a false-alarm rate beyond 63%. Therefore, the fusion problem is not at all trivial with real-world detectors, as our experiments will confirm. In addition, the patch-level detector itself should be designed taking into account image-level performance.
These considerations motivate the need for a full-image fullresolution detector. In this way, precious microtextures can be preserved and, at the same time, information coming from all patches can be processed jointly to make a reliable decision. A naive implementation of this idea, with a CNN input size matching the image size, would require huge computational and memory resources, not to speak of the number of images needed for reliable training. Instead, we propose a suitable architecture that, through reasonable structural constraints, satisfies the needs of forensics detection with limited resources.

B. Proposed architecture
The proposed framework is represented pictorially in Fig.3. It consists of three blocks performing, respectively, patch-level feature extraction, feature aggregation, and decision. Note that, although we propose a specific implementation of such blocks, this is not the core of our proposal, which is instead the whole framework.
1) Patch-level feature extraction: after dividing the image in overlapping patches, these are processed to extract discriminative features. As feature extractors, we adopt some stateof-the-art deep networks, taking the output of the penultimate layer as feature vector, and discarding the final class probabilities. However, considering the peculiarities of image forgery detection, we modify the input layer to accommodate some additional inputs, the image noiseprint [30], besides the image color bands. Noiseprints are high-pass image residuals, extracted through a dedicated network, in which camera-related artifacts are emphasized. Therefore, they highlight possible spatial anomalies and may help detecting local manipulations. 2) Feature aggregation: the feature extractor produces a large number of features, which are aggregated image-wise to obtain a single descriptor for the classification task. To this end, we consider several forms of pooling, maximum, minimum, average, and average of squares: is the C-component feature extracted from the i-th patch, N p is the number of (possibly overlapping) patches, and all operations on features are componentwise. The most appropriate type of pooling depends on the problem of interest. When the information is spread over the whole image, an average pooling is reasonable, while min or max pooling are more appropriate when the discriminative information is concentrated in a localized region. In any case, we also use the combination of multiple types of pooling, leaving the final choice to experiments. After aggregation all explicit spatial dependencies are discarded.
Note that the type of pooling impacts on how information back-propagates from the output to update the parameters of the feature extractor. In more detail, let F agg denote the aggregated feature, L the loss function of the framework, and θ a generic parameter of the CNN. Then, the gradient of L with respect to θ reads In the above equation, δ i,j equals 1 when i=j and 0 otherwise, while i max (c) and i min (c) point to the feature vectors with the largest, respectively smallest, c-th component. Therefore, with max or min pooling, only some "active" patches contribute to the gradient, and are updated during training. Instead, with mean and msq pooling all patches are involved. Of course, when multiple forms of pooling are used at the same time, the gradient is obtained as the weighted sum of the individual terms.
3) Decision: after aggregating the local information in a single descriptor F for the whole image, this is classified by means of a few fully-connected layers. This is the typical classifier used in deep networks, and usually two layers provide a good trade-off between complexity and accuracy.

C. End-to-end training
If we focus only on the post-training operations, the proposed architecture does not look much different from conventional approaches based on patch-wise feature extraction, pooling, and classification. Contrary to such approaches, however, our framework is trainable end-to-end. This means that we do not train the feature extractor on individual labeled patches and, afterwards, train the classifier on the features extracted by a fixed net. Instead, we train the whole framework, top (a) ----------j j j j j j j j j r r r r r r r r r j j j j j j j j j r r r r r r r p p ? ? ? j j j j j j j j j r r r to bottom, on full-size images, with a single label associated with each one: forged or pristine. The loss back-propagates through the net up to to individual patches, allowing the feature extractor to learn which information is the most discriminative for the final decision, and adapting the classifier jointly with the extractor itself.
To better underline the difference with respect to patch-wise training, consider that in a large image with a localized forgery most patches are actually pristine, and only a few ones truly forged. In our end-to-end training, all these patches share the same image-level label (forged). Therefore, the net is forced to learn how to manage such contrasting indications to make the correct decision. As a side benefit, there is no need to have a pixel-wise ground truth for training, since the only relevant label applies to the whole image. Also, images of any size can be used for training, with forgeries of any size (especially if max/min pooling is used).
Going into technical details for each training batch of images, the framework performs i) an inner loop on the patches of each image, computing the back-propagation at the end of the loop, and ii) an outer loop on the images of the batch, that sums up gradients computed for each inner loop and finally updates the weights once at the end of the batch. Due to the arbitrary size of input images, each inner loop involves a different number of patches, impacting on the computational effort, which may vary significantly from batch to batch. This is a minor issue, though, with respect to memory requirements. In fact, to back-propagate the loss, gradients must be computed for all processed patches, causing an increase of the occupied memory, which grows linearly with the image size. For deep networks and large images, this memory is simply unavailable.
The situation is described pictorially in Fig.4, where a circle represents a layer, and a black dot at the center indicates that activations are stored. In the forward pass (a), in fact, all activations at each layer are computed and stored. Then, in the backward pass (b), they are used to propagate gradients from the last layer, where the loss is computed, to the input. After usage, they are erased (small dots). It should be realized that deep nets can include hundreds of layers, with several feature maps at each layer, whose size is typically proportional to the input size. Therefore, to process a large input image at once, a huge number of variables should be stored, exceeding the available memory.
To manage this problem we resort to the gradient checkpointing strategy, originally proposed in [32], which trades off memory for computation. This solution is described pictorially in Fig.5. During the forward pass (a), all activations are deleted (a) ----------j j j j j j j j j r r r (b) ----j j j j j j j j j r r r r r r (c) j j j j j j j j j r r r r p p ? ? ? j j j j j j j j j r r r (d) ----j j j j j j j j j r r r r r r (e) j j j j j j j j j r r r p p p p p p ? ? ? ? ? ? ? j j j j j j j j j r r r p p p p With a judicious choice of the number of checkpoints, memory occupation can be significantly reduced and become manageable. Of course, each activation is computed twice, but the computational overhead is limited, because the forward pass is lighter than the backward pass. Note that gradient checkpointing has been recently made available in PyTorch as well as in other platforms. With this solution, we were able to train our network end-to-end seamlessly, with a increase of the training time that never exceeded 20%.

IV. EXPERIMENTAL ANALYSIS
In this Section we design and perform numerical experiments to assess the performance of the proposed approach.
In the following subsections, we first describe the training procedure, then present the results of some preliminary experiments carried out to make key design choices, and finally compare the proposed method with both baselines and state-ofthe-art references on several challenging datasets widespread in the community.

A. Training
In order to train our networks, we generated a suitable synthetic dataset. Background images are taken from the Vision dataset, proposed originally [33] for camera model identification, which comprises 7565 images acquired by 35 different devices with the native high-quality JPEG compression. To generate manipulated images, we spliced on them objects drawn from a set of 81 objects manually cropped from the uncompressed images of the UCID dataset [34]. Details on all datasets used in this work are reported in Tab.I. We used all images from 25 devices of the Vision dataset for training, and kept the others for validation, with an approximate 70%-30% split, so as to avoid any possible bias. For each pristine image, we created on the fly a manipulated image by inserting in a random position one of the UCID objects, selected at random, with random scaling and rotation. Scaling is such that the size of spliced objects goes from about 1% to about 10% of the image size. Eventually, both pristine and manipulated images are flipped or rotated, and JPEG compressed with QF going from 75 to 100, obtaining a significant augmentation. Fig.6 shows a few examples of manipulated Vision/UCID images (without rotations).
In the training procedure we used the Adam optimizer with minibatches of 10+10 images and a learning rate of 0.001. Training took about three days with an Nvidia Tesla P100 GPU. With the same hardware, testing takes about half a second for a 3072×4096-pixel image, including the noiseprint extraction, which decreases to 0.01 seconds if the image tiles are already stored in the GPU memory.

B. Preliminary experiments
The proposed framework aims at the detection of localized manipulations, such as splicing, copy-move, and object removal through content-aware inpainting. Towards this goal, we instantiated the proposed framework by means of some key design choices. In particular, we • augmented input RGB bands with the corresponding noiseprint bands; • used Xception [35] as feature extractor; • performed aggregation by including all types of pooling; • used two fully connected layers, of size FC1=512 and FC2=256, to perform the final classification. We arrived at these choices as a result of a large number of preliminary experiments, whose description would be dispersive and tedious. However, we can study experimentally the impact of each individual choice on the performance of the proposed architecture. To this end, we generated a new dataset, with the same modalities used for the training set, but completely separated from it. Background images were taken from the Dresden dataset, originally proposed [36] for camera model identification, and manipulated images were created by splicing on them 13 objects taken from the FAU dataset [37] (see again Tab.I for details on datasets). After performing the splicing, images were JPEG compressed at high (QF≥95), medium (90≥QF≥85), or low quality (80≥QF≥75), and eventually resized at scale=0.75 or left unchanged. Spliced objects can be classified as large, medium, or small, depending on the largest dimension of their bounding box (after image resizing), set to 1024, 384, or 128, respectively. Note that, to carry out the large number of tests required by this analysis, we use a small training set, here, and results indicate main trends but can be improved by a more accurate training.
To assess performance, here and in all subsequent experiments, we classify the whole test set, compute false positive rate (FPR) and true positive rate (TPR) as a function of the detection threshold, going from 0 to 1, and obtain the corresponding receiver operating characteristic (ROC) curve. Eventually, we compute the area under the ROC curve (AUC) as a synthetic measure of performance. In Tab.II we report the results of our ablation study. Row 2 refers to the selected architecture, which uses Xception, takes in input both RGB and noiseprint bands, concatenates vectors given by all pooling types, and use a size-512 FC1 layer. In all other rows, we modified a single item of this reference architecture. A number of non-trivial results appear. First of all, Xception is a much better feature extractor than the two alternatives, Resnet101 [38] and InceptionV4 [39]. We had already observed a similar edge in other applications [22] although never so sharp. The likely reason is Xception's better use of resources, with a much smaller number of parameters to optimize for a given network depth. It also clearly emerges that using 4 types of pooling together ensures a significant improvement w.r.t. using only one of them. Using only max-pooling, as suggested by the nature of the problem, is even worse than using average pooling, probably because of its lower robustness to noise. As for the size of the first FC layer, 512 appears to be the best choice, although just slightly. The only controversial choice concerns the input. In fact, using only the RGB bands or only the noiseprint (NP) bands provides results very close to those of RGB+NP, with a statistically insignificant gap. Therefore, we refrain from sharp decisions on the input, and will keep testing several options in real-world cases.
We now study the impact of compression, resizing, and splicing size on the performance of the proposed method by collecting results for specific relevant subsets. A quick look at the numbers of Tab.III makes clear that only minor variations occur across such subsets, with all AUC's in the 0.84-0.89 range. The largest performance gap is observed between original-size and resized images. Also JPEG compression affects somewhat the detection performance, although no significant difference emerges between the medium-QF and low-QF cases. The size of the spliced area, instead, seems to have a minor impact and, contrary to expectation, relatively small-size splicings are detected more easily that large-size ones. Note that, on the average, the AUC on specific subsets is larger than the global AUC, but this is a consequence of the higher homogeneity of the tested images.

C. Comparative performance analysis
Having justified our design choices, we now move to compare the performance of the proposed framework with those of suitable baselines and state-of-the-art methods, using not only our relatively simple Dresden/FAU synthetic dataset, but also several realistic and challenging datasets widespread in the forensic community.
1) Reference methods: first of all, we consider two natural baselines, both relying on Xception, given its good performance. The first one, Xception-resize, consists simply in resizing the target image to fit the CNN input, with straightforward training procedure. Xception-patchwise, instead, works by analyzing the image patch-by-patch, with no resizing and some spatial overlapping, and finally fusing results. Accordingly, the net is trained to perform binary patch classification. Since the detector will look for anomalies, we decided to label only boundary patches as forged, that is, patches including a significant fraction of both background and manipulated areas. Eventually, the output probabilities are collected in a heatmap, from which a suitable statistic is extracted (after some tests, we chose the max statistic) and compared with a threshold to make the image-level decision.
Just like our two baselines, methods proposed in the literature can be grouped in two classes. A few ones work at image level, while the majority work at patch-level, as they pursue forgery localization, and are converted into image-level detectors through some simple post-processing.
For the first category, we selected the SPAM+SVM method [8], winner of the First IEEE Forensic Challenge and based on the SPAM steganalytic features [15], the CNN+SVM method of [16], which extract features through a constrained CNN, and LSTM-EnDec 1 [25], which uses a long-short term memory recurrent neural network to detect pristine/forged spatial transitions. For the second category, we consider several forgery localization methods converted into image-level detectors. In particular, we selected the best performing methods resulting from the analysis carried out in [29], that is, CFA [2], which exploits features related to the color-filter array, DCT [3], based on the analysis of double-quantized DCT coefficients, NOI [1], looking for spatial inconsistencies in the noise level, EXIF-SC [31], looking for anomalies in the image leveraging the EXIF metadata during the training phase, and Noiseprint [29], which extracts and analyzes an image fingerprint where camera model-related artifacts are emphasized. All these methods compute a heatmap representing the probability that a certain patch has been manipulated. To make the image-level decision we extract several statistics from such heatmaps: mean, maximum, and q-quantile, with q ∈ {5, 10, . . . , 95}, selecting the best one in terms of AUC performance separately for each method. Note that all these latter methods are blind, that is, they require no training on forged images or patches.
2) Datasets: for performance assessment, besides our synthetic Dresden/FAU dataset, we consider several more datasets, widely used in the forensics community, with markedly different characteristics. DSO-1 [40] features only splicings, with little or no post-processing. In Korus [41], instead, both splicings and copy-moves are present. Both datasets include only large-size high-quality images, not even compressed in the case of Korus. A very different, and much more challenging, scenario is given by the NC2017, MFC2018, and the very recent MFC2019 datasets [42], developed by NIST 2 in the context of the Medifor initiative. Images of these datasets have been manually doctored, often with multiple and possibly overlapping manipulations of various types. In addition, they 2 https://www.nist.gov/itl/iad/mig/media-forensics-challenge-2019-0 have wildly different sizes and quality levels, and have been subject to several anti-forensics measures to prevent easy detection and localization of forgeries. For our tests, we kept all images with splicing, copy-move, inpainting, or computergenerated material. The reader is referred to Tab.I and to the original papers for more details, while some example images are shown in Fig.7 3) Numerical results: in Tab.IV we report the detection AUC for all reference and proposed methods on all test datasets. Next to each method, in column 2, we give the level of supervision it requires, strong (pixel-wise ground truth), weak (only image label), or -(none) for blind methods. In the upper part of the table we group all reference methods, including our two baselines, and in the lower part all version of the proposed method with end-to-end (E2E) training. Best results are highlighted in red for reference methods and in blue for our proposal. In Fig.8 we also show ROC curves for a subset of methods (for readability) and datasets (for space) characterized by very different features. On the Dresden/FAU dataset, disjoint from the training Vision/UCID dataset, but well-aligned with it, the proposed method (E2E-RGB+NP) largely outperforms all references, with a gain of almost 20 percent points over the best one, the strongly supervised Xception-patchwise. Guided by the outcomes of preliminary experiments, together with the "best" version, with RGB+NP input, we consider also the versions with only RGB and only NP inputs. To our surprise, E2E-RGB provides a further significant performance improvement. Our explanation for this phenomenon is the strong eterogeneity of the input: since RGB bands and noiseprints have quite different statistics, the net may have a hard time processing them jointly. To confirm such hypothesis, we considered a further versions of the proposed method, where the networks trained on RGBonly, NP-only, and RGB+NP inputs are fused afterwards by a trivial average of scores. This strategy proved successful, with the new version, E2E-Fusion, providing almost perfect detection (see also the top-left ROC in Fig.8), thus confirming our conjecture.
Moving to the DSO-1 dataset, we observe again a large gain, more than 10 percent points, of the best E2E method over the best reference. On this dataset, Noiseprint provides an especially good performance. a phenomenon already observed in [29], and likely related to all images being JPEG compressed at high-quality. Accordingly, also E2E works best with only noiseprints as input, with no fusion. Images of the Korus dataset, instead, are uncompressed. This removes a major source of forensic traces, which impacts all methods, some of which exhibit a 0.5 AUC, equivalent to coin tossing. CFA (relying on color filter array properties) and Noiseprint, keep providing decent results, however they trail all E2E versions, featuring AUC's between 0.60 and 0.66. Turning to the more challenging NIST datasets, the general behavior does not change, with E2E working generally better than reference methods. The best reference method is not always the same for all such datasets: CNN+SVM for NC2017, Xception-patchwise for MFC2018, NOI for MFC2019. On the contrary, E2E-Fusion is always the best version of proposed method, and the best overall, with a significant performance gain over the best reference, going from 0.048 (NC2017) to 0.127 (MFC2018).
The final column shows the average over all datasets, which confirms all above observations. We only underline, in passing, that the Xception-resize baseline, as expected, performs quite poorly due to the loss of precious high-frequency details, while the Xception-patchwise baseline is among the best references, although it is fair to recall that it requires strong supervision. A general observation is that the performance of E2E is consistently good in all cases (with a small dip on Korus), including the NIST datasets, despite their great variety and the abundance of counter-forensic measures. This is all the more remarkable, considering that the network was trained on a dataset, Vision/UCID, lacking such a diversity. Therefore, we carried out a further experiment on NC2017 and MFC2018, in which the E2E methods are fine-tuned on their respective development sets, provided by NIST together with the test sets. Results are reported in Tab.V and Tab.VI, while Fig.9 shows the corresponding ROC curves. It is clear that fine-tuning on the development set, certainly more aligned with the test set than Vision/UCID, grants further performance gains. Over the whole dataset ("all" column) the best AUC, obtained always with E2E-Fusion, grows from 0.846 to 0.932 on NC2017, and from 0.838 to 0.902 on MFC2018. The larger improvement on NC2017 can be attributed to better development-test alignment In the tables, taking advantage of the auxiliary information provided with these datasets, we also provide analytic results for each type of forgery. Although E2E is trained only on splicing, it works well also on all other localized manipulations. The most interesting phenomenon we could spot from these data is the performance drop on computer-generated fakes. In NC2017 the AUC for these manipulations was very high, above 0.93 without fine-tuning, lowering dramatically (0.80) in MFC2018. Probably, this is the effect of the fast pace of progress in the quality of such manipulations.

D. Towards forgery localization
The E2E framework was conceived and trained with the goal of making global decisions, leaving the problem of localization to other tools. However, just like localization tools can be used for detection through suitable fusion, the proposed detection framework can be recast to provide also some localization information. In the following subsections we provide some insight into how the system exploits and combines local information coming from all over the image, and how this can be exploited towards forgery localization.
1) Activation Maps: first of all, we try to investigate the impact of each patch of the image on the final decision. To this end, we consider a simplified framework in which only the max pooling is used. Given this hard selection rule, we can easily compute a spatial activation map which counts how many features each patch contributes to the overall feature vector. Such a map, however, would be extremely coarse, due the low resolution of patch-wise analysis. Therefore, we combine it with the a high-resolution map, the Grad-CAM (guided gradient weighted class activation map) obtained by backpropagating the loss gradient to the full-resolution input [43]. In Fig.10 we show some results for images of the Dresden/FAU dataset (hand-made to look more realistic). For this synthetic dataset, we have the pristine version of each manipulated image, so we can analyze the network behavior in both circumstances. In all cases, the network focuses on highactivity regions, often corresponding to object boundaries. When there is no manipulation, the salient regions are scattered all over the image. On the contrary, when a splicing takes place, they tend to concentrate on the boundaries of the spliced object, proving that the system has learned to look at these patches to make its decisions. Therefore, when a forged image is detected, this activation provides hints about the possible site of the manipulation.
2) ROI-based Analysis: moving towards forgery localization, we can obtain some interesting results by leveraging the flexibility of the proposed framework. Indeed, since the system can analyze images of any size, it can also analyze regions of interest (ROI) selected by the user based on the previous activation map or any other criterion. If the ROI contains manipulated material, the system will likely provide a large probability of manipulation (score, from now on). Therefore, the system can be used in supervised modality to test suspicious objects. Also, it can be recast to perform automatic box-like localization. In fact, once features have been computed and stored for all patches, the aggregation and classification phases are extremely simple, with light-speed processing. Therefore, one can easily test a large number of boxes and select automatically as ROI those with the largest scores, obtaining a rough but effective form of localization. Fig.11 shows some examples taken from the MFC2018 dataset. Together with the original images (top) and activations maps (middle) it also shows (bottom) the scores obtained over the whole image (white number in the top-left corner) and on selected boxes (colored numbers). The green boxes have been selected manually around possible subjects of interest, while the magenta boxes are selected by our automatic procedure around the local maxima of the score. In the first image, the man on the right has been spliced on the host background. Here, the activation map provides strong hints on the possible manipulation, confirmed by a large image-level score (0.935). However, an even larger score (1.000) is obtained when a ROI is correctly placed around the splicing. The automatic procedure also selects a ROI roughly covering the splicing, with unitary score. Another ROI is selected automatically in a pristine area in correspondence of a local maximum, but is has a rather low score (0.428). In the second image, a further splicing has been added, the woman in the center. Neither the activation map nor the automatic ROI selection procedure highlight this new subject. So, we selected a ROI manually around this splicing, obtaining a rather low score. Exploiting the side information provided with the NIST datasets, we investigated on this splicing, to discover that the inserted object had been acquired with the same camera model as the host image. This fact reduces the discriminating power of the noiseprint input, justifying in hindsight such result. In the third image, the only manipulation is a tiny inpainted region. Here, a supervised selection makes no sense, since the manipulated region does not correspond to any semantic object. However, the manipulation is nicely localized through the automatic procedure, with unitary score, unlike other candidate ROIs characterized by low scores. The last image shows an opposite case, with many large, semantically relevant, objects spliced on the host image. To avoid cluttering the image, we now show only the supervised ROIs and the corresponding scores, which are very large in all cases.
To complete this visual inspection of results, it is fair to show, in Fig.12, some counter-examples where the proposed framework fails to detect the manipulation. Reasons for failure are not always obvious. In these cases, they may be related to the absence of texture in the spliced object (right) or the strongly textured host image (right) which may hide the discriminating information. Note that in the image on the right, a well-placed ROI would allow detection, but there is no semantic hint to select it.

V. CONCLUSION
We proposed a new CNN-based framework for image forgery detection. Thanks to suitable architectural solutions, it allows one to process jointly information gathered at fullresolution from the whole image. Moreover, the framework can be trained end-to-end based only on weak (image-level) supervision. We proved the effectiveness of this solution by extensive performance analysis on forensic datasets widespread in the community. A large performance gain is observed in all cases with respect to all reference methods. In addition, the framework can be also recast to provide localization information, both in supervised and unsupervised modality.
Despite the very promising results, there is still much room for improvement. In particular, better forms of pooling should be considered to preserve long-range spatial relationships in the aggregation phase. Moreover, image and object semantics should be taken into account to complement the low-level information analyzed by the current framework. Work is already under way along these paths.

ACKNOWLEDGMENT
This material is based on research sponsored by the Air Force Research Laboratory and the Defense Advanced Re-  search Projects Agency under agreement number FA8750-16-2-0204. The U.S.Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Research Laboratory and the Defense Advanced Research Projects Agency or the U.S. Government.