Amplitude SAR Imagery Splicing Localization

Synthetic Aperture Radar (SAR) images are a valuable asset for a wide variety of tasks. In the last few years, many websites have been offering them for free in the form of easy to manage products, favoring their widespread diffusion and research work in the SAR field. The drawback of these opportunities is that such images might be exposed to forgeries and manipulations by malicious users, raising new concerns about their integrity and trustworthiness. Up to now, the multimedia forensics literature has proposed various techniques to localize manipulations in natural photographs, but the integrity assessment of SAR images was never investigated. This task poses new challenges, since SAR images are generated with a processing chain completely different from that of natural photographs. This implies that many forensics methods developed for natural images are not guaranteed to succeed. In this paper, we investigate the problem of amplitude SAR imagery splicing localization. Our goal is to localize regions of an amplitude SAR image that have been copied and pasted from another image, possibly undergoing some kind of editing in the process. To do so, we leverage a Convolutional Neural Network (CNN) to extract a fingerprint highlighting inconsistencies in the processing traces of the analyzed input. Then, we examine this fingerprint to produce a binary tampering mask indicating the pixel region under splicing attack. Results show that our proposed method, tailored to the nature of SAR signals, provides better performances than state-of-the-art forensic tools developed for natural images.


I. INTRODUCTION
Due to the lively development of Internet-based communication systems, the diffusion and sharing of multimedia content (i.e., digital images, videos or audio clips) have become part of our daily life. At the same time, we have become extremely acquainted in using tools for editing these objects. Doubts regarding whether the content we are enjoying is genuine or not are frequent every day. Indeed, from politics [1] to everyday life experience [2], the areas where fake media could possibly harm are many.
In this vein, multimedia forensics researchers aim at developing techniques to retrieve information about the multimedia object at hand. For instance, they are interested in verifying the integrity and trustworthiness of data, spotting manipulated multimedia content and localizing possible forgeries. Forensics researches typically tackle these problems by considering a simple principle: during the data life-cycle, various non-invertible operations are performed. Each operation leaves a peculiar trace, or footprint, that can be exploited to expose and localize a specific editing.
The main efforts of the community have been historically directed towards the analysis of digital images [3]. Many techniques have been developed to detect traces left by specific operations executed on the whole picture. Furthermore, many researches aimed at spatially localizing traces left by editing operations applied locally on the image (i.e., splicing localization). A few examples of local image splicing are the insertion of a portion of an image into another one, or the deletion of a pixel area from the sample under attack.
In addition to classical digital photographs, overhead imagery is recently becoming more accessible than before. This is probably due to the increased availability of satellites equipped with imaging sensors and the widespread diffusion of public websites [4] sharing this kind of images. This imagery represents data in a wide variety of modality, from optical (e.g., panchromatic, RGB), to thermal and Synthetic Aperture Radar (SAR) as well.
Despite the great availability of tools and techniques for analysing the integrity of natural images, the potential malicious editing of overhead images is a growing concern. Indeed, as any other type of digital imagery, overhead data can be easily manipulated through editing software suites (e.g., Photoshop, GIMP, etc.) as well as through synthetic image generation tools [5], [6], and examples of malicious modifications have been worrying the public opinion and media [7], [8].
Unfortunately, the footprints characterizing the overhead image life-cycle are different from those of digital photographs, and state-of-the-art methods suited for digital photographs are likely bound to perform poorly if blindly applied to satellite data. Therefore, developing techniques to localize potential manipulations applied to overhead imagery is becoming a task of paramount importance.
While the forensics community has started developing techniques specifically tailored to satellite data, to the best of our knowledge the problem of forgery localization on SAR imagery has never been investigated in the literature. However, since SAR products, especially those based on amplitude only, are easy to handle and modify even without specific expertise, their possible manipulation by malicious users is concerning. This paper investigates the problem of splicing localization in amplitude SAR images. Specifically, we consider the situation in which a region of an amplitude SAR image has been substituted with another region coming from a different image, and some editing might have been applied to hinder this manipulation.
Our goal is to localize the manipulated region (i.e., performing splicing localization), providing a binary mask highlighting the manipulated area. To do so, we rely on Convolutional Neural Networks (CNNs) to first extract a fingerprint reporting information on the forensic traces found in the analyzed image. The fingerprint extraction stage is inspired by existing state-of-the-art multimedia forensic methods, but we reformulate it to best suit the context of SAR imagery. Then, we exploit the extracted fingerprint to generate a tampering mask showing which pixels have undergone splicing. We propose three different methods, one supervised approach relying on CNNs and two unsupervised approaches leveraging clustering techniques.
To validate our findings, we construct a custom dataset of spliced amplitude SAR images by applying forgeries of different size and considering various editing operations performed on the manipulated data. We compare with stateof-the-art algorithms for splicing localization on natural images, always achieving better localization performances. Our results suggest that the forensic analysis on manipulated amplitude SAR images is feasible, as long as the splicing localization is performed being aware of the distinct nature of SAR imagery with respect to natural photographs.
To summarize, the main contributions of our paper are listed in the following: • We analyze the problem of splicing localization in amplitude SAR imagery, i.e., we propose a solution to localize regions of amplitude SAR images that have been copied and pasted from another sample to alter the original image content, considering also that the spliced region might have undergone some editing to conceal this manipulation. To the best of our knowledge, this is the first contribution on the matter proposed in the literature; • The proposed solution is tailored to amplitude SAR images and has been developed analyzing in detail the life-cycle of SAR data; • We demonstrate the viability of the forensic analysis of amplitude SAR imagery, with the proposed method reaching better performances than state-of-the-art techniques developed for natural photographs.
The rest of the paper is organized as follows. In Section II we present an overview of forensics methods developed for both natural and overhead imagery. In Section III, we provide some useful background on the deep learning tools employed in our proposed splicing localization pipeline, and on the SAR imagery generation process. In Section IV, we formulate the splicing localization problem on amplitude SAR images. In Section V, we illustrate our proposed method in details. In Section VI, we provide all the information regarding the setup used for our experiments. In Section VII, we discuss our experimental findings. Finally, in Section VIII, we draw the final considerations on our work.

II. RELATED WORKS
The presence of peer-to-peer file sharing systems in the early '90s, and now of social media and chat services, has increased dramatically the amount of multimedia objects we enjoy everyday. At the same time, concerns regarding the genuineness of these objects have risen, pushing the multimedia forensics community to tackle the problem of verifying the integrity and trustworthiness of these data. Historically, the main contributions focused on the analysis of digital pictures. The work by Stamm et al. [3] provides a detailed overview of all the techniques and tasks undertaken in the last years. For instance, different methods have been proposed to detect forensic footprints left by operations executed on the entire image. This is the case of Popescu and Farid [9], Kirchner [10], or Vázquez-Padín and Pérez-González [11], who present different techniques to expose resampling operations. Other contributions, such as Cao et al. [12] and Kirchner and Fridrich [13], focus on the detection of the use of median filters. The works by Bianchi and Piva [14], Thai et al. [15] and Mandelli et al. [16] try instead to identify the execution of multiple image compressions.
Always regarding digital picture analysis, another line of research thoroughly explored is the localization of splicing attacks. Splicing refers to the insertion of a portion of an image into another one, with the possible execution of further editing, aiming at the concealment of a specific pixel area. Splicing localization means spatially identifying (i.e., at a pixel level) which areas of an image have been attacked. To spot the tampering traces, many works in the literature such as Lyu et al. [17], Cozzolino et al. [18] and Cozzolino and Verdoliva [19], rely on the information carried by the socalled noise residual. This is a picture obtained by removing the high-level semantic content from the image, for instance through high-pass filtering.
In the last years, thanks to the automatic extraction of forensic traces executed by data-driven methods, techniques coming from the deep learning area have gained a lot of popularity in the forensics field. Especially CNNs have been exhaustively explored for the task of image tampering localization. For instance, Bondi et al. [20] propose a framework for image splicing localization by exploiting CNNbased descriptors developed for camera model identification. Interestingly, some works combine CNNs with the idea of noise residuals: Rao and Ni [21], as well as Liu et al. [22], suppress the high-level image content by using a fixed highpass convolutional filter, while Bayar and Stamm [23], [24] adopt the same approach but relying on a learned filter. More recent contributions further elaborate on this idea, providing tools that greatly improved the state-of-the-art performances on the splicing localization task. A notable example is the Noiseprint by Cozzolino and Verdoliva [25].
Due to the sensible difference in the footprints characterizing the life cycle of overhead imagery, the forensics community has developed techniques specifically tailored to satellite data, as state-of-the-art methods suited for digital photographs are likely bound to perform poorly if blindly applied to them. In this vein, Ho et al. [26] propose a method based on a watermarking technique to detect doctored image regions in overhead imagery. Yarlagadda et al. [27] show a tool for the localization of general overhead image manipulation combining a Generative Adversarial Network (GAN) with a one-class Support Vector Machine (SVM). Bartusiak et al. [28] rely on GANs as well for detecting and localizing RGB image forgeries. Horváth et al. [29] formulate the forgery detection problem on RGB data as an anomaly detection one. Similarly, Mas-Montserrat et al. [30] localize splicing attacks as deviations of the image pixel values from pristine distributions using generative autoregressive models. Horváth et al. [31] rely on VisionTransformers [32] to build an autoencoder and localize splicing attacks as deviation from the learned latent-space distribution of pristine images. Moving to panchromatic images, Cannas et al. [33] localize splicing attacks relying on an ensemble of CNNs trained for sensor attribution tasks.

III. BACKGROUND
Despite the great effort put by the multimedia forensics community, the problem of splicing localization on SAR imagery has never been studied in the literature. To facilitate the discussion on the forensic analysis of amplitude SAR imagery, this section provides the reader with some useful background on SAR imaging and on the deep learning tools employed in the proposed solution.

A. DEEP LEARNING TOOLS
Deep learning is a prosperous study field that greatly improved state-of-the-art solutions for a wide range of applications, including multimedia forensics [34] as well as SAR image classification [35] and automatic target recognition [36]. Among the most used deep learning tools, CNNs had a great success as they proved handy in managing data with an intrinsic regular grid structure [37]. This is the case of digital images as well as remote sensing data, which are stored as multi-dimensional arrays.
In a nutshell, a CNN can be seen as an operator that applies a series of parametric functions (e.g., linear filtering, nonlinear saturation, matrix multiplications, etc.) to its input, in order to obtain a processed output. The parameters of the applied functions are optimized (or "learned") during a preliminary stage called "training". Depending on the task, the output of the CNN can be a label (e.g., for classification problems), a heatmap (e.g., for segmentation or localization tasks), or any other processed version of the input (e.g., for denoising purpose).
CNNs therefore allow a processing of SAR data which is adherent to their semantic. In the following sections, we illustrate some applications of these tools related to the use we make of them in our proposed method.

1) Denoising Forensics
CNNs are being more and more exploited in the forensics field in the last few years [38]. In this work, we are particularly interested in the use of the Denoising Convolutional Neural Network (DnCNN) [39], a CNN developed for image denoising that has been successfully exploited for forensics tasks [25], [40]. For instance, Cozzolino and Verdoliva proposed to use a DnCNN to extract the so-called Noiseprint [25]. This is a noise-like pattern that suppresses the vast majority of the image content and exposes editing-related artifacts due to local image forgery. To extract the Noiseprint, the authors employ a particular training procedure which can be roughly summarized as follows: VOLUME 4, 2016 1) Consider a dataset of images coming from different devices. 2) Apply the DnCNN to patches extracted from different images to obtain a series of noise patterns. 3) Keep training the DnCNN to extract similar patterns for patches coming from the same pixel region (e.g., top-left, bottom-right, etc.) of the same device, and different patterns from patches coming from different regions and/or devices.
The last constraint is motivated by the idea of exploiting the spatial periodicity of camera-related artifacts, so that operations like image shift or rotation can be easily detected [25]. In the end, the trained DnCNN is able to extract a noise-like heatmap that, when analyzing pristine images, is self-consistent, whereas in case of spliced images clearly highlights the edited regions. This solution achieves stateof-the-art results on many image forensics datasets, and also proves useful in the analysis of remote sensing images, in particular overhead RGB data [25].

2) Segmentation Forensics
A wide variety of CNNs-based methods have been proposed for the task of image segmentation [41]. A notable example due to its simplicity and accuracy is that of the U-Net by Ronnenberg et al. [42]. This network is characterized by a "U" shaped architecture. This is made by a contracting path, i.e., a series of convolutional layers each one followed by pooling operations, and an expanding path specular to the contracting one but containing upsampling operator rather than pooling. Skip connections are employed to concatenate and combine the output of the contracting path layers with the input of the mirrored layers of the expanding path. In addition to image segmentation, the U-Net has found a general appreciation also in the forensics field. Kniaz et al. [43] used a U-Net-like architecture to train a GAN to hinder tampering artifacts in spliced images. Bi et al. [44] combined the idea of residual connections and U-Net architecture to realize an end-to-end trainable network able not only to detect image manipulation attacks, but also to localize them precisely.

B. SAR IMAGING
SAR imagery has been widely adopted for a variety of tasks thanks to its characteristics of providing high-resolution images independently from cloud coverage, weather conditions and daylight [45]- [47]. Earth monitoring, 2D and 3D Earth surface mapping and change detection are just few examples of successful exploitation of SAR data [48].
A SAR system is an imaging radar mounted on a platform moving in one direction (e.g., a satellite, an aircraft, etc.). While moving, the system emits sequential high power electromagnetic waves through its antenna. Waves interact with the objects they hit (i.e., the Earth surface) and are backscattered with modified amplitude and phase according to objects permittivity and physical properties (e.g., geome- try, roughness). The antenna then collects these backscattered echoes that can be processed for the SAR image formation.
A simplified schematic representation of this process is provided in Figure 1. The coordinates of SAR data are related to the motion of the platform at acquisition time. As we can see in Figure 1, the first dimension corresponds to the range (or fast time), i.e., the direction perpendicular to platform flight along which the electromagnetic beam travels. The second dimension corresponds instead to the azimuth (or slow time), which is the actual trajectory of the platform.
SAR systems use frequency modulated pulse waveforms called chirps. Chirps are characterized by constant amplitude and instantaneous frequency that is linearly modulated over time. Depending on the application, different frequency bands are used for modulation, with the most popular being L (i.e., from 1 GHz to 2 GHz), C (i.e., from 3.75 GHz to 7.5 GHz) and X (i.e., from 7.5 GHz to 12 GHz) [48].
Differently from optical sensors, data coming from echo signals is not interpretable as it is. Additional processing called focusing (i.e., a double convolution both in the range and azimuth directions) is needed to obtain a visually interpretable image [48]. The resulting SAR image is a complex 2D matrix, usually displayed in terms of intensity so that pixel values approximate the reflectivity of points on the ground. This 2D matrix can be further processed, and different kinds of processing determine the existence of different so-called SAR products [47]. As an example, this can be done to ensure calibration (i.e., each pixel value represents the correct value of reflectivity) and geocoding (i.e., associate the location of each pixel with a position on the ground).
Nowadays, a wide range of SAR products can be downloaded from online platforms. Among these platforms, the Copernicus Open Access Hub [49] is the online portal provided by the European Space Agency for downloading Copernicus Sentinel-1 Mission products. The Sentinel-1 Mission products are generated according to different acquisition modes. The simplest one, the Stripmap mode, senses single continuous strips of Earth surface with a fixed antenna pattern (as Figure 1 depicts). Other acquisition modes instead acquire more than one measurement: for instance, the Interferometric Wide Swath (IW) emits three different pulses steering the antenna in the azimuth direction [50]. This operation results in the generation of three different complex images (i.e., one per pulse in the azimuth direction), or subswaths, provided altogether in the SAR product. SAR products usually depict very large geographic areas. Many companies allow users to select an area of interest to be imaged by their systems. To do so, as the Earth surface coverage of a single echo is often insufficient, multiple signals are collected and concatenated in a single continuous image. This is also the case of the Sentinel-1 Mission, where this operation is denoted as product slicing [51]. Figure 2 provides a graphical representation of it. Product slicing needs a resizing of all slices to a common grid in order to concatenate them.
Among the different Sentinel-1 products and acquisition modes, Ground Range Detected (GRD) products are probably the most common and accessible for a direct inspection. Indeed, they present scene reflectivity in ground-range coordinates, which are the azimuth-range coordinates projected on the Earth ellipsoid model [52]. This transformation allows to reduce the range geometric distortion and have each pixel placed in the correct position with respect to a reference plane [53]. Moreover, in case multiple sub-swaths are available, all the available signals are fused together to obtain a single continuous image. This is the case of IW acquired products. Figure 3 provides an example of such process, which is based on a resizing pipeline [54]. Finally, GRD products represent detected amplitude only, without bringing any phase information with them. All these elements make GRD products easy to handle, but also easy to be manipulated with common image editing software tools.

IV. PROBLEM FORMULATION
SAR products differ from natural photographs for a variety of reasons. For instance, the concept of single shot is hardly defined. Indeed, SAR signals are continuously acquired through moving sensors. Individual products are then generated for manageability reasons by merging several acquisitions. Moreover, some SAR products like GRD are obtained through a very specific chain of operations that has nothing in common with the ones usually employed in natural photography.
However, from the perspective of an end-user with no specific experience on overhead imagery, amplitude SAR products can be considered close to natural photographs when it comes to their manipulation. Indeed, as long as they are provided in single polarization, since they present amplitude information only they can be processed as a matrix of real numbers with any common image editing tool. This is the case of GRD products for instance.
Since GRD products, especially those acquired in IW mode, are popular [55], [56] yet easy to be manipulated, it is reasonable to consider them a vulnerable asset from a forensics perspective. Given these premises, in this work we focus on GRD products. Specifically, we consider images derived from GRD products in C-band in single vertical polarization, all acquired in IW mode. From now on, we refer to them as GRD images, or GRD tiles as they are typically mentioned in the overhead field.
In this work, we are interested in assessing the integrity of a GRD image tile at a local level and at a small granularity. In particular, given a manipulated tile, we want to localize which pixels have been affected by the editing. As manipulation we consider image splicing attacks, i.e., the insertion in a target tile of a portion coming from a different source tile. Moreover, we consider that the target region may have undergone optional editing with image processing operations (e.g., blurring, resizing, noise addition, etc.) in order to render the attack more credible and visually appealing. For instance, a resizing might be needed to match the source and target tile resolution and avoid making the splicing easily detectable at visual inspection.
More formally, we define the coordinates of a pixel of a U × V resolution tile as (u, v), where u ∈ [1, . . . , U ] and v ∈ [1, . . . , V ]. U, V are the number of pixels per row and column, respectively. Let T D and T T be two pristine tiles. VOLUME 4, 2016 Cannas et al.: Amplitude SAR Image Splicing Localization < l a t e x i t s h a 1 _ b a s e 6 4 = " L G E 7 e C Y 7 X m m s Y c f 5 t E p L y y u r a + V 1 e 2 N z a 3 u n s r v X N m G s O b R 4 K E P d 9 Z k B K R S 0 U K C E b q S B B b 6 E j j + 5 y v z O A 2 g j Q t X E a Q R e w M Z K j A R n m E p 3 / Y D h v T 9 K m r P B t W 0 P K l W n 5 u S g i 8 Q t S J U U a A w q X / 1 h y O M A F H L J j O m 5 T o R e w j Q K L m F m 9 2 M D E e M T N o Z e S h U L w H h J n n p G j 2 L D M K Q R a C o k z U X 4 v Z G w w J h p 4 K e T W U o z 7 2 X i f 1 4 v x t G F l w g V x Q i K Z 4 d Q S M g P G a 5 F W g f Q o d C A y L L k Q I W i n G m G C F p Q x n k q x m k / W R / u / P e L p H 1 S c 8 9 q p 7 e n 1 f p l 0 U y Z H J B D c k x c c k 7 q 5 I Y 0 S I t w o s k T e S Y v 1 q P 1 a r 1 Z 7 z + j J a v Y 2 S d / Y H 1 8 A 1 W a l m U = < / l a t e x i t > T D < l a t e x i t s h a 1 _ b a s e 6 4 = " d w 2 p h R Y u Z C I C E B a + b e U b L W t K y q k = " > A A A C A H i c b V C 7 T s N A E D y H V z C v A C X N i Q i J K r I R A s o I G s o g 8 p I S E 5 0 v m 3 D K + W z d r Z E i K w 1 f Q Q s V H a L l T y j 4 F 2 z j A h K m G s 3 s a m f H j 6 Q w 6 D i f V m l p e W V 1 r b x u b 2 x u b e 9 U d v f a J o w 1 h x Y P Z a i 7 P j M g h Y I W C p T Q j T S w w J f Q 8 S d X m d 9 5 A G 1 E q J o 4 j c A L 2 F i J k e A M U + m u H z C 8 9 0 d J c z a 4 t e 1 B p e r U n B x 0 k b g F q Z I C j U H l q z 8 M e R y A Q i 6 Z M T 3 3 6 R t E 9 q 7 l n t 9 O a 0 W r 8 s m i m T A 3 J I j o l L z k m d X J M G a R F O N H k i z + T F e r R e r T f r / W e 0 Z B U 7 + + Q P r I 9 v b R m W d A = = < / l a t e x i t > T S < l a t e x i t s h a 1 _ b a s e 6 4 = " S g 0 P q P 0 L C 2 w A t w 2 u W C K w o + W P X m Y = " > A A A B / n i c b V D L S s N A F J 3 U V 4 2 v q k s 3 g 0 V w V R I R d V l 0 4 0 a o Y B / Q h D K Z 3 t a h k 0 m Y u R F K K P g V b n X l T t z 6 K y 7 8 F 5 O Y h b a e 1 e G c e 7 n n n i C W w q D j f F q V p e W V 1 b X q u r 2 x u b W 9 U 9 v d 6 5 g o 0 R z a P J K R 7 g X M g B Q K 2 i h Q Q i / W w M J A Q j e Y X O V + 9 w G 0 E Z G 6 w 2 k M f s j G S o w E Z 5 h J n h c y v A 9 G 6 c 3 M t g e 1 u t N w C t B F 4 p a k T k q 0 Q k a y f v w 5 3 / f p F 0 T h r u W e P 0 9 r T e v C y b q Z I D c k i O i U v O S Z N c k x Z p E 0 5 i 8 k S e y Y v 1 a L 1 a b 9 b 7 z 2 j F K n f 2 y R 9 Y H 9 / 4 j J W n < / l a t e x i t > M FIGURE 4. Example of a splicing operation, with donor tile T D on the left, spliced tile T S at the center and tampering mask M at the right. In the spliced tile, the outskirts of a urban area are covered.
T D is the donor tile, whereas T T is the target tile. Defining S as the region of T T under splicing attack, the resulting spliced tile T S is defined as: with e(·) being a suitable editing function (e.g., blurring, resizing, noise addition, rotation, shearing, affine transforms, etc.). The pixel-by-pixel integrity of the tile T S can be represented by a tampering mask M with the same resolution of T S , where each pixel takes a binary value 0 or 1 depending on the pixel being pristine or manipulated, respectively. Formally, the tampering mask M has pixel values equal to The goal of this paper is the localization of the spliced region S by estimating a tampering maskM as close as possible to M from the sole analysis of the tile T S . Figure 4 provides a graphical representation of the splicing operation together with a tampering mask M.

V. AMPLITUDE SAR IMAGE SPLICING LOCALIZATION
In the forensics literature, it is well known that both the acquisition device and processing operations leave peculiar traces on digital photographs. These traces can be exploited to expose forgeries [25], [57]. As the considered amplitude SAR products undergo a wide variety of operations from their acquisition to the final production (e.g., re-sampling, de-ramping, ground-range projection, etc.), it is reasonable to assume that different products may contain different processing traces. Due to the nature itself of the SAR signal and of the non-linear operations employed, even amplitude SAR products coming from the same satellite might present different traces relative to the processing executed for generating them.
Leveraging this idea, we propose a splicing localization method that exposes and highlights inconsistencies in the analyzed spliced tile T S due to the different processing that target and donor tiles have undergone, and to any editing trace left by the attacker in the splicing operation. This is done by extracting a fingerprint inspired by Noiseprint [25] Fingerprint extraction Mask estimation 6 R t E 9 q 7 l n t 9 O a 0 W r 8 s m i m T A 3 J I j o l L z k m d X J M G a R F O N H k i z + T F e r R e r T f r / W e 0 Z B U 7 + + Q P r I 9 v b R m W d A = = < / l a t e x i t > T S < l a t e x i t s h a 1 _ b a s e 6 4 = " S g 0 P q P 0 L C 2 w A t w 2 u W C K w o + W P X m Y = " > A A A B / n i c b V D L S s N A F J 3 U V 4 2 v q k s 3 g 0 V w V R I R d V l 0 4 0 a o Y B / Q h D K Z 3 t a h k 0 m Y u R F K K P g V b n X l T t z 6 K y 7 8 F 5 O Y h b a e 1 e G c e 7 n n n i C W w q D j f F q V p e W V 1 b X q u r 2 x u b W 9 U 9 v d 6 5 g o 0 R z a P J K R 7 g X M g B Q K 2 i h Q Q i / W w M J A Q j e Y X O V + 9 w G 0 E Z G 6 w 2 k M f s j G S o w E Z 5 h J n h c y v A 9 G 6 c 3 M t g e 1 u t N w C t B F 4 p a k T k q 0 Q k a y f v w 5 3 / f p F 0 T h r u W e P 0 9 r T e v C y b q Z I D c k i O i U v O S Z N c k x Z p E 0 5 i 8 k S e y Y v 1 a L 1 a b 9 b 7 z 2 j F K n f 2 y R 9 Y H 9 / 4 j J W n < / l a t e x i t > M < l a t e x i t s h a 1 _ b a s e 6 4 = " r S m S C X 7 j M e + H M 6 e 9 d j j X C Y c 5 e f 0 = " Schematic illustration of the proposed processing pipeline. A fingerprint F is extracted from the spliced tile T S under investigation. Then, a binary tampering maskM can be estimated through three different methods.
For the sake of clarity, we also report the ground-truth tampering mask M.
from the tile under analysis. This fingerprint, which suffices in spotting at a visual inspection the spliced region S, is then further processed to estimate a binary tampering maskM as close as possible to M.
To summarize, our splicing localization process follows a two-stage pipeline (see Figure 5): 1) Fingerprint extraction -Using a properly designed CNN, a fingerprint F with the same resolution of T S highlighting any local inconsistencies due to splicing attacks is obtained. 2) Tampering mask estimation -Starting from the fingerprint F, using either unsupervised or supervised approaches, a tampering maskM is estimated. In the following, we provide additional details about each step of the proposed method.

A. FINGERPRINT EXTRACTION
The goal of this step is the extraction of a fingerprint F that visually highlights the spliced region S in an analyzed tile. To do so, we leverage the recent forensics literature. In particular, the Noiseprint [25] method shows promising results in highlighting editing traces even on data distant from natural photographs.
For our fingerprint extractor, we exploit the characterization capability offered by Noiseprint and further adapt it to the context of SAR imagery. In particular, given a spliced tile T S , we extract a fingerprint F with the same pixel resolution. Our goal is to make this fingerprint clearly highlighting spliced regions just by visual inspection. Formally, we define F as: where f (·) represents the fingerprint extraction operator, i.e., the DnCNN network after it has been trained.
For the DnCNN training, we adopt the following pipeline: 1) We collect a number of tiles coming from M different amplitude SAR products, all generated by the same satellite. These tiles are pristine, i.e., they have not been tampered with in any way. 2) From the tiles of each product, we extract a number of patches. The i-th patch extracted from the tiles of the m-th amplitude SAR product is referred to as P i m .
We iteratively update the DnCNN weights by processing small batches of patches. In particular, given a mini-batch of patches, we process their extracted fingerprints with the Distance Based Logistic (DBL) loss presented by Võ and Hays [58]. This loss function computes the pairwise squared Euclidean distance between all the analyzed fingerprints. The objective is to make the fingerprints self-consistent if and only if they are extracted from the same amplitude product. Consistent fingerprint pairs (i.e., coming from the same SAR product) are associated with a desired low value for the Euclidean distance, while non consistent fingerprint pairs (i.e., coming from different products) are associated with a desired high Euclidean distance. More formally, we define the fingerprint pair (F i m , F j n ) as consistent if n = m, ∀i, j. The fingerprint pair is non consistent if n = m, ∀i, j. To do so, we assign a label 1 to all consistent pairs of fingerprints and label 0 otherwise. 5) We process the training patches by continuously updating the DnCNN weights until we reach some desired performance metrics. For clarity's sake, Figure 6 depicts a sketch of the training pipeline of the proposed fingerprint extractor.
When training is finished, f (·) implements the desired fingerprint extraction function defined in (3). This function allows to extract a fingerprint F that captures traces relative to the processing pipeline of the acquired product, highlighting splicing attacks as inconsistencies in these traces. It is worth noticing that f (·) scales with the input resolution, so that tiles of different pixel dimensions can be processed seamlessly.
With respect to the original training procedure outlined in Section III, our pipeline has been modified taking into consideration the differences between natural images and SAR products. For instance, the presence of coherent artifacts in specific positions of the grid of pixels can be hardly demonstrated for amplitude SAR products. On top of that,  u w a 9 Z k L S b Y P X t i z + z F e r R e r T f r / W e 0 Z B U 7 + + w P r I 9 v I U + R p g = = < / l a t e x i t > 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " 4 Q Y M w f Y I F 2 / n z X h W R J k x E T 7 l 1 8 I = " > A A A C B H i c b V C 7 T s N A E D y H V z A v A y X N i Q i J K r I R A s o I J E Q Z J P K Q E m O d L 5 t w y v m h u 3 W k y E r L V 9 B C R Y d o + Q 8 K / g X b u I C E q U Y z u 9 r Z 8 W M p N N r 2 p 1 F Z W l 5 Z X a u u m x u b W 9 s 7 1 u 5 e W 0 e J 4 t D i k Y x U 1 2 c a p A i h h Q I l d G M F L P A l d P z x V e 5 3 J q C 0 i M I 7 n M b g B m w U i q H g D D P J s 6 x + w P D B H 6 b X s 3 v h B a b p W T W 7 b h e g i 8 Q p S Y 2 U a H r W V 3 8 Q 8 S S A E L l k W v c c O 0 Y 3 Z Q o F l z A z + 4 m G m P E x G 0 E v o y E L Q L t p k X x G j x L N M K I x K C o k L U T 4 v Z G y Q O t p 4 G e T e U 4 9 7 + X i f 1 4 v w e G F m 4 o w T h B C n h 9 C I a E 4 p L k S W S V A B 0 I B I s u T A x U h 5 U w x R F C C M s 4 z M c k 6 y v t w 5 r 9 f J O 2 T u n N W P 7 0 9 r T U u y 2 a q 5 I A c k m P i k H P S I D e k S V q E k w l 5 I s / k x X g 0 X o 0 3 4 / 1 n t G K U O / v k D 4 y P b 4 T R l 4 w = < / l a t e x i t > pixels in natural images are temporally coherent: since they are approximately acquired at the same time instant, the signal they represent has basically no spatial nor temporal discontinuities. This is unfortunately not true for SAR products. As a matter of fact, we have seen that SAR images are generated concatenating different measurements. This implies that operations such as product splicing, together with other processing characteristic of amplitude SAR products, might alter and hinder the presence of generation artifacts with a regular spatial distribution. We summarize our main elements of difference with respect to [25] as follows: 1) Noiseprint requires the collection of images coming from different devices (i.e., individual cameras). We exploit instead a number of tiles from M different amplitude SAR products all coming from the same satellite. Our assumption is that tiles of the same amplitude product underwent the same processing pipeline, whereas tiles of different products present different processing traces. 2) Noiseprint trains the DnCNN by comparing pairs of patches, giving a positive label in the DBL only if patches come from the same pixel region and device. Reasonably this constraint should not hold for SAR images, thus we relax it and give a positive label whenever the patches come from the same amplitude product, regardless of the pixel region of extraction. 3) Noiseprint does not employ any data-augmentation strategy during training. We propose to include resizing as data-augmentation. The reason behind this choice is twofold: on one hand, resizing might improve the extractor robustness to editing operations applied to hinder the localization of S. On the other hand, we have seen in Section III that each SAR product is characterized by a number of resizing operations leaving peculiar traces. For this reason, we apply resizing and propose to consider all resized tiles as coming from new SAR products. This is a good solution to enlarge VOLUME 4, 2016 the number and variety of products at disposal. Such modifications, even if at a first glance might appear negligible, will later show to improve the performances of the pipeline with respect to adopting the baseline training procedure as it is. Figure 7 reports some examples of spliced tiles, their tampering masks and the fingerprints extracted with function f (·). As we can see, even though the fingerprints F 1 and F 2 are not binary images yet, spliced areas are easily recognizable at this stage of the pipeline already.

B. TAMPERING MASK ESTIMATION
Once the fingerprint F is extracted, the next step in the pipeline is tampering mask estimation. This stage segments the fingerprint to generate a binary maskM representing the integrity of the spliced tile T S given as input.
Many forensics methods in the literature provide a binary heatmap decision highlighting the spliced region. The most common approaches rely on automatic thresholding or twoclass clustering. However, such procedures may lead to a non-efficient binary partition of the feature space [59], as the presence of scene content might still appear in the generated tampering mask. Such phenomenon is noticeable also in the fingerprints shown in Figure 7. For instance, the fingerprint F 1 presents some texture related to the sea and the urban areas surrounding S.
Having a binary heatmap however is of paramount importance for supporting the work of forensic analysts. For this reason the second step of our pipeline is dedicated in providing the most detailed possible tampering maskM. To do so, we propose different methods that deeply analyze the fingerprint F to provide an efficient binary partition of it. To this end, we have considered three different techniques, which can be divided into two families: • Unsupervised approaches. With these techniques, we first partition F into different clusters. Then, starting from these partitions, different candidate masks are compared to choose the most appropriate one. We propose two unsupervised methods: the first one is based on the K-means clustering algorithm [60]; the second one is based on Gaussian Mixture Models (GMMs) [61]. As unsupervised methods, both techniques do not require a preliminary stage of training. • Supervised approaches. In this scenario, we estimate the tampering maskM using classic CNNs adopted in the image segmentation field. Specifically, we propose a supervised strategy based on the well-known U-Net architecture [42]. This method requires a preliminary stage of training.

1) Unsupervised K-means-based mask estimation
The K-means algorithm [60] is well known in the signal processing community and it has been historically used to perform clustering operations. Given a number of observations, the algorithm partitions them into a finite set of groups, called clusters, assigning each observation to the group showing the nearest mean distance (e.g., the Euclidean distance) from it.
The assumption behind its use in our pipeline is that the spliced region S is well localized in the fingerprint F. A good estimate of the tampering mask will reasonably present a well localized cluster of pixels. In a nutshell, we propose to look for different clusters of pixels in F, compare their compactness and then choose the most compact one to estimate the final tampering maskM.
To do so, we first divide F into a set of non-overlapping patches P n , n = 1, . . . , N , with N being the total number of patches in F. These patches are the observations used by the K-means algorithm to cluster F based on their Euclidean distances. After the algorithm converges, the fingerprint F is divided into C clusters. We define the set of coordinates of pixels belonging to the c-th cluster as P c = [U c , V c ]. U c and V c are the sets of row and column coordinates, respectively.
It is worth noticing that, if the pixels belonging to a cluster are close to each other, this might be indicative of the presence of a well localized spliced area in the fingerprint F. We therefore need a measure of proximity of the pixels. As a metric, we propose to compute the variance of the coordinate values of the pixels belonging to each cluster. The smaller the variance, the more compact the cluster. Thus, the best localized cluster can be estimated as: where σ 2 is the variance and µ is the arithmetic mean.
Starting from the pixel coordinates of Pĉ = [Uĉ, Vĉ], i.e., the coordinates of the best localized cluster, we finally create a binary segmentation mask of the fingerprint F. We do so by assigning a positive label to all the pixels belonging to that cluster, and a negative label to all those not belonging to it. Following the convention introduced in (2), we assign 1 as positive label and 0 as negative one. The final tampering maskM can be formally defined as: 2) Unsupervised GMM-based mask estimation Mixture distributions are a powerful statistical tool consisting in the description of data by linearly combining basic distributions, such as Bernoulli, Dirilichet, or Gaussian [61]. They have been studied for years also for the task of data clustering, and especially GMMs have proven extremely handy and simple to use adopting the Expectation Maximization (EM) algorithm [62]. The basic functioning of the EM algorithm is not too dissimilar from the K-means. Given a set of observations, the EM fits a mixture of C different Gaussian distributions to the data. The mixture is such that each c-th component groups together observations that have been likely generated by the same Gaussian distribution. This performs a clustering of the data based on the probability of each cluster having generated a data point.
Inside our pipeline, the use of GMMs is really close to K-means. In this case as well, we propose to divide the fingerprint F into non-overlapping patches P n , n = 1, . . . , N , with N being the total number of patches in F. These are the observations used by the EM algorithm. After the algorithm converges, the fingerprint F is divided again into C clusters, where the elements of the clusters are paired based on how well their values can be described by the same Gaussian distribution. We then look for the most compact clusterĉ, following the same methodology applied for the K-means mask generation method. Using (4), we look for the cluster whose pixels' coordinate values show the smallest variance. The final estimated tampering mask is defined as in (5).

3) Supervised CNN-based mask estimation
We propose a supervised mask estimation strategy relying on the U-Net architecture [42]. Our choice fell over this network as it is easy and fast to train, while achieving really competitive performances in the segmentation of a variety of imagery data, from SAR [63], to overhead RGB for road extraction [64], to seismic images salt segmentation [65], and of course medical imagery [66].
In the context of our proposed method, the use of the U-Net translates in using the network to generate a tampering mask estimateM starting from an input fingerprint F. The proposed method first extracts a probability maskM u defined asM where u(·) is the fingerprint segmentation function implemented by the U-Net.M u has the same pixel resolution of F, and each pixel presents values close to 0 when there is a low probability for that pixel of being spliced, and values close to 1 when there is an high probability of manipulation.
In order for the function u(·) to correctly implement a coherent segmentation, the deployment of the U-Net needs a stage of training, in which the network learns to retain only the information regarding the localization of the spliced region S. Such training can be done with a dataset of K spliced tiles T S k , k = 1, . . . , K. For each of them, a corresponding ground-truth tampering mask M k together with an extracted fingerprint F k needs to be provided.
During the training phase, every fingerprint F k is processed according to (6) to obtain the tampering mask estimatê M u k . The network performances are then evaluated comparing the ground-truth tampering masks M k and the U-Net-estimatesM u k . We propose to do so by minimizing the sum of Dice loss [67] and Focal loss [68]. For a complete definition and a more comprehensive discussion on both losses we refer the reader to the original papers. For the sake of our discussion, it suffices to say that both have proven to help reducing the overly thick boundaries in the segmented objects that usually present when training segmentation networks with a simple binary cross-entropy loss [67].
At deployment stage, we propose to impose a threshold τ on the pixel values of the estimated maskM u derived from a query fingerprint F. The final estimated mask is equal to: It is worth noticing that the training of the fingerprint extractor (presented in Section V-A) and the training of the U-Net for the forgery mask estimation do not simultaneously happen: we first need to train the fingerprint extractor to generate the fingerprint F, and then, in a second stage, the U-Net. Furthermore, notice that the fingerprint extractor is trained on pristine tiles only; the U-Net instead needs to be trained on spliced tiles.

VI. EXPERIMENTAL SETUP
In this section, we describe the details regarding our experimental setup, including the dataset collection procedure, the training strategy together with the hyperparameters for the CNN-fingerprint extractor and mask estimation methods, and finally the metrics used for our method evaluation.

A. DATASET
As introduced in Section V, in our work we considered SAR GRD products in single vertical polarisation. More specifically, we downloaded from the Copernicus Open Access Hub 20 products acquired in IW mode coming from the Sentinel-1 mission. All products have been sensed by the same satellite (S1-B), present high spatial resolution, and overall dimensions in pixels roughly around 20000 × 20000. Given the size of these acquisitions, each of them has been divided into non-overlapping tiles T 1024×1024 pixels wide. These operation allowed us to work at a local level with small granularity and making the input easily processable by our networks. From each product, we extracted 300 − 400 tiles, VOLUME 4, 2016 resulting in a total of approximately 8000 tiles. These data constituted the basis for all the steps of our experiments. Indeed, starting from these samples we managed to create the following datasets: • Fingerprint Extraction Dataset (FED). This is the dataset of pristine tiles used for training the fingerprint extraction function f (·) defined in Section V-A; • Spliced Dataset 1 (SD1). This is a dataset of spliced tiles used for training the U-Net segmentation function u(·) defined in Section V-B; • Spliced Dataset 2 (SD2).This dataset is again constituted by spliced samples, but that have never been seen during the training nor validation of the U-Net. We used these tiles to test the performances of the complete pipeline with all its tampering mask estimation methods. Table 1 reports a summary of the different datasets used in the paper. In the following paragraphs, we provide further details on the creation and usage of each set. Fingerprint Extraction Dataset (FED). For creating this dataset, we took only tiles from the first 10 SAR products we downloaded. More specifically, we took the 50% of tiles from each GRD product, reserving the remaining 50% for creating the Spliced Dataset 1 (SD1). In this way we assured that the training of the U-Net happened on samples never seen during training by the fingerprint extractor. In the end, we created a dataset of approximately 4000 pristine tiles for training the fingerprint extractor. Spliced Dataset 1 (SD1). The SD1 is a dataset of splicing attacks created for training the U-Net. The tiles have been taken from the first 10 GRD products downloaded for our experiments. More specifically, we took the 50% of tiles not used for training the fingerprint extractor.
The splicing attacks have been realized in four different scenarios: 1) with the donor tile T D having undergone no editing; 2) with the donor tile T D having undergone a rotation with angle chosen randomly; 3) with the donor tile T D having undergone resizing; 4) with the donor tile T D having undergone both rotation and resizing. For all four scenarios, we always considered the case where the donor tile T D and the target tile T T come from different products. We generated spliced tiles T S with a spliced region S contained inside a 128 × 128 or 256 × 256 pixel area. More specifically, we proceeded as follows: 1) we applied a selected editing operation to T D ; 2) we randomly cropped a pixel region from T D , imposing it to have a maximum resolution of either 128×128 or 256 × 256 pixels; 3) we selected a random position in the target tile T D and pasted the spliced region S on it.
We considered different combination of parameters for the editing, resulting in a final number of 1600 samples. For clarity's sake, Table 2 reports all the considered editing operations with their parameters. Spliced Dataset 2 (SD2). The SD2 is a second dataset of splicing attacks designed to test the performance of the complete pipeline. The composition of the SD2 has been executed starting from the tiles of the last 10 GRD products at our disposal. These products have been reserved to this task to avoid any possible overlap between the data used for training the data-driven components of our pipeline, and the data used for testing them.
For the generation of the SD2, we wanted a more challenging dataset with respect to the SD1. To do so, we considered both the cases where the donor tile T D and the target tile T T come from different or the same products. We define the first scenario as inter-splicing and the latter one as intra-splicing. Moreover, we extended the number of editing operations applied on T D using processing never seen by the U-Net. For this last aspect, we tried to simulate an attacker perspective and considered operations that could make the tampering more plausible in the SAR imaging context.
We used noise addition with two different distributions (Gaussian and Laplacian), two typologies of blurring (average, median), a similarity transformation comprehending rotation and scaling, a speckle-like multiplicative noise degradation and, finally, we considered also the case where no editing is applied to T D . The parameters used for executing the editing are all reported in Table 2.
We also varied the dimensions of the spliced region S. Starting from the previous maximum pixel resolutions of 128×128 and 256×256 pixels, we included the intermediate areas of 160×160, 192×192 and 224×224. In the end, 7000 tiles compose the SD2, 3500 realized in the inter-splicing scenario, and 3500 realized in the intra-splicing one. These numbers account for 100 spliced tiles per area and operation, multiplied by 2 accounting for the inter and intra-splicing modalities.

B. TRAINING
Here we briefly illustrate the training procedures followed for the data-driven components of our pipeline, i.e., the fingerprint extractor and the U-Net mask estimator.

1) Fingerprint Extraction
The training set was constituted by the pristine tiles coming from the Fingerprint Extraction Dataset (FED) dataset. For the fingerprint extractor, we relied on the mini-batch boost procedure originally employed by Cozzolino and Verdoliva Parameters used for the editing operated in the SD1 and Spliced Dataset 2 (SD2) and total number of samples. Parameters for noise-based editing are reported referring to samples with values between 0 and 1.

Set
Editing operation Editing parameters Total Total # in set [25]. However, due to the limited amount of products available with respect to the original forensic task, we exploited more patches for the construction of the mini-batches. Specifically, in each batch we accounted for 4 GRD products at a time, inserting 10 tiles per product. From each tile, we randomly extracted 6 patches of 48 × 48 pixels, ending up with 240 patches per mini-batch.
In executing the training, we investigated three scenarios, leading to three different fingerprint extractors: • Baseline Extractor (BE): this extractor corresponds to training the fingerprint extractor proposed in [25] offthe-shelf on amplitude SAR images. Specifically, we trained the DnCNN without relaxing the constraint on the position of the patches to compute the DBL loss (see Section V-A for details). This extractor served as a baseline to evaluate the goodness of our proposed fingerprint extraction method. We randomly selected 5 products for training and 5 for validation, corresponding to 2000 tiles for training and 2000 for validation. • SAR Adapted Extractor (SAE): in this scenario, we relaxed the patch position constraint following the motivations reported in Section V-A. This translated into assigning a positive label in the DBL loss to every patch pair coming from tiles of the same product, regardless of the position from which the patches have been extracted. As in BE, we randomly picked 5 products for training and 5 for validation, ending up with 2000 tiles for training and 2000 for validation. • Augmented SAR Adapted Extractor (ASAE): for this extractor, we relaxed the patch position constraint as done for the SAE, but we also applied data augmentation. More specifically, we resized all the tiles using a 1.5 scaling factor, and then randomly cropped them to 1024 × 1024 pixels. As previously explained in Section V-A, we considered all resized tiles as coming from separate GRD products. We exploited 20 pristine products, i.e., 10 original products and 10 new products corresponding to their resized versions, randomly using 10 for training (corresponding to 4000 tiles), and 10 (other 4000 tiles) for validation.
All the extractors have been trained for 500 maximum epochs, with each epoch consisting of 128 batch iterations, using Adam optimizer [69] with a learning rate of 10 −4 . We stopped the training if the validation loss did not improve for 30 consecutive epochs. Then, we kept the model showing the best validation loss.

2) U-Net Mask Estimator
Since the goal of the mask estimation task is highlighting potential forged areas in the fingerprint extracted from the query tile, the U-Net training dataset must consist of fingerprints. Therefore, we extracted the fingerprints of the samples in the SD1 by exploiting the three extractors listed in Section VI-B1. Then, we trained a U-Net on each set of fingerprints, creating a separate pipeline for each extractor. We relied on the U-Net model reported in [70], which allows to use various CNNs as backbones for the encoderdecoder structure. Our choice fell on the EfficientNetB0 model [71], a network of the EfficientNet family which proved extremely handy and recently found a discrete success both in multimedia forensics [72]- [74] and in overhead imagery analysis [75]. We considered an EfficientNetB0 as encoder, and another one as decoder.
We randomly split the fingerprints extracted from the SD1 dataset into 50% for training and 50% for validation. We trained the networks for 500 epochs, using as loss function the one described in Section V-B and resorting to Adam optimization with a learning rate of 10 −4 . We reduced the learning rate by a 0.1 factor on plateau of the validation loss for 10 consecutive epochs, and early-stopped the training if the validation loss did not improve for 30 consecutive epochs. We kept the best validation model for all the networks trained VOLUME 4, 2016 with the three different fingerprint extractors.

C. TAMPERING MASK ESTIMATION PARAMETERS
In order for our unsupervised methods (i.e., K-means and GMM) to be successfully deployed, the dimension of the extracted patches P n , as well as the number C of clusters in which the fingerprint F is partitioned, are crucial aspects. A too big resolution of each P n or a small number of clusters C might lead to an under partition of the fingerprint, which is generally way less preferable with respect to an over segmentation.
For this reason, we spent a preliminary part of our work in determining the right amount of clusters and the right patches resolution, finding a good trade-off in dividing the fingerprint F into non-overlapping patches 8 × 8 pixels wide, and using 7 clusters. Also the U-Net needs to have a correct threshold τ applied to the probability maskM u . In this case, we found an optimal value with τ = 0.5.

D. EVALUATION METRICS
For evaluating our performances in correctly estimating the tampering mask, we relied on two metrics: the balanced accuracy and the Jaccard index or Intersection Over Union (IOU). Given an estimated tampering maskM, we can divide its pixels based on the correctness of the tampering localization. Specifically, we can assign them to four different categories: This quantity measures how well our pipeline performed in correctly assigning each pixel in the estimated tampering maskM, taking into account the disproportion between pristine and spliced pixels. The higher the balanced accuracy, the better the splicing localization. The IOU is defined as: This measure is popular in computer vision for object detection tasks, where it is used to quantify how well a predicted bounding-box for an object overlaps with the actual object's position. For our task, this translates in the IOU accurately quantifying how well the area localized in the estimated tampering maskM overlaps with the one indicated in the original mask M. Figure 8 shows some examples. Values close to 1 are better, but an IOU equal or greater than 0.5 is good too.

VII. RESULTS
In this section, we report the results of our experimental campaign. In particular, we describe the achieved results for the splicing localization task and compare our performances with state-of-the-art.
To have a fair comparison among the different fingerprint extraction and mask estimation methods, we resort to dataset SD2 for evaluating all the results. It is worth noticing that SD2 comprehends acquisitions never seen by any of the datadriven blocks of our pipeline, together with new unseen editing operations. We show our results by considering separated the two different scenarios of donor and target tiles coming from the same (i.e., intra-splicing) or different (i.e., intersplicing) acquired products.
Tables 3 and 4 report the localization results by combining the two proposed fingerprint extractors (i.e., SAE and ASAE) and the baseline fingerprint extractor (i.e., BE). Specifically, Table 3 depicts the results achieved in the inter-splicing scenario. On the contrary, Table 4 shows results for the intra-splicing scenario. Finally, Figures 9 and 10 report some examples of splicing attacks together with all the artifacts generated by our proposed pipeline. In the following we report the major findings from these results.

A. FINGERPRINT EXTRACTORS COMPARISON
The best fingerprint extractor is always the ASAE, with the mask estimation methods GMM and U-Net alternating in providing the best performances. Moreover, the SAE showed on average better results than the baseline extractor, on 4 editing operations out of 7 in the inter-splicing scenario and on 6 operations out of 7 in the intra-splicing scenario.
Notice that, while the relaxation of the patch position constraint proposed in Section V-A (i.e., the SAE configuration) provided us better average metrics with respect to the baseline, the additional insertion of a simple data augmentation like resizing (i.e., the ASAE configuration) gave us an even greater performance boost. From this point of view, it is worth mentioning that the ASAE-based detectors showed better results also on editing operations which are not strictly related to resizing. This is true for the noise addition and blurring operations, for instance. On the "No editing" operation, all the pipelines presented the worst results. Moreover, we observed significant differences in performances depending on the donor and target tiles coming from different (i.e., inter-splicing) or the same (i.e., intra- splicing) acquired products. In the first scenario, with best IOU and balanced accuracy of 0.25 and 0.69 respectively, performances were still fairly good. In the second scenario, the evaluation metrics depicted instead an almost random decision for the estimation of the tampering mask.
This different behavior was somehow expected. As explained in Section V, our proposed pipeline has been de-signed with the objective of capturing inconsistencies related to the generation process of amplitude SAR products. The fingerprint extraction has been trained to provide a globally incoherent fingerprint only if the spliced region and its surrounding areas have undergone different processing. In intrasplicing attacks, when no editing is applied, splicing inconsistencies are absent as T D and T T come from the same (c) Fingerprint F.
(f) U-Net estimated tampering maskM.   GRD product. For this reason, we expected our detectors not being able to localize such attacks.

C. GENERALIZATION ON EDITING OPERATIONS
With the only exception of the "No editing" scenario, in Table 3, the results achieved on each editing operation always depict an IOU greater than 0.66 and a balanced accuracy exceeding 0.91. In Table 4, we exceeded 0.60 and 0.85 for IOU and balanced accuracy, respectively. While intrasplicing results are lower than inter-splicing, we expected such a behavior for the reasons reported above: spliced tiles in intra-splicing modality do not present inconsistencies in the forensic traces related to the pipeline that generated their original products, making them a more difficult asset to analyze. However, the overall good performances also in the intra-splicing scenario suggest the proposed pipeline was useful in finding inconsistencies associated to general editing operations executed on S.
It is interesting to notice that all the methods performed consistently across the different types of editing considered (i.e., noise addition, blurring, rotation, resizing, noise multiplication). In particular, the results achieved by the U-Net, which showed the best balanced accuracy for noise-based attacks, were quite surprising, considering that the U-Net was trained only on resizing-based attacks.

D. SUPERVISED VERSUS UNSUPERVISED APPROACHES
Comparing the different mask estimation methods, we can notice that GMM and U-Net alternated in providing the best performances. The U-Net represented the best method for all editing operations in terms of balanced accuracy, while the GMM provided best results in terms of IOU on 5 operations out of 7.
The K-means, while being the worst method of the three, showed nevertheless fairly good results. For instance, it was better than the U-Net in 5 operations out of 7 in terms of IOU in both inter and intra-splicing scenarios. Moreover, with respect to the GMM-based method, the K-means presented shorter computational times. Figure 11 reports a small performance study where we have computed, varying the resolution of the analyzed sample, the time needed by both algorithms to generate a tampering mask. We can see that for samples close to 1024 × 1024 pixels the two methods showed similar performances, but, starting from resolutions equal or greater than 2000 × 2000 pixels, K-means was considerably faster. We argue this behaviour was motivated by the EM algorithm requiring more iterations to reach similar convergence to the K-means, with each iteration requiring more computations [76]. Therefore, at deployment stage an end user might prefer to sacrifice the localization performances of the GMM method in place of the faster execution times of K-means.
Finally, while the U-Net-based strategy might seem the most promising one on average, we must be aware that, as a supervised technique, it needed a preliminary stage of training. Since the training of CNNs takes time and computational resources, the GMM and K-means methods may be appealing as well, depending on the final needs and the resources at disposal. As unsupervised methods, they are faster to deploy while still providing good performances.

E. COMPARISON WITH STATE-OF-THE-ART
For what concerns the comparison with state-of-the-art, Tables 5 and 6 summarize the best localization results achieved by the two proposed fingerprint extractors for each editing operation, along with the results obtained by the Noiseprint method [25] and the Splicebuster method by Cozzolino et al. [18]. We selected these techniques as they are widely exploited as a baseline in the forensics literature. Moreover, they proved to be robust to standard editing operations and do not require many adaptations to the domain of data under investigation.
Since the Noiseprint produces real-valued heatmaps (without binarization), all the tampering masks have been created starting from the extracted fingerprint and following our proposed mask estimation methods (i.e., the K-means, GMM and U-Net). The achieved results exactly correspond to the BE results that we have previously shown. Splicebuster returns real-valued heatmaps as well, but in this case we estimated binary tampering masks following the methodology suggested by the same authors in [19].
The state-of-the-art results always showed inferior performances than our best localization method. Nonetheless, especially looking at the editing operations involving noise addition or multiplication (i.e., the Speckle-like noise), Splicebuster presented even better performances than methods based on BE and SAE extractors. Despite the clear differences between natural images and amplitude SAR products, from the nature of the signals they depict to the different processing that leads to their formation, such results seem to indicate that forensics tools based on generic footprints might reveal to be useful also in the SAR context. This is especially true if the attacker relies on common editing operations (e.g., resizing, blurring, etc.), where features like the high-pass frequency co-occurrences used by Splicebuster are robust enough to be an effective splicing localization tool.

VIII. CONCLUSIONS
In this paper, we analyzed the problem of splicing localization in amplitude SAR imagery. The forensic analysis of these objects is becoming of paramount relevance, as amplitude SAR products are relatively easy to handle and process, even with general editing software such as GIMP or Photoshop.
To the best of our knowledge, no solution has been proposed yet in the forensics literature tailored to this kind of signals. As a matter of fact, amplitude SAR products present a completely different nature with respect to natural imagery, VOLUME 4, 2016 therefore are posing new and different challenges in assessing their integrity.
Inspired by a state-of-the-art method developed for natural images, we proposed a new splicing localization technique specifically designed for amplitude SAR products. Our proposed method extracts a fingerprint localizing spliced regions in SAR tiles. Then, the fingerprint can be analyzed using three different methods, one supervised and two unsupervised, to generate a final tampering mask. This mask is a binary heatmap indicating whether pixels underwent splicing or not. We generated different datasets of spliced GRD tiles, to train and test the validity of our proposed method. We considered different kind of manipulations applied to the tiles, from noise-based attacks to blurring and resizing.
All proposed techniques showed encouraging results in the localization of splicing attacks, providing better performances when compared to state-of-the-art solutions developed for natural images. The supervised approach reported the best numbers in terms of balanced accuracy, however needing a preliminary stage of training. The unsupervised approaches showed instead better performances in terms of the IOU metrics, and while they are less accurate in terms of balanced accuracy, they do not require a training phase.
These results proved the feasibility of the forensic analysis of amplitude SAR imagery, paving the way to further investigations on the development of methods tailored to this kind of signals. Possible research themes regard the evaluation of the proposed pipeline over elaborated splicing attacks (e.g., GAN-generated inpainting) that should in principle be detected by our technique, a specific exploitation of traces related to the generation pipeline of SAR images, the use of physics-based clues linked to the scene represented in the data and, finally, the adaptation of splicing localization methods for electrical-optical imagery.