Forensic Analysis of Synthetically Generated Western Blot Images

The widespread diffusion of synthetically generated content is a serious threat that needs urgent countermeasures. As a matter of fact, the generation of synthetic content is not restricted to multimedia data like videos, photographs or audio sequences, but covers a significantly vast area that can include biological images as well, such as western blot and microscopic images. In this paper, we focus on the detection of synthetically generated western blot images. These images are largely explored in the biomedical literature and it has been already shown they can be easily counterfeited with few hopes to spot manipulations by visual inspection or by using standard forensics detectors. To overcome the absence of publicly available data for this task, we create a new dataset comprising more than 14K original western blot images and 24K synthetic western blot images, generated using four different state-of-the-art generation methods. We investigate different strategies to detect synthetic western blots, exploring binary classification methods as well as one-class detectors. In both scenarios, we never exploit synthetic western blot images at training stage. The achieved results show that synthetically generated western blot images can be spot with good accuracy, even though the exploited detectors are not optimized over synthetic versions of these scientific images. We also test the robustness of the developed detectors against post-processing operations commonly performed on scientific images, showing that we can be robust to JPEG compression and that some generative models are easily recognizable, despite the application of editing might alter the artifacts they leave.

Abstract-The widespread diffusion of synthetically generated content is a serious threat that needs urgent countermeasures. As a matter of fact, the generation of synthetic content is not restricted to multimedia data like videos, photographs or audio sequences, but covers a significantly vast area that can include biological images as well, such as western blot and microscopic images. In this paper, we focus on the detection of synthetically generated western blot images. These images are largely explored in the biomedical literature and it has been already shown they can be easily counterfeited with few hopes to spot manipulations by visual inspection or by using standard forensics detectors. To overcome the absence of publicly available data for this task, we create a new dataset comprising more than 14K original western blot images and 24K synthetic western blot images, generated using four different state-of-the-art generation methods. We investigate different strategies to detect synthetic western blots, exploring binary classification methods as well as one-class detectors. In both scenarios, we never exploit synthetic western blot images at training stage. The achieved results show that synthetically generated western blot images can be spot with good accuracy, even though the exploited detectors are not optimized over synthetic versions of these scientific images. We also test the robustness of the developed detectors against post-processing operations commonly performed on scientific images, showing that we can be robust to JPEG compression and that some generative models are easily recognizable, despite the application of editing might alter the artifacts they leave.

I. INTRODUCTION
Synthetically generated multimedia content has been flooding the web lately, catching people's attention mainly thanks to the entertainment and the artistic possibilities than can arise from these new technological advancements. State-ofthe-art methods for synthetic content generation allow one to synthesize incredibly realistic images and audio sequences [1], [2], [3], [4], [5]. It is also possible to transfer the identity of a person [6], or even the body movements [7], from one video to another one. The majority of these innovative tools owe their birth to Generative Adversarial Networks (GANs) and probabilistic generative models, which are leading technologies for synthesizing multimedia data. All these tools usually present easy-to-use free interfaces, such that any amateur without particular experience in digital arts can use them.
In spite of these evident new artistic opportunities, the vast production of synthetic content inevitably introduces serious threats related to data trustworthiness and integrity. Novel technologies can be maliciously exploited for data counterfeiting. This phenomenon is not only limited to digital multimedia content but it has been spreading worldwide over a significantly larger area, potentially including images reported in scientific publications [8], [9], [10].
In particular, western blot images are widely used in the biomedical literature concerning molecular biology and immunogenetics. They concern the analysis of proteins at a high sensitivity and precision level [11]. The scientific community started arguing about their authenticity since 2016, when the authors of [12] began to scan images from more than 20K scientific papers, eventually discovering an incredibly high manipulation rate (around 4%) with several duplicated or tampered with images.
Nowadays, the most common procedure to spot manipulations on western blot images is visual inspection. As a matter of fact, forensics techniques aiming at spotting local image tampering have a hard time in detecting manipulations applied to scientific images. This is often due to their reduced pixel resolution and the numerous amount of processing operations applied to create realistic forgeries [13].
The visual observation by an expert is still the most widespread approach, although requiring one important hypothesis: the investigated images are supposed not to be synthetic. The manipulated region is supposed to be derived from an already existing image, aptly processed to hinder tampering traces. If the western blot image under analysis has been synthetically generated, either in its entirety or locally only in specific pixel regions, there would be essentially no hope to spot such traces by visual inspection [10]. Indeed, during some preliminary experiments, the authors of [10] verified that standard image generation techniques based on GANs [14], [15] can synthesize almost indistinguishable western blots with respect to the real ones, even at the experts' eyes.
In this paper, we tackle the detection of western blot images which have been synthetically generated through GANs and probabilistic generative models. Our goal is to explore forensics methodologies to automatically classify synthetic and real western blots. We investigate how different forensics strategies developed for natural images perform over scientific images. In doing so, we simulate the realistic scenario in which synthetic versions of western blot images are not available to the analyst, who therefore cannot develop a forensic detector specifically tailored to them. We experiment with two main approaches: 1) a binary classification approach, borrowed from a recently proposed method for detecting synthetic versions of natural images [16]. This method relies upon a Convolutional Neural Network (CNN) purposely designed to tell real and synthetic images apart. In particular, we never train the detector on western blot images, thus testing its robustness on images of diverse nature such as western blots; 2) a one-class classification approach, in which we train a detector only on original western blots, looking for any anomalies or inconsistencies appearing in the synthetic images.
To compensate for the absence of a publicly available dataset of real and synthetic western blots, we create a new one comprising 14K real and 24K synthetic images, generated by means of three different GANs and one probabilistic generative model. To the best of our knowledge, the detection of synthetic images generated through probabilistic generative models has not been faced yet in the literature. Since these models have recently proved to synthesize images with high fidelity and diversity in a comparable manner to GANs, we propose a first insight on their detectability.
We extensively evaluate the proposed techniques, comparing various binary detectors and one-class detectors over the generated dataset. The achieved results demonstrate that the currently available strategies developed for natural images can be a valid option for identifying synthetic western blots.
Moreover, we test the detector robustness to common postprocessing operations like image compression and resizing. We show that the proposed one-class classification approach can be robust to JPEG compression and can detect the synthetic images generated through some generative models almost independently from the post-processing applied.
To summarize, the main contributions of this paper are: • We create a new dataset of western blot images including 14K real images and 24K synthetic images, generated by means of four different generative models including probabilistic models as well, whose detectability has never been investigated in the state-of-the-art. • We investigate forensics strategies for the detection of synthetically generated western blots, proposing both binary and one class detection approaches that never exploit synthetic western blots at training stage. • We explore the robustness of the proposed approaches in case the scientific images are post-processed with common editing operations. • Results demonstrate the validity and generalization of the proposed methods, although additional research is needed to enhance robustness against standard processing operations and unseen generative models. The rest of the paper is organized as follows. In Section II, we describe the generation process of synthetic western blots and present the created dataset. In Section III, we provide details on the proposed detection methods to distinguish real from synthetic western blots. In Section IV, we describe the experimental setup and discuss the obtained results. Eventually, in Section V, we draw our conclusions.

II. SYNTHETIC WESTERN BLOT GENERATION
In this section, we provide details about the generation process of synthetic western blot images. We start with a brief description of the methods used for synthetic image generation, then we illustrate the original images employed as reference for the creation of synthetic samples. Eventually, we present the synthetic generation process and the generated dataset, providing some examples and highlighting the differences among the generation strategies.

A. Architectures
To generate synthetic western blot images, we adopt wellknown CNN architectures from the literature of natural images generation.
Three of the proposed CNNs belong to the family of GANs, which have been extensively used to generate synthetic images of human faces, animals and various objects. We first illustrate GANs dealing with the image-to-image translation problem. Among the various methods presented in the literature, we focus on Pix2pix [14] and CycleGAN [17] models, being two of the best performing and most widespread generation methods. We also consider style-based GANs [1], [2], [18], [19]; in particular, we employ the StyleGAN2 with Adaptive Discriminator Augmentation (StyleGAN2-ADA), one of the newest and most promising models [1] for the task.
The last considered technique is based on probabilistic generative models, which have been recently proposed as an alternative to GANs for creating synthetic data with highfidelity [20], [21], [22], [23]. In particular, we select the Denoising Diffusion Probabilistic Model (DDPM) proposed in [23].
1) Image-to-image translation GANs: Image-to-image translation GANs cover the vast area of generative networks which learn a mapping between two image categories and translate one category into the other one. To perform imageto-image translation, we need to train GAN architectures with multiple images selected from the two distinct groups. Pix2pix. Pix2pix [14] is an image-to-image translation GAN inspired by conditional adversarial networks. It follows the typical paradigm of image-to-image translation models as it requires a training set of aligned image pairs in which it exists a correspondence between two images of distinct categories. For instance, an aligned image pair could be composed by a color image and its grayscale version, or by an edge-map and the corresponding photograph. Specifically, Pix2pix exploits a conditional GAN which conditions on an input image and generates an output translated image [14]. CycleGAN. CycleGAN [17] is a particular class of image-toimage translation GANs defined as unpaired image-to-image translation model. Since finding paired training data is not always possible and it can be difficult and expensive [17], CycleGAN is trained to translate between images of distinct domains without exploiting aligned image pairs. The main feature of CycleGAN is its "cycle-consistency" property which translates an input image to an output meaningful synthetic image belonging to a different category [17].
2) Style-based GANs: Style-based GANs were born in 2019 as an alternative to traditional generation models [24]. The generator of StyleGAN [25] introduces a mapping of the latent code into an intermediate latent code, which is transformed to different "styles" that control the layers of the synthesis network. The proposed architecture has been further improved with the StyleGAN2 [19], StyleGAN2-ADA [26] and StyleGAN3 [2] models, which remove undesired blob artifacts and enable achieving outstanding synthesis quality by training only on few samples. The main difference between image-to-image translation GANs and style-based ones lies on the input data to be provided for training and for synthesizing new images. If the former needs pairs of input images selected from two distinct categories for training and one image category for synthesizing, the latter requires images of a single category for training and synthesizes new images with the same style starting from the latent code provided to the generator.
3) Probabilistic generative models: Probabilistic generative models are a class of generative models which sequentially disturb training data with slowly increasing noise, and then learn to reverse this corruption in order to build a generative model of the original clean data [22]. Among them, DDPMs [27], [20], [21], [23] were introduced in 2015 by [27]. They model the "noising" data process as a forward diffusion process which gradually converts any complex data distribution into a simple and tractable noise distribution [27]. Then, they learn the backward process (i.e., to pass from noise distribution to data distribution) which allows to generate new synthetic data. In the last few years, DDPMs have proved to generate data with high fidelity and diversity, often being comparable or outperforming state-of-the-art GANs.

B. Original Images
We collected almost 300 original RGB images of different resolutions depicting multiple western blots. Every western blot image may contain different bands, which can have multiple shapes. The final shape depends on the operations done by the biologist who processed the protein, on the protein itself and on the properties of the processing apparatus [11]. The images usually present irregularities like spots, scratches  and bubbles. All these ingredients make every western blot image almost unique, and also the single bands contained inside can have small variations among them [11].
Our dataset includes 284 original images downloaded from the web or selected from scientific publications. Since all images present small size (i.e., usually less than 256 pixels on the smallest dimension), we resize them, keeping the aspect ratio of the initial image, such that the minimum dimension is always equal to 256 pixels. A few examples of the original western blot images are depicted in Figure 1.

C. Synthetic Images
We start showing how to generate synthetic images with GANs, then we present probabilistic generative models. Eventually, we illustrate the final generated dataset that is used in our experiments.
1) Image-to-image translation models: We propose to generate synthetic western blots by feeding image-to-image translational GANs with images selected from the following two categories: • original western blot images; • images containing information on the position of western blot bands inside the original images. In particular, samples belonging to this last category have the same resolution of the original images they refer to, but consist of binary values being 0 in pixels corresponding to a detected blot band and 1 elsewhere. We refer to these images as blot-masks. Given an original western blot image I, it is related in one-to-one correspondence to its blot-mask M. For example, Figure 2 depicts the blot-masks corresponding to the original images shown in Figure 1. We build the blot-masks through a semi-automatic segmentation process. For each image, we exploit Otsu's image thresholding [28] and Watershed segmentation [29] algorithms to automatically obtain possible blot-masks associated with the image, then we pick the best mask by visual inspection. Pix2pix. We generate synthetic western blots by training Pix2pix with images belonging to the previously reported two classes. Pix2pix requires these images to be aligned one with respect to the other. In other words, each original image and the related blot-mask should be included within the same input pair. The network is trained to learn the mapping between the position of western blots (i.e., information carried by blot-masks) and their related representation (i.e., information carried by original images). As Pix2pix requires squared input images, we randomly extract 50 squared patches with size 256 × 256 from each original image and an equal amount of squared patches from the related blot-mask. For illustration's sake, Figure 3 draws a sketch of the training setup required by Pix2pix.
In the generation phase, we provide as input only binary masks according to the desired western blot location. Pix2pix generates new synthetic images containing western blot bands in these positions.
CycleGAN. To generate synthetic western blots with Cycle-GAN, we propose to feed it with the same images exploited for training the Pix2pix model. However, we can remove the alignment constraint and train with unpaired images. As Pix2pix, CycleGAN also requires squared input images, therefore we use the same squared patches extracted for training Pix2pix. Figure 4 depicts a sketch of the training setup required by CycleGAN. Notice the relaxation of alignment constraint with respect to the Pix2pix model reported in Figure 3.
In the generation phase, we provide again as input only binary masks according to the desired western blot location. CycleGAN, similarly to Pix2Pix, generates new synthetic images containing western blots in these positions.
2) Style-based generation models: Among the style-based generative models, we exploit StyleGAN2-ADA, which has proved to generate highly realistic images and needs less samples to be trained with respect to StyleGAN2 and StyleGAN3. Differently from image-to-image translation models, we can feed the network with single input squared patches. During training, the network learns to generate new images with the same style of the training dataset. Figure 5 depicts a sketch of the training setup required by StyleGAN2-ADA.
In the generation phase, we can provide different seeds to the synthesis network, each one corresponding to a new synthetic western blot image.
3) Probabilistic generative models: We select the DDPM proposed in [23], which recently improved the generation performance of diffusion models both in terms of data fidelity and diversity. As performed for StyleGAN2-ADA, we directly feed the generative model with squared input patches from the training data (see Figure 5). We use these images to implement a diffusion-based noising process and then learn how to reverse it. In generation phase, samples from the noisy distribution can be randomly selected. Starting from them, DDPM is able to gradually remove the noise and return new synthetic western blots.

4) Final dataset:
The final dataset that we use to evaluate our experimental setup consists of original and synthetic squared images with a common size of 256 × 256 pixels.
The original samples are derived from the data described in SectionII-B. Specifically, we randomly extract 50 squared patches per original western blot image. We end up with 14, 200 real images with size 256 × 256 pixels.
The synthetic samples include: • 6, 000 squared images with size 256 × 256 generated by the Pix2pix model, providing as input to the generator the same blot-masks seen in training phase; • 6, 000 squared images with size 256 × 256 generated by the CycleGAN model, providing as input to the generator the same blot-masks seen in training phase; • 6, 000 squared images with size 256 × 256 generated by the StyleGAN2-ADA model, providing as input to the generator different seeds for each new image to be synthesized. • 6, 000 squared images with size 256 × 256 generated by DDPM, providing as input to the generator different noisy samples, corresponding to an equal number of new images to be synthesized. Figures 6-9 depict a few examples of synthetic western blot images generated by the four proposed models. If we provide the same blot-mask to Pix2pix and CycleGAN, the generated western blot varies according to the generation model. Nonetheless, in both situations, the synthetic images are plausible and realistic. In case of StyleGAN2-ADA and DDPM, the generated samples present high quality and photo-realism. The complete dataset is available at https://www.dropbox.com/ sh/nl3txxfovy97b1k/AABqb-gkGBEfjS6pjke3a-d7a?dl=0.

III. SYNTHETIC WESTERN BLOT DETECTION
In this section, we present the investigated methods for synthetic western blot detection. Given a query image, we investigate two kinds of classification setups: (i) a binary setup, in which we train a binary classifier on both original and synthetic natural images; (ii) a one-class setup, in which we train a one-class classifier only on the original western blot dataset. We always consider the challenging scenario in which the synthetic dataset of western blots is never seen during the detectors' training phase. In the binary classification framework, the training dataset does not even include the original western blot images, but only natural images. In the one-class detector configuration, we only see a reduced subset of the original western blot images during training.

A. Binary detection
We investigate the challenging scenario in which we never see western blots, pristine nor synthetic, during the training phase. We consider the realistic situation in which we have available some binary classifiers trained to distinguish original from synthetic images, which however do not belong to the western blot image category. For instance, we may have available binary classifiers trained to detect original and synthetic versions of human faces, animals or objects.
To this purpose, we borrow some of the GAN-image detectors recently proposed in [16], which performs a critical state-of-the-art analysis of the GAN-image detection task. The backbone architecture is a ResNet50, modified to avoid the down-sampling in the first network layer as suggested by [30]. In [16], this architecture modification proves to be robust to compression and resizing operations performed on the testing image dataset.
At deployment stage, each classification is associated with a positive score for the images belonging to the synthetic category and a negative score for the original category.

B. One-class detection
In this scenario, we remove the possibility to train the detector over synthetic images of any category, i.e., we consider training only on original images. We propose to train a oneclass classifier over a reduced set of the original western blot images.
To describe the texture characteristics of the training images, we propose to extract some features that will be fed to the classifier. Following a common state-of-the-art procedure [31], [32], we convert each color image in grayscale and apply highpass filtering by subtracting a low-pass version of the grayscale image to itself. The low-pass filter is a 3 × 3 spatial kernel defined as High-pass filtering We report some examples of high-pass filtered images in Figure 10. At visual inspection, there are not significant traces to tell real (top row) and synthetic images (bottom row) apart. Then, we convert the pixels' values to 8-bit unsigned integers and we compute the gray level co-occurrence matrix, a 2D matrix which reports a histogram of co-occurring grayscale values at a given offset over the input image. Indeed, the cooccurrence matrix has been widely exploited in the forensics literature both for binary and one-class detection tasks. For instance, co-occurrences have been used for spotting subtle differences (usually not visible at human inspection) in the textural features of real versus manipulated natural images (through splicing) [32] and of real versus synthetic natural images [33], [34].
We can define the co-occurrence matrix as C, with size 256 × 256. Every element [C] ij , with i, j ∈ [0, ..., 255], corresponds to the number of times the gray-level j occurs at a certain distance from the gray-level i, along a certain direction. To compute the co-occurrence matrix C, we investigate four different distances d for the gray-levels' comparison, testing d = {4, 8, 16, 32}, along both the horizontal and vertical directions. We normalize each co-occurrence matrix C by the sum of its elements, definingC as To motivate our choice, we randomly select 25 original western blots and 25 synthetic ones and investigate the behavior of co-occurrences on this reduced dataset. We apply gray-scale conversion and high-pass filtering for each image and compute the co-occurrence matrix corresponding to any possible gray-level distance and direction. Then, we aggregate the results through pixel-wise arithmetic mean, ending with an average co-occurrence matrix per image, with a size 256 × 256. To focus on the main differences between original and synthetic samples, we propose to compute the Principal Component Analysis (PCA) of their average co-occurrences. We can extract compact descriptors of original and synthetic samples to evaluate their differences visually.
In more details, we extract the PCA of the co-occurrence matrices of the original images. Then, we compute the projection on the extracted principal components for both original and synthetic samples. Figure 11   descriptors to separate the two classes. Considering that we are extracting a high-pass filtered version of the western blots, Figure 11 confirms the findings reported in [35], [36], i.e., that original natural images show a more homogeneous behavior in the high-frequency components concerning synthetically generated ones.
Given these premises, we process the co-occurrence matrices to extract several texture properties that enable to distinguish real from synthetic samples. In particular, we explore 5 different processing methods that extract one scalar feature from every image: • Contrast-weighted feature: givenC, we weight each element by the squared difference of its coordinates, and we sum over all the matrix elements. We define the contrast-weighted feature as • Homogeneity-weighted feature: givenC, we divide each element by the squared difference of its coordinates shifted by 1, and we sum over all the matrix elements. We define the homogeneity-weighted feature as • Dissimilarity-weighted feature: givenC, we weight each element by the absolute difference of its coordinates, and we sum over all the matrix elements. We define the dissimilarity-weighted feature as • Energy feature: givenC, we compute the square root of its energy, defining the energy-related processing feature as • Correlation-weighted feature: givenC, we weight each element by a cross-correlation measure of its coordinates, and we sum over all the matrix elements. Precisely, we compute the correlation-weighted feature as where R is a square matrix with the same size ofC, which emulates a normalized cross-correlation between row and column coordinates, weighted by the matrixC. For the sake of clarity, we define R as where and Considering that we extract 5 textural features (i.e., f c , f h , f d , f e and f ρ ) for each co-occurrence matrix version, we finally end up with 40 different features per query image.
We propose to feed every single feature to a one-class classifier, investigating both the well-known and widely exploited One Class Support Vector Machine (OCSVM) [37] that we consider as a baseline reference and the more recent Isolation Forest (IF) [38], which has been proposed as efficient strategy for anomaly detection. Both algorithms are trained for detecting outlier samples which are not distributed as the original training data.
At deployment stage, each classification is associated with a positive score for the images belonging to the training category (i.e., original images) and a negative score for outlier images (i.e., synthetically generated images).

IV. RESULTS
In this section, we report the experimental setup and the achieved results in the detection of synthetically generated western blot images. First, we report the performance of binary detection methods, then we show the results achieved by the one-class detector approach.
A. Experimental setup 1) Binary detection: We train three binary detectors by following the findings reported in [16]. In the first detector, the modified ResNet50 is trained over the training image dataset provided in [39], comprising 362K real images extracted from the LSUN dataset [40] and 362K generated images obtained by 20 ProGAN [24] models, each trained on a different LSUN object category. In the second detector, the modified ResNet50 is trained using 720K StyleGAN2 images and 552K real images selected from different public datasets, i.e., the LSUN [40], the AFHQ [41], the AnimalWeb [42], the BreCaHAD [43], the FFHQ [44] and the MetFaces [45]. The synthetic images were generated by training StyleGAN2 with real images selected from these datasets. In the third detector, we explore the situation in which the modified ResNet50 is trained to distinguish all the considered real images versus both ProGAN and StyleGAN2 synthetic images. It is worth noticing that none of the considered detectors exploits western blot images, real nor synthetic, during training.
2) One-class detection: The IF detector is trained by setting the number of samples to train each IF embedded estimator equal to the maximum possible one, i.e., the total number of training images. The remaining detectors' parameters are those suggested in [46].
We train the two proposed one-class detectors over the features extracted from half of the available real images depicting western blots. To avoid possible bias in evaluating the results, we split the pristine dataset of patches according to the original western blot images they have been extracted from, as described in Section II-B. In doing so, all the patches extracted from the same original image belong to the same dataset split. We end up with 142 training western blot images, corresponding to 7100 original patches with resolution 256 × 256 pixels.
Since the one-class approach is trained over original western blots, this inevitably reduces the number of original images to include in the test set. Therefore, to have a fair comparison with the binary detection results, evaluated over the final dataset and thus including all the original western blot patches, we apply a 2-fold cross-validation approach: we exploit each of the two dataset splits once as training set and once as testing set. Then, we average the achieved results.

B. Binary detection results
Binary classification results are shown in Tables I and II. We always report results by keeping separated the images synthesized through the four investigated generation methods, thus the real images are compared four times against a different synthetic dataset. Table I depicts the achieved Area Under the Curve (AUC) of the Receiver Operating Characteristic (ROC) curve built for the binary classification task, while Table II reports the achieved balanced accuracy in correctly classifying real and synthetic images. Notice that we are including also the classification results achieved by the state-of-the-art GAN detector proposed in [39].
The best detector almost always consists in the one proposed by [16] in the last configuration, i.e., trained on both ProGAN  and StyleGAN2 synthetic images. This result confirms the experiments performed in the original paper [16]: the bigger the training dataset, the better the generalization capability of the detector. Overall, Pix2pix and DDPM synthetic images are the most detectable ones. For Pix2pix, this might be expected, being the Pix2pix generation method the oldest of the four and reasonably introducing generation artifacts that might be easier to be spot. DDPM images, despite their recentness and their high-quality realism, still present more generation artifacts than current state-of-the-art GANs. Evaluations on the CycleGan and StyleGAN2-ADA datasets achieve similar AUCs, however the results on CycleGAN samples report an accuracy of more than 4 percentage points below the one reached on StyleGAN2-ADA western blot images. We investigate this behaviour in Figure 12, which depicts the distribution of the logit scores achieved by the best detector in case of synthetic images generated through Pix2pix, CycleGAN, StyleGAN2-ADA and DDPM, respectively. A significant amount of CycleGAN images is associated with a negative logit score, especially for the scores ≈ −1.8. This phenomenon is much more reduced for Pix2pix, StyleGAN2-ADA and DDPM synthetic images. When computing the AUC, the negative CycleGAN scores do not cause a strong impact on the performances, as the ROC curve is built by considering all the possible thresholds related to the binary decision problem. The balanced accuracy, instead, is computed by thresholding the logit scores with a fixed threshold equal to 0. This fixed thresholding inevitably assigns the wrong label to a great amount of synthetic images, thus lowering the detection performances.
C. One-class detection results 1) Single feature analysis: As reported in Section IV-A, we investigate 40 different features per query image, which correspond to an equal number of classification scores per image for each detector. For brevity's sake, we report only the  best classification results for each of the 5 proposed processing features, i.e., f c , f h , f d , f e , f ρ . In reporting results, we follow the same approach employed for binary classification, that is, we separately evaluate our performances on the four datasets of synthetic western blot images. Tables III and IV show the best achieved AUC on each selected feature by exploiting OCSVM and IF, respectively.
The features related to the contrast and dissimilarity never report the best results. For image-to-image translational models, the energy and correlation features are often the most discriminative ones, while on the StyleGAN2-ADA generated samples the AUCs approaches 0.9 only for the correlation feature. Over DDPM images, we achieve excellent results with f h and f e .
We further investigate how performances vary according to the exploited features in Figure 13, where we report the histogram of the achieved AUCs considering all the 40 investigated features. It is noticeable that, for any GANbased generation method, there are few features which allow to achieve high AUCs. For Pix2pix and CycleGAN images,  To provide insight on the nature of these features, Figure 14 investigates which are the parameters characterizing the best 8 features for each detector, i.e., which are the selected graylevel distance (i.e., 4, 8, 16 or 32), the direction of computation (horizontal (H) or vertical (V)) and the kind of textural metrics used (i.e., f c , f h , f d , f e or f ρ ) providing the best performances. The best features characterizing the OCSVM detector are the same of IF, except for the images generated through StyleGAN2-ADA, in which OCSVM and IF differ for only one feature. From Figure 14(a)-(b)  On the left, the parameters related to OCSVM, on the right those related to IF. In particular, (a) corresponds to real versus Pix2pix synthetic western blot images; (b) to real versus CycleGAN synthetic western blot images; (c) to real versus StyleGAN2-ADA synthetic western blot images; (d) to real versus DDPM synthetic western blot images. The blue • corresponds to the feature fc, the orange to f h , the green to f d , the red to fe, the purple to fρ.
by one single metric, f e , f h and f ρ , respectively, and any combination of gray-level distance and direction achieves acceptable results. StyleGAN2-ADA images (see Figure 14(c)) present stronger artifacts along the vertical direction: none of the 8 best AUCs is found over the horizontal direction. Moreover, both f d and f ρ report acceptable results, even though f ρ demonstrates to be more accurate, as reported in Tables III and IV. 2) Combined feature analysis: We also explore the scenario in which the proposed one-class detectors are trained not over a single feature per image but over a combination of multiple features. At deployment stage, we extract the feature combination from the query image, and we feed them to the detectors. For the sake of brevity, for each detector and for each generative method, we investigate only the combinations among the features returning the best 8 AUC values, i.e., the features described in Figure 14. Thus, we explore 3 different scenarios: (i) training on the combination of two features; (ii) training on the combination of three features; (iii) training on the combination of four features. We investigate all the 28 possible combinations for the first scenario; all the 56 combinations for the second one and all the 70 for the third one. We depict the best achieved AUC by OCSVM and IF in Tables V and VI, respectively. In this scenario, we also show the best achieved balanced accuracy by OCSVM and IF in Tables VII and VIII, respectively.
It is worth noticing that selecting combinations of mul- tiple features may improve the results, but does not bring a significant boost to the performances. Indeed, combining more features may lead to worse results. In more details, the performance change in exploiting more than one feature does not always represent an improvement and, whenever results are improved, the gain is reduced to a maximum of +2.22% in the AUC (see Table VI, first row) and to a maximum of +4.34% in the balanced accuracy (see Table VII, third row). Moreover, in the worst scenarios, exploiting four features can lead to −0.51% of performance loss in the AUC (see Table V, last row) and to −5.24% in the balanced accuracy (see Table VIII, third row). On average, exploiting more than one feature returns an AUC gain of 0.5% and a balanced accuracy gain of 1.2%. Thus, the choice to train the classifiers on more than one features might not be the preferred option because it is computationally and temporarily more expensive than the single feature scenario. For a further comparison with a standard feature extraction procedure followed in the literature [32], [33], [34], we also extract the co-occurrence based local features proposed in [31]. We train the one-class detectors on these features extracted from the training images; in testing phase, for each query image, we feed these features to the detectors. In order to provide a clear comparison with the proposed methodology, we report the achieved AUC in Tables V and VI, while we report the achieved balanced accuracy in Tables VII and VIII. In none of the considered scenarios the features of [31] outperform the proposed methodology.
3) One-class detector comparison: In general, we achieve the best results by means of the IF classifier. When comparing AUCs of the two detectors (see Tables V and VI), OCSVM reports accurate and comparable performances with respect to IF. On the contrary, the achieved accuracy by OCSVM is significantly lower than IF's (see Tables VII and VIII). This discrepancy in the reported AUC and accuracy can be explained with the same considerations done in Section IV-B. The IF detector demonstrates to be more stable and less prone to errors when exploiting a fixed thresholding strategy, i.e., selecting a threshold equal to 0 to discriminate images when solving the binary decision problem.

D. Binary vs One-class results
For clarity's sake, we summarize the best results of the binary and the one-class classification approaches in Table IX. Interestingly, the one-class approach outperforms the binary one on the CycleGAN and DDPM datasets, even if on DDPM only in terms of achieved AUC. In this scenario, learning textural properties of real western blot images brings a significant improvement with respect to a binary classifier trained on real and synthetic natural images not depicting western blots. As a matter of fact, in all the considered situations, the oneclass classifier reports valid and comparable results to those achieved by the binary one, considering that it is trained only on original western blot images, never looking at synthetic data.

E. Robustness to post-processing operations
As last experiment, we investigate scenarios where western blots underwent some post-processing operations. In doing so, we simulate realistic situations in which images to be included in a scientific publication might be resized and/or compressed due to limited resources dedicated to the manuscript in terms of maximum number of pages, Byte count, etc. The applied post-processing also simulates operations that might be done by malicious users who tampered with or generated completely synthetic versions of scientific images. Indeed, to create realistic forgeries, it is common to apply various post-processing on the modified images to conceal the tampering traces.
In this vein, we investigate three kinds of post-processing that might undermine the performance of our synthetic image detectors: • an upscaling post-processing, in which the images are enlarged by factors 1.25 and 1.5, and then randomly cropped to fit the 256 × 256 pixel resolution; • a down-upscaling post-processing, in which images are downscaled by factors 0.5, 0.75 and 0.9, and then upscaled back to fit their original resolution of 256 × 256 pixels; • a JPEG compression with different quality factors (i.e., 80, 90 and 100) corresponding to increasing image visual quality.
We resort to Albumentation [47] as data augmentation library. Table X reports the achieved AUCs in classifying the postprocessed images with the binary detector proposed in [16] trained on ProGAN and StyleGAN2 images. We pick this binary detector as it reports the best results in the experimental analysis on non post-processed images. In training phase, we do not apply any post-processing to the images. Unfortunately, the binary detector reports a consistent performance loss in almost all scenarios, especially on Pix2pix and DDPM images.
For the one-class detection, we investigate two possible training scenarios corresponding to realistic situations that forensics analysts commonly deal with: • training on the post-processed images. In this scenario, all data underwent some known editing operation and we have available a portion of them for the training phase; • training on the original non post-processed images. In this scenario, we miss information about the potential editing operations applied on testing data.
In both scenarios, we follow the same 2-fold cross-validation approach reported in Section IV-A to have fair comparisons with the binary detection approach. Tables XI and XII report the achieved AUCs in the first and second training scenarios, respectively. Notice that the average performances of the two scenarios are similar. By training on the original non post-processed images, i.e., simulating an agnostic scenario from the post-processing view point, we achieve almost the same results of the perfect knowledge situation.
Contrarily to the binary detection approach, the one class classifier is significantly more robust against the JPEG-based post-processing. As long as the compression quality factor does not excessively reduce the image quality, the performance loss can be contained around 12%, with the only exception of the Pix2pix synthetic images.
It is also worth noticing that DDPM synthetic images can be spotted with high accuracy level independently of the kind of post-processing applied, almost approaching the results achieved in the experiments executed on the non postprocessed dataset.

V. CONCLUSIONS
In this paper, we performed a forensics analysis of synthetically generated western blot images. Previous works have already shown that western blots can be tampered with or totally synthesized in a relatively easy way, with expert inspectors having a hard time in spotting forgeries.
We were not able to find in the literature a sufficiently vast dataset of original and synthetic western blot images to perform scientific experiments. Therefore, we created a new dataset containing more than 14K original and 24K fully synthetic western blots, generated through four state-of-the-art generation methods based on GANs and Denoising Diffusion Probabilistic Models (DDPMs).
Regarding the detection, we investigated the realistic scenario in which the analyst does not have available any synthetic versions of western blot images. To do so, we explored how forensics detectors purposely developed for binary classification of real versus synthetic natural images perform in distinguishing original and synthetic western blots. We also explored one-class classification approaches, in which we learned textural feature properties of original western blots and looked for any anomalies occurring in the synthetic data.
We extensively evaluated the proposed detectors on the collected dataset. Our results showed that synthetic western blots can be distinguished from real ones with a high accuracy in all the considered experimental scenarios. This is noteworthy, considering that we never exploited synthetic western blot images to optimize the detectors. Up to now, forensics detectors trained only on natural images or on original western blots represent valid solutions to identify fully synthetic versions of them.
Our experiments also highlighted that the one-class approach is robust to JPEG compression applied to the images to be analyzed, even if the training is performed on original non compressed images. Robustness to resizing operations is still a challenging issue for spotting the majority of synthetic images. Nonetheless, the one-class approach can easily detect the synthetic images generated through DDPMs, no matter the resizing applied.
Future work will include additional generative models, particularly adapting the binary detection approach for the attribution of the model used to create synthetic samples and investigating open-set recognition techniques to identify images generated with methods never seen during training. Furthermore, we will focus on improving detectors' robustness to common post-processing operations to make them more suitable for in-the-wild scenarios where an unknown processing chain might hinder traces of the synthetic generation process.