Synthbuster: Towards Detection of Diffusion Model Generated Images

Synthetically-generated images are getting increasingly popular. Diffusion models have advanced to the stage where even non-experts can generate photo-realistic images from a simple text prompt. They expand creative horizons but also open a Pandora's box of potential disinformation risks. In this context, the present corpus of synthetic image detection techniques, primarily focusing on older generative models like Generative Adversarial Networks, finds itself ill-equipped to deal with this emerging trend. Recognizing this challenge, we introduce a method specifically designed to detect synthetic images produced by diffusion models. Our approach capitalizes on the inherent frequency artefacts left behind during the diffusion process. Spectral analysis is used to highlight the artefacts in the Fourier transform of a residual image, which are used to distinguish real from fake images. The proposed method can detect diffusion-model-generated images even under mild jpeg compression, and generalizes relatively well to unknown models. By pioneering this novel approach, we aim to fortify forensic methodologies and ignite further research into the detection of AI-generated images.


I. INTRODUCTION
How to assess the validity of an image as a proof to its content?Photographic images used to be considered the most reliable evidence possible, as they were difficult to realistically modify.With the proliferation of digital photography and the development of sophisticated image editing tools, this status of absolute proof is unfortunately long gone.It is increasingly easier to alter an image, not only to make it more aesthetically appealing, but also to change its semantic content and give it a different meaning than the truth.
In the fight against disinformation, the role of image forensics was thus to analyse whether an image was authentic or had been maliciously and locally altered to hide or distort the truth.However, a new source of disinformation has now appeared.Thanks to the advent of diffusion models [40], [48], [49], [50] and text-to-image joint embeddings, it is now possible and easy to generate images from scratch with nothing more than a text prompt describing the intended image, as seen in Fig. 1.Although the resolution of generated images remains limited, these images have achieved a high level of photorealism, that can make them visually indistinguishable from real photographies.
This progress has enabled many innovations, for instance in the arts, to create movies or even in architecture.However, it also brings the risk of people pretending the synthetic images they created is in fact an actual photography representing a real scene, for instance to incriminate or ridicule someone or more globally spread disinformation.
A cardinal question thus arises: how can such images be distinguished from real ones?Until very recently, synthetic images were mainly generated using Generative Adversarial Networks (GANs) [23], [30], [31], [32], [33].The methods to detect synthetic images have thus also focused on this architecture, while the literature on detecting images synthetized by those newer diffusion-based methods is still lacking.
It has been noted [21], [25], [38], [55] that GAN-generated images feature frequency artefacts.This is also true, to some extent, of DM-generated images [14], [15].Can these artefacts be used to identify synthetic images?Such an enterprise is challenging.These artefacts are subtle and not immediately visible, they must be revealed with suitable filters.While previous work [14], [29] reveal these artefacts, they could only do so by aggregating a large number of images together.To identify whether an image is synthetic, those artefacts must be extracted from a single image, which is a much more challenging undertaking.Furthermore, the frequency artefacts of DM-generated images lie at the same frequency spots as the artefacts caused by a common JPEG compression.It is thus crucial to be able to distinguish the artefacts that come from frequency-based methods from those coming from JPEG compression, lest natural but JPEG-compressed images be mistakenly detected as synthetic.
In this paper, we propose a method based on spectral analysis to detect synthetic images generated by diffusion models.We set up a simple method to highlight and analyse the frequency artefacts in images, distinguishing DM-generated images from authentic ones.Experiments show that the proposed method can reliably detect artefacts even under mild JPEG compression, and distinguish the artefacts caused by compression than those caused by diffusion processes.The method adapts well to unseen architectures, a gap that is yet to be overcome by existing models.
Our main contribution is four-fold: r We show that the cross-difference, a simple high-pass filter, can outperform the state of the art to highlight frequency artefacts in images, to a point they can be detected on individual images, r Based on this, we introduce a spectral method to detect AI-generated images from diffusion models, r We design a database of synthetic images to compare the existing methods.The dataset includes the most recent available generation methods to date.
r We study the ability of the proposed method and of the state of the art to distinguish real from fake images against JPEG compression and on unseen models.

II. RELATED WORKS A. SYNTHETIC IMAGE GENERATION
Recently, the domain of image generation has undergone profound transformations, predominantly fueled by the triad of Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Diffusion Models (DMs).These advancements have revolutionized image synthesis, paving the way towards crafting photorealistic synthetic images.
While GANs [28] have deeply influenced the landscape of image generation [9], [32], they have recently been surpassed by diffusion models [51].These models conceptualize data distribution as a diffusion process, iteratively distorting the image using a simplistic prior and gradually converting it back into the target distribution.Notably, the Ablated Diffusion Model (ADM) [20] has exceeded the capabilities of GANs and VAEs in image generation, marking a vital inflection point in the evolution of diffusion models.
In parallel with diffusion models, Transformer Models [52] have witnessed expanding applications in computer vision, primarily fueled by the advent of CLIP [47], a model adept at embedding images and text into a shared space.
Capitalizing on this capability, latent diffusion models [49] such as Stable Diffusion (SD) [50] and DALL•E [48] have extended diffusion models to synthesize images from text prompts in a latent feature space, resulting in a leap forward in the realm of image generation capabilities, both in terms of variety and photorealism.
Nonetheless, the swift advances in image generation have birthed societal apprehensions, primarily the threat of deepfakes, posing substantial security risks.The necessity to devise robust methods for synthetic image detection and potential misuse mitigation cannot be overstated.

B. SYNTHETIC IMAGE DETECTION
The central thrust of this paper lies in the authentication of synthetic images, an area where existing literature remains sparse.AutoGAN [55] utilizes a classifier in the spectral domain to identify synthetic images by their frequency artefacts.PatchForensics [10] investigates the unique properties of fake images, particularly face images, that render them detectable and discerns what generalizes across varying model architectures, datasets, and training alterations.McCloskey and Albright [39] take advantage of the fact that the intensity values of synthetic images are rarely saturated, while Wang et al. [53] and Gragnaniello et al. [29] train CNNs to differentiate real and GAN-generated images.However, these studies largely predate the prevalence of diffusion models and text-to-image techniques, hence they are primarily trained and evaluated on GAN-generated images sampled from specific classes.Two methods have already been proposed to detect DM-generated images.Corvi et al. [15] retrains the existing architecture of Gragnaniello et al. [29] on DM-generated images, while Ojha et al. [46] train a network to distinguish real and fake images in the latent domain of a CLIP-trained architecture [22].However, neither methods achieve good generalizability against methods unseen during training.

III. PROPOSED METHOD
As seen in Fig. 1, the frequency artefacts from DM-generated images lie at very specific frequencies, corresponding to components of periods 2, 4, and 8.We propose to use a cross-difference filter on the image to highlight the frequency artefacts in an image, and extract the magnitude of the points corresponding to components of periods 0, 2, 4, or 8, in both directions.A simple classifier is then trained to distinguish real from generated images.

A. HIGH-PASS RESIDUAL TO REVEAL THE ARTEFACTS
The cross-difference filter [11] has been introduced and used to reveal periodic artefacts coming from JPEG compression [11], [43] and image demosaicing [4].The crossdifference is an high-pass filter defined as the absolute difference between the two diagonals of a 2 × 2 block on an image.Let I be a 2-dimensional image, the cross-difference at location (x, y) is defined as (1) We propose to use the cross-difference to dampen the low frequencies and highlight the frequency artefacts from DM- generated images, shown in Fig. 2 For each colour channel of the image to analyse, the cross-difference filter defined in (1) is used to extract a simple fingerprint of the image.As the cross-difference acts as an high-pass filter, the high-frequency artefacts we expect to find in synthetic images are much more prominent on the cross-difference than on the original image.
The Fast Fourier Transform (FFT) of the cross-difference is then computed.To avoid any bias linked to the image size, the FFT is normalized by the size.Peaks representative of DM-generated images occur on components of period 0, 2, 4, and 8, in both directions.We extract the magnitude of the 45 peaks from each of the three colour channels, leading to 135 extracted magnitudes.

B. ANALYSIS OF THE EXTRACTED PEAKS
Using only the 135 potential magnitude peaks as features, we then train a classifier to distinguish real from fake images.We use a histogram-based gradient boosting tree classifier (HBGB) [34], [35].This variant of the traditional Gradient Boosting Trees leverages the concept of gradient boosting with an histogram-based approach to accelerate the treegrowing process.The algorithm maintains the robustness of gradient boosting, while the histogram-based technique enhances its scalability, making it suitable for large-scale datasets.It is able to handle the dimensionality of the data, as 135 features are to consider, and can maintain a high accuracy.Although it was historically shadowed by neural network, this much simpler model is sufficient as we only use a relatively small number of features.Trained with both real and synthetic images from different diffusion models, our classifier learns to distinguish authentic and generated images using only the magnitude of these peaks.
The model is trained on both natural images and diffusionmodel generated ones, to detect whether the analysed features correspond to a natural or synthetic images.Different training schemes are presented and discussed in the Experiments section.

C. ROBUSTNESS TO JPEG COMPRESSION
The proposed method analyses FFT peaks corresponding to periods 0, 2, 4, and 8, in both directions.However, JPEG compression also leaves strong artefacts in these periods [2], [7], [8], [44].To train the network to only detect artefacts coming from diffusion methods, we apply JPEG compression to the training images.
One model is trained for each JPEG quality factor, as well as without compression.At inference, the JPEG potential quantization table of the tested image is estimated using a quantization table estimator [45], and the appropriate model is selected, a strategy that has already proved its efficiency in the forensic literature [16].

IV. DATABASE
Training and evaluating the proposed method requires sets of real and fake images.Real images are plentiful.In particular, the Raise dataset [17] and the Dresden dataset [27] contain 8156 and 1488 uncompressed photographs.Using uncompressed images is particularly useful, as we can then apply various post-processing such as JPEG compression on a clean image.
On the other hand, as diffusion models are quite recent, the available data on such images is scarce.To the best of our knowledge, the only such published database is proposed by Corvi et al. [15], consisting of 1000 images generated with different GAN and diffusion models, including DALL•E 2 [48], Glide [41], and Latent Diffusion [50].
To address this scarcity, we propose our own dataset of DM-generated images.This enables us to provide a way to Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.While the synthetic images are generated from a text prompt, we use an existing database of real image as guideline for the generated image, the Raise-1k dataset, which is a varied subset of the full Raise [17] dataset.This dataset contains one thousand high-quality, uncompressed photographies of diverse categories: indoor, outdoor, landscape, nature, people, objects, and buildings.While using an existing dataset of natural images is not strictly needed, it provides several advantages: 1) Being of the same categories, the natural images themselves provide a fair comparison point for the methods to check both their ability to detect fake images and to avoid false positives, 2) As already established in the literature [21], it is crucial to evaluate synthetic image detection methods on varied image classes.Using an already-diverse dataset as a guideline ensures the generated images are varied.Note that the original images are not used as image prompts to try to recreate a similar image or modify it.The original images are only used as a guideline to create the new text prompt of the presentation, to ensure the resulting image broadly belongs to the same category as the original one.
For each of the 1000 images, descriptions of the images are generated using Midjourney descriptor [40] and CLIP Interrogator [13].These descriptions are used as a basis to manually write a text prompt to generate a photo-realistic image loosely based on the original image.The objective is not to recreate a perfectly similar image, but rather to obtain an image from the same category, so as to keep the variety of the images.
The parameters that are used to guide the methods are selected randomly, within reasonable bounds.Examples of generated images can be seen in Fig. 3.

V. EXPERIMENTS
We now have a database of synthetic and authentic images tailored to evaluating methods, as the synthetic images are matched with real images from the Raise [17] database.We train our model on a separate fake images dataset [1] and on real images from the Dresden database [27], guaranteeing a fair evaluation on a challenging case where fake, but also real images from the training [27] and testing [17] datasets are wildly different.
For evaluation, we compare our results to the state of the art on the proposed database, naturally combined with the raise-1k [17] real images on which the dataset is based.Three scenarii are initially considered for training, to show potential results on the method depending on whether the tested synthetic image is generated by a model seen during training (generic), if the diffusion model is exactly known (specific) or in the worst case where the diffusion model is entirely unseen during training (generalization): 1) Generic training: the proposed method can be trained generically on images coming from all known diffusion models in the augmented Corvi et al. database.This is the most realistic case, as fake images for disinformation are usually created with fake images from existing, publicly-available diffusion model, but it is rarely known specifically which model was used.
The generic-trained method constitutes the final method proposed in this paper, whereas the specific and generalization scenarii should be viewed as experiments to test the strengths and limits of the method.2) Specific training: the method can be trained specifically on the diffusion model used for the images.While this approach can be seen as unrealistic, it enables us to know the limits of the method in an ideal case where it the exact model used to generate an image is known.3) Generalization training: Reversely, the method can be trained on all known diffusion models, except the one used to generate the image, to assess whether the method can generalize to unknown models.Results of this experiment are reported in Fig. 4 and in Table 1.Under the generic training, the proposed method yields consistently good results across diffusion models and beats the state of the art on all, even against stable diffusion images which are already well-detected by Corvi et al. [15].Knowing the specific model used is shown to slightly enhance the results, although this is only significant against Midjourney [40] and DALL•E 2 images.The model also shows great generalization ability, although the results are expectedly worse than when the model has been seen during training.Generalization results are significantly worse on Midjourney images, and even more so on DALL•E 2 images, suggesting these models architectures are dissimilar to the other known ones.We also note that, surprisingly and seemingly inexplicably, generalization results against Glide images are better than results when the model belongs to the training set.

A. ROBUSTNESS TO JPEG COMPRESSION
It was stated earlier that DM artefacts and JPEG compression artefacts lie at the very same frequency, potentially rendering their distinction difficult.To assess this, we test the proposed model on images at different JPEG compression levels, as seen in Table 2.
The test images, both real and synthetic, are JPEG- compressed at the mentioned quality factor.As can be seen, the model is very robust even against mild JPEG compression, and can still distinguish real from fake images even at JPEG quality 70, albeit with reduced performance.

B. ABLATION STUDY
Frequency artefacts on synthetic images were previously highlighted using denoising with DnCNN [14], [15], [21], [54].However, this was only be performed by aggregating numerous images to reveal the artefacts, rather than on a single image.We propose the use of a cross-difference filter, that can highlight the frequency artefacts on single images.Table 3 shows that this filter indeed improves performance over using DnCNN denoising.

VI. DISCUSSION AND LIMITATIONS
Despite its simplicity, the proposed method is indeed able to detect synthetic images better than the existing state of the art.It shows some generalization ability, as well as robustness to JPEG compression.Despite that, those two points remain an important challenge.Indeed, while the proposed method performs better than the existing ones both against JPEG images and on unseen architectures, false positives are still impossible to avoid in these complex situations.Yet, simple Bayesian reasoning shows that even a small number of false positives can be sufficient to drown true detection from false alarms, due to the high proportion of authentic images in the wild.In addition, wrongly accusing someone of fraud can have disastrous consequences.As a consequence, current synthetic image detection methods, including the proposed one, should still be considered a research artefact, and not be used as proof that an image is actually forged.For practical usability, setting an automatic threshold would be crucial, for instance with a contrario analysis [18], [19], a promising approach in forensics [2], [3], [4], [5], [6], [26], [36], [42], [44] Finally, we note that the proposed method is trained on diffusion-model-generated, photorealistic images.It is not trained to work on GAN images, for which numerous tools already exists.Given that the frequency artefacts are usually stronger on GAN images than on DM images, it would be easy to adapt the proposed method to GANs should the need arise.We also note that our method has only been tested on photo-realistic images; it remains untested, and thus not suited for, digital art examination.Indeed, not only is the method not trained on such images, it is likely they would be more challenging, as digital art usually present flatter textures than natural photographies, and thus fewer opportunities for frequency artefacts to be revealed.

VII. CONCLUSION
In this paper, we have trained a simple method to detect synthetic images generated by diffusion models.The method reveals the frequency artefacts using an high pass filter, then distinguishes real and fake images using the presence of these artefacts with a simple classifier on the FFT magnitude peaks.
This method performs well even in difficult situations such as JPEG compression and unseen models.Still, the risk of false positives and their consequences should before all draw future work into preventing and controlling the risk of false alarms.

FIGURE 1 .
FIGURE 1. Proposed method detects synthetic images generated by diffusion models in the spectral domain.It computes a high-pass residual of a suspect image, and analyses suspected peaks in the Fourier transform of the image to detect whether an image is synthetic or authentic.

FIGURE 2 .
FIGURE 2. FFT of the averaged cross-difference of the models and of real images, computed on the proposed database.For a given model, we compute the cross-difference of each of the 1000 images, then average the computed cross-difference as well as the colour channels.We then display the magnitude of the Fourier transform of the averaged result.For better visual legibility at display size, the magnitude is augmented by a morphological dilation, which increases the size of the peaks in the images.For models where images vary in size, only those of the most frequent size are used.We can see that most diffusion models feature traces at periods of 2, 4, and 8. Firefly even features a 16-periodic artifact component, possibly due to a higher number of upsampling steps.Glide images feature fewer, but more visible artefacts, possibly due to the fact that it performs only one small super-resolution step, which is less than the other models.Curiously, DALL•E 2 images only feature artefacts on the horizontal axis of the Fourier transform, hinting at a strongly different treatment of both axes in the weights of the model.

FIGURE 3 .
FIGURE 3. Examples of the generated images in the database.The images are generated with different diffusion models using a text prompt, that loosely based on a natural image from Raise-1k[17].As the goal of the database is to evaluate methods that can distinguish natural photographies from synthetic images, attention is paid in the prompts to generate images with photorealistic styles and textures rather than artistic styles.

FIGURE 4 .
FIGURE 4. Comparative ROC curves of the proposed method with the existing state of the art, on the different detection models.The F1 and MCC scores are computed using the standard thresholds (0 or 0.5 depending on the method).We can see that the proposed method consistently get excellent results, even when not trained on the specific model to be tested.It thus shows some decent generalization ability, except on the DALL•E 2 and Firefly models, which seem to yield slightly different artefacts.

TABLE 2 . Study of the Robustness of the Proposed Method (Generic Training) Against JPEG Compression on the Different Models TABLE 3. Ablation of the Proposed Method (Generic Training), With the Cross-Difference and the DnCNN Denoiser Proposed in Existing Works [14], [21], [38], Which Used DnCNN [54] to Reveal Frequency Artefacts on Synthetic Images, but had to Aggregate the Results Over a Large Number of Images evaluate
[24]ods on current diffusion models, such as Stable Diffusion[50]1.3,1.4,2, and XL, Midjourney[40], Adobe Firefly[24], DALL•E [48] 2 and 3, for which no publicly-available datasets are available yet.This newlyconstructed dataset is also useful to train and test models on independently-generated data, ensuring a fair evaluation.