Journals & Magazines >IEEE Access >Volume: 12

Deepfake Detection: Analyzing Model Generalization Across Architectures, Datasets, and Pre-Training Paradigms

Training and evaluation setup, starting with the extraction of faces from videos, followed by augmentation, normalisation and resizing. The pre-trained models are used as...

Abstract:

As deepfake technology gains traction, the need for reliable detection systems is crucial. Recent research has introduced various deep learning-based detection systems, y...Show More

Metadata

Abstract:

As deepfake technology gains traction, the need for reliable detection systems is crucial. Recent research has introduced various deep learning-based detection systems, yet they exhibit limitations in generalising effectively across diverse data distributions that differ from the training data. Our study focuses on understanding the generalisation challenge by exploring different aspects such as deep learning model architectures, pre-training strategies and datasets. Through a comprehensive comparative analysis, we evaluate multiple supervised and self-supervised deep learning models for deepfake detection. Specifically, we evaluate eight supervised deep learning architectures and two transformer-based models pre-trained using self-supervised strategies (DINO, CLIP) on four different deepfake detection benchmarks (FakeAVCeleb, CelebDF-V2, DFDC and FaceForensics++). Our analysis encompasses both intra-dataset and inter-dataset evaluations, with the objective of identifying the top-performing models, datasets that equip trained models with optimal generalisation capabilities, and assessing the influence of image augmentations on model performance. We also investigate the trade-off between model size, efficiency and performance. Our main goal is to provide insights into the effectiveness of different deep learning architectures (transformers, CNNs), training strategies (supervised, self-supervised) and deepfake detection benchmarks. Following an extensive empirical analysis, we conclude that Transformer models surpass CNN models in deepfake detection. Furthermore, we show that FaceForensics++ and DFDC datasets equip models with comparably better generalisation capabilities, as compared to FakeAVCeleb and CelebDF-V2 datasets. Our analysis also demonstrates that image augmentations can be beneficial in achieving improved performance, particularly for Transformer models.

Training and evaluation setup, starting with the extraction of faces from videos, followed by augmentation, normalisation and resizing. The pre-trained models are used as...

Published in: IEEE Access ( Volume: 12)

Page(s): 1880 - 1908

Date of Publication: 29 December 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3348450

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Deepfakes, or deepfake media, are digital media that have been generated or modified using deep learning algorithms [1]. They have gained notoriety in recent years due to their potential to manipulate and deceive by producing fraudulent and deceptive media content. While deepfakes can serve innocent or even entertaining purposes, they also harbor substantial dangers when harnessed for malicious intentions, like crafting convincing fraudulent media to sway public opinion, manipulate electoral outcomes, or incite violence [2], [3], [4]. Also, given the prevalence of powerful and budget-friendly computing resources along with the widespread accessibility of paid, as well as open-source software, the creation of deepfakes has become increasingly straightforward [5]. This accessibility extends to individuals with limited technical knowledge, facilitating the production of convincing deepfakes that closely resemble genuine content.

The research community has been actively proposing novel AI-based automated deepfake detection models, trying to address these issues posed by deepfake media [6], [7], [8], [9], [10], [11]. However, a significant issue associated with current deepfake detection models is their lack of generalisation capability [1], [7], [12]. This means that the detection systems work excellently when dealing with deepfakes that come from the same data distribution as they were trained on. However, they struggle to perform well when exposed to deepfakes generated using different methods than the ones used for training.

Previous research efforts have introduced a multitude of carefully designed deep learning models for deepfake detection, accompanied by novel techniques for training (e.g., multi-modal features [6], novel augmentations [13] etc) and evaluating these models on well-known deepfake datasets. However, given the vast volume of research publications, it has become increasingly challenging to discern which kind of architectures yield optimal results and which datasets are most effective in facilitating robust model performance, thus enhancing generalisation to unseen data. In light of these considerations, we contend that a comprehensive analysis that unites a diverse range of deep learning architectures, trained and assessed across multiple prominent deepfake datasets in a unified manner, is imperative to gain a deeper understanding of this issue. We also believe that such a comparative analysis has the potential to uncover valuable insights for identifying the most suitable architecture and dataset(s) to enhance the effectiveness of deepfake detection systems. Consequently, we believe that this analysis will also contribute significantly to understanding the challenge of generalisation in deepfake detection.

In this study, we carry out a comprehensive comparative analysis of several widely recognised deep neural network architectures for image and video recognition, aiming to assess their efficacy in detecting deepfakes. Our primary goal is to determine which among these models achieves superior performance on unseen, out-of-distribution data, i.e., exhibit impressive generalisation capability. The models selected for our study comprise of both Convolutional Neural Networks (CNNs) and Transformer models. The rationale behind incorporating transformer models is rooted in their recent notable achievements across a spectrum of computer vision tasks such as image classification [14], [15], [16], object detection [17], [18], image segmentation [19], video classification [20], multi-modal learning [21], [22], 3D analysis [23], [24] and beyond [25].

For our analysis we train all participating models on four deepfake detection datasets (one-by-one) and evaluate them in both intra-dataset 1 and inter-dataset 2 configurations. Additionally, we evaluate the difficulty level of each benchmark and investigate whether a more challenging benchmark leads to better generalisation performance on unseen data. We train participating models on all four datasets twice: first, without any image augmentations and then with various image augmentations to find out if they improve models’ performance.

Recently, transformer models trained using self-supervised methods have demonstrated the ability to generate robust representations from both textual and visual data [26], [27], [28]. Resulting models have been shown to achieve excellent performance on new tasks often without the need for additional training or with minimal training on downstream tasks [19], [26], [27], [29], [30]. Owing to this, we also analyse Vision Transformer (ViT) architecture pre-trained using two well-known self-supervised learning strategies: DINO [26] and CLIP [27]. To study these models and find out how good the self-supervised feature representations are, we use self-supervised ViT-Base models (DINO and CLIP) as feature extractors and train a classification head on top of them. It is important to note that we only train the classification head and freeze the weights of the feature extractors.

In summary, our study aims to provide insights into various aspects, including: (1) identifying the most effective model architectures for detecting deepfakes among those being tested, (2) pinpointing the model with the highest ability to adapt to new and unseen data, (3) assessing the difficulty of different datasets for model training, (4) determining the dataset that best facilitates generalisation to unseen data, (5) evaluating the performance of self-supervised training strategies and (6) examining the impact of augmentations on enhancing model performance.

This paper is organised as follows. In Section II we present literature review on the topic of deepfake detection. Section III presents the proposed framework. In Section IV we present the results and discussion of our findings and finally Section V concludes this study by summarising our analysis and presents future research direction.

SECTION II.

Literature Review

Since recently quite a large number of research studies focused on deepfake media detection have been proposed. Most studies employ CNN models trained on large amounts of data in order to detect deepfake media. The proposed studies also employ different strategies e.g., novel augmentation techniques [13], hybrid models [9], [31], biological features [32], multi-modal features [6], [9], temporal features along with spatial information [9], [10], [33], recurrent networks, transformer models [8], [9] etc to detect deepfake images/videos while trying to increase the models’ generalisation capabilities. Below we present some well-known, as well as some of the recently proposed deepfake detection studies. We chose to review studies in this section that share similarities with ours, focusing on common aspects such as the selection of detection models, datasets used to train and evaluate the proposed models.

A. CNN based detection models

In 2019, Rossler et al. released FaceForensics++, a dataset for deepfake detection [34]. The dataset, containing over 1.8 million manipulated images was made publicly available. Using the proposed dataset authors also conducted an extensive analysis of several data-driven deepfake detection methods. The methods included traditional machine learning models (SVMs) trained on handcrafted features, as well as deep learning architectures including MesoNet [35] and XceptionNet [36]. By conducting a thorough analysis, authors discovered that a vanilla deep CNN XceptionNet [36] outperforms other participating models significantly in the context of detection task on compressed low quality data. Authors additionally demonstrated via experiments and surveys that the data-driven models even outperform humans in detecting deepfakes. However, the paper lacks a cross-dataset analysis of the models, which could have been beneficial in understanding the generalisation performance across diverse and unseen domains.

In [33] Sabir et al. proposed a deepfake detection system by focusing on the temporal information present in video streams to exploit temporal discrepancies across multiple frames in fake videos. In order to analyse temporal data authors employed a recurrent convolutional architecture [37], [38] comprising of a CNN for feature extraction and a BiDirectional Recurrent Neural Network (BiDir RNN) to analyse temporal information present in videos. Specifically, authors studied two different CNN architectures, ResNet [39] and DenseNet [40] for feature extraction. The authors also employed a carefully crafted pre-processing regime to pre-process facial frames before inputting them into the models. The models were evaluated using the renowned FaceForensics++ deepfake detection benchmark [34], showing excellent results in an intra-dataset evaluation regime. Authors do not carry out a cross-dataset analysis in their study.

Ciftci et al. [32], shifted away from traditional image features and proposed to employ biological signals (i.e., photoplethysmography or PPG signals which detects subtle alterations in color and motion within RGB videos) to train their models. The proposed model was comprised of a CNN, as well as a Support Vector Machine (SVM). The CNN and SVM models made their individual classifications on the provided feature sets, which were then fused together in order to get a final classification score. The proposed deepfake detection scheme achieved promising results when tested using both intra-dataset as well as inter-dataset configurations on multiple different deepfake detection benchmarks including, CelebDF [41], FaceForensics [42] and FaceForensics++ [34] datasets.

In study [6], Zhu et al. introduced a deepfake detection framework that leveraged 3D face decomposition features for detecting deepfakes. The authors demonstrated that the fusion of 3D identity texture and direct light features notably enhanced the detection performance, simultaneously promoting the model’s generalisation ability when assessed across different datasets. The training of the detection model involved both a cropped facial image and its corresponding 3D features. Authors employed XceptionNet [36] for feature extraction. The study also provides an extensive analysis of various feature fusion strategies. The proposed model was trained on the FaceForensics++ [34] benchmark and subsequently evaluated on three datasets: (1) FaceForensics++, (2) Google Deepfake Detection Dataset [43] and (3) DFDC [44] dataset. The reported evaluation statistics showed promising results across all three datasets, highlighting the model’s robust generalisation capability in comparison to previously proposed deepfake detection methods.

B. Transformer based detection models

In study [9], Khan et al. introduced the utilisation of transformer architecture for the purpose of deepfake detection, presenting two novel models: (1) Image Transformer and (2) Video Transformer. Both models were trained using 3D face features [45] in addition to standard cropped face images. Incorporating 3D face features was intended to obtain aligned facial details, thereby enhancing the learning process. The combination of these aligned features with conventional cropped face data contributed to the acquisition of pertinent facial details. To harness temporal information within videos, authors modified the standard Vision Transformer (ViT) [14] to accommodate multiple successive face frames. Notably, the proposed model exhibited incremental learning capabilities, accommodating new data without catastrophically forgetting prior knowledge. Authors conducted comprehensive evaluations of their models across prominent deepfake detection benchmarks, including FaceForensics++ [34], DFDC [44] and Google DFD [43]. Their models showed impressive performance across all these datasets, underscoring their efficacy in deepfake detection.

Wang et al. [8] introduced a Multi-modal Multi-scale TRansformer (M2TR) model, which was capable of processing patches of multiple sizes to identify local abnormalities in a given image at multiple different spatial levels. M2TR also utilised the frequency domain information along with RGB information using a sophisticated cross-modality information fusion block to detect forgery related artifacts in a better way. Through extensive experiments authors established the effectiveness of M2TR and showed their model outperforms other participating deepfake detection models by acceptable margins.

Coccomini et al. in [31] proposed a video deepfake detection model based on a hybrid transformer architecture. Authors used an EfficientNet-B0 to extarct image features. The extracted features were then used to train two different types of Vision Transformer models in their study, e.g., (1) Efficient ViT and (2) Convolutional Cross ViT. The latter was comprised of two branches, i.e., S-branch for processing images with smaller patch sizes ( $7\times 7$ ) and L-branch for processing images using a larger patch size ( $64\times 64$ ), thereby possessing a wider receptive field. Through experimentation on DFDC [44] dataset, authors established that the hybrid model comprising of EfficientNet-B0 feature extractor and Convolutional Cross ViT achieved the best performance scores as compared to other models that they tested in their study. They also carried out cross-dataset analysis on FaceForensics++ [34] showing encouraging results.

Zhao et al. [10] propose an Interpretable Spatial-Temporal Video Transformer (ISTVT) for deepfake detection was proposed. The proposed model incorporates a novel decomposed spatio-temporal self-attention as well as a self-subtract mechanism to learn forgery related spatial artifacts and temporal inconsistencies. ISTVT can be also visualise the discriminative regions for both spatial and temporal dimensions by using the relevance propagation algorithm [10]. Extensive experiments on large-scale datasets were conducted, showing a strong performance of ISTVT both in intra-dataset and inter-dataset deepfake detection establishing the effectiveness and robustness of proposed model.

Through this literature review it becomes apparent that the research community actively employs deep learning models along with other techniques to try develop robust and efficient deepfake detectors. However, while carefully reading the research studies it also becomes noticeable that the models do not always perform as expected when tested on unseen, out-of-distribution data. In addition to this, there is a lack of comparative studies which aim to identify which specific family of deep learning architectures is better for the task of deepfake detection. Furthermore, it is not easy to determine without thorough experimentation that which of the well-known datasets offer improved generalisation potential to the models, i.e., allow models to better handle new and unseen data.

To address this, we study some of the most frequently used architectures (EfficentNets, XceptionNet, Vision Transformers) in the literature of deepfake detection in this research. We also employ widely known datasets for experimentation and try to find out the datasets offering best generalisation capabilities to the models. We also analyse some of the understudied approaches for deepfake detection i.e., we train and evaluate the performance of self-supervised models on deepfake detection and compare their performance with that of the supervised models.

SECTION III.

The Proposed Framework

The workflow followed in this study for training and evaluating the models is illustrated in Figure 1. On top we show the training pipeline where we start by extracting and cropping faces from videos. The cropped face frames are then augmented, normalised and resized before being fed to the model for training. We load pre-trained models as feature extractors, i.e., we remove the last layer from the loaded models and add a new classification head (linear layer) on top. For supervised models, during training we update weights of both feature extractor as well as the classification head.

FIGURE 1.

The proposed framework. The process involves several steps, starting with the extraction and cropping of face frames from videos, followed by augmentation, normalisation and resizing. The pre-trained models are then used as feature extractors, with a new classification head (linear layer) added on top. During training, the weights of both the feature extractor and the classification head are updated for supervised models, while only the newly added classification head is updated for self-supervised models. All models are evaluated through both intra-dataset and inter-dataset evaluation strategies to test their performance and generalisation capabilities. For image models, the input data is a single cropped face image, while for video models, it is a tensor containing eight consecutive cropped face images from a given video..

Show All

For self-supervised models, our objective is to assess the quality of the representations they produce since they were initially trained through self-supervised training strategies. Consequently, for these models we only update weights of the newly added classification head while maintaining the frozen weights of the feature extractor backbone. This strategy enables us to directly compare the self-supervised feature representations with those obtained from supervised models. Since we deal DINO and CLIP as feature extractors, we follow the guidelines provided in their respective code repositories to extract features.

For DINO, we extract features from the last four encoder blocks, as this configuration yielded optimal results. On the other hand, for CLIP, we exclusively extract features from the last encoder block. We then feed these features into the classification head.

For intra-dataset evaluation we evaluate models on the same dataset (test set) it was trained on, e.g., model trained on dataset D1 is evaluated on the test set of D1. The primary objective of intra-dataset evaluation is to discern which model achieves the highest performance score as compared to other participating models on each of the dataset. Moreover, this evaluation will offer insights into which dataset presents the greatest learning challenge for the models and which dataset is comparatively easier to learn.

In the context of inter-dataset evaluation, we evaluate models that were trained on one dataset across the remaining three datasets. For instance, a model trained on dataset D1 is tested on D2, D3 and D4 datasets. The objective of inter-dataset evaluation is two-fold: first, to investigate the models’ ability to generalise across datasets and second, to understand how effectively the training dataset empowers models to generalise well on unseen data.

The input data for training and evaluating image models is a single face cropped image ([3, 224, 224]), whereas, input data for training and evaluating video models is a tensor containing 8 consecutive face cropped images ([8, 3, 224, 224]) from any given video.

A. Datasets

In this study we train and evaluate several different deep learning models on four deepfake detection datasets/benchmarks including FaceForensics++ [34], CelebDF-V2 [41], DFDC [44] and FakeAVCeleb [46]. All of the four datasets comprise of real and fake videos, where fake videos are generated using different deepfake generation methods. In subsequent paragraphs, we present a brief description of these datasets.

FaceForensics++ [34] is one of the most widely studied deepfake detection benchmarks. It comprises of 1000 real video sequences (mostly from YouTube) of mostly frontal faces and without any occlusions. These real videos were then manipulated using four different face manipulation methods: (1) FaceSwap [47], (2) Deepfakes [48], (3) Face2Face [49] and (4) NeuralTextures [50] to have four subsets. Each subset comprises of 1000 videos each. In total, the dataset contains 5000 videos, i.e., 1000 real and 4000 fake videos. FaceForensics++ contains videos having three different resolutions, i.e., (1) Raw, (2) High-Quality and (3) Low-Quality. In our study, we experimented the high-quality videos. FaceSwap and Deepfakes subset contains videos generated using what is called, face swapping. As the name suggests, face of the target person is replaced with the face of source person and results in transferring the identity of the source person onto the target. Face2Face and NeuralTextures subsets are generated by a different process called, face re-enactment. In contrast to face swapping, face re-enactment swaps the faces of source and target, however, keeps the original identity of the target face.
CelebDF-V2 [41] contains 5639 fake and 590 real videos. The real videos are collected from YouTube and contain interview videos of 59 celebrities having diverse ethnic backgrounds, genders, age groups. CelebDF-V2 dataset comprises of fake videos generated using Encoder-Decoder models. Post-processing operations are also employed to circumvent color mismatch, temporal flickering and inaccurate face masks.
Deepfake Detection Challenge (DFDC) dataset [44] comprises of around $128k$ videos, out of which, around $104k$ are fake. Similar to the FaceForensics++, the DFDC also comprises of videos generated using more than one face manipulation algorithms. Five different methods were employed to generate fake videos, namely, (1) Deepfake Autoencoder [44], (2) MM/NN [51], (3) NTH [52], (4) FSGAN [53] and (5) StyleGAN [54]. In addition to these, a random selection of videos also underwent a simple sharpening post-processing operation which increases the videos’ perceptual quality. Unlike FaceForensics++ dataset, the DFDC dataset also contains videos having undergone audio-swapping. However, in this study we do not use audio features to train and evaluate our models. Since DFDC dataset is huge as compared to other participating datasets, we only use a subset of the full dataset to train and evaluate our models i.e., to keep the number of training, validation and test data nearly similar. For training we use roughly around $19.5k$ (around $16.5k$ fake and $3k$ real) randomly selected videos from which we get $100k$ face cropped images ( $50k$ real and $50k$ fake). We use $20k$ images as validation set. For testing the models we use 4000 face frames randomly selected from 3500 (3200 fake and 300 real) videos.
FakeAVCeleb [46] is the most recently proposed dataset as compared to the three other datasets used in this study. FakeAVCeleb dataset contains $19.5k$ fake and 500 real videos. This dataset also contains audio modality and manipulates audio as well as video content to generate deepfake videos. For video manipulation, FaceSwap [55] and FSGAN [53] algorithms are used. For audio manipulation, a real-time voice cloning tool called SV2TTS [56] and Wav2Lip [57] are used. The dataset is divided into four subsets, i.e., (1) FakeVideo/FakeAudio, (2) RealVideo/RealAudio, (3) FakeVideo/RealAudio and (4) RealVideo/FakeAudio. In this study, we only employ 2 of the mentioned subsets to train our models, i.e., (1) FakeVideo/FakeAudio and (2) RealVideo/RealAudio.

B. Dataset Preparation

The data preparation process was notably time-consuming due to two main factors: firstly, the datasets being substantial in size and secondly, some selected datasets lacking clear dataset preparation guidelines. For instance, FakeAVCeleb does not provide predefined train/validation/test splits. Consequently, we had to manually develop a strategy to effectively partition the dataset into train, validation and test sets. Ensuring that a single identity didn’t appear in multiple splits added another layer of complexity to this task.

Additionally, all the datasets exhibit an imbalance, with a significantly higher number of “fake” videos compared to “real” ones. To address this, we took steps to ensure that the resulting datasets of cropped face images are balanced by extracting more frames from real videos as compared to fake videos. We also make sure to include at least one frame from every video selected for training and evaluation. You can refer to Table 1 for detailed information regarding the number of face frames used for training, validation and model evaluation from each dataset.

TABLE 1 Number of Real and Fake Images Used to Train, Validate and Test our Image Models

The data provided in Table 1 clearly illustrates that the training and validation sets for FakeAVCeleb contain a relatively smaller number of frames. This discrepancy can be attributed to the dataset containing a limited count of real videos (only 500), while the number of fake videos is substantially larger (19500). As the video clips are of shorter duration, this translates to 47, 808 frames being extracted from the chosen 300 real videos for the training set and 5360 frames from 100 videos for the validation set. Despite this slight variance in the size of the training and validation sets, we assume that it has a minimal impact on the models’ performance. This assumption is supported by our observations from training and evaluating the models using even fewer frames (approximately $25k$ real and $25k$ fake frames), which resulted in no significant deviations in the final test scores.

In addition, the test set for CelebDF-V2 contains fewer frames for the same underlying reason - the test set of the dataset includes only 50 real and 50 fake videos. Due to this, we were only able to extract a total of 2000 frames from this set of 100 test videos.

C. Pre-processing and Augmentations

We adopt two distinct approaches to train our models in this study. Initially, we train models without applying any image augmentations. Afterwards, we train the models using a range of randomly chosen image augmentations, such as horizontal flips, affine transformations and random cut-out augmentations. All the cropped face images are then normalised according to the same method which was originally used for pre-training on the ImageNet [58]. For augmentations, we utilise the imgaug3 library.

D. Models

We opt to explore six supervised image classification models, equally divided into three CNNs and three transformer-based models. Furthermore, we assess two variations of transformer models trained via self-supervised methods, namely (1) DINO [26] and (2) CLIP [27]. In addition to the image classification models, our study encompasses the training and evaluation of two distinct video classification models: (1) ResNet-3D [59], a CNN model for video classification and (2) TimeSformer [20], a transformer model tailored for video classification.

We choose models based on their performance on the ImageNet benchmark [58], their parameter count and, in the case of certain models like Xception [36] and EfficientNet [44] their established performance in deepfake detection, as reported by some of the previous studies [34], [44].

1) Image Models

Deepfake detection is typically treated as an image classification problem. In this context, deep learning models are trained and evaluated on images independently, dealing with each image on its own. This differs from video-based deepfake detection, where models are trained and tested on consecutive video frames to capture temporal discrepancies between frames along with spatial cues within each frame.

Below, we provide a brief introduction to the image models employed in this study.

Xception [36] is a convolutional neural network (CNN) architecture that builds upon the Google’s Inception CNN architecture [60]. It distinguishes itself by using depth-wise separable convolutions in place of conventional Inception modules. Unlike standard convolutions, which are applied across all ${N}$ channels at once, depth-wise convolutions operate sequentially on individual image channels. This characteristic reduces Xception’s trainable parameters compared to other prominent deep CNN models. Despite this reduction, Xception’s performance remains on par with models having more parameters, as evidenced on the ImageNet benchmark [58]. Furthermore, its smaller parameter count enhances resistance to overfitting on unseen data and decreases computational load, making it an efficient choice. Figure 2A illustrates the concept of depth-wise convolution, the fundamental building block of this network. Xception not only demonstrates excellent performance on the ImageNet benchmark but also boasts significant achievements in previous deepfake detection studies [6], [34], [44]. Based on its proven track record in this domain, we include Xception for analysis in this study.
Res2Net [61] is a CNN architecture which is built upon the widely adopted ResNet architecture [39]. Res2Net introduces a new building block named the “Res2Net Block,” which replaces the conventional bottleneck residual blocks utilised in ResNet models. By operating at a granular level, the Res2Net architecture captures multi-scale features and extends the receptive field range for every network layer. As a result, the network becomes more potent and efficient, leading to enhanced performance across diverse computer vision tasks, including image classification, segmentation and object detection [61]. The innovative Res2Net block can be seamlessly integrated into other leading-edge backbone CNN models, such as ResNet [39], DLA [62], BigLittleNet [63] and ResNeXt [64]. We visualise the Res2Net block in Figure 2B. In this study, we employ Res2Net-101 to explore whether multi-scale CNN features contribute to improved deepfake detection performance. Additionally, we investigate whether these enhancements extend to cross-dataset performance, gauging the model’s generalisation capability.
EfficientNet [65] belongs to the EfficientNet family of CNN architectures. In their paper, the authors propose a scaling technique that uniformly adjusts depth, width and resolution using a compound coefficient. The central concept revolves around systematically scaling the model’s architecture and parameters to achieve better efficiency. Unlike the conventional approach of arbitrarily scaling individual dimensions, the proposed strategy employs a consistent set of scaling coefficients across all dimensions. Consequently, the architecture offers a family of seven models spanning various scales [65]. Impressively, EfficientNet achieves top-notch performance across several image classification benchmarks, while maintaining computational efficiency that surpasses other architectures like ResNet and Inception [65]. In a manner similar to Xception, a specific variant of EfficientNet, namely EfficientNet-B7, has also demonstrated remarkable prowess in deepfake detection tasks. Notably, the triumphant solution of the Google-sponsored Deepfake Detection Challenge (DFDC) was built upon the strengths of EfficientNet-B7 models [44]. Given this notable track record, we employ this model for analysis in our study.
Vision Transformer [66] belongs to the family of transformer models which were initially designed for natural language processing tasks [66]. In the realm of computer vision, the Vision Transformer (ViT) emerged as a pioneering transformer-based architecture designed specifically for image classification tasks [14]. ViT harnesses the power of self-attention mechanisms to processes visual data. Its methodology employs a simple yet powerful strategy, i.e., it divides images into smaller patches, which are then flattened. These flattened patches are then enriched with learnable positional embeddings, enabling them to retain their spatial context within the original image. A special classification token is added at the beginning of this input to enable the transformer model for image classification task. The input is fed to the transformer for processing. At the end of processing, the added classification token contains the representation of the entire image, which is then fed to a multilayer perceptron which in the end outputs probability scores for each class. This process empowers the model to better capture the context and relationships between different parts of the image. As a result, the network effectively captures contextual nuances and interrelationships across distinct segments of the image, achieving performance comparable to state-of-the-art CNN models on the ImageNet dataset, especially when trained on giant datasets like ImageNet-21k or JFT-300M [14]. The ViT architecture is visually depicted in Figure 2E. In our analysis, we ViT-Base model for experimentation and subsequently compare its performance against other models participating in the study.
Swin Transformer [15] is a class of Vision Transformer models. Swin Transformer architecture comes with a hierarchical structure, utilising a shifted windows approach for computing image representations. The shifted windowing strategy enhances efficiency by confining self-attention computation to non-overlapping local windows, while still enabling cross-window connections. This hierarchical design offers flexibility for modeling at different scales and maintains linear computational complexity concerning image size. Swin Transformers achieve competitive performance, comparable to other state-of-the-art image classification models like EfficientNets [15], [65] and even outperform Vision Transformers and ResNets [14], [39]. Not only limited to image classification, Swin Transformers also excel in tasks such as image segmentation and object detection [15]. Figure 2G provides an illustration of the window generation and attention calculation process in Swin Transformers. Because of the excellent performance Swin Transformer achieve on ImageNet [58], we use it in this study. Specifically, we employ Swin-Base variant for experimentation.
Multiscale Vision Transformer [16] is another class of ViT models. Unlike traditional ViTs, the MViTs have multiple stages that vary in both channel capacity and resolution. These stages create a hierarchical pyramid of features, where initial shallow layers focus on capturing low-level visual information with high spatial resolution, while deeper layers extract complex, high-dimensional features at a coarser spatial resolution. This approach allows the network to capture the context and relationships between different parts of the image in a better way, which results in improved performance on a broad range of computer vision tasks including image classification and image segmentation [16]. A broad overview of the architecture of MViT is shown in Figure 2C. Since MViTs are relatively new and achieve excellent performance on different vision tasks, we employ these in our study to analyse how well they perform on the task of deepfake detection. We use MViT-V2-Base variant in this study.
DINO [26] is a self-supervised training method, which is interpreted as self-DIstillation with NO labels. Authors show that transformer models trained using DINO show interesting properties, e.g., (1) self-supervised ViT features (DINO) incorporate explicit visual information within an image useful for computer vision tasks such as semantic segmentation, which does not come along as evidently with supervised ViTs and also not with CNNs; (2) self-supervised ViT features are also shown to achieve excellent performance when tested as k-NN classifiers, attaining 78.3% top-1 on ImageNet with a ViT-small architecture. For further details, we would like to point readers towards the original DINO paper [26]. The DINO training strategy is shown in Figure 2I. Inspired from these findings, we also employ ViT-Base [14] architecture trained using DINO [26]. In our study, we use the ViT-Base as feature extractor and add a classification head on top. We only train the added classification head on participating deepfake detection datasets, while freezing the weights of the ViT-Base feature extractor.
Contrastive Language-Image Pre-Training (CLIP) [27] is a neural network that has been trained on a diverse set of (image, text) pairs in a self-supervised contrastive manner. It has the ability to infer the most suitable text excerpt for a given image using natural language, without explicit supervision for this task. It exhibits zero-shot capabilities similar to the ones exhibited by GPT-2/GPT-3 [67], [68]. In CLIP’s original research paper, authors show that it achieves performance scores equivalent to the original ResNet50 [39] CNN model when evaluated on ImageNet [58] in a zero-shot fashion, i.e., even though CLIP does not use any of the 1.28 million labelled examples from the original dataset it achieves comparable performance as a ResNet50 model trained on ImageNet in a supervised manner. CLIP is illustrated in Figure 2H. For more details on CLIP, we refer readers to [27]. We employ a ViT-Base model trained using CLIP as a feature extractor for our study. Similar to DINO, we add a classification head on top of ViT-Base trained using CLIP. For our analysis, we only train the classification head and keep the CLIP ViT-Base features frozen i.e., we do not update its weights during training.

FIGURE 2.

Visual representation of the models used for analysis in this study. Due to space limitations, only basic, key concepts for each model are illustrated instead of the whole model. For optimal understanding of the essential components of each model, we recommend viewing this figure in color and at a higher magnification.

Show All

FIGURE 3.

Performance (accuracy) comparison of participating models on all datasets. The reported scores result in an intra-dataset evaluation. Results in this figure are obtained by evaluating each model separately on each dataset and averaging the resulting scores. In addition to this, the figure presents the performance of each model trained with and without the augmentations, along with their parameter count.

Show All

FIGURE 4.

This figure presents t-SNE visualisations of the participating detection models. We chose the best performing models on all datasets (with/without image augmentations).

Show All

FIGURE 5.

Performance (accuracy) comparison of two self-supervised ViT models and one supervised ViT. The reported scores result in an intra-dataset evaluation. Results in this figure are obtained by evaluating each model separately on each dataset and averaging the resulting scores. In addition to this, the figure presents the performance of each model trained with and without the augmentations. All of of the three models have the same amount of trainable parameters since they all are ViT-Base models and the only difference is the pre-training schemes used to train the models.

Show All

FIGURE 6.

CBAM visualisations of the supervised image models.

Show All

FIGURE 7.

Performance (accuracy) comparison of participating models evaluated using inter-dataset scheme. Results in this figure are obtained by, (1) evaluating each model trained on one dataset on each of the remaining datasets and (2) averaging the achieved scores, i.e., add the 3 accuracy scores and divide by 3.

Show All

FIGURE 8.

ROC curves of each of the model when evaluated on each of the 4 different participating datasets in an intra-dataset evaluation setting.

Show All

FIGURE 9.

DET curves of each of the model when evaluated on each of the 4 different participating datasets in an intra-dataset evaluation setting.

Show All

FIGURE 10.

ROC curves of self-supervised models trained and evaluated on each dataset using the intra-dataset evaluation scheme.

Show All

FIGURE 11.

DET curves of self-supervised models trained and evaluated on each dataset using the intra-dataset evaluation scheme.

Show All

2) Video Models

We examined two different video classification models in this paper: (1) ResNet-3D [59], a CNN-based video classifier and (2) TimeSformer [20], a transformer-based video classification model. Our investigation includes assessing the performance of both these models in both intra-dataset and inter-dataset contexts across four renowned deepfake detection benchmarks. Our decision to include video-based models alongside image-based detection models stems from our curiosity about the potential impact of temporal information present in videos for the deepfake detection task.

ResNet-3D [59] is based on the same principles as the original ResNet architecture [39], but they are specifically designed to work with 3D data, such as videos and volumetric medical images. These models use 3D convolutions, instead of 2D layers, for feature extraction. In addition to that, ResNet-3D models generally use a large number of layers, which allows them to learn complex and abstract features in the data. ResNet-3D models have been utilised for a variety of computer vision tasks, including video classification, action recognition and medical image segmentation [59], [69]. For reference, we illustrate both 2D and 3D convolutions in Figure 2F. We choose to employ ResNet-3D model for our study because, (1) it is widely studied in regards of video recognition and (2) pre-trained models are easily available. We chose ResNet-3D model pre-trained on 8 frames per video to experiment in this study.
TimeSformer [20] is a video recognition model based on the transformer architecture. TimeSformer utilises self-attention over space and time, instead of traditional convolutional layers, or the spatial attention as employed by ViT for image classification. The TimeSformer model modifies the transformer architecture, generally used for image classification, by directly learning the spatio-temporal features from a sequence of frame-level patches. This is accomplished by extending the self-attention mechanism from the image space to the 3D space-time volume. Similar to the Vision Transformer (ViT) model, the TimeSformer employs linear mapping and positional embeddings to interpret ordering of the resulting sequence of features. In TimeSformer paper [20], authors experimented with different self-attention techniques. Out of those techniques, the “divided attention” technique which calculates temporal and spatial attention separately within each block, was found to perform better than other self-attention calculation techniques and thus we choose to analyse the same architecture in this study. Divided space-time attention is illustrated in Figure 2D. We opt to evaluate TimeSformer on the task of deepfake detection and compare it with convolutional video classification network, ResNet-3D. We also chose 8 frame per video version of the TimeSformer model, same as the ResNet-3D model we described above.

E. Evaluation Metrics

In order to analyse the performance of our models in a comprehensive way, we employ multiple widely used classification metrics, e.g., (1) LogLoss, (2) AUC and (3) Accuracy. Below we briefly introduce the chosen evaluation metrics.

1) LogLoss

LogLoss, also known as logarithmic loss or cross-entropy loss, is used to measure the classification performance of machine/deep learning models. LogLoss calculates the dissimilarity between the predicted probability score with the true label (0, 1 in case of binary classification). The LogLoss score is computed as the negative logarithm of the likelihood of the true labels given a set of predicted probabilities. The range of the LogLoss function is from 0 to infinity, with 0 representing the ideal outcome and higher values representing worse outcomes. $\begin{equation*} L = -\frac {1}{N} \sum _{i=1}^{N} [y_{i} \log (p_{i}) + (1-y_{i}) \log (1-p_{i})] \tag{1}\end{equation*}$ View Source where $L$ is the LogLoss, $N$ is the total number of samples in the dataset, $y_{i}$ is the true label of the i-th sample, $p_{i}$ is the predicted probability for the i-th sample.

It is worth noting that Logloss is a widely used evaluation metric in machine learning competitions such as Kaggle competitions, as it gives a general idea of how good the predictions of the model are. We use LogLoss as one of the evaluation metrics in this study as other previously proposed deepfake detection research studies often use it as their evaluation metric and thus would allow us to compare our results with them.

2) Area Under the Curve (AUC)

AUC is another widely known metric used to evaluate classification models. AUC basically refers to calculating the entire two-dimensional area under the Receiver Operating Curve (ROC). AUC gives hints about how well a model has made a certain prediction. Quite understandably, the higher the area falling under the ROC, i.e., AUC, the better the performance of the model at discriminating between “real” and “fake” samples in our case. Most of the recently proposed deepfake detection studies employ AUC as the evaluation metric to study the performance of their models.

Note that the ROC curve is created by varying the threshold used to make predictions from 0 to 1, so the AUC provides a summary of the model’s performance across all possible thresholds.

3) Accuracy

Accuracy is another prominent classification metric. Accuracy score is basically the measure of correct predictions made by a model in relation to all the predictions made by the model. Accuracy does not indicate how well a model has made a certain classification, as was the case with LogLoss and AUC. Accuracy score can be obtained by dividing the number of correct predictions by total predictions.

It is worth noting that accuracy is the proportion of correctly classified samples out of the total number of samples. It is a common evaluation metric used in binary classification tasks, however, it can be misleading in cases where the classes (real, fake) are imbalanced, or if the cost associated with the false positives and false negatives is different. In such cases, other evaluation metrics like F1 score, precision, recall, or AUC may provide a more accurate evaluation of the classification model’s performance. In our study however, since we have balanced number of samples both for real and fake classes, we can use accuracy as one of the evaluation metrics.

4) Efficiency Comparison

To gain a comprehensive understanding of models’ performance in deepfake detection, we conduct an in-depth analysis using three distinct classification performance metrics outlined in earlier sections. Additionally, we provide efficiency metrics (see Table 2) for each model to offer insights into the trade-off between a model’s effectiveness in detecting deepfakes and its efficiency in real-world deployment. This analysis highlights the financial implications of deploying detection models on cloud services, emphasising the trade-off between efficiency and detection performance. For example, while models like Xception or ViT demonstrate high efficiency (on GPU), the forthcoming sections show that slower, more heavy models often outperform faster, lighter models in deepfake detection. For visual depiction of these efficiency scores, please see Figure 12 and 13.

TABLE 2 This Table Presents a Detailed Account of Efficiency Metrics of All the Participating Supervised Models, Including, the Parameter Count, Inference Times Both on CPU and GPU and the Number of Floating Point Operations Per Second (FLOPs)

FIGURE 12.

This bar chart highlights the efficiency of the supervised models in terms of inference time on both GPU and CPU devices. It reveals that CNN models outperform transformer models, taking nearly half the time for processing a single image frame on CPU. On GPU, the figure illustrates that all models achieve inference in less than 45 milliseconds at most. ViT and Xception models are the fastest among other models on GPU inference speeds, taking less than 10 milliseconds to process a single frame.

Show All

FIGURE 13.

This figure illustrates the performance of supervised image models, showcasing both total parameters and the number of floating-point operations per second (GFLOPs). The results align with the preceding bar chart, emphasising the superior efficiency of CNN models, as compared to transformer models. it is important to note that video models, although not depicted here, exhibit a significantly higher number of floating-point operations per second, acting as outliers in the figure and slightly affecting its visual coherence. This disparity arises from the nature of video models processing more data at once, specifically 8 image frames, compared to image models that handle only one image at a time.

Show All

We employ fvcore4 library to compute GFLOPs for our models. While various libraries exist which allow GFLOPs measurement, it is crucial to acknowledge that results may exhibit slight variations.

To determine CPU and GPU inference times, we execute inference on 300 random images and then calculate the average time spent on each image in milliseconds. For GPU, a warm-up phase precedes inference, involving the processing of 10 images to ensure optimal GPU performance before the actual inference on 300 images commences. Our machine is equipped with an RTX 3080 GPU, Ryzen $5800X$ CPU and 32GB RAM.

F. Implementation Details

We use PyTorch5 framework to facilitate the training and testing of our models. In our training approach, we employ a batch size of 16 for image models and 4 for video models. The learning rate remains constant at $3\times 10^{-3}$ for both image and video models. Our chosen loss function is CrossEntropyLoss and we utilise Stochastic Gradient Descent (SGD) as the optimiser for model training. Our models undergo training for a span of 5 epochs, with final selection of the model having lowest validation loss for subsequent testing and evaluation purposes. For the evaluation stage, we use Scikit-Learn library [70]. We use Scikit-Learn to calculate and report LogLoss, AUC, Accuracy scores, as well as ROC and DET6 (Detection Error Tradeoff) curves [70].

To facilitate our model implementations and leverage pre-trained weights, we heavily rely on the PyTorch Image Models7 repository by Ross Wightman. Additionally, we adapt certain code snippets from [26] to train linear classification heads on top of self-supervised feature extractors like DINO and CLIP. We augment images for training using the imgaug8 library.

SECTION IV.

Results

We conducted extensive experimentation and evaluation on six image classification models and two video classification models, which we specifically trained for deepfake detection. These evaluations are conducted across four different datasets, as outlined in Section III. The analysis includes evaluating all models under both intra-dataset conditions (trained and evaluated on the same dataset) and inter-dataset conditions (trained on one dataset and evaluated on other datasets, excluding the training dataset). Subsequent sections present the performance outcomes of all participating models within both intra-dataset and inter-dataset contexts.

In addition to the supervised models, our investigation includes two vision transformer (ViT-Base) models that have been pre-trained using the self-supervised techniques DINO [26] and CLIP [27], as previously outlined in Section III. We then compare these two self-supervised models against a supervised Vision Transformer (ViT) [14]. it is important to note that all three models - DINO, CLIP and the supervised ViT - are all ViT-Base models. By training a classification head on top of these three models, our goal is to discern whether self-supervised features offer superior representations in comparison to supervised features.

A. FakeAVCeleb

FakeAVCeleb [46] is a newly released deepfake detection dataset containing four different categories of videos as given in Section III-A earlier. Since we focus only on visual deepfakes in this study, we do not use the audio data (real and fake) for training and evaluating the models. Thus out of the four subsets of FakeAVCeleb dataset, we only use two for our experiments i.e., (1) FakeVideo/FakeAudio, (2) RealVideo/RealAudio.

We present scores of intra-dataset evaluation in Table 3 showing that all models perform pretty well in distinguishing between fake and real faces. From Table 3, we can see that all of the participating models achieved almost 99% AUC and very low LogLoss score when tested in an intra-dataset configuration. The numbers in Table 3 suggest that FakeAVCeleb dataset is relatively easy and thus the models can accurately distinguish between real and fake samples.

TABLE 3 Intra-Dataset Performance Comparison of Image Models. The Table Below Presents Scores Achieved by Image Models When Trained and Evaluated on FakeAVCeleb [46] Dataset. Best Results are Highlighted in Yellow

TABLE 4 Intra-Dataset Comparison of Image Models. The Table Below Presents Scores Achieved by Image Models When Trained and Evaluated on CelebDF-V2 [41] Dataset

TABLE 5 Intra-Dataset Comparison of Image Models. The Table Below Presents Scores Achieved by Image Models When Trained and Evaluated on FaceForensics++ [34] Dataset

TABLE 6 Intra-Dataset Comparison of Image Models. The Table Below Presents Scores Achieved by Image Models When Trained and Evaluated on DFDC [44] Dataset

TABLE 7 This Table Compares the Performance of All the Participating (Supervised) Models. We Present Scores After Averaging the Scores (LogLoss, AUC, Accuracy) Achieved by Each Model When Evaluated in an Intra-Dataset Setting

TABLE 8 This Table Compares the Performance of All the Participating (Self-Supervised) Models When Evaluated in an Intra-Dataset Setting. The Statistics of This Table are Illustrated in Figure 10

TABLE 9 This Table Compares the Performance of the Self-Supervised Models. We Present Scores After Averaging the Scores (LogLoss, AUC, Accuracy) Achieved by Each Model on the Four Datasets, When Evaluated in an Intra-Dataset Setting. In This Table, Supervised Refer to ViT-Base Model Pre-Trained Using Supervised Training Scheme. DINO Refers to ViT-Base Model Pre-Trained Using Self-Supervised Scheme Proposed in [26] and CLIP Refers to ViT-Base Pre-Trained Using Self-Supervised Scheme Prposed in [27]. All of These ViT-Base Models are Used as Feature Extractors, Where we Only Train a Classification Head on Top of Each of the Feature Extractor and Freeze the Weights of Feature Extractors

TABLE 10 This Table Compares the Performance of All the Participating (Supervised) Models Evaluated in an Inter-Dataset Setting. Results in This Table are Obtained by, (1) Evaluating Each Model Trained on One Dataset on Each of the Remaining Datasets and (2) Averaging the Achieved Scores, i.e., add the 3 Accuracy Scores and Divide by 3. Figure 7 Illustrate the Statistics of This Table

Table 11 reports results achieved by all the models when trained on FakeAVCeleb and evaluated on the remaining three datasets. From numbers reported in Table 11, it is apparent that almost all of the models perform poorly on all the other datasets. We can see that in terms of accuracy scores, the models are making random guesses. LogLoss and AUC scores are also not remarkably good in inter-dataset evaluation.

TABLE 11 Inter-Dataset Evaluation Scores of Models Trained on FakeAVCeleb [46] Dataset and Evaluated on the Remaining Three Datasets

For self-supervised models, the intra-dataset evaluation scores are not as high as those achieved by the supervised models, however, they are still not bad. This is understandable as these models aren’t trained in an end-to-end manner, rather only the classification heads are trained on frozen features, as previously mentioned. On this dataset, DINO outperforms the other two models, i.e., CLIP and supervised ViT, with a significant margin as indicated in Table 8.

In an inter-dataset evaluation setting, self-supervised models provide intriguing insights. Notably, DINO, trained on the FakeAVCeleb dataset and evaluated on CelebDF-V2 and FaceForensics++ datasets, demonstrates comparable results to supervised image models. It is worth highlighting that DINO achieves this performance while only training the classification head, in contrast to supervised models that undergo full training. Also, the results suggest that training more complex models on easier datasets do not yield good performance scores when tested on out-of-distribution data, i.e., the models overfit training data. Table 12 presents these results.

TABLE 12 Inter-Dataset Evaluation Scores of Self-Supervised Models Fine-Tuned on FakeAVCeleb [46] Dataset and Evaluated on the Remaining Three Datasets

To sum up, the results given in Tables 3, 8, 10, 11 and 12 we can infer that FakeAVCeleb dataset is not challenging enough for the models to learn and is fairly easy to distinguish between fake and real samples for both supervised and self-supervised models. In addition to that, this dataset does not enhance the models’ ability to learn robust distinguishing features between real and fake faces, or in other words, it lacks at integrating the generalisation capability into the models, as is apparent from Tables 10, 11 and 12.

B. CelebDF-V2

Table 4 presents the performance of supervised models when trained and evaluated on CelebDF-V2 [41] dataset. Same as it was the case with FakeAVCeleb dataset, almost all of the participating models achieve excellent scores i.e., more than 97% accuracy and more than 99% AUC score, while having a very small LogLoss. We can thus infer that the models quite comfortably learnt to discriminate between real/fake samples of the CelebDF-V2 dataset, similar to FakeAVCeleb dataset.

To gauge the extent to which this dataset aids models in acquiring robust features for enhanced generalisation, we carry out extensive inter-dataset evaluation involving all participating models trained on CelebDF-V2. The outcomes of this evaluation are presented in Table 13. Surprisingly similar to the observations from models trained on the FakeAVCeleb dataset and assessed on other datasets, the models trained on CelebDF-V2 and subjected to inter-dataset evaluation also display suboptimal performance. This outcome could possibly be attributed to CelebDF-V2 not being particularly challenging for the models to differentiate, as they almost flawlessly categorise every real/fake sample. Nonetheless, this dominance in classification also renders the models less adept at handling unseen data, as evidenced by the performance metrics detailed in Table 13.

TABLE 13 Inter-Dataset Evaluation Scores of Models Trained on CelebDF-V2 [41] Dataset and Evaluated on the Remaining Three Datasets

The evidence of CelebDF-V2 being less challenging to learn is further substantiated by the outcomes obtained from the self-supervised models as illustrated in Table 8. The numbers clearly demonstrate that even when training merely a classification head on feature extractors that remain frozen, models still manage to achieve commendable results. For inter-dataset evaluation, self-supervised models trained on CelebDF-V2 and tested on the other datasets yield outcomes akin to those of supervised models, but in some cases, e.g., for DFDC self-supervised models show a considerable performance drop. For additional details, kindly consult Tables 13 and 14.

TABLE 14 Inter-Dataset Evaluation Scores of Self-Supervised Models Fine-Tuned on CelebDF-V2 [41] Dataset and Evaluated on the Remaining Three Datasets

C. FaceForensics++

The performance metrics for all supervised models when trained and evaluated on the FaceForensics++ [34] dataset are presented in Table 5. These results are noticeably less favorable compared to those achieved with the previous datasets, FakeAVCeleb and CelebDF-V2. Few models managed to exceed 95% accuracy and LogLoss scores are also less impressive in comparison. The metrics imply that this dataset presents a relatively intricate challenge for the models to differentiate between real and fake samples. The self-supervised models also encounter difficulties in achieving good scores on the FaceForensics++ dataset, as evident from the numbers in Table 8. This reaffirms the notion that accurately distinguishing between fake and real faces in the FaceForensics++ dataset poses a formidable task. This prompts us to question whether a more demanding dataset corresponds to enhanced generalisation capabilities.

Consequently, we move forward with evaluating all supervised models trained on the FaceForensics++ dataset using an inter-dataset evaluation framework. The insights from this evaluation are outlined in Table 15. The models now exhibit satisfactory performance even when confronted with data originating from previously unseen domains. This contrasts with models trained on the FakeAVCeleb and CelebDF-V2 datasets, which tend to exhibit comparatively poor generalisation capabilities. To illustrate, the assessment of MViT trained on FaceForensics++ and evaluated on the FakeAVCeleb dataset yields an accuracy exceeding 80% and an AUC score exceeding 90%. Furthermore, not only on the FakeAVCeleb dataset, we can also see encouraging performance from all models trained on this dataset and evaluated on others. The results in Tables 15 and support the statement that more challenging datasets mean better generalisation capability. But we have to further re-enforce this statement after evaluating the models trained using DFDC [44] dataset in the upcoming section.

TABLE 15 Inter-Dataset Evaluation Scores of Models Trained on FaceForensics++ [34] Dataset and Evaluated on the Remaining Three Datasets

D. DFDC

DFDC is one of the biggest and widely adopted deepfake detection benchmarks. We present intra-dataset evaluation scores of our models trained and evaluated on DFDC in Table 6. Res2Net-101 turned out to be the best model in this evaluation, managing to achieve more than 84% accuracy score, 93% AUC score on the DFDC dataset. Self-supervised models also achieve relatively low scores when trained and evaluated on DFDC, as apparent from Table 8. This establishes that DFDC is comparably more challenging dataset out of all the four datasets in this study.

In Table 17 we present inter-dataset evaluation scores achieved by the supervised models trained on DFDC dataset. It is evident from the numbers that the models trained using DFDC dataset still achieve acceptable performance on unseen data, as compared to the scores achieved by the models which were trained on FakeAVCeleb and CelebDF-V2. Also, by looking at the results now, we can affirm the statement that models trained using more challenging datasets seem to achieve better results. This finding is evident from Tables 10, 15, 17 and 18.

TABLE 16 Inter-Dataset Evaluation Scores of Self-Supervised Models Fine-Tuned on FaceForensics++ [34] Dataset and Evaluated on the Remaining Three Datasets

TABLE 17 Inter-Dataset Evaluation Scores of Models Trained on DFDC [44] Dataset and Evaluated on the Remaining Three Datasets

TABLE 18 Inter-Dataset Evaluation Scores of Self-Supervised Models Fine-Tuned on DFDC [44] Dataset and Evaluated on the Remaining Three Datasets

E. Discussion

1) Supervised Models

In Figure 3, we illustrate a comparison of all participating supervised models based on their attained accuracy scores (averaged) in an intra-dataset evaluation context. The visualisation clearly indicates that there exists minimal performance difference among the models. Across the majority of cases, the models achieve accuracy levels ranging from approximately 92% to 94%.

Notably, the figure underscores that image augmentations do not always yield significant performance gains. For instance, XceptionNet, Res2Net-101, MViT-V2-Base and EfficientNet-B7 display superior performance when trained without image augmentations, as compared to their counterparts trained with augmentations. Nonetheless, the divergence in accuracy scores between models trained with and without image augmentations is generally modest, except in the case of ViT. Specifically, the ViT trained with image augmentations achieves an accuracy of 91.62%, whereas the ViT trained without augmentations records an accuracy of 88.63%. In addition to this, Figure 3 highlights that transformer models consistently perform better when trained using augmentations. Additionally, video models also exhibit better performance when trained using image augmentations. An important insight is that the best-performing model, Swin-Base, attains its peak accuracy when trained with image augmentations, further advocating for the incorporation of augmentations in training protocols.

Furthermore, it is worth noting that the transformer models (Swin-Base and MViT-V2-Base, TimeSformer) demonstrate superior performance compared to their CNN counterparts. Interestingly, the Res2Net-101 model also achieves remarkable numbers in the intra-dataset evaluation context, despite having roughly half the number of parameters (43 million parameters) compared to the top-performing Swin-Base model (87 million parameters). Figure 3 and Table 7 collectively indicate a valuable observation: models equipped with multi-scale feature processing capabilities, such as Res2Net, MViT-V2 and Swin Transformer, exhibit the best performance among all the models.

Moving towards inter-dataset analysis, we present the outcomes attained by the supervised models when assessed in an inter-dataset context through Figure 7. The figure showcases that the models exhibit noticeably reduced performance levels in inter-dataset evaluation compared to intra-dataset evaluation. This discrepancy is reasonable since detection models tend to experience performance degradation when confronted with data originating from unseen distributions. However, the Figure 7 reports a useful finding: across all datasets, as compared to CNN models the transformers consistently emerge as the top-performing models. We refer readers to Table 10 to examine the inter-dataset scores achieved by models on each of the dataset.

We also visualise t-SNE plots of all the supervised models in Figure 4. These visual representations illustrate how the models cluster together faces from the same datasets in close proximity to each other, contrasting with faces from different datasets. The t-SNE plots also offer insights into the relative difficulty of datasets, with a clear distinction between the easier datasets (FakeAVCeleb and CelebDF-V2) and the more challenging ones (FaceForensics++ and DFDC), as evident in Figure 4.

Another notable observation is that image models tend to perform the separation task more effectively compared to video models. This is expected, considering our earlier mention that video models typically require larger amounts of training data (we trained both image and video models on the same dataset in this study). As part of our future research, we aim to explore video models on larger datasets to further validate this hypothesis. Despite this, the t-SNE visualisations reveal an interesting insight: while the video model ResNet-3D may struggle to distinguish between real and fake faces within the same dataset, it excels at effectively separating data from different datasets.

In addition to that, for a better diagnosis of the models we also visualise the predictions using Gradient-weighted Class Activation Mapping (Grad-CAM)9 [71]. Figure 6 presents Grad-CAMs of the supervised image models on all datasets. It is interesting to observe that all models, to varying degrees, concentrate on different facial regions when making predictions.

Furthermore, we provide the ROC, DET curves for the participating models assessed in an intra-dataset context, as illustrated in Figures 8 and 9 respectively. The corresponding AUC scores reinforce the notion that FakeAVCeleb and CelebDF-V2 datasets present less complexity to the models in comparison to FaceForensics++ and DFDC datasets. This underscores the idea that training the models on more challenging datasets, rather than easier ones, enhances their generalisation capabilities for deepfake detection.

The scores (LogLoss, AUC, ACC) reported in Tables 7 and 9 for each model are calculated by averaging the individual scores achieved by that specific model on each dataset. For example, s1, s2, s3, s4 are scores that a model achieved on datasets d1, d2, d3 and d4.

2) Self-Supervised Models

In Figure 5 we show a similar comparison involving self-supervised models. It is clear that DINO outperforms the other two models. A careful examination of the outcomes in Tables 8 and 9 enables us to deduce that self-supervised features, particularly DINO, yield superior feature representations in comparison to CLIP and supervised ViT. To strengthen this finding further, we illustrate the ROC and DET curves in Figures and 11 respectively.

3) The Outcome

Answering the six questions that we posed at the beginning of this study in Section I:

identifying the most effective model architectures for detecting deepfakes among those being tested - Ans: Models equipped with multi-scale feature representation capabilities, such as MViT-V2-Base, Res2Net-101 and Swin-Base (hierarchical representations).
pinpointing the model architectures with the highest ability to adapt to new and unseen data - Ans: Upon examining the tables, it becomes evident that transformer models including Swin-Base, MViT-V2-Base and TimeSformer achieve superior performance scores compared to other models in the majority of cases..
assessing the difficulty of different datasets for model training - Ans: DFDC and FaceForensics++ datasets pose greater challenges for the models to learn in comparison to CelebDF-V2 and FakeAVCeleb datasets.
determining the dataset that best facilitates generalisation to unseen data - Ans: Table 10 confirms that the FaceForensics++ dataset promotes strong generalisation of models to unseen data, with the DFDC dataset ranking second in this regard.
evaluating the performance of self-supervised training strategies - Ans: From Tables 8 and 9, it is evident that DINO outperforms the other two competing strategies in intra-dataset evaluation across all datasets.
examining the impact of augmentations on enhancing model performance - Ans: Within the scope of this study, the augmentations that we have employed have a minimal effect on models’ performance i.e., in some cases, augmentations help models achieve better performance, while in other cases, they do not.

SECTION V.

Conclusion

We conducted a comprehensive study to assess the effectiveness of various image and video classification architectures for deepfake detection. Models were initially pre-trained using both supervised and self-supervised approaches and then evaluated on four prominent deepfake detection datasets. Our extensive experiments revealed that models adept at processing multi-scale features, such as Res2Net-101, MViT-V2 and Swin Transformer, consistently outperformed others in intra-dataset comparisons. Notably, MViT-V2-Base and Res2Net-101 achieved superior performance with approximately half the parameters of the Swin-Base transformer model. Regarding generalisation across datasets, transformer models consistently outperformed CNN models, with FaceForensics++ [34] and DFDC [44] enhancing generalisation capabilities.

Our investigation into models pre-trained using self-supervised strategies showed that the ViT-Base model, pre-trained using DINO [26], outperformed both supervised ViT-Base and self-supervised CLIP [27] ViT-Base models. Additionally, our findings indicate that the selected image augmentations lead to improved performance for Transformer models, while offering comparably less notable benefits for CNN models.

ACKNOWLEDGMENT

The authors acknowledge the use of ChatGPT [72] for checking and correcting the grammar of this article. It is important to note that they did not use ChatGPT to generate totally new text, rather they use it to check/correct grammar of the provided text.

References is not available for this document.

Deepfake Detection: Analyzing Model Generalization Across Architectures, Datasets, and Pre-Training Paradigms

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction