Convolutional Neural Network Based on Diverse Gabor Filters for Deepfake Recognition

Media synthesis and manipulation has reached unprecedented levels of realism owing to the proliferation of deep learning. Deepfake has been the de-facto tool for media manipulation. Although this technology has potential in the entertainment industry, its threats include political manipulation and bypassing biometric security systems. As a result, deepfake detection has garnered widespread attention among research communities. The intuition is to use deep learning to fix the problems created by deep learning. Although convolutional neural networks have shown their dominance in the filed of pattern recognition, the receptive field-model size dilemma still persists along with the lack of interpretation for such models. While the traditional Gabor function was proposed to fix these problems, it can only generate limited linear Gabor filters which makes it optimal for limited data and applications. The contribution of this paper is quadruple: (i) proposing a unified Gabor function capable of generating linear, elliptical, and circular Gabor filters. (ii) leveraging the back-propagation learning framework to incorporate the proposed function in convolutional neural networks and generate adaptive Gabor filters. (iii) presenting a dual scale large receptive field network for deepfake image recognition. (iv) demonstrating where the proposed model stands in terms of performance and architecture size compared to state-of-the-art models. The proposed model is evaluated on four benchmark datasets: Celeb-DF (v2), DeepFake Detection Challenge Preview, FaceForensics++ and Wilddeepfake. Experimental results show that the proposed adaptive Gabor filters reduce the model size by 64.9% compared to adaptive weighted filters without performance reduction.


I. INTRODUCTION
The recent developments in deep generative models (DGMs), particularly variational autoencoders [1] and Generative Adversarial Networks [2], has enabled media synthesis and manipulation to reach unprecedented levels of realism. DGMs have impacted different fields including medical imaging [3], digital forensics [4] and art production [5]. However, the dark side of DGMs have been perceived with the emergence of 'deepfake' which is an infamous technology that employs DGMs to superimpose face images of a target person over that of a source person as shown in Fig. 1. Although public figures were the first targets of deepfake due to the abundant availability of their images online, it is currently possible for attackers to digitally impersonate any The associate editor coordinating the review of this manuscript and approving it for publication was Zahid Akhtar.
individual with acquiring a single image. The threats of deepfake include fake pornographic production, political manipulation, and bypassing biometric security systems. Since the risks of deepfake outweigh its benefits, deepfake detection models have become indispensable tools for distinguishing fabricated from authentic media.
While a deepfake image could fool human eyes, videos have noticeable key points to distinguish deepfake from genuine. In [7], the detection model is based on out-of-sync audio. Another identifier for videos is the inconsistencies between consecutive frames [8]. Furthermore, the average blinking time was found to be longer in deepfake videos than that of real videos [9], [10]. While sequence-based models based on the aforementioned elements achieved promising results, they suffer from two drawbacks. First, current and upcoming deepfake generators tend to improve these artifacts to come up with more robust models that could fool FIGURE 1. Deepfake frames from Celeb-DF (v2) dataset [6]. Frames from left to right belong to target, source and fake, respectively. established detectors. Second, sequence-based detectors cannot be applied to deepfake images due to the lack of temporal information. Conversely, image-based detectors are applicable to manipulated videos through frame analysis and score fusion.
Although early image-based methods focused on salient artifacts for deepfake image detection, it was found that such methods tend not to generalize as well to samples spawned from unknown generators with latent artifacts compared to convolutional neural networks (CNNs) [11], [12]. CNNs have shown their dominance in the field of pattern recognition with Adaptive Weighted Filters (AWFs) being the fundamental component. However, the learnable weights for a single AWF of size k × k is k 2 . As a result, there is a parabolic increase in the number of learnable parameters with the filter size for a constant number of filters. In order to alleviate this problem, Simonyan et al. proposed to use consecutive filters with small receptive fields (i.e., filter size) instead of a filter with large receptive field [13]. While this replacement was necessary to avoid the explosive increase in architecture size, it was not suitable for all recognition problems [14]. In addition to the increase in number of parameters with receptive fields, another problem is that AWFs lack interpretation.
Despite the particularity of each problem, they both stem from the foundation of CNNs that resides in AWFs of convolutional layers. Prior to deep learning, visual descriptors for image classification were extracted through hand-crafted methods characterized by their effectiveness and interpretability. Hence, incorporating these methods to deep learning will concurrently solve these problems. While traditional predefined filters such as Sobel, Schmid and Gabor were considered in CNNs, Gabor filters became a common choice lately [15]- [17]. The preference of Gabor filter comes from the following points. From a neurophysiological perspective, studies have revealed that the response of Gabor filters is equivalent to that of receptive fields of simple cells in the primary visual cortex [15]. From a signal processing perspective, Gabor filters are capable of extracting informative and discriminative joint spatial-spectral features. Furthermore, it was shown that low-level layers of CNNs tend to redundantly learn Gabor filters [17].
In this paper, we introduce parameter α to the Gabor function that controls the geometrical shape of Gabor stripes (i.e., axial ratio). Therefore, the Gabor function is capable of generating diverse filters including linear, circular and elliptical Gabor filters. Note that we consider both the real and imaginary components of the proposed function by considering the phase-induced form of the Gabor function [17]. Moreover, we adopt a back-propagation learning framework to enable the generation adaptive Gabor filters (AGFs). While AGFs produced from the proposed function could be applied in a variety of vision-related applications, we propose a compact architecture based on dual scale large receptive fields and self-attention for deepfake image recognition to demonstrate the effectiveness of AGFs. Three well-known deepfake datasets are used to evaluate the proposed architecture. In addition to comparing the proposed architecture to stateof-the-art image recognition models in terms of performance and model size, we further show that the utilization of AGFs instead of AWFs in the proposed architecture reduces the architecture size by 64.9%. The main contributions of this paper are outlined as follows: • Different from previous work [16], [17] that leverage linear Gabor filters for pattern recognition, we develop a unified Gabor function capable of producing linear, circular and elliptical filters.
• In order to incorporate the proposed function in datadriven models, we utilize the back-propagation framework to enable the learnability of function parameters.
• We present a compact architecture for deepfake image recognition. The architecture leverages dual scale convolution with high receptive fields and self-attention mechanism.
• We evaluate the proposed architecture on three datasets and compare it to other state-of-the-art models in terms of performance and architecture size. The remainder of this paper is organized as follows. Section II provides a brief review of the relevant related work in both Gabor-based CNNs and deepfake image recognition. Section III introduces the proposed Gabor function, provides the learning framework to generate AGFs, and presents the proposed architecture for deepfake image recognition. Section IV demonstrates the experimental results. Finally, Section V concludes the paper.

II. RELATED WORK A. GABOR-BASED CNNs
Initially, predefined Gabor filters with fixed parameters were introduced in CNNs based on the observation that some weighted filters in AlexNet redundantly learn Gabor filters [19]. The objective was to modulate the learnable weighted filters aiming to enhance the deep feature representations with steerable orientation and scale capacities. This approach has proven to enhance the recognition performance with a perceptual reduction in the architecture size. Motivated by [19], Jiang et al. explored different architectures with varying depth for fast and efficient facial expression recognition [20]. In order to extract distinctive feature at different scales and orientations from limited training data, a combination of both fixed Gabor ensemble filter and AWFs were proposed for hyperspectral image classification [21]. In [22], a series of Gabor filters replaced the weighted filters VOLUME 10, 2022 in the first layer of CNN to enhance the overall classification performance. Furthermore, the spatial frequency and scale parameters of the filters are optimally obtained through coarse search in a small predefined subset of the parameters space and backward propagation is applied for fine tuning.
Meng et al. proposed a training procedure for Gabor filters in the first layer of CNNs based on the multipopulation genetic algorithm. The proposed procedure showed reduction in computational time and storage requirements [23]. In [24], Yuan et. al developed a regularizer loss function for a learnable Gabor convolution module. The module is used as a pre-processing tool and the features are passed to the ResNet-50 architecture for person re-identification. In [15], Zhang et al. addressed the difficulty of adjusting the parameters of Gabor filters via adaptive learning. In addition, the relation between the scale and frequency of the filter was leveraged to converge to the optimal values. The model demonstrated superior performance in finger-vein recognition. Stimulated by the fact that Gabor features assist in mitigating the negative effects introduced by the lack of training data, Liu et al. introduced naive Gabor networks [16]. In naive Gabor networks, CNNs strictly learn traditional Gabor filters to reduce the number of involved parameters and constrain the solution space. In addition, the offset phase parameters was not ignored in order to extract both local low-frequency and high-frequency features. The naive Gabor network was applied in hyperspectral image classification and showed superior performance with a small training set. Despite of the work done in Gabor-based CNNs, only linear Gabor filters (i.e., traditional) were incorporated in CNNs offering architectures with low diversity and adaptability to highly complex data.

B. DEEPFAKE IMAGE RECOGNITION
One of the first attempts to detect deepfake images involved exploiting the mesoscopic properties of images [25]. Two architectures were proposed to detect tampering: Meso-4 and MesoInception-4. Meso-4 consists of four layers of successive convolutions and pooling. These layers are followed by a dense network with one hidden layer. MesoInception-4, which is based on Meso-4, is formulated by replacing the first two convolutional layers by a variant of the inception module. Despite the lack of physical and mathematical interpretation for the proposed solutions, MesoNets and their variations have shown promising results in deepfake detection. In [26], capsule networks were proposed to detect forged images. Faces are detected and scaled to 128 × 128 in the pre-processing phase and a segment of VGG-19 extracts latent features. These features are distributed to three primary capsules and statistical pooling is utilised for forgery detection. Finally, outputs of the three capsules are dynamically routed to the output capsules. The two output capsules, one for real images and one for fake images, indicate the authenticity of images. Li et al. observed that the DeepFake generation pipeline produces face warping artifacts [12]. These artifacts result from resolution inconsistency. Motivated by pre-existing architectures such as VGG16, ResNet50, ResNet101 and ResNet152, these models were trained to detect face warping artifacts. In [27], Kim et al. proposed a combination of content and trace feature extractors to expose deepfake images. The content feature extractor utilizes ResNet-18 pre-trained model, while the trace feature extractor employs multi-channel constrained convolution. Furthermore, the features from both extractors are aggregated and connected to a fully connected layer to produce the classification result.
Owing to the fact that face manipulation methods share a common blending step, Li et al. proposed an image representation called face X-ray [28]. This representation was able to show the blending boundary for fake images without relying on specific facial artifacts, making it effective in image forgery detection. On the other hand, in [29], a patch-wise consistency learning approach was proposed for deepfake image detection. The module requires minor modification depending on the utilized backbone architectures. In [30], Feng et al. incorporated triplet loss function in the feature extraction stage of the deep learning model followed by a linear classification network to discriminate the learned contrastive features between real and fake face images. Due to the growth in number of face manipulation methods, octave convolution and an attention-based fusion module were proposed for mining intrinsic clues in channel difference image and spectrum image [31]. In addition, they designed an alignment module to enhance generalizability. In order to detect the convolutional traces left by GANs in fake images, Guarnera et al. developed an approach based on the Expectation-Maximization algorithm to detect the fingerprints of fake images [32].
Multiple methods started to leverage the frequency domain for deepfake recognition more recently. Durall et al. utilized discrete Fourier transform and applied azimuthal averaging to reduce the amount of features without losing relevant information [33]. Additionally, three different classifiers were used for classification comparison: SVM, logistic regression, and K-means clustering. In [34], it was shown that frequency representation can be used to easily identify severe artifacts. In addition, they demonstrated that transformed images via discrete cosine transform were linearly separable, while classification on raw pixel images required nonlinear models. Motivated by frequency-aware forgery clues, Qian et al. developed a Frequency-Aware Decomposition (FAD) for adaptive partition of input images according to a set of learnable frequency filters [35]. Moreover, Local Frequency Statistics (LFS) were extracted to describe the statistical discrepancy between real and fake samples. Both clues from FAD and LFS were learned by a cross-attention powered two-stream network. In [36], it was observed that cumulative up-sampling in face forgery techniques resulted in plain changes in the phase spectrum. Therefore, they designed a novel spatial-phase shallow learning approach that utilized the spatial image and phase spectrum to capture the up-sampling artifacts of face forgery. Wang et al. managed to capture the subtle artifacts at different scales through the utilization of transformer models [37]. In addition to the multi-scale transformer that detects local inconsistency at different spatial levels, frequency information is leveraged to enhance the robustness of the model to image compression.

III. METHOD A. DIVERSE GABOR FUNCTION
The conventional 2-D Gabor function is defined as a Gaussian function multiplied by a sinusoidal plane wave. Therefore, filters produced from the traditional function have the shape of parallel linear stripes, which represent the sinusoidal wave, encompassed by a Gaussian envelope. Hereafter, traditional Gabor filters are referred to as linear Gabor filters (LGFs). Other variants of Gabor filters such as Circular Gabor Filters (CGFs) and Elliptical Gabor Filters (EGFs) have been designed for hand-crafted texture segmentation and analysis of ring-like shapes [38], [39]. However, a single image could contain different shapes. For example, while the outline human face is linear, the outline of eyelid is elliptical and that of the iris is circular. In order to fully utilize all these shape for effective feature extraction, we develop a unified phase-induced diverse Gabor function G (x, y) by introducing the novel parameter α, as follows: where = {γ , σ, θ, ω, α, φ} defines the set of parameters described in Table 1. K defines the Gaussian function, x r and y r refer to the axis transformation. The real and imaginary components of the complex function can be expressed by Generally, the phase offset is ignored and only the real component is considered for practical applications. However, it has been proven that both real and imaginary components are needed for low and high frequency analysis [40]. Therefore, we consider the real cosine component with phase offset φ, since the imaginary sine component could be generated from the real cosine component at φ = −π/2. Hereafter, we refer to the phase-induced real component by G = K cos P. The incorporation of parameter α provides greater diversity, making LGF and CGF special cases of the proposed function when α = 0 and α = 1, respectively. Furthermore, weighted EGFs can be generated in addition to EGFs by changing the values of α and γ . The difference is that the former has varying amplitude while the latter has constant amplitude along the ellipse. However, we refer to both as EGFs for convenience and the same goes for CGFs. Fig. 2 illustrates special cases of the diverse Gabor function by varying the values of α and γ .

B. LEARNING FRAMEWORK
A distinct difference between AWFs and the proposed AGFs is the reduced hypothesis space. This can be inferred from the range of values for each parameter. Since the objective function used in the deepfake image recognition model is VOLUME 10, 2022 the differentiable binary cross-entropy loss, the parameter set can be optimized via back-propagation. The update of parameters can be expressed by By applying the chain rule The gradient with respect to each parameter is defined as follows: In order to simplify the derivation of θ, the relation in Eq. (4) is leveraged as follows: ∂x r ∂θ = y cos θ − x sin θ = y r (14) ∂y r ∂θ = −(x cos θ + y sin θ) = −x r (15) sin P x r y r K (16) The advantage of incorporating the proposed function in convolutional layers is threefold. First, it will significantly reduce the number of parameters should the need for large receptive field (i.e. filter size) arise. Traditionally, the input to a convolutional layer of N weighted filters of each k × k size is assumed to consist of C channels. As a result, the number of parameters for a convolutional layer based on AWFs is k 2 N (C + 1). For a constant N and C, there is a parabolic increase in the number of parameters with respect to k. However, in case of a convolutional layer incorporated with the diverse Gabor function, the cardinality of its parameter set | | is 6. As a result, the number of learnable parameters will be 6N (C + 1) for a convolutional layer based on AGFs. In this case, the number of parameters is independent on the filter size which allows the utilization of large receptive field filters without increase in the number of learnable parameters.
However, it is noteworthy that AGFs will be very inefficient for convolutional models used in attention mechanisms that utilize 1 × 1 filters [41]. Second, AGFs have higher interpretability compared to AWFs since they are generated from a mathematical function with a predefined set of parameters ψ. Third, while the proposed function can be used to construct deep architectures that solely consist of convolutional layers based on AGFS, it can also be simultaneously used with other convolutional layers that are based on AWFs within a single architecture as in the proposed dual scale large receptive field network (DSLRFN) for deepfake image recognition.

C. DEEPFAKE RECOGNITION
Although the proposed function and learning framework could be applied to any visual application, we focus on the pervasive problem of deepfake image recognition. The overall model is shown in Fig. 3. Since deepfake manipulates the facial region, face detection is an essential step in deepfake recognition. A multi-task cascaded convolutional network (MTCNN) is adopted since it provides satisfactory performance-runtime trade-off [42]. The extracted facial region is then passed to DSLRFN for classification. The proposed DSLRFN consists of four main blocks. The face image goes through a dual scale convolutional block that utilizes the proposed AGFs with large receptive fields. If small receptive fields were to replace the large receptive fields, 5 and 7 convolution layers of 3 × 3 filters would replace the 11 × 11 and 15 × 15 layers, respectively. Therefore, AGFs do not only reduced the architecture size but also reduces its depth. The objective of dual scale block is the extraction of feature at different scale. Furthermore, features are aggregated using element-wise maximum instead of a conventional concatenation layer for feature space reduction. Note that padding is used in the 15 × 15 convolutional layer to produce feature maps of the same size. The second block represents a self-attention mechanism. The block takes feature maps produced from a max pooling layer, produces attention maps, and multiplies attention maps to the feature maps in order to refine the most germane segments of the feature maps. The third block is a high-level embedding block. It takes feature maps from the second pooling layer in order to produce highly abstract features used for classification. The final block is the classifier block which takes high-level features and uses a single fully connected layer to produce the classification result. A batch normalization layer is used after every convolutional layer for training stabilization by reducing the covariant shift [43]. Note that AGFs are only used in dual scale convolution blocks. Furthermore, the number of filters in each convolutional layer is 32, resulting in compact network that consists of 17,013 parameters.

A. DATASETS
The experiments were conducted on four benchmark datasets: Celeb-DF (v2) (CD2) [6], DeepFake Detection Challenge Preview (DFDC) [44], FaceForensics++ (FF++) [45]   and WildDeepfake (WDF) [46]. CD2 contains 5639 fake videos generated from 890 genuine videos collected from the internet. The standard testing split consists of 520 videos (179 real and 341 fake). The fake videos in CD2 were produced by a single generator. DFDC consists of 1131 real videos acquired from paid actors and 4105 fake videos produced by two unknown generators. The standard testing split of DFDC is 775 videos (276 real and 499 fake). FF++ is a superset that consists of 1000 real videos that were manipulated to produce 4000 tampered videos by four different methods: Deepfake, FaceSwap, Face2Face, and Neu-ralTextures. FF++ has a standard train:validation:test split of 720:140:140. Finally, WDF is one of the most recent datasets that consists of 707 videos. Note that all the videos, whether real or fake, were found in the wild. Hence, the number of manipulation methods in this database is unknown. Therefore, WDF is only used to test the generalizability of the models against samples found in the wild.

B. IMPLEMENTATION DETAILS
Conventionally, image-based deepfake detectors exploit a number of still frames from a given video in order to avoid redundancy and high computational complexity, especially since videos in the considered datasets are short and take place in stationary environments. Existing models select key frames, first few frames, or sample one frame per second of the video. This work selects equally separated frames from each video for the sake of variety following [31]. Additionally, MTCNN is used to detect 224 × 224 facial regions in the selected frames. Furthermore, DSLRFN is implemented in PyTorch using the binary cross-entropy loss and Adam optimizer with an initial learning rate of 0.001 for 120 epochs and batch size of 128 and the learning rate is reduced on plateau by a factor of 0.1 with a patience of 8.

C. EVALUATION
Three widely used metrics for deepfake detection are considered for evaluation: accuracy (ACC), Area Under receiver operating characteristic Curve (AUC) and Equal Error Rate (EER). Note that image-level evaluation is performed since the model operates on images [46]. For comparison, we consider the following state-of-the-art architectures ResNet-18 [47], DenseNet-121 [48], MobileNetV2 [49], EfficientNetB0 [50], and MesoNets [25]. In addition, we consider the replacement of AGFs by conventional AWFs in the proposed DSLRFN. Note that the pre-trained ImageNet weights were not used and all these models were trained on the same data as the proposed network for fair evaluation. Tables 2, 3, 4 show the performance on FF++, CD2 and DFDC, respectively. Furthermore, we evaluate model generalizability on WDF. Owing to the fact that DFDC has shown VOLUME 10, 2022   the best generalizability results out of the three datasets used for training, the scores reported in Table 6 is when DFDC is used. An ablation study is conducted, results shown in Table 5, to show the importance of the dual-scale block and the self-attention mechanism.

D. DISCUSSION
In addition to the satisfactory performance of DSLRFN compared to state-of-the-art architectures, it is considered the most compact only utilizing a few number of parameters as shown in Fig. 5. Furthermore, AGFs reduced the architecture   size by 64.9% compared to AWFs. It is noteworthy that this reduction is due to the large receptive field of filters in the first layer. If larger receptive fields are used, the size will be further reduced. In contrast to architectures based on AWFs, AGFs can be directly interpreted since all of the filters are based on a single mathematical function with a parameter set . Moreover, AGFs showed greater generalizability compared to AWFs as shown in Table 6. Since we introduced the parameter α that controls the shape of Gabor stripes, we show the distribution of α for AGFs along each dataset as shown in Fig. 6. It is clear that EGFs are widely utilized compared to LGFs and CGFs. Note that previously proposed Gabor models such as naive Gabor networks [21] and deep Gabor networks [24] can be viewed as special cases of the proposed AGFs. Therefore, AGFs provides a more generalizable framework compared to previous Gabor-based CNNs.

V. CONCLUSION
In this paper, we proposed a unified Gabor function capable of producing linear, elliptical, and circular Gabor filters. The proposed function is applicable to images that has diverse shapes compared to the limited traditional Gabor function. A back-propagation learning framework was adopted to allow the adaptability of the proposed function in CNNs. In contrast to conventional adaptive weighted filters, adaptive Gabor filters enable the utilization of large receptive field without parabolic increase in the number of learnable parameters. While deep architectures could be designed solely using the proposed function, it could also be used with adaptive weighted filters within the same architecture. While the underlying function could be applied to a variety of visual pattern recognition problem, a dual-scale large receptive field network (DSLRFN) was developed for deepfake image recognition. DSLRFN consists of a dual scale convolution, self-attention mechanism, high-level embedding block, and a simple classifier. The proposed architecture demonstrated its performance compared to other state-of-the-art models on CD2, DFDC, FF++ and WDF datasets with a substantially smaller number of parameters.