Detection Enhancement for Various Deepfake Types Based on Residual Noise and Manipulation Traces

As deepfake techniques become more sophisticated, the demand for fake facial image detection continues to increase. Various deepfake detection techniques have been introduced but detecting all types of deepfake images with a single model remains challenging.We propose a technique for detecting various types of deepfake images using three common traces generated by deepfakes: residual noise, warping artifacts, and blur effects. We adopted a network designed for steganalysis to detect pixel-wise residual-noise traces. We also consider landmarks, which are the primary parts of the face where unnatural deformations often occur in deepfake images, to capture high-level features. Finally, because the effect of a deepfake is similar to that of blurring, we apply features from various image quality measurement tools that can capture traces of blurring. The results demonstrate that each detection strategy is efficient, and that the performance of the proposed network is stable and superior to that of existing detection networks on datasets of various deepfake types.


I. INTRODUCTION
D EEPFAKE is a technique for creating synthetic content by naturally changing the human face of the original content using an autoencoder and generative adversarial network (GAN) [1,2,3]. In a broad sense, deepfake refers to deformed or created content that uses deep learning methods (audio deepfake [4], imaginary people generation [5], etc.) to trick people. In a narrower sense, deepfake refers to an image or video of a human face that has been generated using deep learning methods and can cause malicious effects. Deepfake in this narrow sense can be classified into three types according to the type of manipulation: face-swap, puppet-master, and attribute-change.
In face-swap deepfakes [6,7], which is the most common type, a person's face is pasted onto that of another person, while maintaining the original person's expression. Previous face-swap methods consider only the shape, direction, and skin color of the face, regardless of the original facial ex-pression. However, in deepfake, face swapping is synthesized by imitating the expression. Nowadays, because anyone can easily use deepfake face-swap methods on the Internet, it is already being exploited to naturally synthesize the face of a famous celebrity into pornography. Only celebrities have been targeted, because deepfake training requires many pictures of the same person. However, deepfake synthesis techniques that use only a few photos have been proposed; thus, anyone on the street can now be a victim of deepfake. Currently, most of the harmful effects of deepfakes fall under this category in which personal rights can be violated. Puppet-master deepfakes [8,9], also called reenactment, manipulate a target image to follow the movements of the face, head, and upper body of a source image. As this type of deepfake does not require the appearance of other faces, a more sophisticated synthesis is possible. This technique is primarily used to create fake news; fake news that synthesizes the face of a key person (president, prime minister, famous news anchor, etc.) that can cause mass confusion in society. Finally, attribute-change techniques [10,11] can manipulate a wide range of visual traits in facial images (hair color, beard, aging signs, etc.). This type of deepfake can be exploited by manipulating evidence, such as changing the facial traits of a criminal captured on camera, causing social confusion. Many studies are currently in progress to detect deep fakes that can adversely affect society, but more sophisticated deepfake creation and detection avoidance methods are emerging [12,13]. A competitive race between more elaborate deepfake generations and more accurate detection is underway.
Deepfake images differ in the form and degree of traces because of the diversity of generating algorithms, facial characteristics, and postprocessing methods. However, some tracing forms are commonly observed: generating fine noise while passing through a GAN or auto-encoder, blurring caused by resizing and postprocessing, and warping caused by a failure of facial geometric and illuminance predictions.
In this study, we propose a generalized detection method using traces to detect three types of deepfake (face-swap, puppet-master, and attribute-change). To improve detection performance, we developed a network based on image quality measurement (IQM) features and warping artifacts extracted from facial landmarks. Instead of using a general network of recent algorithms, such as XceptionNet [14], we propose the use of a network designed for steganalysis to capture residual noise traces in deepfake images. The experiments were performed using different types of deepfakes with public databases, and we demonstrate that the proposed network achieves performance stability and is superior to existing detection networks on datasets of various deepfake types.
Our contributions are summarized as follows: • We propose a generalized detection method using traces to detect three types of deepfake: face swap, puppetmaster, and attribute change. • We developed a network based on image quality measurement (IQM) features and warping artifacts extracted from facial landmarks. • We propose using a network designed for steganalysis to capture residual noise traces in deepfake images.

1) Face-swap
Korshunova et al. [15] suggested training a multi-scale architecture convolutional neural network (CNN) to paste faces from one image to another. RSGAN [16] performs natural face swapping by separating the hair and face in a latent space. In addition, in combination with existing 3D analysis technology, the face-swap has become more sophisticated [17,18]. Li et al. [6] suggested techniques (mask area adjustment, additional layer in auto-encoder, and effective post-processing) to obtain better quality face-swap content. Using these techniques, they built the 'Celeb-DF' deepfake dataset. The DeepFaceLab team [7] released the deepfake face-swap application. They used a GAN with an autoencoder and set of attention masks to improve the details of output images.

2) Puppet-master
Suwajanakorn et al. [8] proposed a synthesis technique for manipulating lip shape. By learning the mapping of audio features to mouth shapes, they created a fake version of a video of a Barack Obama speech using the target audio. Tripathy et al. [9] proposed a two-stage GAN using a facial attribute vector consisting of the head pose and action unit (AU). This model generates a neutral image with a central pose and neutral expression from a source image and transforms it to follow the target image's attributes. Rössler et al. [19] introduced a face manipulation dataset generated by Face2Face [20], which is a technique for facial reenactment manipulation.

3) Attribute-change
Choi et al. [10] proposed StarGAN, which uses only a single model for multiple attribute domains, and Pumarola et al. [21] used facial AU labels that allow the generation of detailed and continuous facial expression transformations. Kingma et al. [11] proposed Glow, which uses a flow-based generative model using invertible 1 × 1 convolution. Glow allows various attribute changes and exhibits a high quality.

B. DEEPFAKE DETECTION METHODS
In addition to the image forensics of general image modifications [22,23,24], several deepfake detection techniques have been proposed [25,26,27,28]. Matern et al. [29] used color mismatch in two eyes and noise owing to inaccurate geometric predictions and inaccurate light predictions. Yan et al. [30] detected a deepfake using the inconsistency in the 3D direction between the narrow face area and overall head. Afchar et al. [25] proposed two simple fake-face detection networks (Meso-4 and MesoInception4) that exploit mesoscopic features. Because training is performed with a distribution in the RGB color space, Li et al. [31] changed the color space to HSV or YCbCr and detected deepfake using the statistical difference between the color spaces. Koopman et al. [32] detected deepfake videos using photo response non-uniform noise patterns that disappeared when the facial area was modified. Li et al. [33] identified deepfakes by representing the blending boundary determined using the inconsistencies of the underlying image statistics as grayscale images. Rössler et al. [27] constructed a dataset for face manipulation detection, published it, and presented experimental results using existing detection techniques. They showed that extracting only the facial region and using it as an input image improves performance. The authors demonstrated that XceptionNet [14] exhibited the highest performance among the detection networks used in their experiment.
Most existing methods detect specific types of deepfake or use only one strategy to detect deepfakes. However, because the deepfake types and generation methods are diverse, detecting fake faces is difficult using a single detection strategy. Furthermore, using a specific feature to detect deepfakes means that detection can be easily avoided. The proposed method combines various strategies into a single model for stable deepfake detection.

FIGURE 1: Sample dataset images
The datasets were collected or generated using representative methods, as shown in Table 1. The original and deepfake image datasets were preprocessed for face region detection and cropping. In dataset generation, deepfake images are created by applying various deepfake generation methods to the original images. Because deepfake images are generated using various combinations of original images, the number of deepfake images is relatively larger than that of the original images. Therefore we included the VGG face dataset [34] to match the numbers in training and testing equally. Figure  1 represents the original and fake sample images of each dataset.

A. GENERATING FACE-SWAP IMAGES
We generated face-swap images using a synthesis application that DeepFaceLab [7] provides. Because this method is based on an autoencoder with a GAN, numerous facial images of a specific person are required. The greater the variety and quantity of a specific person, the better the output quality. Thus, we collected 40 videos from five different people (Donald J. Trump, Moon Jae-in, Xi Jinping, Abe Shinzo, and Kim Jong-un). From these videos, we collected approximately 195 K cropped facial images and then trained the face-swap models for each person. After training, we generated 428 K face-swap images for all five people for swappable cases ( 5 P 2 = 20). We also included 91 K faceswap images from the deepfake Detection Challenge (DFDC) [35] training dataset and 87 K face-swap images from Celeb-DF [6], which showed low distortions from various synthesis methods.

B. GENERATING PUPPET-MASTER IMAGES
To build puppet-master datasets, we used the ICFace [9] model. We used 3,521 frames extracted from randomly selected videos in the VoxCeleb2 dataset [36] as the target image datasets. For each image, attribute vectors representing the head poses and AUs of 17 facial muscles [37] were extracted using OpenFace [38]. We then randomly sampled 70,420 images from MegaFace [39] and generated 70,420 fake images using an attribute vector. We also included 428 K images from the FaceForensics [19] dataset, which was generated using Face2Face [20], a technique for facial reenactment manipulation.

C. GENERATING ATTRIBUTE-CHANGE IMAGES
We built attribute-change datasets using the Glow [11] model. We randomly selected 38 K images from CelebA-HQ [40] as source images. In total, 13 relatively valid and natural attributes (5_o_Clock_shadow, Bags Under Eyes, Bald, Black Hair, Blond Hair, Bushy Eyebrows, Chubby, Heavy Makeup, Male, No Beard, Rosy Cheeks, Smiling, and Young) were chosen for the transformation. Consequently, we obtained 334 K face images transformed from 38 K images.

IV. DEEPFAKE IMAGE DETECTION A. TRACES OF DEEPFAKE 1) Residual noise
Once an image pixel is modified (or synthesized) by image operations, its relationship with its neighbor is expected to change, yielding traces with a periodicity that depends on image operations. To analyze traces of image generation, residual-domain approaches have been investigated for image forgery detection [41]. In deepfake detection, residual noise is a transfiguration trace generated by passing it through an autoencoder and GAN network filters [42].
Image residuals are affected by the transformation methods rather than the image content. Therefore, we propose a feature extraction method using a deep learning model that can focus on the residual noise of the deepfake operation. We propose the use of SRNet [43] to capture residual noise traces in deepfake images. SRNet, which is designed for steganalysis, is used to concentrate on fine signals by excluding pooling layers at the front of the network. As the technique for deepfake generation has developed, distinguishing it from its content becomes more difficult. Differences between real and fake images, such as unnatural face collapse, edge of the pasted face, and unnatural eye and mouth behavior, will be minimal in high-level features. Therefore, detecting the residual noise caused by synthesis operations is an important factor in deepfake detection.

2) Warping artifacts
Owing to inaccurate predictions in facial geometry and light, deepfake images generate warping artifacts that are useful high-level features for detecting deepfake. For example, pupils and teeth, which require detailed expressions, are visually distorted in deepfake images. In addition, boundary artifacts typically appear on the forehead, chin, or edges of the face because of pasting. Warping artifacts appear because of the limitations of deepfake generation technology. The difference in the size and position of facial components, facial color, head angle, facial expression, and lighting conditions between the source and target can cause warping artifacts. Furthermore, a limited number of photos or videos and a lack of training can also cause warping artifacts. Warping artifacts tend to appear in semantic areas of the face, as shown in Fig. 3. Therefore, we extracted landmark image patches from semantic face regions to focus on warping artifacts.

3) Blur effects
In [26], a deepfake-like dataset was created by blurring facial regions in images. From this, we inferred that a blurlike effect exists in deepfake images. This is because of the resolution inconsistency and postprocessing that occurs during the deepfake generation process. The resolution of the inputs and outputs of deepfake networks is typically fixed, whereas the resolution of the source or target image is not fixed. Furthermore, the size of the face in the image varied according to the distance from the camera. Therefore, face image resizing occurs frequently in the deepfake generation process, and causes interpolation and blur-like effects.
Owing to the limitations of deepfake generation techniques, the output of a deepfake typically has unnatural features. In particular, the output of a deepfake generation network often leaves the boundary owing to the discontinuity between the source and target faces [6]. Textured noise often occurs in the output. Therefore, postprocessing typically considers blurring for naturality, among other methods. Considering this, blur-like effects often occur in deepfake images, and we exploited this as a trace of deepfake. Blurring does not make a significant difference when applied to an already-blurred image. Based on this observation, we applied a Gaussian filter to face-only images and compared them to blurred face-only images using IQM tools.
As shown in Table 2, we used the following 17 IQM tools: Laplacian blur variance (LPV), high-low frequency index (HLFI) [48], spectral phase error (SPE) [49], spectral magnitude error (SME) [49], gradient-magnitude error (GME) [44], gradient phase error (GPE) [44], structural content (SC) [46], average difference (AD) [46], mean square error (MSE) [47], signal-to-noise ratio (SNR) [50] in dB, normalized absolute error (NAE) [46], peak signal-to-noise ratio (PSNR) [51], Laplacian MSE (LMSE) [46], maximum difference (MD) [46], R-averaged maximum difference (RAMD) [47], normalized cross-correlation (NCC) [45], and visual information fidelity (VIF) [52]. Owing to the complexity of the formula,  R-Averaged Max Difference [47] RAM D(I,Î, R) = 1 VIF, which is an image quality assessment index that uses natural scene statistics to quantify the loss of image information, is excluded in Table 2. In Table 2, G is the gradient of an image, and G M and G P denote the magnitude and phase of G, respectively. Moreover, F indicates the Fourier transform operation, and F l and F h denote the low and high frequencies in the Fourier domain, respectively. In addition, φ indicates the phase in the Fourier domain, and L is the Laplacian filter such that L( Figure 4 is an example of the histograms for two IQM feature values: PSNR and LMSE. The red area represents the feature values of deepfake images, and the blue area represents the originals. In PSNR, a higher value represents a smaller difference between the two compared images. A higher LMSE value represents a larger difference. We found that deepfake images tend to have a blurred effect because they are less affected by the Gaussian filter. Therefore, we applied IQM features to the proposed network to capture blur-like traces in deepfake images. Figure 2 illustrates the overall architecture of the proposed network for deepfake detection, where LT and LB denote the layer type and landmark block, respectively. The numbers in parentheses are kernel numbers.

B. DETECTION NETWORK
To capture residual noise, we propose adopting the SRNet architecture [43] as the base network. Primarily, SRNet is used in steganalysis to concentrate on fine signals at the pixel level. The key method is to not reduce the dimensions at the front of the network by excluding the pooling. In the experimental section, we demonstrate that SRNet can effectively capture noise traces in deepfake images. The base network is optimized to detect fine signals but does not focus on high-level features such as warping artifacts.
To detect warping artifacts, we extracted 14 landmark patches where the warping artifacts primarily appeared and used them as the input of the LB. Unlike the front part of the base network, a pooling layer was added to the LB because it does not need to preserve residual noise.
The output of the LB was concatenated with the output of LT 3 to deliver the warping artifact information. After LT 2, without a skip connection, we used global average pooling to prevent overfitting and reduce the number of neurons in the fully connected layer. Finally, to detect traces of blur effects, the 17 IQM features computed from the target and blurred images were concatenated with the output of the global average pooling after passing through a fully connected layer. Subsequently, the output passed through the last fully connected layer for two-class classification.

V. EXPERIMENTAL RESULTS
We used 2.88 M images to train and test the proposed network. The size of the fake and original training sets were 2.55 M. For the tests, 320 K fake and original images were used. The training and tests sets do not include the same subject such that training data leakage does not occur during the training process. For each deepfake dataset experiment, the original images were randomly imported such that the ratio of fake to original images was 1:1. We used the Dlib [53] library for preprocessing. Pre-processing included facial landmark detection and region cropping. As described in Fig. 3, 12 facial landmark patches were made around the landmark point of size 32×32.
The resolution of the input images was 128 × 128. While constructing the deepfake image dataset for our experiment, we collected deepfake images from FHD (1920 × 1080) and videos from the DFDC [35] and FaceForensics [19] dataset (> 640 × 480). We then cropped the facial area from the images in these datasets, which reduced the input image size. Because we pre-extracted the facial area, the input size required to be considerably smaller than the original size. In addition, most of the cropped facial images had a size of 128 × 128. Therefore, we unified the image input size with this value.
The model was trained for 10 epochs, with a batch size of 32. The binary cross-entropy (BCE) loss function was used, with a learning rate of 0.001. We used the Adam optimizer was used for a smooth learning rate adjustment and initialized the weight using Kaiming initialization [54].

A. COMPARISON OF THE RESIDUAL-NOISE DETECTOR
Residual noise generated when producing deepfakes is fine noise at the pixel level. To detect this, we propose the use of SRNet, which was designed for steganalysis. To demonstrate the efficiency of the residual noise feature in deepfake detection, a comparative experiment was conducted with existing well-known convolutional neural network (CNN) models.  For comparison, we trained and tested VGG [55], ResNet [56], DenseNet [57], XceptionNet [14], and SRNet [43] using our datasets. The environmental setting was the same as in the proposed network. Table 3 lists the accuracy of the CNN models for each deepfake type. SRNet, which focuses on fine noise, exhibits better performance than existing CNN models that analyze high-level features. The puppet-master technique does not change the face of the target person but changes only the expression of the target. Therefore, a significant difference in appearance was not observed compared with the other deepfake generation methods. This is because the puppet-master technique has the largest difference in detection performance between SRNet and other network models.

B. BOOSTING PERFORMANCE USING THE PROPOSED FEATURES
We propose deepfake image detection using warping artifacts, blur effects, and residual noise. Landmark patches were applied to the network to detect warping artifacts and IQM features between images, and blurred images were used to capture the blur effect. For blurring, a Gaussian blur filter with a kernel size of 3, and a Gaussian kernel standard deviation in the X direction of 0.5 was used. We tested the effectiveness of our strategy by applying it to XceptionNet and SRNet, which exhibited high performance in previous experiments. In XceptionNet, we concatenated the result of the landmark block to the middle of the exit flow, and concatenated IQM features to the input of the final fully connected layer. Table 4 shows the effect of applying warping artifact and blur-effect features to the network. The results show that each strategy is effective in detecting deepfakes. Furthermore, the network, which included the combination of each method, showed the highest performance on the total 6 VOLUME 4, 2016 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and

C. COMPARISON TEST AND DISCUSSION
We conducted a comparative experiment between the proposed method and other deepfake-detection techniques. In the experiment, SRNet was used as the base network for the proposed method. Li et al. [26], Afchar et al. [25], and Rössler et al. [27] with XceptionNet, which exhibited the highest performance on FaceForensic++, were used for the comparison. The proposed models, XceptionNet and Meso-4, were trained using our datasets. For ResoNet, we used a pretrained model for the tests. Table 5 lists the accuracy and area under the receiver operating characteristics (AUROCs) of the network models for each type of deepfake, and Fig. 5 illustrates the receiver operating characteristic (ROC) curves for each network model per dataset. Our proposed networks have higher accuracies and AUROCs than existing deepfake-detection networks in all datasets. The SRNet+IQM method exhibited a higher performance than the SRNet+LM+IQM method on datasets that have relatively more blur-like traces, such as the Celeb-DF dataset. Similarly, the base+LM method was more effective than the SRNet+LM+IQM method on some datasets that have few traces of blurring and more traces of warping artifacts, such as the FaceForensics dataset. When the accuracy of SRNet+LM or SRNet+IQM is the highest, the gap between the highest accuracy value and the accuracy value of the SRNet+LM+IQM method is always small. However, on the DFDC and Celeb-DF datasets, the SRNet+LM+IQM method exhibited an accuracy approximately 4% higher than that on SRNet+LM and SR-Net+IQM respectively. That is, the detection performance of the SRNet+LM and SRNet+IQM methods depends relatively more on the deepfake generation algorithm. However, the SRNet+LM+IQM method exhibits stable detection performance for various deepfake datasets, which is why the SRNet+LM+IQM method shows the highest accuracy and AUROC values for the entire dataset.
For time consumption, the IQM feature extraction of our method required 1.14 seconds, and 1.06 seconds to extract the landmark patches, which can be drastically shortened using parallel calculations. Table 6 lists the network execution times (based on NVIDIA RTX 2080Ti GPU) and number of network parameters. Compared to XceptionNet, which showed the best performance on FaceForensics++, our final network (SRNet+LM+IQM) showed better performance with approximately 60% of the execution time and 20% of the number of parameters. Furthermore, the additional number of network parameters and execution time required to detect blurry traces and warping artifacts were approximately 157 K and 0.001 s, respectively.
Limitations on the state-of-the-art deepfake detector have been reported, including a lack of generalization by focusing on specific artifact identification and rapid performance decreases as deepfake quality improves [58]. Addressing this, this study alleviated the generalization problem by detecting deepfakes using various strategies, and insignificant performance degradation was observed when experimenting with sophisticated deepfake images using the latest deepfake generation methods and datasets. Despite these strengths, some case predictions still fail. For example, an original image is determined a deepfake because of the rapid movement of a face or overlapped image that occurs when the scene changes when dividing a video into a frame-by-frame sequence. In addition, owing to the strategy of detecting blur effects, blurred images that are not deepfakes are sometimes incorrectly detected. Signal-processing attacks, such as compressionquality changes and noise additions, can also interfere with detection.
We discuss adversarial attack in terms of noise addition. Adversarial attacks are methods that cause errors in a model by adding perturbations to the image. This perturbation is a type of noise that is extremely small and has a fatal effect on the model [59]. Szegedy et al. [59] indicated that deep neural network (DNN)-based models are vulnerable to adversarial attacks, and [58] mentioned the problems and limitations of adversarial attacks when using deep learning models to detect deepfakes. Currently, studies on defending against adversarial attacks [60,61] and attack defense techniques [62,63] are being actively studied. In future studies,we can improve the robustness against adversarial attacks using methods such as adversarial training [59] and defensive distillation [61]. Finally, methods that are more advanced than the dlib we used in the landmark patch extraction process are available. We can expect to further improve performance using state-ofthe-art face detection technology, such as Mediapipe [64], in future studies.  In this study, we proposed a generalized detection method to detect three types of deepfake techniques: face swap, puppet-master, and attribute change. We exploited three types of common traces (residual noise, warping artifacts, and blur effects) generated by the deepfake process. We applied them to the proposed network for deepfake detection. First, a network designed for steganalysis was adopted as the base network to detect residual noise. Second, landmark patches were extracted from the semantic facial region to detect warping artifacts, which are unnatural high-level features. Finally, we applied IQM features to capture the statistical characteristics of the blur-like effects of a deepfake. The results revealed that each detection strategy is effective, and the performance of the proposed network is superior to that of existing networks.
From an additional perspective, we focused on determining deepfakes' common traces, which are difficult to bypass. Because detecting image-based traces is more difficult to bypass than detecting traces of time-based inconsistencies in deepfakes, we targeted image features. Because a deepfake video inherits residual features from image operations, our approach can be directly adopted for deepfake video detection pipelines based on frame-by-frame detection. Based on the proposed method, we plan to expand this study to include a deepfake video detection method. We hope this method is SANGYEONG LEE Sangyeong Lee is a senior majoring in big data at Hallym University, South Korea. She is an undergraduate researcher at Multimedia Computing Laboratory. Her main interests are multimedia forensics, computer vision, and deep learning.
DAEHEE JANG Daehee Jang is an assistant professor at Sungshin W. University, Security Engineering Department. He received Ph.D. of Information Security at KAIST in 2019; and he worked as postdoctoral researcher at Georgia Tech until 2020. He participated in various global hacking competitions (such as DEFCON CTF) and won several awards. He received a special prize from 2016 KISA annual event for finding 0-day security vulnerabilities in many software products. Also, he is the founder of pwnable.kr wargame -an education platform for training hacking skills.
JONG-UK HOU Jong-Uk Hou received the B.S. degree in information and computer engineering from Ajou University, South Korea, in 2012, and the M.S. degree and Ph.D. degree from KAIST, South Korea, in 2014, and 2018, respectively. He has been an Assistant Professor at School of Software, Hallym University since 2019, and the Principal Investigator of Multimedia Computing Laboratory. His major interests include various aspects of information hiding, point cloud processing, computer vision, machine learning, and multimedia signal processing. VOLUME 4, 2016