MSRD-CNN: Multi-Scale Residual Deep CNN for General-Purpose Image Manipulation Detection

The authenticity of digital images is a major concern in multimedia forensics due to the availability of advanced photo editing tools/devices. In the literature, several image forensic methods are available to detect specific image processing or editing operations. However, it remains a challenging task to design a universal forensic method that can detect multiple image editing operations. In this paper, a novel Multi-Scale Residual Deep CNN (MSRD-CNN) is designed to learn the image manipulation features adaptively for multiple image manipulation detection. Our network comprises of three stages: pre-processing, hierarchical high-level feature extraction, and classification. Firstly, a multi-scale residual module is employed in pre-processing stage to extract the prediction error or noise features adaptively. Afterwards, the obtained noise features are processed by feature extraction network having multiple Feature Extraction Blocks (FEBs) for the extraction of high-level image tampering features. Lastly, the resultant feature map is provided to the fully-connected dense layer for classification. The experiment results show that our model surpasses the existing schemes even under anti-forensic attacks, when evaluated on large-scale datasets by considering multiple image processing operations. The proposed network provides overall classification accuracies of 97.07% and 97.48% for BOSSBase and Dresden datasets, respectively.


I. INTRODUCTION
The digital information can be shared in the form of audio, image, and video using various social media platforms such as Facebook, Instagram, Snapchat, etc. The advent of powerful editing software results in a significant increase in the number of tampered images on social media related to political, individual attacks, publicity, etc. Therefore, the authenticity of digital images is very crucial. Moreover, the investigation of digital images can play important role in many fields related to medical, news media, scientific exploration, law and crime [1]- [3]. Thus, it is a concern of great importance in multimedia forensics.
The detection of different image processing operations has a great relevance to the forensic community due to the fact that these operations may be used by the counterfeiter in the creation of an image forgery. It is perceived that The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . different image processing operations embed special artifacts or footprints in the processed image. Several forensic algorithms have been designed to detect the particular image processing operation by analyzing the corresponding artifacts. Some image processing operations considered are resampling [4]- [8], JPEG compression [9]- [12], median filtering [13]- [16], contrast enhancement [17]- [20], etc. Also, many anti-forensic approaches related to different image processing operations such as JPEG compression [21], [22], median filtering [23], and contrast enhancement [24] have also been proposed to mislead the forensic techniques by concealing the footprints of corresponding image processing operations.
The researchers have also developed general-purpose image manipulation detection schemes to detect different image processing operations [25]- [30]. Moreover, it is observed that recent works on multi-purpose image tampering detection are based on deep learning techniques, for instance, Convolutional Neural Networks (CNNs).
These CNNs have demonstrated the ability to automatically learn the image manipulation features from data. A novel constrained convolutional layer based CNN is proposed in [25] to detect the multiple image processing operations by suppressing the image content information and the authors further optimized their constrained neural network in [28] for better performance. In [26], a densely connected CNN based on isotropic constraint is proposed for general-purpose image forensics by considering the anti-forensic attacks. The isotropic convolutional layer works as a high-pass filter to highlight the image processing operations artifacts by suppressing the image content information. Moreover, an image manipulation detection approach built upon [25] and combined with a deep Siamese CNN network is presented in [27]. However, their work was not to identify the specific image manipulation but to classify the input patch pair (two images) whether they are identically processed or not. In [29], Xception architecture is employed to classify multiple image processing operations by considering smallsized images. Most of the existing general-purpose forensic techniques can be easily circumvented by using some antiforensic attacks. Recently, a universal image manipulation detection approach based on densely-connected CNN is proposed in [30] and it has also considered most of the image processing operations including various anti-forensic techniques for evaluation. However, the proposed CNN is significantly different from the existing approach [30] in terms of network architecture as well as used image manipulation datasets.
Overall, designing a unified forensic scheme capable of detecting different image manipulations under different attacks is still a challenging task for the researchers. Also, to the best of our knowledge, the existing works have not performed any cross dataset testing to evaluate the generalization of their models. In this work, we present a novel and effective image manipulation detection approach capable of detecting multiple editing operations including antiforensic methods. The main contributions of our work are as follows: • We propose a novel method: MSRD-CNN for general-purpose image manipulation detection.
• Inspired by Res2Net [31], we propose a multi-scale residual module to obtain efficient noise features adaptively. Further, the obtained noise features are processed by using FEBs to extract the high-level image manipulation features.
• In this paper, we have considered several image processing operations including anti-forensic schemes and with arbitrary parameters to evaluate our network. The extensive experiment results show that our MSRD-CNN provides better accuracy in comparison to the existing methods, even in cross-dataset settings. The remaining part of the paper includes a detailed description of the proposed network in Section II and the experiment results are discussed in Section III. Finally, we conclude our work in Section IV.

II. PROPOSED MSRD-CNN ARCHITECTURE
In this section, we propose a novel MSRD-CNN architecture capable of detecting the traces of multiple image processing operations and anti-forensic techniques. The architecture of MSRD-CNN, as shown in Fig. 1, includes three different stages i.e., extraction of noise features using a multi-scale residual module, feature extraction network to extract high-level features related to image tampering artifacts, and classification.

A. MULTI-SCALE RESIDUAL MODULE
Most of the image manipulation detection schemes use the idea of suppressing the content information of an input image to highlight the image manipulation artifacts. Compared to applying fixed filters to the input image prior to CNN for the extraction of prediction error features, it is preferred to employ a trainable filtering scheme for pre-processing to potentially learn more appropriate image manipulation features adaptively for image forensic tasks. In our approach, we use a data-driven pre-processing scheme that consists of a two-layer CNN and a multi-scale residual module. Each convolution layer in the two-layer CNN contains 64 filters of 3 × 3 followed by batch normalization and the ReLU layer. This two-layer CNN is employed to obtain better input features for the multi-scale residual module. Let us denote the functions of these two convolution layers by C 1 (·) and C 2 (·), respectively. For a given input image I of size 256 × 256, the output of this two-layer CNN is formulated as: This output I C 1 C 2 , having size of 256 × 256 × 64, is then passed to the multi-scale residual module which is inspired from Res2Net [31] and designed to learn the suitable noise features. The proposed multi-scale residual module explores the multi-scale feature representation by dividing the input features of size 256 × 256 × 64 along the channel axis, which results in four different groups of size 256 × 256 × 16. These groups are then interconnected in a hierarchical residual-like style as shown Fig. 1(b). Each group is further processed by a Convolutional Block (CB) having two convolution layers with 16 filters of 3 × 3 followed by batch normalization and ReLU layers. The output feature maps of the first CB is added to the second group before passing to the second CB as shown in Fig. 1(b). Let x i represents the feature maps of i th group, where i ∈ {1, 2, 3, 4}, and H i (·) is the function performed by the convolutional block of i th group. The output of H i (·) which is y i will be added to x i+1 group and passed to (i + 1) th convolutional block (H i+1 ) as provided in Eq. (2).
The outputs of all the convolutional blocks are concatenated and passed to a convolution layer having 64 filters of size 1 × 1. The output of this convolutional layer is subtracted from the input of the multi-scale residual module to obtain the where, MSRM (·) denotes the function performed by the multi-scale residual module. The feature extraction blocks further process these noise features to extract the high-level image manipulation features. Note that the features size i.e., height and width remains same during the pre-processing stage except the channel size.

B. FEATURE EXTRACTION NETWORK
The noise features obtained from the multi-scale residual module are passed to the feature extraction network to extract the high-level image manipulation features. This feature extraction network has four FEBs and each FEB (F B ) is based on a residual skip connection containing two regular convolution layers of size 3 × 3 and a 1 × 1 convolution layer. The input of a FEB is added to the output of the second convolution layer followed by the average pooling operation as shown in Fig. 1 The output of this feature extraction network i.e. I F B is further processed by two convolution layers each having 64 filters of size 3 × 3 to obtain the more relevant image manipulation features. First convolution layer is followed by batch normalization and ReLU and the second convolutional layer is followed by batch normalization. Afterward, the average pooling layer with filter size 4 × 4 and stride 4 is applied to reduce the feature dimension.
Lastly, the global features obtained after the average pooling layer is fed to a fully-connected (FC) layer with 11 neurons corresponding to image processing operations used for classification. We use the softmax function to get the probability of predicted classes and the cross-entropy function to calculate the overall network loss.

C. COMPARISON WITH GIMD-NET
The proposed CNN is significantly different from the existing GIMD-Net approach [30] in terms of network architecture as well as used image manipulation datasets. The model proposed in [30] is inspired from DenseNet [32] employing the concept of local and global residual learning for the extraction of high-level image manipulation features using residual dense blocks (RDBs). On the contrary, the proposed MSRD-CNN is inspired from Res2Net [31] that learns prediction error features adaptively to highlight the image manipulation artifacts and then extract high-level hierarchical image tampering features by using feature extraction network. In [30], there is no preprocessing used to extract the noise features, whereas we propose a preprocessing stage (multi-scale residual module) to extract the noise features adaptively. The RDBs used in [30] are fussed globally and the convolutional layers used in each RDB are densely connected to comfort the training and optimization. But, instead of using global fusion, the FEBs in proposed method are connected sequentially to extract the high-level features. Also, the convolutional layers are employed without dense connectivity in each FEB. Further, the image processing operations used in [30] are based on fixed parameters to create image manipulation datasets. On the other hand, we have created image manipulation datasets based on arbitrary parameters as shown in Table 1. Therefore, we have considered a more challenging dataset in this work to evaluate the model performance as compared to [30].

III. EXPERIMENTAL RESULTS
We conducted extensive experiments to evaluate the performance of the proposed model in the detection of multiple image processing operations and various anti-forensic attacks. Firstly, to confirm the multi-purpose nature of our MSRD-CNN, we considered 10 image processing operations along with corresponding parameters listed in Table 1. The image processing parameters are selected randomly to create more challenging image manipulation datasets. For instance, in JPEG compression, we compress the original images by randomly selecting the Quality Factor (QF) ranging from 60 to 90. We consider BOSSBase [33] and Dresden image dataset [34] for the evaluation of different image tampering detection approaches. The standard BOSSBase dataset comprises of 10,000 grayscale images of resolution 512 × 512 in PGM format. We have transformed these PGM images into PNG format for evaluation purposes. The standard Dresden dataset contains 3008 × 2000 size 1491 raw images in NEF format. We converted these raw images into PNG format for evaluation. Our model is implemented by using PyTorch 1.8 deep learning framework and all the experiments are performed using Tesla V100 GPU with 32GB RAM. We compared our network with recent multi-purpose image tampering detection methods [26], [28]- [30] in terms of detection accuracy. We also assessed our model's robustness and generalization by performing cross-dataset testing. The experimental results exhibit the efficacy of the proposed model in comparison to the existing image manipulation detection methods. All the relevant codes are available on request for reproducibility and research advancement.

A. MULTIPLE IMAGE MANIPULATION DETECTION
In this subsection, we evaluate our MSRD-CNN performance in the detection of multiple image processing operations including anti-forensic techniques using BOSSBase and Dresden datasets. We created one original image (OR) and 10 tampered image datasets using the image processing operations as listed in Table 1 by considering 4,167 and 1,333 images sequentially from the BOSSBase dataset for training and testing, respectively. We extracted 4 patches of size 256 × 256 from each of these images, which results in 16,668 training and 5,332 testing images for each of the image processing operations. Therefore, we obtained a dataset having 2,42,000 grayscale images. We used 1,83,348 images (including 16,668 original images) for training, and remaining 58,652 images (including 5,332 original images) for testing purposes. Note that we follow the strategy used by the existing works [28] to create image manipulation datasets corresponding to different image manipulation operations to make the comparison feasible. Therefore, we have used only 4167 and 1333 images from the BOSSBase dataset for training and testing, respectively. This may also be noted that the complete BOSSBase dataset images are not used in consideration to the limited computational facilities availability, as we are considering 10 image manipulation methods including anti-forensic approaches which are highly compute-intensive and time-consuming.
We also evaluated our network ability using 881 images from the Dresden dataset. We follow the same strategy as used for the BOSSBase dataset in preparing image manipulation datasets using different image processing operations. We considered 667 images for training and 214 images for testing the considered neural networks. All of these images are cropped from the center to obtain a sub-image region of size 1280 × 1280. Afterward, each sub-image region is processed to extract 25 patches of size 256 × 256 and then converted into grayscale format. Therefore, we obtained 16,668 (approx.) images for training and 5,332 (approx.) images for testing corresponding to image processing operations provided in Table 1. The training of our network is performed by using the Adam optimizer with a learning rate of 0.001 and we trained our network for 100 epochs in each experiment.
We evaluated confusion matrices for our model based on multiple image processing operations for BOSSBase and Dresden datasets as shown in Tables 2 and 3. Our MSRD-CNN provides average accuracies of 97.07% and 97.48% for BOSSBase and Dresden datasets, respectively, when evaluated on multiple image processing operations. Table 2 reveals that the proposed network gives an accuracy of greater than 97% for each image processing operation except for the original and CE images on the BOSSBase dataset. The accuracy of original and contrast-enhanced images is 87.92% and 90.15%, respectively for the BOSSBase dataset. Table 3 demonstrates that our proposed approach identifies each image processing operation with an accuracy of greater than 97% except for the original and contrast-enhanced images with 92.22% and 85.03% respectively on the Dresden dataset. Moreover, the robustness of our model is confirmed by the fact that it provides high accuracies against different anti-forensic approaches on both the datasets.
We also conducted an experiment by combining both the training sets of BOSSBase and Dresden datasets. It is observed that combining both the training datasets increases the model accuracy further, likely because of the increase of training dataset size and/or more diversity. The testing accuracy increases from 97.07% to 97.38% on the BOSSBase test dataset. Similarly, model testing accuracy increases from 97.48% to 98.11% on the Dresden test dataset. However, the training time increases significantly due to the large training data.

B. COMPARATIVE ANALYSIS WITH EXISTING APPROACHES
We compared our MSRD-CNN with existing multi-purpose forensic schemes [26], [28]- [30] by considering multiple images processing operations including anti-forensic techniques using the same training and testing datasets as defined in Section III-A. We provide the diagonal entries of confusion matrices in Table 4 for different methods for ease of comparison. The proposed model provides better detection as compared to the existing approaches for all the considered image manipulations except GB, JPEGAF [22], and CEAF [24] operations, when tested on the BOSSBase dataset as shown in Table 4. Similarly, our network achieves better detection accuracy for all image manipulations except JPEG, GB, and CE operations for the Dresden dataset. However, it may be noted that for GB and CEAF [24] operations in the BOSSBase dataset, our model is second best and is around 0.2% lower than the best performing method. Also, for the JPEG and GB operations in Dresden dataset, our method is 0.02% and 0.17% lower than the best performing method, respectively. Moreover, Table 4 shows that our model outperforms the recent deep learning based scheme [30] with average accuracy improvements of 1.04% and 1.48% for the BOSSBase and Dresden datasets, respectively.

C. PERFORMANCE EVALUATION BASED ON CROSS DATASET IMAGES
In this subsection, we evaluate the performance of our network by considering cross dataset testing images. In the first experiment, the considered models, trained on the BOSSBase training dataset images, are applied on the Dresden test set images. Similarly, we also perform the experiments considering Dresden training dataset images and BOSSBase test dataset images. The average accuracy results of these cross VOLUME 10, 2022   dataset testing experiments are presented in Table 5 and it is observed that our MSRD-CNN architecture outperforms the recent multi-purpose forensic schemes by providing higher detection accuracies of 86.49% and 81.40% for BOSSTrain-DRESTest and DRESTrain-BOSSTest, respectively. It is also noted from Table 5 that all the considered forensic methods do not perform well for the original images because the proposed model focuses on the artifacts introduced by the image manipulation operations in the image. But, the original images do not have any manipulation artifacts except the camera fingerprint-related features. Moreover, the original images of these two datasets are acquired from different camera models/devices. Therefore, we also provided the overall average accuracies excluding the original images as shown in Table 5. These results are also in favour of proposed MSRD-CNN, with 95.1% and 87.7% accuracies in two settings considered. This highlights the overall best generalization ability of the proposed approach.

D. ABLATION STUDIES
The performance of our MSRD-CNN is examined considering the different architectural design choices to achieve an optimal design for the proposed model. Initially, we evaluate our MSRD-CNN model with different number of initial  convolution layers in pre-processing stage. Then, we examine the influence of multi-scale residual module on the model performance. Moreover, we also conducted experiments to evaluate the effect of number of FEBs on the model performance. We also perform experiments related to the choice of activation function used in the proposed model. All of these experiments based on different structural design choices are performed by considering multiple image processing operations on BOSSBase dataset. We have also plotted testing accuracy versus number of epochs for these experiments, as shown in Figs. 2 to 5.
In the first ablation study i.e., when different number of initial convolutional layers are considered, the overall classification accuracy of 95.99%, 97.07%, and 97.06% is achieved with one, two, and three convolutional layers, respectively. It is observed that accuracy is around 1.07% less when using only one convolutional layer and the accuracy in the case of two and three initial convolution layers is almost same. But training time increases significantly in the case of three initial convolution layers. This is because the pre-processing stage does not contain any pooling layer and perform convolution operations with full sized image. This results in the increase  in the number of training parameters and the training time with the addition of each initial convolution layer. It is clear from the Fig. 2 that our MSRD-CNN with two initial convolution layers consistently perform better by providing higher VOLUME 10, 2022 classification accuracy for most of the epochs as compared to the other design choices. Moreover, we also evaluated our model performance without multi-scale residual module to reveal its importance. It is observed from Fig. 3 that our model with multi-scale residual module consistently performs better as compared to MSRD-CNN without multi-scale residual module by providing higher accuracy for most of the epochs. Therefore, these results reveal the importance of the multi-scale residual module in our proposed network.
In another experiment, we perform ablation study on number of FEBs. The classification accuracy with three, four, and five FEBs is 95.69%, 97.07%, and 97.08%, respectively. It is observed from Fig. 4 that there is a significant improvement in accuracy for all epochs, when number of FEBs are increased from three to four. But, when we evaluated our model by considering five FEBs, there is not much improvement in classification accuracy. However, adding FEBs to the model also increases the computation cost by increasing the total number of model parameters. Therefore, we choose four FEBs in our proposed model. We also perform experiments by considering Tanh and recent Mish [35] activation functions to evaluate the model performance. Again, it is observed that the proposed MSRD-CNN (with ReLU activation function) provides better performance than the Tanh and Mish activation functions as shown in Fig. 5.

IV. CONCLUSION
In this paper, a novel general-purpose forensic approach is proposed for image manipulation detection. Our MSRD-CNN employs a multi-scale residual module to learn the prediction error features adaptively by suppressing the image content information. A feature extraction network further processes these low-level forensic features to provide high-level image manipulation features for better classification. A series of experiments were performed using two largescale datasets. The results consistently show that our model can effectively classify different image processing operations, including anti-forensic attacks. Our model provides overall accuracy improvements of 1.04% and 1.48% as compared to the recent forensic method [30] on BOSSBase and Dresden datasets, respectively. Even in cross dataset testing settings, our model outperforms other approaches and exhibits good generalization ability. In the future, we further plan to evaluate the robustness of our network against adversarial attacks and image manipulation chain detection scenarios. He is currently working as an Associate Professor with the Department of Computer Science and Engineering, Indian Institute of Technology Ropar, India. His current research interests include image processing, computer vision, image forensics, applied deep learning, and assistive technologies. VOLUME 10, 2022