Generation of Novelty Ground Truth Image Using Image Classification and Semantic Segmentation for Copy-Move Forgery Detection

Since the ground truth (GT) generated by CNN has pieces of patch information of the learned class, the accurate detection of Copy-Move is ambiguous. With various CNNs for image classification and semantic segmentation, the generated GT images are different yet similar to patch patterns for detecting forgery regions. It is difficult to determine which network model-generated GT image is suitable. Therefore, an optimal GT image is essential in image forensics. The proposed scheme in this paper generates a novelty GT image to solve this problem for the correct detection of Copy-Move forgery. The novelty GT image was configured using image classification and semantic segmentation. The variety of GT images is generated by adopting the state-of-the-art four image classifications and one semantic segmentation in the deep neural network. The proposed scheme implements mainly three tasks: 1) each network model generates the GT images (GTnet), 2) which are convergence synthesized into one (GTconv), and 3) it decomposed again into GT images (GTdecomp) with a threshold value of the ‘Threshold Filter.’ Here, the GTnet images involve two pieces of information about the image classification and semantic segmentation of the forgery image. The GTconv has two pieces of information as one GT image. The GTdecomp is decomposed GTconv into various GT images by the threshold value, which is a permeated degree of the information about ‘Image classification’ and ‘Semantic segmentation.’ The proposed novelty GT image is accomplished with this operational flow for Copy-Move forgery detection. The results confirmed in the experiment for comparing the performance of the existing GTnet image, and the GTdecomp image of the proposed scheme showed that the Accuracy and F1 Score of the proposed scheme had the maximum improvement rate of 0.4% and 0.2%, respectively. Also, by estimating the proposed CMFD scheme, Area Under the Curve (AUC) is graded as ‘ Excellent ( $A$ )’ with a value of 0.9 higher.


I. INTRODUCTION
In modern social media, digital images contain a vast amount of information as an essential communication medium. While images can easily edit, transmit, and distribute information about our lives, trust in images is difficult to believe. It has emerged as an image forensics problem.
Using image editing tools (Premiere, Final Cut Pro, Vegas, Movavi, and After Effects, they registered trademarks.), a villain can easily manipulate or modify it. Therefore, Cut-Paste, Copy-Move manipulation, and distribution of malicious images cause severe personal infringement.
The associate editor coordinating the review of this manuscript and approving it for publication was Zhan-Li Sun .
If there is image manipulation by such aggressive means, a defensive means to detect it is needed. Therefore, Cut-Paste [1], [2] and Copy-Move detection [3], [4] methods are being developed day by day as countermeasures.
Copy-Move operation selects one part of an image and copies it to another region of the same image. This manipulation method is used widely in the field of image forensics.
There are three main methods of detecting Copy-Move as follows: 1) 'Block segmentation': The forgery image to be tested is divided into several blocks, and features are extracted from each block. If blocks divided into n × n pixel square type have similar characteristics, these blocks are regarded as a Copy-Move area. There are some disadvantages: all blocks must be compared and computed, the computational cost is high, and other geometric transformations of moving patches cannot be handled. On the other hand, using the Simple Linear Iterative Clustering (SLIC) algorithm [5], the 2D superpixels of the image are calculated and divided into irregular shapes instead of the regular square division and groups the pixels into regions with similar values. 2) 'Keypoint matching': It is possible to find similar areas in the image by extracting and matching the key points of pixels without dividing the forgery image [6]. Scaleinvariant feature transform (SIFT) [7], [8] is the most basic method for detecting Copy-Move. SIFT is an algorithm that extracts features that are invariant to the size and rotation of an image. The basic principle is to extract the SIFT features from two different images and find the corresponding parts by matching the most similar features in the two images. The disadvantage of SIFT is that the region of copy and moving are mismatched due to the bad key points. The expansion of the feature vector dimension for key point generation increases the computing cost and the decision time of the Copy-Move area, so it is not easy to manage the locale smoothly.

3) For pixel-based segmentation classification of images:
BusterNet was proposed by Wu et al. [9] using the VGG16 net model structure. BusterNet has two branches: namely Simi-Det and Mani-Det. However, there are two drawbacks. One should ensure that both branches correctly locate regions, and the other, the Simi-Det branch, only extracts single-level and lowresolution features due to the four pooling layers in VGG16 [10]. Chen et al. proposed the CMSDNet (copy-move similarity detection network) and the STRDNet (source/target region distinguishment network) [11], which is an atrous convolution instead of the 4th polling layer of VGG16 was used to preserve 'Field-of-views of filters'. The above detection category is 'Passive forensic detection' methods that can detect the forgery area without needing a ground truth image of the Copy-Move area. On the other hand, the method [12] that uses CNN for Copy-Move detection is being developed. Still, since the class (copy area/move area), the ground truth image classification is trained together in the network model implementation. Hence, the author's view seems to be regarded as 'pre-Active forensic detection.' The main words of Copy-Move forgery detection from the state-of-the-art [13] showed as a word cloud in Fig. 1.
Essentially, generating a good ground truth to detect the exact temper area of the forgery image is necessary for the image forensics field. To this end, this paper proposes a generation method of novelty ground truth for Copy-Move forgery detection. For this purpose, it adopts the state-ofthe-art deep learning CNNs structure of existing 'image classification' [14]- [18] and 'semantic segmentation' [19]. The central concept of this paper is twofold: 1) The generated novelty GT image involves two characteristics of 'image classification' and 'semantic segmentation' of a forgery image. 2) Copy-Move detection is considered as 'pre-Active forensic detection,' referring to the ground truth image only when network training is performed, but while testing for forgery detection, the trained forgery detector becomes 'Passive forensic ground truth' in the actual field. Also, the contributed operations of this paper for forensic detection are as follows: 1) Copy-Move forgery detection using the deep learning CNN structure for 'image classification' and 'semantic segmentation.' The generated various ground truth images converged to be one synthesized pattern. 2) The pattern image above in 1) is decomposed into a novelty ground truth image which can most accurately detect the Copy-Move patch by the 'Crossing Over' operation and the threshold value of the 'Threshold Filter.' 3) The performance evaluation of general forgery detection calculates Accuracy and F1 Score from the detection result of Copy-Move after overlaying the ground truth and the forgery image. However, the many generated ground truth images already include Accuracy and F1 Score. Therefore, a novelty ground truth image with high Accuracy and F1 Score is ready to use for forgery detection; thus, CMFD is rapidly performed. The above contribution in this paper is to reduce the generation cost of a ground truth image and improve the performance of an effective forensic detector.
The rest of this paper is organized as follows: Section 2 briefly introduces the state-of-the-art CNNs for 'image classification' and 'semantic segmentation.' In Section 3, the new scheme of the Copy-Move forgery detection is proposed. Here, a novelty ground truth image is generated with the properties introduced in Section 2. The experimental results are discussed in Section 4, the comparison of performance evaluation with the generated existing and proposed ground truth images. Lastly, Section 5 draws the conclusion and the future research possibilities presented for the area of image forensics.

II. THEORETICAL BACKGROUND A. ResNet
ResNet [14], [15] is based on the structure of VGG-19 [20] and is a CNN model that is basically used in YoLo model. It won the 2015 ILSVRC (ImageNet Large Scale Visual Recognition Challenge) and was developed by Microsoft. Before ResNet, net models tried to increase performance by stacking many deep layers, but the ResNet author found the Vanishing Gradient Problem (VGP) out of optimization, as shown in Fig. 2(a). This phenomenon in which the influence of gradient on backpropagation rapidly decreases as the model deepens. Fig. 2(b) shows that a residual block called bottleneck architecture was used to solve this problem.
In this module, the input x is a shortcut structure added directly to the output ReLu (F(x) + x), and the VGP in Fig. 2

(a) is solved when the number of layers increases, the input value is forgotten.
And in the version with more than 50 layers, the bottleneck skips connection structure, as shown in Fig. 2(c), is used. [14], [15].

B. MobileNetv2
Google developed MobileNetV2 [16]. Fig. 3(a) is the residual block structure of ResNet, and (b) is the inverted residual black of MobileNetV2.
In Fig 3,(a) is wide → narrow → wide type, and narrow makes a bottleneck, whereas (b) is narrow → wide → narrow type. The first feature is a linear bottleneck, and it is added to the last skip connection without going through ReLU. The necessary information is narrow and passed to deeper layers using the skip connection. Also, because skip connection is narrow, memory usage is reduced.

C. XCEPTION
''Extreme Inception,'' [17] a powerful version of the Inception Architecture, is called 'Xception'. If the size of the network is increased to improve the performance of the deep neural network, 1) overfitting occurs, and 2) computational resources increase. GoogLeNet [21] uses the Inception module as shown in Fig. 4 to solve this problem. Rhee [22] proposed a network model for Cut-Paste forgery detection as an application thereof.

D. INCEPTIONRESNETV2
As a model that surpassed VGG and won 1st place in IRSVRC 2014, a thesis was published under the name of Inception [18], and one of several versions of Inception is GoogLeNet [21]. Fig. 5(a) and (b) present the modules of the original and factorizing module types. Inception v2 has 2.5 times more training parameters than Google Net but is more efficient than VGG.
Convolutional layers are used for all other primary networks.
This DeepLab v3plus implementation does not include a global average pooling layer in ASPP but transforms it to a CNN for semantic image segmentation.
The network uses an encoder-decoder architecture, extended convolution. To use a CNN network for semantic segmentation, retrain the segment region by giving it a class.

F. SEMANTIC SEGMENTATION
Semantic segmentation [24] segments objects in an image into meaningful units. More specifically, it predicts which class each pixel in an image belongs to. This task is sometimes called dense prediction because prediction is performed on all pixels in the image.
If semantic segmentation applies to an image, then know which class each pixel belongs to. That is, the input value of the semantic segmentation algorithm is a color image or a black and white image, and the output value is a segmentation map representing the predicted class of each pixel as in Fig. 7 [25], the semantic labels {1: person, 2: purse, 3: plants/grass, 4: sidewalk, and 5: building/structures}. The feature extraction modules of BusterNet use the VGG16 network, and the mask decoder module uses 4 pairs of BN-Inception [9] and performs bilinear upsampling. A correlation module is added to the Simi-Det branch to detect copy patches. This module consists of an autocorrelation layer, a percentile pooling layer, and a batch normalization layer. Mani-Det detects forged area (moved patch), and Simi-Det detects similar patch (copy patch). These two patches are defined into copy and move regions in the distinguishment map. BusterNet has two drawbacks [11]. The one should ensure that both branches correctly locate regions, and the other one, the Simi-Det branch, only extracts single-level and low-resolution features due to the four pooling layers in VGG16. Therefore, the detected forged area may not be correct.

H. SOURCE/TARGET DISTINGUISHMENT
Chen et al. proposed the Source/Target Distinguishment [11] as shown in Fig. 9 to compensate for the shortcomings of BusterNet. The parallel scheme BusterNet should ensure that each branch locates regions correctly. In contrast, this structure has the CMSDNet (copy-move similarity detection network) and the STRDNet (source/target region distinguishment network) connected in series. A detection map is inserted between them. That comprises three modules: the feature extraction module, correlation module, and mask decoder module. The peculiarity is that instead of the fourth polling layer of VGG16, an atrous convolution is used to preserve 'Field-of-views of filters.

III. PROPOSED COPY-MOVE FORGERY DETECTION SCHEME A. CONFIGURE TO SEMANTIC SEGMENTATION OF NETWORK STRUCTURE
In this work, a novelty ground truth image to detect a Copy-Move region generated using the deep neural of the stateof-the-art CNN networks {Resnet50 [14], Resnet18 [15], Mobilenetv2 [16], Xception [17], Inceptionresnetv2 [18], and DeepLab v3plus structure [19]} that have excellent performance in 'image classification' and 'semantic segmentation' were developed, respectively. In Fig. 10, [14]- [18] used for 'image classification' are converted into that called 'net_Model n' (n = 1 ∼ 5) in turn through the [19] used for 'semantic segmentation.' So, the structures of the net_Models are newly designed and have the characteristics of 'image classification' and 'semantic segmentation,' both.  Table 1 shows the composition of the Copy-Move image data set {CoMOFoD [26], CVIP [27], GRIP [28], CASIA [29], COVERAGE [30], MICC-F600 [31], and CPH [32]}, and those augmented Copy-Move images with a shifted eight directions are used for the training of the net_Models for the configured the structure of 'image classification' [14]- [18] and 'semantic segmentation' [19]. A total of 11,350 images of the seven image data sets are divided at a (0.7: 0.15: 0.15) ratio for training, validation, and testing of the net_Models. Randomly selected 1,275 images of the training data for an augmentation used in the net_Models, a copy patch of data image is transformed with {scaled or rotated, or blur}, then moved to another region in an image.

B. CONFIGURE TO NET_MODEL TRAINING
In the manipulation area of the forgery image, copy and move areas are Class 1, and the other innocent area is Class 2. In the ground truth image, each class appears as 'white' and 'black' colors, respectively, shown in Fig. 11.

C. PROPOSED SCHEME
It is training to implement net_Models as a detector for Copy-Move detection. The training option used in the proposed scheme adopts SGDM (Stochastic Gradient Descent with Momentum) [35].
Subsequently, to improve the performance of CMFD, the workflow of the scheme proposed in this paper is shown in Fig. 12. The process from the GT generation of forgery image in net_Model (on the upper left) to the decision of the Copy-Move forgery (on the far right), described as follows: 1) The structure of the state-of-the-art five networks [14]- [19] transformed to net_Models (No. [1][2][3][4][5] which generate the ground truth image (GTnet) of the forgery image. image, Accuracy, Recall, and F1decomp Score are calculated from TP, TN, FP, and FN through the same procedure of 5).

8) By higher value is selected from F1net in 5) and
F1decomp Score in 7), CMFD is determined.
In Fig. 12, procedure 4) is depicted, also procedure 6) has a working same.

D. PROPOSED FORGERY DETECTION METHOD
In Fig. 14, the performance evaluation of Copy-Move forgery detection is the block described in green, and the green line indicates the Accuracy and F1 Score calculation processing.   The general and proposed methods are configured in (a) and (b), respectively.
In Fig. 14(a), the ground truth image generated by some method is overlaid with the forgery image, and then the evaluation is performed to examine the Copy-Move patch area.
If the result is not satisfactory, another method generates a ground truth image. Overlay with the forgery image again and repeat until the evaluation of the detection result is satisfactory.
In Fig. 12 (b), on the other hand, the performance evaluation of copy-move forgery detection in the proposed scheme is directly performed by overlaying the generated ground truth images and the forgery image.

IV. EVALUATION OF EXPERIMENTAL RESULTS AND PERFORMANCE
For the experiment of the proposed scheme, an environment is the MATLAB 2021a tool used as simulation software on a PC environment (64bit Win10 Pro, AMD Ryzen9 3950X R 16-Core CPU @3.5GHz, 128GB DDR4 memory, and NVIDIA 2080Ti 11GB Double graphic boards).

A. NET_MODEL TRAINING EVALUATION
The execution of each net_Model of the training process configured in Section III-B is shown, which includes {Training Loss, Training Accuracy, and Final Validation Accuracy} in Fig. 15 (a) ∼ (e), respectively. These measured variable values are shown {Training time, the number of layers, Final Validation Accuracy} of net_Models, respectively in Table 2.

B. EVALUATION OF TRAINED NET_MODEL
The Copy-Move forgery images are tested for test preparation in Section III-B on the trained net_Model of the proposed scheme. For the CMFD performance evaluation of 1) and 3) generated in the proposed scheme as shown in Fig. 12, the GT images generated in procedure 4) and 6) is a paired pattern of the ''Crossing Over'' operation ( Fig. 13), with TP, TN, FP, and FN are displayed as the following color components.
- The original GT is the Copy-Move area to be detected, but the generated GT is missing From procedures 5) and 7), the F1 Score is calculated by measurement equations (1) to (7). The meaning of the measurement is described in Table 3.    Table 4 are recorded with the final validation after the training.  Fig. 16 shows a confusion matrix to classify 'Class 1' (white: Copy-Move patch) and 'Class 2' (black: innocent region) for the generated GTnet images of the entire test forgery images by each net_Model (a) ∼ (e), see Fig. 11-1). The detection rate of the Copy-Move region is about 70% of 'Class 1', which is more than 50% of the class classification criterion, confirming the validity of semantic segmentation. Here, the brightness of the class overlay image (e) and (h) is differently displayed to distinguish the generated GT image of net_Model and the proposed scheme. Also, Table 5 shows how the GT generated from which model became the best GTdecomp image with how much threshold value.                 Specificity} in Table 6 (next page). As a result of comparing the generated GT images of each net_Model and the proposed scheme, the performance was somewhat lower than that of net_Model 1; meanwhile, it was superior to the results of VOLUME 10, 2022  other net_Model 2 ∼ 5 (red), and it was confirmed that the improvement rate was increased (blue bold).
The final validation accuracy in Table 4 is the accuracy of the verification process after net_Model training, and the accuracy in Table 6 is the results of test processing with the trained net_Model. The maximum Accuracy is a value of 0.9523 (red bold on highlight). Fig. 36 presents the estimated proposed scheme with ROC curves: (a) the GTnet classification of the existing net_Models and (b) the GTcomp classification of the proposed scheme.
The aspect of the whole AUC values confirms that outstanding of the AUCs with a value of 0.9 higher. Thus, the evaluation [36] of the proposed CMFD scheme was graded as 'Excellent (A)'.
In particular, it was also confirmed that all measurement items of net_Model 2 and 3 were 'excellent' (red) by increasing the improvement rate.
The measurement results in Table 6 present a graph for visual aid. In Fig. 37, GTnet (a), and GTcomp (b) are shown,  respectively, and the improvement rates of each net_Models are shown in (c).
The performance evaluation of the proposed CMFD scheme was compared with the state of the arts [9], [11]   of the pixel-based segmentation classification described in Sections 1 and 2. Table 7 presents the executed results.
The experimental environment is executed under the same conditions as in this paper. It was confirmed that the deep learning network structure of the copy and move region patch classification for CMFD has a higher classification rate with the converged method: (image classification + semantic segmentation) against the structures of a parallel and serial configuration, respectively.

V. CONCLUSION
This paper proposed a new scheme for generating a novelty ground truth image detecting Copy-Moved patches in a forged image. The generated GT involves 'Image classification' and 'Semantic segmentation,' both in Copy-Move forgery images.
According to the information of 'Image classification' and 'Semantic segmentation,' the best-fit GT image of the test forgery image is generated.
Variety kinds of forgery patches in Copy-Move {such as Rotate, Scale, and Blur, etc.} were included in the data set of the net_model training and were detected well in the test stage.
The proposed GT image of the dedicated CNN structures of excellent image classification and semantic segmentation improved the detection {Accuracy and F1 Score} for the Copy-Move matching. Consequently, the proposed scheme in this paper could be applied more in-depth to the image forensics field (such as Cut-Paste, Inpainting, and Forgery feature extraction, etc.).
The detecting method of the Copy-Move patch region of the proposed scheme can be used quickly in an environment of online JIT (Just In Time). It is necessary to advance research on detecting the multiple classes of moving patches in the future.