Source Camera Identification Based on Coupling Coding and Adaptive Filter

Source Camera Identification (SCI) has been playing an important role in the security field for decades. With the development of Deep Learning, the performance of SCI has been noteworthily improved. However, most of the proposed methods are forensic only for a single camera identification category, e.g., the camera model identification. For exploiting the coupling between different camera categories, we present a new coding method. That is, we apply the multi-task training method to regress the categories, namely, to classify brands, models and devices synchronously in a single network. Different from the common multi-task method, we obtain the multi-class classification result by just one single label classification. To be specific, we classify the categories in a progressive way that the parent category classification result will be used in the child category classification (a detailed explanation will be given later in the main context). Also, by appropriately increasing the redundancy of the coding method for classifying new camera categories, the training time can be greatly reduced. To better extract camera attributes, we propose an adaptive filter. Additionally, we propose an auxiliary classifier that only focuses on the camera model re-classification, due to the low performance of the main classifier on certain models. Lastly, the extensive experiments show that our methods have a better performance than other existing methods.


I. INTRODUCTION
With the rapid development of multimedia technology, digital images have gained growing popularity on the idea expressing. In many scenarios, digital images are playing more important roles [1], [2]. For example, they can be treated as evidences in criminal investigations. However, as the image editing applications are speedily evolving, modifying digital images is no longer the job that needs professional skills [3]- [6]. That can cause a series of problems. Like the example we mentioned before, it may affect justice and the law enforcement. In order to ensure the credibility of the image, source identification on digital images become necessary. Many forensics algorithms have been proposed to identify the source of digital image [7], [8]. The essence of The associate editor coordinating the review of this manuscript and approving it for publication was Kim-Kwang Raymond Choo . camera forensics is to detect the camera attribute difference. As shown in Fig.1, there are multiple steps in the processing between a real image to a digital image. For each step of the imaging processing, there are corresponding traditional methods proposed to classify the images [9], [10].
In the past few years, SCI performance has been greatly improved. By the powerful learning capability, CNN (Convolutional Neural Network) can automatically learn the differences among different cameras [11], [12]. The network performance highly depends on the number of images in the training set. Increasing the number of training images will improve the accuracy but the training time will also be increased. Limited by storage capacity of hardware devices, original images taken by the camera are difficult to be directly used as the input of CNN, which will generate excessive parameters. Therefore, the existing method based on deep learning divides the images into fixed-size blocks [13]. SCI includes the identifications on the brand, model and device. Although the proposed camera forensics methods improve significantly, there are still issues need to be resolved. Firstly, most of the existing methods only focus on a single category, such as the camera model. Although the camera brand can be identified with high accuracy, there are still room for the model and device classification accuracy to be improved. For the SCI, all existing methods did not put the correlation among the three categories into consideration. Then, the classification network need to be retrained for classifying the new categories, even if there are only few catregories. Also, the performance will be greatly affected by the image contents while extracting the identification attributes of the camera. Therefore, it is necessary to preprocess the images. Tuama et al. [14] use a high-pass filter and a wavelet-based denoising filter to remove the image contents. However, using a fixed high-pass filter may remove the original camera attributes as well.
In our work, we propose the category coupling multitask training method based on the adaptive filter. In order to make full use of the correlation between camera categories of brands, models and devices, we adopt a progressive method to classify the three categories. We start with the classification on camera brands and categorize images into different visual subspaces (brands). The model classification is done separately in each corresponding brand subspace. Meanwhile, we expect that the classification on subclasses can in turn improve the classification accuracy of the parent class. In this way, we consider to use the single label to realize the multi-class classification, which called as SLMC. Besides, we set redundancy coding to improve the scalability of the network before training, which can reduce the network training time for new categories. For a better image attribute information extracting, we use the residual learning to extract image contents with multi-layer convolution. By concatenating the output of each layer of the convolution kernel, 1×1 convolution kernel is used to selectively extract the lowfrequency or high-frequency contents relevant to SCI. For the local neighborhood differences of the camera lens, we train an additional position classifier as an auxiliary classifier to reclassify some camera categories.

II. RELATED WORKS
In general, a digital camera consists of two major subsystems, the hardware and software [15]. The most common forensics methods contains hardware part are based on the optical aberration [9] and the Sensor Pattern Noise (SPN) [16], [17]. Similarity, the software part involve JPEG compression [10] and color interpolation [18]. The light alters dramatically in real environment, so it is improper to use illuminance as a SCI feature. However, the illuminance is consistent in the same image. Riess et al. [19] used illuminance as the feature of image forgery detection. As we know, cameras from different manufacturers have differences in lens distortion parameters, such that the interpolation map for specific camera lens distortion can be considered as fixed. Therefore, Hwang et al. [20] used interpolation based lens distortion parameters as a feature to classify the model of camera.
The difference of sensor SPN makes it becoming the most widely studied camera forensics method and the distortion introduced by SPN is very helpful for camera model classification. SPN includes fixed pattern noise and Pattern Responding Non-Uniform noise (PRNU). The PRNU is generated by the non-uniformity of the hardware sensitivity to different illuminance intensity. Lucas et al. [21] obtained camera fingerprints by extracting the average residuals in amounts of images. Through the decomposition and combination of image color channels, Li et al. [22] obtained relatively complete PRNU, they propose the DPRNU method to verify the integrity of camera images. By calculating the relationship between neighboring pixels of different color channels, Choi et al. [23] proposed a method based on color interpolation to capture camera categories. Most cameras use JPEG compression to store the final image.
For different cameras, the size of the image is different, so the quality of the image produced by different cameras is also quite different. Choi classified images by JPEG compression for the first time. Similar to Riess, Mahdian et al. [24] used JPEG compression to detect forgery of images. Farah Ahmed et al. [25] proposed a comparative analysis of SCI between deep learning and traditional methods (PRNU). Camera forgery methods include Seam Carving, Fingerprint Copying, and Adaptive PRNU Denoising, etc.
Sameer et al. [26] proposed a method based on deep learning to detect camera forgery. Bondi et al. [27] proposed the method which combines CNN and SVM classifier. It uses CNN to extract features, and then uses SVM classifier for identification. At the same time, with the popularity of mobile phones, mobile device-based source 54432 VOLUME 8, 2020 camera forensics [28]- [30] are becoming more and more extensive.
No matter what been used is a deep learning method or a traditional method, the camera attribute extraction is affected by image contents and various noises. Therefore, Tuama et al. [14] proposed to pre-process the input image by extracting the high-frequency texture using a low-pass filter, and then classifying the high-frequency image. Bayar et al. [11] proposed a robust CNN-based camera model identification. By using the constrained convolution layer, camera model identification is robust to re-compression.

III. PROPOSED METHODS
In this section, we first describe the proposed network architectures. We analyze network from the perspective of residual learning and use a new coding method which is different from other methods. Noteworthy, our coding method is easily transferred to other similar tasks. In what follows, we describe our method with details.

A. OVERALL STRUCTURE
As shown in Fig.2, we use the method of residual learning [13], [31] to extract the image contents. Firstly, we use one convolution layer to extract the image features and then use the multi-layer convolution to extract the contents of the image. Next, the output of the multi-layer convolution kernel is concatenated to extract the contents that have the same dimension with image features through the 1 × 1 convolution kernel. CNN can determine what needs to be removed in order to maximize the preserve information which is related to the classification attributes. The identification method removes the contents that disturbs the classification, and preserve camera attributes as much as possible. However, the residual learning network disrupts the correlation between original camera image neighborhoods. Therefore, we use a series of convolution layers to extract the relevant information of the  image neighborhoods separately, and then concatenate the two parts of information together as the final output feature.
This paper uses the SLMC method to classify the camera categories. Different from the previous classification method, when extracting image features, we want to extract enough common features that can simultaneously classify camera brands, models, and devices. As shown in Fig.4, we use recursive method to extract camera features. The sub-classifier can affect the parent-classifier to drop some features that are invalid for sub-classification. Finally, we extract enough common features to classify the three categories.
Based on this extracting process, we propose a new coding method. As shown in Table 1, the existing deep learning methods classify brands, models, and devices with output categories of 14, 27, and 74, respectively. All methods do not consider the correlation among the three categories of a VOLUME 8, 2020  single camera. The multi-classification method will impact the classification performance of the network. The more output categories, the more significant impact there will be. The performance of binary-classifier is usually better than the multi-classifier. In order to improve the classification performance for models and devices, we choose to classify models (devices) respectively under the same brands (models), which transforms the unconstrained multi-classification into the constrained binary or trinary-classification. As shown in Table 1, the encoding length of our coding method is where i = 0, 1, . . . , b l − 1 and j = 0, 1, . . . , m l − 1. N is the encoding length, b l denotes the number of brands, m l and d l are the number of models and devices under the same parentclass, respectively. We first select some images (Sony_DSC) as the pretraining data set. The encoding method is shown in Table 2. Since all camera devices are of the same brand, there is no need to classify the brands. We use six-bits to represent output categories. For example, the ideal result of the Sony_DSC − H 50_0 camera model classifier output is 110000, and the camera devices classifier output is 110100, which is classified in a progressive way. We denote the category of the devices or models which is less than the binary bits (e.g., the number of devices of Sony_DSC − W 170 is less than 3) as the coding redundancy. For example, there are no devices encoded as 101001.
Coding redundancy will impact network performance. However, experiments show that such impact on performance can be neglected (cannot be neglected when setting too many coding redundancy), and the coding between different classes does not affect each other. That is, when we classify the devices, the binary bits of the brand and model will not be set. The details will be given in experiment section.
The process is given in Algorithm 1. Where Classify denotes softmax cross entropy function. Conv3 and Conv1 represent the 3 × 3 convolution and 1 × 1 convolution operation. However, this approach does not increase coupling  among different camera categories. For example, we assume that the label of given images is 110100, and the camera brand (1-bit), model (2-bits), device (3-bits) are encoded as 100000, 010000, and 000100, respectively. Ideally, the output by the device classifier is consistent with the label (actually 110100). However, models may divide different camera categories into different numerical spaces and indirectly eliminate the coupling between categories. Then coding method degenerates into a separate classification. As shown in Table 3, the division of the numerical spaces make the parent class (brand, model) having no effect on the loss of the subclass (model, device). That is, the parent bit (arbitrary value that range in [a, b]) in the subclass has no effect on the subclass classifier after the softmax function. So we propose the method of correlation loss.

B. LOSS FUNCTION
SLMC is a progressive method when classify the camera attributes. The model classification works under the condition that the brand classification is accurate. Likewise, device classifier can be better trained whereas the camera brands and model classification are accurately classified. As shown 54434 VOLUME 8, 2020 in Table 1, the camera brands occupy most of binary bits, which results in the lowest accuracy on the camera brand classification when network randomly initializes the weights. As described in the previous section, this training method is pathological. Therefore, in order to prevent the cameras classification from falling into local minimum points and solve the problem of pathological, when classifying the camera models or devices, we can replace the formula in Algorithm 1 with the following formula where Label1, Lael2−Label1 and Label3−label2 can represent the output logits of the network for camera brand, model and device. However, this method just solves the problem of dividing numerical spaces, sub-classes do not improve the classification accuracy of the parent class. We further modify the loss function, as shown in Fig.5. Cost_b, Cost_m and Cost_d are defined in Algorithm 2. Cost1 and Cost2 are used to assist the information sharing among the three categories, which are defined as the following where Label has three bits been set to 1. L 1 stands for L 1 norm [32]. b l and m l denote the binary bits of brand and model. Therefore, for the devices classification, they classify the camera brands and models as well. However, this method has the same loss on the three categories of cameras implying that additional weight settings and progressive training are still necessary.
We recommend to set the above parameters to 0.5, 0.4, 0.1, 0.1 and 0.1, respectively.

C. CODING REDUNDANCY
For the proposed methods, the number of final outputs obtained by the network is fixed. Accordingly, the model  needs to be retrained for classifying the new data. In order to improve the scalability of the network, it is appropriate to set redundancy for network classification before the training. Extended classifier has impacts on the model performance, but it increases the ability of network retraining. As shown in Table 4, we give the number of binary bits of our proposed methods and existing classification methods. When new data needs to be classified, the trained model can be used as a pre-training model, which can greatly reduce the training time. We define the representation of redundancy as where C b , C m , C d represent the number of category binary bits in the corresponding classifier. We denote OC d as the amount of devices in training set. Redundancy can simply measure the efficiency of the coding. Simultaneously, high value of redundancy will affect the performance of the classification network. The higher the value of redundancy, the greater the impact on the classification network performance.

D. AUXILIARY CLASSIFIER
The proposed methods work well for most models, but for some models, e.g., the D70 and the D70s, the classification performance is not as good as what is expected. Therefore, we propose an auxiliary classifier to improve the classification performance of the main classifier. However, this classification method requires a separate classifier for each camera model, namely, the requirement on memory will be higher. Therefore, we only use the classifier as an auxiliary VOLUME 8, 2020 classifier to re-classify camera categories which are difficult to be classified by the main classifier.
As shown in Fig.6, the same position of different pictures (for the same camera model) divide into the same class, i.e., the classification based on the lens position. Too many patches in the same image will reduce the accuracy of the Position Classifier (PC). In our experiment, we choose the patch size to be 48. Considering the difference on camera category sizes, we only use the top left corner of the image for classification. The selected area is Both C line and C Col are set to 14, which means that images are divided into 196 categories. By the large coupling between neighbor pixels and the limitation of training data, the classification performance of the PC is poor. The coupling between adjacent categories can be reduced by setting stride appropriately when dividing patches. We take the value of stride to 30 and retrain the network for the comparing experiments. Fortunately, despite the bad classification performance of the PC, the extracted position information still contributes to the Binary Classifier (BC).
For any camera category, we first train the PC. The network can better extract the camera attribute information when the PC gradually improves the classification accuracy. It is worth noting that the neighborhood classification method can extract attribute information of any local position of the camera. However, when the image from the same camera passes the same post-processing method (different processing methods for different camera categories), the location-based classifier may ignore the global information. Therefore, with the multi-location classifier, we set additional networks to extract global information from different camera models. We fix the network parameters of PC when we train the BC. The input of the BC is the concatenate of three network outputs. More details are given in Algorithm 3.

A. DATASETS
In our experiments, all data comes from the publicly available Dresden database [33]. We select all 74 categories to evaluate the network performance. Same as the experimental settings  of [13], we divide the dataset into training and testing sets randomly, where 70% of the data is chosen for training and the rest 30% is for the testing data. Network performance is evaluated by calculating the average accuracy of all image blocks from the testing set.

B. PRE-TRIANING
We first pre-train the network with images from Sony_DSC. The pre-training is used to test the effect of redundant coding on classification performance. For saving the training time, we crop the image before training. All input images are clipped to 48 × 48 blocks with the non-overlapping method.
Considering the limit of hardware storage capacity, we set the batch to 96. We use the Extended-Classifier to encode the camera categories. As shown in Table 5, when we train the network with Sony_DSC, the device coding bits will generate redundancy, but it has small effect (<0.0001) on the device classification performance. Therefore, the new encoding method can be well applied to classify camera categories.

C. EVALUATION ON FINAL DATASET
Now, we train the network with 74 devices from 14 brands. Same as the experiment settings above, we first crop the original image to 48 × 48 blocks. If the size of image is 2500 × 2000 (may be 2560 × 1920 and other shapes), which means that every image contains an average of 2,000 blocks. It takes too much time to use all the blocks as the training set. So we randomly extract 60000 images from training set to train the network and re-produce the training set using the same way every epoch. We reset the learning rate to 0.001 and change it with the training process. The final results are shown in Table 7. We implement the proposed network structure by using Tensorflow R . All experiments are trained with NVIDIA GTX 1080Ti GPUs. When testing the camera models, we classify all devices under the same model into the same class. Table 10 shows the confusion matrix of brands. Table 6 shows the confusion matrix of models. Table 11 shows several device classification accuracy of the data set used in the experiment. Camera models can be classified by model classifier and device classifier. Table 7 shows that the performance of the model classifier is better than the device classifier using our proposed method. Meanwhile, our method performs better than the existing methods no matter the model classifier and the device classifier.  Table 6, the proposed classifier has low classification performance for the Nikon D70 and the Nikon D70s. This phenomenon is not only presented in our  proposed method. Therefore, a reasonable explanation is that two camera models have high similarity. We classify all devices belonging to the same camera model into same categories. The blocks corresponding to the same position in different images are classified into the same class. Each image is divided into 196 blocks, namely, 196 categories. In order to better train the BC, we need to extract enough camera attributes. The classification performance of the PC is positively correlated with the features (position attributes) extracted by the classifier. During the training, limited size of data and neighborhood similarity make the classification performance of the network greatly reduced. Over-fitting occurs when there are too many training times. At the same time, the attribute information extracted by the PC may not always be helpful to BC, although it can effectively distinguish different positions of the images.

As shown in
As shown in Algorithm 3, we use the alternating training method to gradually train the entire network. We choose to change the network every five epochs. Our final results are shown in Table 8. Comparing with the original results, the effect of PC is not obvious. BC classification performance is subject to the performance of PC classification, whereas PC classification performance is bad (about 0.1, random guess accuracy is 1/196). However, we believe that with the size of training data set growing, the performance of the PC will gradually improve. In order to show the impact of PC on BC, we removed the global information extraction network and used the location information extracted by PC to classify BC. As shown in Fig.7, each alternation includes 5 epochs (for BC training). We provide 45 epoch results, and further training cannot get further improvement in the network performance.

V. THE SCALABILITY
In this section, we test the scalability of the proposed method. The selected categories are shown in the Table.9.   Samsung_NV15_2, Nikon_D70 and Agfa_Sensor505-x_0 are adopted to detect the network scalability for camera device, model and brand. We train the network under the same coding methods and length as the experiment settings of the paper. The samples of black fonts in the Table.9 are  TABLE 11. Classification accuracy of several devices. used to pre-train the network. After several iterations, we load the pre-trained network to train all samples in Table.9 The accuracy curve is shown in Fig.8. From where we can see, the retrained process need fewer iterations to obtain the higher accuracy, which can greatly reduce the training time for new camera categories. Every 100 iterations take about 52 seconds when the batch size is set to 96.

VI. CONCLUSION
In this paper, we propose a new deep learning approach for solving the problem of category coupling and image attribute extracting. To accomplish such goal, image contents are extracted by the multiple convolution kernel. By subtracting the image contents from the original images, the images can be better classified. We adopt coupling coding method to train the network and the multi-classification problem is decomposed into a number of binary or tri-classification problems. Meanwhile, redundant coding can improve the scalability of the network. Pre-training experiments show that redundant coding has small effect on the classification performance. For several models which are difficult to classify, we proposed the auxiliary classifier and we believe that it can be better trained by appropriate setting crop position. We evaluate the effectiveness of our proposed methods by using the Dresden database. The final experiment shows that our method is superior to the existing methods.