Ensembling Shallow Siamese Neural Network Architectures for Printed Documents Verification in Data-Scarcity Scenarios

The popularity of printing devices has multiplied the diffusion of printed documents, raising concerns regarding the security and integrity of their content. The same device that prints reliable contracts, newspapers, and others, can also be used for malicious purposes, such as printing fake money, forging fake contracts, and produce illegal packaging, thus calling for the development of image forensics techniques to pinpoint criminal printed materials and trace back to their origin. Despite some recent advances, previous works model such a problem as a big data-focused closed-set classification problem. In this work, we address the source linking problem of printed color documents by treating it as a verification problem. Specifically, we aim at deciding if two documents have been printed by the same printer or not. To achieve this goal, and to cope with the data scarcity deriving from the difficulty of gathering massive amounts of printed and scanned documents, we propose to use an ensemble of Siamese Neural Networks, with unique architectures expressly designed to work with a small training dataset. As a further unique feature, the proposed approach is suited to work in an open set scenario, where the printers used to produce the documents analyzed at the test time are not included in the training set. Results obtained under both open and closed set conditions, with a thorough comparison with available baseline methods, showed classification performance higher than 97% in the closed set scenario and higher than 86% in the open set case, highlighting the practicality of such approaches in real-world scenarios.


I. INTRODUCTION
Despite the tremendous efforts to replace the number of printed documents with their digital counterparts, printed documents are still common and can be found everywhere. Generating printed documents is very cheap, with printing devices being more accessible and easy to use than ever. This accessibility and ease of use plays a major role in the production of vast amounts of printed documents, as we see every day. From advertisements to currencies, books to newspapers, magazines to contracts, and product packaging, there is always a printing technology involved. With the diffusion The associate editor coordinating the review of this manuscript and approving it for publication was Wentao Fan . of staff-less and cashless stores in big cities [1], there will be only printed data available (such as the QR-CODES) for purchases and interaction with clients, making printing and scanning technologies even more important in a near future.
Notwithstanding the wide diffusion of printed information, the lack of regulation and forensic procedures on printed documents allow counterfeiters and other criminals to use printing technology for malicious purposes. For example, fake currency can be printed and distributed in a neighborhood, thus harming the local economy; pedophiles can print and distribute child porn to avoid the controls over the Internet of security agencies; deceivers can fake badges to access restricted areas, hitting up the organization and the security of meetings and other kinds of events.
Additionally, the analysis of printed documents related to other criminal activities like corruption or terrorism may help to trace back to crime perpetrators.
In addition to the previous issues, modern printing and scanning technologies have made counterfeiting easier and more profitable than ever, as counterfeiters can perfectly copy and print packages of fake products that are visually identical to the original ones. Acknowledging such a problem, the International Chamber of Commerce has raised an alarm of 3.7 trillion loss due to counterfeiting and piracy, with 5.4 million jobs at risk by 2022 [2]. Finally, counterfeiting has also a significant impact on health: according to the World Health Organization, up to half of the malaria medications could be fake [3].
As an answer to the above needs, several works have focused on present Printer Source Linking solutions using Computer Vision and Machine Learning to pinpoint the ownership of printed texts [4]- [23], color images [24]- [34] or both [35]- [37]. In particular, the approaches based on Convolutional Neural Networks (CNNs) and Deep Learning (DL), in general, [19], [21], [37] have allowed significant progress in this research area, in special due to their ability of learning from the data itself, artefacts specific to given printers. CNN-based solutions usually have better performance than previous image/noise description approaches for printer attribution.
When applied in real-life scenarios, however, CNN-based approaches have the following significant limitations: (i) as far as we know, there is no study in the literature considering the fact that the source of an unknown document is not included in the training set. Such an open-set setting is the most probable one in the real world and the lack of methods suitable to work in such a scenario limits strongly the practical usability of CNN techniques proposed so far; and (ii) DL models require a huge amount of data (printed and scanned documents) for training, especially when very deep networks with a huge amount of parameters are used. However, the dataset generation process is not only expensive (given that printer inks/toners are among the most expensive liquids in the world [38]), but it is also a very time-consuming task, requiring dedicated personnel and very strict procedures.
In this paper, we propose to deal with the previous limitations by treating the source linking problem as a verification problem. According to such an approach, any new sample is compared against a reference document, producing a score indicating how similar the unknown sample and the reference template are. Given that the network is trained to decide whether two printed documents have been generated by the same device, the samples analyzed at test time do not need to been generated by one of the printers used during training, thus opening the way to the use of the network in an open-set scenario. Indeed, a similar approach has been conveniently adopted in biometric recognition [39], where one person is allowed to access a system if his/her biometric traits are similar to anyone in the dataset of enrolled users, even when the system has not been trained to directly ''recognize'' the biometric traits of the users. However, In contrast to the biometrics scenario, where visually-relevant traits are compared, the answer to the question: ''have two documents containing different or even the same content, been generated by the same printer?'', requires the analysis of subtle, texture-like patterns produced by different printers, possibly neglecting the semantic of the images. The answer to such a question can be useful in several forensics and anti-counterfeiting applications. As a first example, we may consider a situation where a printed child pornography picture is found and someone is suspected to have printed it. The investigators can print other pictures with the printer of the suspect person and detect if the criminal document has been printed by the same printer.
Anti-counterfeiting provides another scenario fitting our solution. Assuming that authentic packages or labels can be printed only by a pool of authorized printers. If, in a suspected counterfeited merchandise, the printing artefacts of the barcode (or any other pattern) printed on the package do not match the patterns present in the original (templates) barcodes printed by an authorized printer, the product is classified as being counterfeited. Finally, the manipulation of printed documents is another interesting application of our research. As the verification network is trained to find similarities, or dissimilarities, in printed patterns, if a document (e.g., a contract) is counterfeited by replacing some pages with fake ones printed by a different printer, then the fake document can be revealed by verifying that all the image pairs of the documents have been printed by the same printer.
In order to develop a printed document verification system, expressly thought to cope with the data scarcity problem typical of such a scenario, we propose to use an ensemble of Siamese Neural Networks (SNNs) with novel architectures characterized by shallow-but-wide topologies. In particular, we tested the performance of two different ensembles: one with diverse SNN topologies and another consisting of networks having the same topology, but with some tweaks in the SNNs parameters. The results we got in the closed set and, most importantly, in the open set scenarios built based on a challenging realistic dataset, demonstrate the promising performance of the proposed solution.
In summary, the contributions of this paper are: 1) We propose the use of Siamese Networks for printed documents attribution, treating it as a verification problem. This approach marks a significant difference concerning state-of-the-art solutions, which mostly treat such a problem as a classification one. 2) To deal with the availability of small training datasets, we propose several shallow-and-wide SNN architectures, that can be trained also on small amounts of data. 3) We propose two different ensemble strategies with and without diversity in the network topologies, achieving different, and somewhat complementary results. 4) As far as we know, this is the first study dealing with open set source linking of printing devices.
Last but not least, we compare all the proposed solutions against several baseline techniques, in both open and closed set conditions. The remaining of this paper is organized as follows: in Section II, we discuss some progress of the related work in detail. Section III describes our printer verification solution composed of SNNs ensembles. Section IV details the experimental procedure adopted to validate the solutions for the printer verification problem. Section V reports the results of the experiments we carried out and, finally, Section VI concludes this paper highlighting future research directions and open problems.

II. RELATED WORK
Several previous works have investigated artifacts that printers leave in the generated documents in order to identify their source. Although several other works have worked on physical, microscopical, and chemical techniques [40], most of them are destructive methods that require specialized staff and expensive material [35]. Therefore, we focus our discussion on methods based on machine learning and computer vision, as they are cheaper, faster, and simpler. For the task of pinpointing the owner of printer documents, most of the surveys in the literature [41]- [44] divide them into the following branches: (i) approaches focused on text documents source linking; (ii) approaches for colored documents source linking; and (iii) approaches for any kind of document. Several other techniques have been proposed to identify among different printing technologies (e.g., inkjet, laser, etc.) [45]- [49], however, in this section, we focus on linking the laser printer source of any kind of documents. The literature has reported two possible clues that can be used to pinpoint the source of a text document. The first one is the extrinsic signatures. These signatures are introduced by the printing process in order to explicitly input the printer biometrics in any printed document. Such signatures are investigated by active forensic methods, and some examples are the embedding code sequences in electrophotographic halftone images [50] and also machine identification codes [51]. Approaches based on extrinsic signatures are usually expensive because they require expensive modifications of the printer. The main disadvantages of techniques based on extrinsic signature are the fact that some of these signatures can be hidden from the printed material [52] and that they are not adopted as a standard for all printer brands.
The second branch of investigation analyses intrinsic signatures through passive forensic methods. Intrinsic signatures are unintentionally injected into the printed document due to electromechanical printer devices imperfections. The most investigated intrinsic signatures are banding, jitter, and skewed jitters [25]. As the artifacts investigated by such a branch of research are difficult to erase, the literature has been mostly focused on such signatures, which are also the focus of our proposed solution.
In the case of text documents, intrinsic signatures are not so explicit given that the printed data is very small, and such artifacts are often visible only employing a microscope [35]. Due to this limitation, literature solutions have been focused on character extraction and texture description of such images [27]. For example, in one of the first works in this regard, Ali et al. [4] extracted ''I'' letters from printed text, using the raw pixel values in a multi-class machine learning algorithm to discriminate different printers intrinsic artifacts. Given an unknown document, majority voting of individual ''I'' letters classification defines the printer source of the document. Inspired by this solution, several works have followed a similar path. Some examples are the Gray Level Co-occurrence Matrices (GLCMs) features from Mikkilineni et al. [5]- [7], [12] and Ferreira et al. [35], features from Distance Transform from Deng et al. [9], features from Discrete Cosine Transform from Jiang et al. [11], statistics of GLCMs, residual noise and Wavelet Transform features from Tsai and Liu [13], Tsai et al. [15], [18] and Elkasrawi and Shafait [14], ad-hoc texture descriptors from Joshi and Khanna [20], [23], SURF and ORB features from Kumar et al. [53], and geometric distortions signatures from Jain et al. [22]. Finally, the use of deep neural networks from Ferreira et al. [19] and their extension from Joshi et al. [21] proved their ability to learn better the features from the data itself when sufficient data is used for training.
In the case of colored documents, which is the focus of the research presented in this paper, intrinsic artifacts are more frequently found as more areas in the paper are printed. One of the first works in this direction comes from Ali et al. [24], who extracted from image patches Fourier transform features to discriminate color laser printers through different banding frequencies. Eid et al. [25] looked for jitter artifacts with Gabor filtering and Discrete Fourier Transform. Choi et al. [27] looked for specific banding artifacts by calculating noise features from Discrete Wavelet Transform sub-bands in RGB and CMYK channels. Later, the same authors estimated noise with Wiener filtering and Gray Level Co-occurrence Matrix statistics [28]. Tsai et al. [30] extracted features from different Discrete Wavelet Transform subbands, adding to the pipeline feature selection. Other important techniques for color documents source attribution involve describing geometric distortions [26], [33] and halftone texture descriptors [29], [32], [34].
The last branch of approaches is those that can deal with any kind of document, being it a color-image or a text document. For example, the work of Ferreira et al. [35] proposed texture descriptor approaches based on GLCMs, together with a Convolutional Texture Convolutional Filter approach to be applied on frames, which are areas of the printed paper with sufficient printed material. Tsai et al. [36] applied several filters to the documents, extracting features from GLCMs, Discrete Wavelet Transform, spatial filters, Gabor filter, Wiener filter, Gray Level Co-occurrence Matrices features, and fractal features. In recent work, Nguyen et al. [54] printed eight different patterns and investigated the printer attribution using shape descriptor indexes features on Support Vectors Machines and Random  Forest classifiers to perform printer attribution. Finally, Bibi et al. [37] reached state of the art results by using convolutional neural networks.
Notwithstanding the progress made in the last 20 years, there are still several gaps and limitations that should be explored, including: (i) solutions proposed so far treat laser printer attribution as a classification problem, neglecting the possibility to treat it as a verification problem; (ii) the datasets considered in the research are quite outdated, without professional printers being considered; and (iii) as far as we know, the open set laser printer attribution has not been explored in the literature. In the next section, we detail our solution to tackle such limitations.

III. PROPOSED METHOD
In this paper, we propose to treat the printed document source linking problem as a verification problem. To do so, we train an ensemble of Siamese Neural Networks on a dataset of printed images. Siamese networks can learn similarities from pairs of images to minimize the distance of similar pairs by sharing the same weights in the two networks. The general pipeline of a Siamese neural network is reported in Figure 1.
Siamese Networks have the following advantages over common CNNs for our specific problem: (i) they learn to find similarities or dissimilarities between different documents according to the printing sources, learning to detect if the patterns introduced during the printing process originate from the same printer or not; (ii) Siamese Neural Networks are also known for their one-shot learning capabilities [55]- [57], not requiring many samples for successful learning; and (iii) since they learn to match -or unmatch -similar pairs no matter their class labels, they can be used in an open set modality to also classify samples produced by printers that do not belong to the training set.
The pipeline of the proposed method is illustrated in Figure 2. During the training stage, patches of interest are extracted from a dataset of known printers, and then several Siamese Networks are trained so to let them learn to minimize the distances between patches from the same printers and maximize them when the patches have been generated VOLUME 9, 2021 by different printers. After the models have been trained, we ensemble them in two manners: (i) without diversity, where we tweak only some parameters of a unique SNN architecture; and (ii) with diversity, where completely diverse SNN topologies are considered. During testing, unknown test patterns are compared by these networks with a template generated by a printer that does not necessarily belong to the training dataset. Each pair of patches is classified according to a majority voting procedure. Finally, the second round of majority voting, this time merging the results obtained on every single patch, is applied to decide on the source of the analyzed images. In the following subsections, we detail each of the above steps of our proposed ensemble. For reproducibility and open access purposes, the source code of our approach can be found at GitHub. 1

A. PRE-PROCESSING (ROI EXTRACTION)
Most of the solutions in the literature consider source linking in specific regions of the printed material, where the artifacts investigated are expected to be more visible: Ali et al. [24] searched for letters ''I'', Kee and Farid [8] performed their investigation by focusing only on letters ''E'', and Ferreira et al. [35] searched for artifacts on areas with sufficient printed material.
In this paper, we follow a path similar to that adopted in [58], focusing on the top-n high energy patches of printed documents after Canny filtering binarization. Selecting pairs of patches in this way has a threefold advantage: (i) more relevant information about printer sources is acquired, as the top-n high energy patches are usually noisy areas containing more information about printing patterns (edges/banding with high energy); and (ii) they can generate significant training data of different printers from a relatively small dataset of printed documents, as one printed document is now represented by several high energy patches; (iii) The binarization and edge extraction applied by Canny filtering helps to remove the semantic content of the images, given that all the edges are equally weighted thanks to binarization.
The first step for the detection of high-energy patches consists of the application of a Canny filter. Such a filter initially applies a Gaussian filter to images. The Gaussian filter F of size (2k + 1) × (2k + 1) is defined as follows: where σ is the standard deviation of the Gaussian distribution. After Gaussian lowpass filtering, the smoothed image is filtered with a Sobel kernel in both horizontal and vertical directions to get first the derivative in the horizontal (G x ) and vertical (G y ) directions. The derivatives are computed by applying to the images two 3 × 3 kernels, defined as follows: 1 https://github.com/anselmoferreira/siamese4PrintedVerification After such convolutions, the magnitude G and the direction θ of the gradient are computed: Then, the edges are processed by two final steps. The first one, commonly called non-maximum suppression, eliminates thin edges. To do that, the strength of each edge pixel is compared with the edge strength of the neighboring pixels in the positive and negative θ gradient directions calculated in Equation 3. If the edge strength G of the current pixel is the largest, such a value is preserved, otherwise, it is set to zero. The second step uses one low and one high threshold value to filter edges. The following comparisons are then made using the two thresholds: (i) if a pixel gradient is higher than the upper threshold, the pixel is an edge detected; (ii) if a pixel gradient value is below the lower threshold, then it is rejected; and (iii) if the pixel gradient is between the thresholds interval, then it is accepted only if it is connected to an accepted edge (a pixel whose gradient is higher than the upper threshold, or one that passed condition (i)). The low and high thresholds are automatically generated and are image-dependent so, in our system, we used the median intensity of the gray-level imageĨ to generate the intervals of thresholds as follows: where δ is used to vary the thresholds using a percentage of the medianĨ . In our system, we used δ = 0.33, a value that works well in many object detection applications [59].
After the above operations have been completed, we divide the resulting binary image into squared blocks of varying sizes. We choose squared blocks of 28 × 28, 64 × 64, 224 × 224, depending on the size of the input accepted by the SNNs used in our system. Then, we apply a one-level Discrete Wavelet Decomposition (DWT) using Daubechies family to calculating the energy E of each image patch by considering the horizontal detail (H ), vertical detail (V ), and diagonal detail (D) sub-bands of the DWT as follows: where N is the number of values in the sub-bands of the DWT and M is the size of the squared patches. We chose to compute the energy in the DWT domain because DWT sub-bands contain useful information about edges in different directions, as noted by several previous works [13], [27], [30], [36]. After the energy is calculated for all patches, we select only the n patches with the largest energy of each printed image to train and test the Siamese networks. The areas with high energy contain enough information for printer source linking based on noisy/banding artifacts.
When we apply such a procedure to digital images, we also tend to find areas with strong sharp edges. However, for the case of printed documents, flat areas can also be selected. This happens because, by ranking the energy of binarized Canny filtered images, the algorithm searches areas with many edges, instead of extracting areas with the sharpest edges. This is useful when the noise of the printer is hidden in the background or flat areas. As the halftone printing patterns and printing imperfections from printers can introduce several new edges in the form of noise in the background, the proposed patch extractor can help in the forensic analysis by isolating such areas for further investigation. To exemplify such an effect, in Figure 3, we show the background of a digital and printed version of the same image. In the printed version, more halftoning and noisy artifacts are visible, which can be used for printer attribution.

B. APPLYING SIAMESE NETWORKS ARCHITECTURES TO HIGH ENERGY PATCHES
In our method, we take further advantage of SNNs characteristics by training them on a small training set. Inspired by previous works on steganalysis [60], [61] we propose to use shallow-but-wide architectures, where multiple and diverse filters are applied to the same outputs of the previous layers, with their final outputs fused and forwarded through the network. Specifically, we propose three novel SNNs architectures based on this idea as discussed in the following paragraphs.

1) MULTI-CONVOLUTION SUMMATION SNN (MCS-SNN)
Our first architecture is inspired to the the Residual Convolutional Neural network, also known as RESNET. Proposed by He et al. [62], RESNET uses a residual module consisting of two convolutional layers, where the output of the second layer is summed to the input of the first one through a bottleneck operation.
The MCS-SNN model is a shallow-but-wide SNN that uses multiple convolutions modules, where different feature maps are built in parallel, and further fused and forwarded to the rest of the network. Part of the topology of one sister MCS-SNN is illustrated in Figure 4. Every input is processed by three different convolutional layers, then the output maps are summed together and forwarded to an output convolutional layer. This procedure is repeated three times, and then a global average pooling is applied to build a pair of 48-dimensional vectors indicating the similarity or dissimilarity of a pair of printed image patches. Concerning RESNET, MCS-SNN is wide, simpler, and shallower. In addition, we neither apply bottleneck nor skip connections, making such a network more effective and fast for printer documents verification in data scarcity scenarios.
Similar to the other architectures that will be discussed later, the MCS-SNN is trained by applying the Contrastive Loss function. Such a loss is more effective than the common Binary Cross Entropy as the goal of SNNs is not to classify sample pairs, but to differentiate between them. Essentially, such a loss evaluates how well the SNNs are distinguishing image pairs using the following formula: where Y p is the label of the training pairs images (0 if the pair comes from different printers and 1 otherwise) and D p is the Euclidian Distance between the image pairs given by the 48-D output vectors of the SNNs. Finally, the margin m defines a radius around the sample space in such a way that dissimilar pairs only contribute to the loss function if D p is within the margin. In particular, we set m = 1, so that only distances between 0 and 1 will contribute to the loss. By training the networks with such a loss, pairs of images coming from the same printer, have lower distances, while images generated by different sources have large distances. Then, according to a threshold value T (we set it to 0.5), we verify if the pair of documents have been printed by the same printer or not.
We instantiated four different networks with this specific architecture, whose details are given below: 1) MCS-1: it contains the same number of filters (32) in all layers, but in the multiple convolution summation modules we apply 3× 3, 5 × 5 and 7 × 7 convolutions.
The receiver convolutional layer has 3× 3 filters, and we apply the LeakyRELU [63] activation in all layers. 2) MCS-2: it contains the same number of filters (32) in all layers, but in the multiple convolution summation modules we apply 3× 3, 5 × 5 and 7 × 7 convolutions. In the multiple convolutions summation modules, we have different activations for each layer inside the multiple convolutions summation modules: LeakyRELU, RELU, and sigmoid. The receiver convolutional layers have 3× 3, 5 × 5 and 7 × 7 convolutions through the different modules, and we apply the LeakyRELU [63] activation to their outputs. 3) MCS-3: it is somewhat similar to MCS-1Ṫhe differences are: (i) the receiver convolutional layers have 3× 3, 5 × 5 and 7 × 7 convolutions through the different modules, and they also have 32, 64, and 96 filters. 4) MCS-4: it contains the same number of filters (32) in all the multiple convolutions summation modules layers, and we apply 3× 3, 5 × 5 and 7 × 7 convolutions in every multiple convolutions summation module. However, the activation is fixed to be only LeakyRELU [63], RELU, or Sigmoid for each module.
The receiver convolutional layers have 3× 3, 5 × 5 and 7 × 7 convolutions, and they also have 32, 64, and 96 filters, all of them activated by LeakyRELU [63]. All the networks (as all the others that will be presented in the remaining of this section) work on 64 × 64 × 3 top-energy input patches, extracted as explained in Section III-A.

2) MULTI-CONVOLUTION INCEPTION SNN (MCI-SNN)
The second shallow but wide set of Siamese Networks is based on the Inception modules proposed by Szegedy et al. [64]. The SNN with Inception modules uses modules of parallel convolutional layers with filters of different sizes. The output feature maps are concatenated, as illustrated in Figure 5, and passed through the rest of the SNN. These SNNs are somewhat similar to the MCS-SNNs, with the difference of performing concatenation of feature maps instead of summation. As detailed below, we also created four networks with this specific architecture: 1) MCI-1: it contains the same number of filters (32) in all layers but in the multiple convolutions inception modules we apply 3× 3, 5 × 5 and 7 × 7 convolutions. The receiver convolutional layer has 3× 3 filters, and we apply the LeakyRELU [63] activation in all layers. 2) MCI-2: it contains the same number of filters (32) in all layers but in the multiple convolutions inception modules we apply 3× 3, 5 × 5 and 7 × 7 convolutions. In the multiple convolutions inception modules we also have different activations for each layer in a given multiple convolutions inception module: LeakyRELU [63], RELU, and sigmoid. The receiver convolutional layers have 3× 3, 5 × 5 and 7 × 7 convolutions through the different modules, and we apply the LeakyRELU activation to their outputs. 3) MCI-3: it is somewhat similar to MCI-1. The differences are: (i) the receiver convolutional layers have 3× 3, 5 × 5 and 7 × 7 convolutions through the different modules, and they also have 32, 64, and 96 filters. 4) MCI-4: it contains the same number of filters (32) in all the multiple convolutions inception modules layers, and we apply 3× 3, 5 × 5 and 7 × 7 convolutions in every module. However, we let the activation be LeakyRELU [63], RELU or Sigmoid. The receiver convolutional layers have 3× 3, 5 × 5, or 7 × 7 convolutions, and they also have 32, 64, and 96 filters, all of them activated by LeakyRELU [63].

3) SHALLOW XCeption SNN (SX-SNN)
The last set of Siamese Neural networks is inspired by the XCeption architecture [65]. Such architectures have in common point-wise and depth-wise convolutions. Firstly, ordinary 1×1 convolutions are performed, then channel-wise spatial convolution is applied (depth-wise convolution). This architecture evolved from the INCEPTION-v3 [64] network with two differences: (i) no ReLU is applied after depth-wise (channel-wise) convolutions; and (ii) the operations in the XCeption architecture are applied in inverse order. Such adaptations have been found to be faster to train than than INCEPTION-v3 (having less parameters), and to provide a higher accuracy on Imagenet [65], [66].
Point-wise convolutions are a special kind of convolution, with a size of 1 × 1 × N . 1 × 1 convolutions perform convolutions over the channels of an input image, an operation also called channel-wise pooling or feature map pooling. Such filters contain one single parameter, or weight, for each input channel and are then well-suited to summarize the input feature maps. After the point-wise convolutions, another convolution is applied to each channel of the feature map created previously. The result of such operations are added to the output of another ordinary convolution carried out in parallel through a bottleneck operation such as the one in RESNET [62] CNN.
Our XCeption-based Siamese networks follow the original pipeline of the XCeption network, which is composed of flows (or modules), which are pieces of networks with their specific layers, bottlenecks, separable convolutions, maxpooling, activation, inputs, and outputs. One of these flows (the middle flow) is repeated several times, and in the end, all the flows are stacked in such a way to form a unique CNN. The main differences between XCeption and our architecture are (i) the size of the input images is 64 × 64 × 3; and (ii) we control the depth of the SNN in our ensemble, by proposing several variations of shallow XCeptions as follows: 1

C. ENSEMBLES OF DIVERSE AND HOMOGENEOUS SNNs
Ensembles is a popular approach in machine learning according to which several complementary models are applied trusting that even the weakest models can help to classify some samples that are commonly mistaken by the others. The power of ensembling has been validated in several works in diverse applications [67]- [70] in the literature.
In this paper, we ensemble different versions of the SNN architectures described in the previous section. In particular, we apply a two-fold majority voting fusion scheme, as illustrated in the rightmost part of Figure 2. In the first fusion stage (patch-wise fusion), we apply several SNN models to classify each pair of patches as coming from the same printer or not. The decision on each pair is made by looking at the most frequent class given by the SNN ensemble. The second round (document-wise fusion) considers the decisions made on all the patch pairs to decide if a pair of documents comes from the same printer or not. Such a fusion, also known as late fusion [71] has commonly been applied in several previous works in the printer source attribution literature [4], [8], [19], [35].
To enrich the ensembling analysis and performance, we investigate two different ensembling methods in the patch-wise classification. In the first one, which we call Homogeneous ensemble, we apply the same architecture (either MCS, MCI, or SX) with their unique parameters and topologies and evaluate the performance of the ensemble when they are combined three by three (we chose an odd number to avoid ties in the majority voting). By Evaluating hundreds of unique combinations of three classifiers with the same architecture but trained with different weight optimization methods, we found that that the best performance is achieved with the following combination of SNNs: 1) MCS-2: It contains the same number of filters (32) in all layers, but in the multiple convolutions summation modules we apply 3× 3, 5 × 5 and 7 × 7 convolutions. In the multiple convolutions summation modules, we also have different activations for each layer in a given multiple convolutions summation module: LeakyRELU, RELU, and sigmoid. The receiver convolutional layers have 3× 3, 5 × 5 and 7 × 7 convolutions through the different modules, and we apply the LeakyRELU activation on their outputs. Such an SNN is trained with the ADAMAX weight optimizer [72]. 2) MCS-3: It contains the same number of filters (32) in all layers, but in the multiple convolutions summation modules we apply 3× 3, 5 × 5 and 7 × 7 convolutions. The receiver convolutional layers have 3× 3, 5 × 5 and 7 × 7 convolutions through the different modules, and they also have 32, 64, and 96 filters. Such an SNN is trained with the RMSPROP optimizer [73]. 3) MCS-4: It contains the same number of filters (32) in all the multiple convolutions summation modules layers, and we apply 3× 3, 5 × 5 and 7 × 7 convolutions in every module. However, we keep the activation as being only LeakyRELU, RELU, or Sigmoid for each module. The receiver convolutional layers have 3× 3, 5 × 5 and 7 × 7 convolutions, and they also have 32, 64, and 96 filters, all of them activated by LeakyRELU. Such an SNN is trained with the NADAM optimizer [74]. Throughout the rest of the paper, we refer to the above ensemble as HOM-ENS.
In the second patch-wise verification ensemble method, we fuse the output of different SNN architectures. By selecting three unique models belonging to the MCS, MCI, and SX classes, we investigated a total of 4096 unique diverse ensembles with all possible combinations of weight optimization methods. Based on our experiments, we selected an ensemble containing the following models: 1) MCS-3: it contains the same number of filters (32) in all layers, but in the multiple convolutions summation modules we apply 3× 3, 5 × 5 and 7 × 7 convolutions. The receiver convolutional layers have 3× 3, 5 × 5 and 7 × 7 convolutions through the different modules, and they also have 32, 64, and 96 filters. Such an SNN is trained with the RMSPROP optimizer [73]. 2) MCI-3: its an SNN similar to MCS-3, but with inception modules instead of summation operations. Such an SNN is trained with the ADAM optimizer [72]. 3) SX-4: its a shallow-based XCeption architecture with seven blocks in the middle flow (instead of eight from the original architecture). Such an SNN is trained with the RMSPROP optimizer [73]. We name such a heterogeneous ensemble as HET-ENS as it consists of different sister Siamese networks specifically thought for printed documents verification.

IV. EXPERIMENTAL SETUP
To assess the effectiveness of the proposed SNNs and their ensembles for the problem of close and open set printed documents verification, we performed a series of experiments further explained in this section. In the following subsections, we discuss the dataset, methodology, baselines, and metrics considered.

A. DATASET
Our work has been validated on the second version of the VIPPrint dataset [58]. Such a dataset has a challenging nature as it contains pictures of human faces printed by several modern and professional color laser printers. We consider the original set of eight printers presented in such a dataset which are further described in Table 1. The first set of experiments aims at analyzing the behavior of Siamese Networks for printed documents verification in a closed set environment.
In the second set of experiments, we make one step further, addressing the capability of the proposed architectures for open set source linking. To this aim, we printed the same digital images of the VIPPrint dataset with four new printers, contained in a second version of the dataset [75], and we tested the performance of a system trained on the first printers on these new printers. The printers we used are professional and large-scale printers such as those used in print shops. From the analysis of Table 2, we can appreciate the difficulties presented by this new scenario: (i) there are printer brands that were not present in the training set (such as CANON and Ricoh printers); and (ii) new printing resolutions are also present which were not represented in the training set (e.g., 1200 × 1200 dpi and 2400 × 2400 dpi). For the acquisition process, we used the same scanner, a TaskAlfa 3551 multi-functional printer scanner with 600 x 600 dpi resolution at the highest possible sharpness. The images are saved in a lossless compression configuration and used to feed and evaluate the accuracy of the proposed and baseline classifiers.

B. METHODOLOGY
To train the SNNs we selected 800 printed images (100 from each printer) and extracted the top-10 high energy patches from all of them (see Section III-A). For each block in the  training set, we semi-randomly 2 selected another block from the same printer and another block from a different printer, thus building two pairs of similar and dissimilar blocks. In Figure 6, we show some examples of image patches coming from the same printer or different printers. This procedure results in about 16,000 pairs of images to train the SNNs. In a testing scenario, we used other 800 documents printed by the same printers, but containing different printed patterns (i.e., different faces). Then, we randomly selected (with a fixed random seed as stated previously) 400 pairs of documents for testing. As in a real-world scenario, there will be more pairs of documents coming from different printers, so we investigate the closed set behavior considering 357 pairs of documents coming from different printers and 43 pairs of documents coming from the same printer.
We opted for such a random choice of the pairs to avoid any bias coming from a manual selection.
In the open set scenario, we would like to analyze the behavior of the SNNs when facing pairs of documents coming from unknown printers. For such a task, we initially considered pairs of documents coming from the same unknown printers. In such a testing scenario, we printed the same 200 images in the original dataset with the four new unknown printers, obtaining 100 semi-random pairs of documents from the same unknown printers for models evaluation. In a second set experiment, we also select semi-random pairs of documents from all possible combinations of different printers, evaluating the probability that documents coming from different printers are judged as being printed by the same printer.

C. BASELINES
We compare our approach against nine baseline approaches, some of them based on Siamese Networks others relying on different technologies. The first baseline SNN is inspired by one CNN applied on the MNIST dataset [76]. This is the shallower network we have analyzed since it contains only two convolutional layers, followed by max pooling. Finally, we adopted a dropout training strategy after the max-pooling layer. This network, similarly to ours, is trained from scratch on a dataset of top-10 energy patches.
Finally, specifically for the closed set scenario, we consider a handcrafted feature approach, used together with K-Nearest Neighborhood classifiers. According to this approach, which we call LBP-KNN, pairs of test images are described through Local Binary Patterns [81], and features from each image are classified individually by a KNN classifier. The KNN classifier is previously trained with LBP features from the 8 closed set classes. If the features from a pair of images under investigation lie in the same cluster, we define them as coming from the same printer and different printers otherwise. We performed a grid search on the number of possible neighbors in the KNN classifier, selecting them on the set of possible neighbors.

D. METRICS
In this section, we present the metrics used to validate our approaches against the baselines. For a better understanding of the metrics shown in the following subsections and also the discussion present in the remaining of this paper, we define the positive class as the class with pairs of documents coming from the same printer, and the negative one as the class with documents printed by different printers.

1) SPECIFICITY
For binary classification problems, the specificity, also known as true negative rate, indicates the percentage of correctly classified negative samples and is calculated as: where TN (True Negatives) represents the number of samples correctly classified as negatives, and FP (False positives) is the number of negative samples wrongly labeled as positive.
In our binary classification problem, the sensitivity metric measures how many pairs of documents created by different printers were correctly detected as such.

2) RECALL
For binary classification problems, the recall, also known as true positive rate, indicates the percentage of correctly classified positive samples and is calculated as where TP (True Positives) represents the number of samples correctly classified as positives, and FN (False Negatives) is the number of positive samples wrongly labeled as negative. In our binary classification problem, the Recall metric measures how many pairs of documents created by the same printer were correctly detected as such.

3) PRECISION
With this metric, we want to know how many samples classified as positive are indeed positive. That is where FP (False Positives) is the number of samples incorrectly classified as being positive (from the same printer).

4) F-MEASURE
This is the most important metric for unbalanced problems and thus will be used to rank our results. The F-measure (F) calculates the harmonic mean of precision and recall as follows: such a metric, like all the others considered, is normalized between 0 and 1 (1 being the best value).

5) GEOMETRIC MEAN
In the binary classification case, Geometric Mean can combine Sensitivity and Specificity into a single score that balances both concepts in the following manner: The Geometric mean combines the true negative rate and the true positive rate at one specific threshold of the SNNs VOLUME 9, 2021 TABLE 3. Closed Set experiments details and results. Approaches are trained using the ordinary Binary Cross Entropy (BCE) or the Contrastive Loss (CL) and we report the losses with the best results. The comparative study uses metrics specifically thought of unbalanced problems, namely: (i) the Balanced Accuracy (BACC); (ii) G-Mean (GMEAN); and (iii) f-measure (F1). The performance of the proposed approaches are boldfaced while the best results are highlighted in yellow. Results are sorted according to the F1 measure. similarity score (in our case, we use 0.5 as threshold according to the contrastive loss discussed in Section III-B1). This formula has the beneficial property of averaging out both scores penalizing unbalanced pairs.

6) BALANCED ACCURACY
The balanced accuracy gives the quality of detection based on the mean performance on the positive and negatives classification, or The balanced accuracy is the most recommended metric when considering unbalanced testing scenarios and was validated in several works in the literature where such an environment is found [82], [83].

V. EXPERIMENTS
In this section, we discuss the results we have got by validating our proposed method against the baselines detailed in the previous section. We first consider the closed set scenario, then we pass on the more challenging case of open set verification.

A. CLOSED SET PRINTED DOCUMENTS VERIFICATION
We start validating our approaches in a closed set scenario with the methodology explained in Section IV-B. From the results in Table 3, we see that deeper networks in a Siamese setup do not work properly for printer verification. In particular, we highlight the bad results of EFFICIENTNET-B0 and EFFICIENTNET-B7 for such a problem. Such networks were applied successfully in some digital image forensic works [84], [85], however, in the Siamese setup such networks provide bad performance due to their deep structure and complexity which would require very large training sets. Training printed documents data-driven models on large datasets, however, is a very expensive and time-consuming task, thus suggesting the use of fewer complex models.
Our second set of experiments included some SNNs of interest as they are less complex being suitable to run also on mobile devices: the MOBILENET and NASNET-MOBILE based SNNs. Both these networks do not show good results being designed and pre-trained for other applications. In Table 3 their results are similar or even worse than random classifiers, even though they have been pre-trained and their generalization capabilities were supposed to increase due to data augmentation. In the same regard, results in Table 3 show that IMAGENET-based classifiers that are better than random guessers are those that include specific modules such as the dense modules (DENSENET-201) and those at the basis of our MCS-based approaches (residual modules from RESNET-50). Results in Table 3 show that the best IMAGENET-based SNN does not help in the identification of the source of the printed documents, with a specificity of 0.79 and a recall equal to 0.48. Such results further prove the non-transferability of the IMAGENET dataset weights to the printer verification problem.
The results we got highlight the better results of nonpre-trained methods. Our specific LBP-inspired approaches used as baselines (LBP+KNN-224 and LBP+KNN-64) outperform the IMAGENET SNNs by a large margin, but with a performance that are still unsatisfactory. We found that using 224 × 224 blocks for this task is a little better, with an almost perfect specificity (0.99) and a small false alarm probability (0.86 precision). However, such an approach has the following drawbacks: (i) it is an easy approach to be attacked given its small recall; and (ii) given the fact that it uses nearest neighborhoods in a known number of clusters, it cannot be used for the open set case.
We finish the analysis of the baselines by discussing the results of the shallower and simpler methods used in the experiments. The MNIST SNN showed a better balance between precision and recall, resulting in the highest F1 measure (0.71) among the baselines. Such better performance comes from the fact that this SNN is trained from scratch on our dataset and because of its simplicity that makes it easier to train on a small dataset involved. However, such an approach also has an unacceptable false alarm rate (0.55 precision). Finally, the top results in Table 3 highlight the promising performance of our shallow but wide SNNs and their ensembles. Our best individual SNN (MCS-3) shows an almost perfect accuracy of 0.98, classifying correctly all the documents from the same printer and misclassifying only 4% of documents from different printers. However, as the problem is highly unbalanced (there are more documents from different printers than from the same), such an approach shows the fourth better precision (0.74). For sake of completeness, in Table 4 we report the confusion matrix of MCS-3.
Some interesting findings were discovered by analyzing the semi-randomly generated pairs of testing documents and their misclassification/classification performance by the best proposed individual method. First, all of the misclassified documents come from the same pairs of printers: the Kyocera Color Laser (printer #2 in Table 1) and one individual sample of the Kyocera Task Alfa 3551 (printer #3 in Table 1). Such difficulty and discriminating such printers are in line with previous work in the same dataset [58]. Second, the semi-random pairs selected for testing contained five pairs of the same picture, but only one was incorrectly classified (the one that contained the pair of printers discussed before). Such problems are only partially alleviated with the ensembles containing twins Siamese networks (HOM-ENS) which minimized partially such a confusion of these two printers (exactly one document is now classified correctly), and no improvement was done with the diverse ensemble (HET-ENS). As such a scenario is easier and the individual best classifier got already an almost perfect result, the real importance of such ensembles will be further highlighted in the open set scenario discussed in the following subsections.

B. OPEN SET PRINTED DOCUMENTS VERIFICATION
We start the performance investigation in the open set scenario by considering the case of documents produced by the same printer, analyzing the recall or true positive rate. Then, we analyze the false alarm rate obtained by feeding the networks with documents coming from different printers. We show and discuss such results in the next subsections.

1) OPEN SET RECALL ANALYSIS
We start by evaluating in Table 5 the performance of some of the approaches presented previously when tested on pairs coming from the same but unknown printers (i.e, printers from Table 2).
The first issue we notice is the bad performance of SNNs based on IMAGENET, with the best result being achieved by RESNET-50. These networks face several difficulties when applied to the problem herein: (i) they were trained and structured for another kind of application; (ii) they are too deep for the amount of training data we can rely on; and (iii) they were never validated in an open set scenario. As we will see, these networks always predict any pair of documents as coming from different printers.
For the proposed methods, we see that the multiple convolution summation SNN in its third version (MCS-3) classifies almost perfectly all the pairs coming from the OKI (printer #9) and CANON (#11). The CANON printer is a professional printer usually employed by print shops, making these results very more interesting as it is very likely that this kind of printer is used by counterfeiters. The images coming from the other two printers were more difficult to classify: the KYOCERA P5021 (printer #12) is some months old, and the Ricoh is a brand that was not included in the training set, with a resolution (1200 × 1220) that was neither used during training. This last printer presented the hardest challenge for all the best approaches. The proposed ensembles without diversity (HOM-ENS) helped just a little, exactly as happened before in the closed set, improving the detection accuracy of Kyocera and RICOH.
For the last baseline, the results of MNIST seem to be the best one in this scenario. The minimum recall is 0.79 for the Ricoh printer. However, such a simple structure got only 0.01 better result than our approaches, and, as we will show later, MNIST approach suffers a lot when dealing with documents produced by different printers.

2) OPEN SET SPECIFICITY ANALYSIS
In this final set of experiments, we fed the various systems with documents produced by different printers. We did so by choosing all possible pairs of the four new printers in Table 2.
The results we got, reported in Table 6, confirm the biased behavior of IMAGENET-based SNNs. For this family of SNNs, the NASNET-MOBILE showed the best result of all approaches, with a 0.95 mean specificity, highlighting the fact that, according to its worse results on Table 5, this SNN tends to always say that any pair of documents come from different printers. All other related CNNs (DENSENET-201, RESNET-50,MOBILENET, EFFICIENTNET-B0 and EFFICIENTNET-B7) follow a similar path.
The results in Table 6 also show how the shallower MNIST SNNs can fail in classifying pairs of documents coming from different printers, having the second-worst mean specificity (0.41).
Finally, the results in Table 6 highlight even better the promising results allowed by the ensembles, since they boost significantly the results of the best individual MCS-3 SNN. In particular, the ensemble with diversity (HET-ENS) with three different SNN architectures achieves the best result for three pairs of printers, achieving a mean specificity of 0.94, with the smallest false alarm of 0.6. Such result improves by 11% the best individual MCS-3 SNN. The ensemble obtained with homogeneous networks (HOM-ENS) also outperforms by 5% the best individual approach, with a clear improvement for the OKI − CANON and RICOH − KYOCERA pairs.

3) OPEN SET FINAL ANALYSIS
To give a better global view of the results we have got in the open set scenario, we calculated the Balanced Accuracies (BACC) using the previous specificities and recalls, sorting such performance metric in Figure 7. It can be seen that both proposed ensembles are better positioned among all the other approaches, with the MNIST being the best competitor. The fact that shallow networks are better positioned both in the open and close-set cases, is due to the data scarcity scenarios we face in this application.
In particular, it can be noticed that the ensembles, being them homogeneous (HOM-ENS) or heterogeneous (HET-ENS) outperform the best individual approach (MCS-3) by 2% and 3% respectively.
In general, the results discussed so far highlight the use of the diverse ensemble (HET-ENS), achieving the best final result considering both closed and open sets. In the closed set, the diverse ensemble showed a competitive performance with both individual MCS-3 and HOM-ENS. In the open set, it showed a reasonable 78% recall but a very good 94% specificity. In summary, the HOM-ENS showed to be the best when the printer is known, but it yields a 17% false acceptance and 12% false alarm. Our proposed HET-ENS has a similar final result of HOM-ENS (it is only 1% better), but its true power comes from minimizing the false alarms, having only 6% error when innocent pairs come to test being, thus, our best proposed method in general.

VI. CONCLUSION AND FUTURE WORK
The accessibility and popularity of printing and scanning devices is a double-edged sword: while they can generate increasingly good and reliable printed information such as currency, advertisements, attention signs, they can also be used to generate illegal content such as fake currency and counterfeited products. Several efforts have been made in the digital forensics literature to detect the source of unreliable printed documents, however, most of them have been tested on simple datasets (with common printers), use black-box approaches that require massive amounts of training data, and do not consider that printers outside the training dataset could be used by the counterfeiter.
In this paper, we propose to treat printer source attribution as a verification problem (instead of a classification problem).
By using siamese networks, we can make the best out of a small training dataset. With Siamese networks, documents coming from unknown printers can be evaluated, given that the networks learn similarities instead of labels. We propose to better exploit such networks by creating three shallow but wide siamese networks. We further propose four variations of each network and improve even more their verification accuracy by building ensembles with and without diversity. Our best approach using ensemble of heterogeneous Siamese Networks showed a promising mean 0.92 balanced accuracy when considering a challenging unbalanced closed set scenario and an open set scenario with four unknown printers at the same time, defeating other baseline approaches (both deep and shallow ones) by a large margin.
Notwithstanding these promising results, further work can be done to better exploit the verification scenario in the printer source attribution. First of all, the obvious search for more data and their effect on the verification task should be studied. Second, other losses, such as the triplet loss [86] could be used to improve the performance of the proposed networks. Finally, other shallow but wide architectures could be designed and their performance could also be evaluated to enrich the ensemble architecture.