SimilarNet: Pairwise Similarity Comparator Layer for Versatile Comparison

In recent years, deep learning has attracted considerable attention owing to its ability to address complex problems in various fields. One notable problem is metric space learning, which is aimed at learning feature embeddings through the calculation of the metric space similarity of feature vectors in the embedding space by training embedding models. However, research on metric space similarity has been limited, and the existing methods, such as those based on the cosine similarity or concatenate layer, exhibit drawbacks in terms of flexibility and performance. For example, to apply the cosine similarity, the shapes of the two vectors must be identical, and thus, an advanced comparator cannot be used. Moreover, the concatenate layer cannot reflect the positions of the elements in the two vectors, leading to deteriorated performance and learning success rates. To address these limitations, this paper proposes a specialized artificial neural network layer named SimilarNet, designed to compare two feature vectors while considering the positions of their elements to produce an output in a vector format. By leveraging the advantages of the cosine similarity and concatenation layer, SimilarNet can effectively compare two vectors, enabling the construction of trained comparison models using multidimensional activation functions. In addition, SimilarNet can realize 1:1 comparisons of data with different shapes, unlike cosine similarity. The results of experiments conducted on various datasets indicate that models employing SimilarNet outperform those with the concatenate layer in terms of the comparison accuracy by 4.3% to 26.5% and learning success rate by 5% to 75%.


I. INTRODUCTION
With the advancement and commercialization of deep learning, innovative methodologies are being increasingly applied to address the traditional problems in various fields.Representative techniques include convolutional neural networks for basic tasks such as classification, regression, and clustering; Deep Q-Network [1] for reinforcement learning; generative adversarial networks [2]; and diffusion models [3].In the The associate editor coordinating the review of this manuscript and approving it for publication was Jerry Chun-Wei Lin .
domain of natural language processing, transformer-based approaches [4] have demonstrated remarkable performance.
In the context of comparison techniques, numerous researchers have focused on feature extractors and embedding models.Notably, research on learning suitable embedding models for comparisons has been typically based on the Siamese network structure [5] with contrastive loss [6], triplet loss [7], and center loss [8].These techniques have been applied to problems including face verification [9] and few-shot learning, with representative models including the prototypical network [10].However, only a few researchers have applied comparators for evaluating the similarity between two extracted features.A technique used at present is the application of metrics, such as the cosine distance or Euclidean distance [5], [6], [7], [8], [9], [10], or neural networks trained on concatenated embedding vectors, such as the Relation Network [11].
However, the application of these metrics to measure the distance involves two main problems.First, to use these metrics, the shapes of the two vectors must be identical, which makes it challenging to realize heterogeneous comparisons between vectors of different shapes.Second, the output of the metrics is a single scalar, which cannot be connected to the upper layer, as shown in Figure 1.
Although the use of a concatenate layer can help avoid the abovementioned two problems, it involves unnecessary comparisons, resulting in a low efficiency.When two vectors are simply concatenated, the elements of the two vectors are randomly operated without considering the position of each element, even if the base vectors for each feature are identical.In this configuration, the comparator model may learn invalid patterns of the elements in different positions, as illustrated in Figure 2.This phenomenon significantly decreases the accuracy and learning success rate.
First, unnecessary comparisons arise owing to the calculation of the weighted sum of irrelevant combinations, as illustrated in Figure 2(a), resulting in invalid comparisons.Notably, the operation shown in Figure 2(a) is a simplified one with two elements for ease of understanding.In practice, however, multiple elements would be calculated, resulting in excessive complex computations and inefficient learning.
Second, neural networks that employ weighted sum operations tend to construct complex models to achieve outputs that are proportional to the similarity between the two input values.The similarity calculation problem is similar to the XNOR problem, which cannot be modeled through a single weighted sum.A complex model is thus required for addressing this problem.A single-layer artificial neural network has been noted to be unsuitable for solving the XOR problem. Figure 2(b) illustrates a valid comparison resulting from an appropriate comparison of two elements from two vectors.To achieve only these valid comparisons, it is necessary to construct a complex artificial neural network with two or more layers.For example, the Relation Network employs a model that stacks the convolution block twice and adds fully connected (FC) layers.
Considering these aspects, in this study, we establish a specialized comparison layer, named SimilarNet, to overcome the previously mentioned problems and generate an efficient comparator capable of effective learning.This layer takes two vectors as the input and outputs a single vector.Moreover, it is designed to treat the two vectors distinctly at the element level.The rationale of SimilarNet is to avoid invalid comparisons between unrelated elements of the two vectors.
The original contributions of this work can be summarized as follows: 1) We propose an efficient comparison-specialized neural network layer, named SimilarNet, that can be applied to various comparison problems.Unlike concatenation strategies, this layer differentiates the two input vectors being compared and selectively performs the necessary operations for comparison based on the position of each element.This design can effectively enhance the accuracy and learning success rate in comparison problems.
2) The SimilarNet layer offers a level of flexibility that cannot be used using traditional metrics, such as cosine similarity.As the output of SimilarNet is in the form of vectors rather than a single scalar, diverse models may be stacked on top of it and trained using conventional learning methodologies.3) We extend the functionality of SimilarNet to facilitate comparisons of heterogeneous vectors.Comparison operations with optimal characteristics for the target of comparison can be implemented by exchanging the activation function defined in the perceptrons.

II. RELATED WORKS
The fundamental approach for comparing two vectors is to use various metrics [12], such as the Euclidean distance and cosine distance, each tailored for specific applications owing to their unique characteristics.Cosine distance, known to be effective in high-dimensional spaces regardless of the vector size, is widely used in the field of machine learning [13].
Research on the use of deep learning for comparing two datasets began with the introduction of the Siamese network [5] in 1993.In 2005, a mapping method that applied contrastive loss [6] to the learning process of the Siamese architecture was proposed.With the increasing popularity of deep learning since 2015, numerous researchers have applied the contrastive loss and Siamese artificial neural network.The triplet loss [7], an extension of contrastive loss, was later introduced.Since then, comparison-based methods have been successfully applied to various problems in different fields.
Notably, the abovementioned studies use specific metrics such as the Euclidean and cosine distances for comparison, which corresponds to metric learning, as discussed [14], [15] in several papers.Specifically, these studies use only fixed metrics, such as the Euclidean distance and cosine distance, as comparators.Despite the notable progress of metric learning focused on efficiently learning feature embedding models, the comparator model itself has been largely overlooked in relevant research.
Certain researchers have attempted to implement learnable comparators and apply them to few-shot learning.For example, the Relation Network [11] serves as a comparator by concatenating vector pairs for comparing multiple embedding vectors.However, this method requires the relation module, consisting of a convolution block and fully connected layer, to learn the comparison operation of this vector input.This approach leads to the inefficient comparison of embedding vectors using concatenation, as described in Section I.
A 3D activation function, i.e., the functional perceptron, has been designed for indoor positioning problems, based on the comparison of Wi-Fi fingerprints [16].However, this functional perceptron is a specialized perceptron for indoor positioning solutions and has limited applicability to other problems.

III. DESIGN AND IMPLEMENTATION OF SIMILARNET A. VANILLA SIMILARNET
The design objective of SimilarNet is to leverage the flexibility of the concatenate layer and efficiency of cosine similarity.To this end, SimilarNet uses a modified form of the cosine similarity algorithm inspired by artificial neural networks.Specifically, we partition the cosine similarity algorithm into three layers: L2 normalization, Hadamard product, and sum of elements.
Artificial neural networks typically rely on activation functions.However, the Hadamard product, which is a matrix operation defined as H (X , Y ) = X ⊙ Y , cannot be directly applied as an activation function.To address this problem, we introduce the concept of a functional perceptron [16], which uses a three-dimensional activation function that takes each element of both matrices as the input.Specifically, we reconstruct the Hadamard product as a function of each element, f (x i , y i ) = x i y i , which can then be employed as an activation function for the functional perceptron.We refer to this function as the cosine activation function.The cosine activation function outputs higher values when the elements  x and y on both sides are closer to each other and lower values when they are farther, as depicted in Figure 3.
The SimilarNet layer uses a functional perceptron based on the cosine activation function instead of the Hadamard product.Moreover, we eliminate the sum operation at the end  of the cosine similarity to obtain the output in the form of a vector instead of a scalar.The vector representation of the output helps enhance the model flexibility.A comparative analysis of the cosine similarity and SimilarNet is presented in Algorithm 1 and Figure 4.
In SimilarNet, operations are performed on only the elements corresponding to the same index on both sides of the vectors, similar to cosine similarity.Notably, when the concatenate layer is used, elements of different features are forcibly combined, resulting in the occurrence of invalid elements, as shown in Figure 2(a).In contrast, SimilarNet operates on elements corresponding to the same features, minimizing the occurrence of invalid elements and allowing the model to efficiently calculate the similarity.Furthermore, as the output represents a similarity vector rather than a scalar, various comparator models can be flexibly introduced after SimilarNet, as illustrated in Figure 5.An implementation of SimilarNet using Keras [17] with TensorFlow [18] can be found on GitHub [19].

B. PARAMETRIC SIMILARNET
To perform a proper comparison, it is necessary to learn the methods and criteria that match the characteristics of the data being compared.A representative example is a case in which the weights for the similarities and differences of each characteristic should be separately calculated.However, vanilla SimilarNet does not have trainable weights and biases, unlike general artificial neural networks, rendering it ineffective in managing such cases.To enhance the learning capability, we extend the vanilla SimilarNet to parametric SimilarNet by enabling it to learn how to compare each feature by applying an activation function with trainable parameters.An example of trainable activation functions is the parametric rectified linear unit (PReLU) [20], which extends the commonly used ReLU [21] in fully connected layers, enabling the training of weights of the negative part as α for each perceptron.The graph of PReLU is shown in Figure 6.
Similar to PReLU, the parametric SimilarNet is designed to learn different weights for the positive and negative parts.In other words, we assign different weights to the similarities and differences of each feature when performing comparisons.
We refer to this activation function as the parametric cosine (PCosine).Using PCosine, the model can learn to neglect certain differences in a pair when there are clear similarities, thereby yielding high similarity values.PCosine is defined as The shape of PCosine depends on α, as illustrated in Figure 7.The structure of parametric SimilarNet incorporating this activation function is illustrated in Figure 8.The Keras implementation of parametric SimilarNet can be found on GitHub [22].

C. HETEROGENEOUS SIMILARNET
In comparison problems, the shapes of the two datasets being compared are typically assumed to be identical.Moreover, the basis vectors representing each feature in the two vectors being compared are considered to be identical.Various metrics, such as Euclidean distance and cosine similarity, as well as comparison functions, such as the mean squared error (MSE) and mean absolute error (MAE), can only be meaningfully used when the shapes and basis vectors of the two vectors are identical.However, when comparing vectors of different shapes, the shapes of the embedding vectors also differ, as depicted in Figure 9. Consequently, it is necessary to identify a novel strategy for treating heterogenous vectors.For example, the recommendation problem can be regarded as a comparison problem for multiple heterogeneous vectors.To address this problem, methods such as clustering [23], [24] or concatenation have been used to create a single vector for training the model [25], [26], thereby avoiding direct comparison of the two datasets.
In the heterogeneous SimilarNet framework, we extend SimilarNet to be able to directly compare two heterogenous vectors by applying the dot product, as illustrated in Figure 10.This testing process is similar to the dot product attention technique [4] commonly used in natural language processing.Algorithm 2 describes this process, which can be implemented by adding a flatten operation for two vectors and a transpose operation for one vector in Algorithm 1.A Keras implementation of SimilarNet for heterogeneous comparison is available on GitHub [19].

IV. EXPERIMENTS A. DATASETS
The performance of SimilarNet is assessed using 1:1 pair datasets generated from six sources: MNIST [27], Fashion-MNIST [28], Omniglot [29], CIFAR-10 [30], miniImageNet [31], and a combined MNIST:CIFAR dataset.Data pairs are generated by randomly pairing values and labeled according to their class equivalence.For instance, the label 1 represents ''same'' for a pair of cars, while the label 0 represents ''different'' for a pair of a car and an airplane.
Unlike conventional few-shot learning approaches, the classes of the two values are randomly extracted from the entire range, and the data pairs from the same class and other classes are extracted in a ratio of approximately 1:1.The implementation of the datasets can be found on GitHub [32].
The To investigate heterogeneous comparisons, in which two vectors with different shapes are compared, we generate a  dataset consisting of data pairs of MNIST and CIFAR-10.A one-to-one correspondence between the ten classes of MNIST and CIFAR-10 is established while preserving their original ordering.Consequently, the 0, 1, and 2 entities of MNIST are considered to be in the same class as the airplane, automobile, and bird classes of CIFAR-10, respectively, and so on.For instance, a label representing ''same'' is assigned to the pair of 0 and airplane, and a label representing ''different'' is assigned to the pair of 0 and bird.

B. EXPERIMENTAL MODELS
To evaluate the performance of SimilarNet, we design a testbed model with a structure that fits well with the 1:1 pair datasets as shown in Figure 12.Specifically, we use the embedding models proposed in previous few-shot learning studies, such as ProtoNet [10] and the Relation Network [11].These embedding models are incorporated into Siamese networks, where their weights are shared.The structure of a single embedding model is shown in Figure 11.
Upon these two embedding models, we add SimilarNet and concatenate layer respectively, followed by an FC layer with eight outputs as the hidden layer, and one FC layer with one output as the output layer as shown in Figure 12   equivalence labels represented as 0 or 1.Therefore, we apply the sigmoid function to the output perceptron to match the output format.
Owing to differences in the image size and number of pixel channels between MNIST and CIFAR-10, the embedding models for heterogeneous comparison have different node size.Therefore, we cannot share their weights and instead use two independent embedding models without constructing Siamese networks.

C. EXPERIMENTAL SUBJECTS
To evaluate the performance of SimilarNet, we conduct experiments by replacing the concatenate layer used in 121608 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the Relation Network [11] with the vanilla SimilarNet (Section III-A) and parametric SimilarNet (Section III-B), at the same position in the model described in Section IV-B.However, unlike SimilarNet and concatenate, cosine similarity cannot support a trainable comparator model.Therefore, the model architecture and learning environment are different.Additionally, because cosine similarity cannot handle heterogeneous comparisons, it is excluded from the experimental targets.

D. EXPERIMENTAL ENVIRONMENT
To conduct the experiments, we implement the models described in Section IV-B using TensorFlow 2.8.1 and Keras 2.8.0.The code used in the experiments is publicly available on GitHub [32].To simultaneously evaluate the accuracy and learning success rate, we create a script that automatically restarts the computer and executes the experiments, providing a clean slate for measuring the accuracy and learning success rate.We configure the training environment for the experiments to be as similar to the default settings of TensorFlow and Keras as possible.The MSE is used as the loss function without any modification, and the Adam optimizer supported by Keras is used with its default parameters.The learning rate is set to the default value of 0.001.The batch size is 32 and step size is 256 for training.The number of epochs for training is set as 10,000, along with the EarlyStopping callback provided by Keras having a patience parameter set at 100.This callback is designed to halt training if there is no further improvement in performance within the given patience parameter and restore the model with the best validation loss.The average number of training epochs is 307, with the minimum and maximum epoch range of 100 to 738.
The computer used for the experiments has an Intel Core i9-11900K CPU, 32 GB of RAM, and an NVIDIA 121610 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.RTX 3070 GPU.The operating system is Windows 10, and the software include Python 3.9.9 and TensorFlow 2.8.1.

E. PERFORMANCE EVALUATION METRICS
The performances of SimilarNet configurations and the concatenate layer in the same model are evaluated.Accuracy serves as the primary and fundamental performance metric.This metric can be easily measured by estimating whether the class matches when a pair of data is input and then checking if the result is the same as the ground truth.
In addition to accuracy, the learning success rate is an important metric.Learning often cannot be realized in the 1:1 comparison model using the concatenate layer, owing to the inefficient comparison of the 1:1 comparison model, as explained in Section I.In such cases, by using SimilarNet instead of the concatenate layer, the comparison efficiency can be enhanced, thereby increasing the learning success rate.These enhancements can be quantified by examining the accuracy and characteristics of the outputs of multiple models trained from scratch.In this study, we verify whether the accuracy exceeds 50% and whether the outputs are not consistently the same for any pair of inputs, as a measure of the learning success rates.Models that fail to achieve an accuracy above 50% or consistently produce outputs solely in either the positive or negative category are deemed unsuccessful in learning.
The class compositions of the training and test sets are the same for MNIST, FashionMNIST, and CIFAR-10 but different for Omniglot and miniImageNet.Nevertheless, because the output of the 1:1 comparison model used in the experiment only determines the match or non-match of the input, the type or number of classes is irrelevant to the model operation.Therefore, experiments can be performed by directly inputting all the unseen of the test set.

F. EXPERIMENTAL RESULTS
Table 1 presents the performance metrics of models using SimilarNet and concatenate layer for the six datasets.
accuracy is calculated as the arithmetic mean of 20 trained-from-scratch models, measured using 65,536 pairs of test set data, with a 95% confidence interval.The success rate is measured as the ratio of models satisfying two criteria out of a of 20 models: The first criterion is the exceedance of an accuracy threshold of 50%.Because the models used in the experiment are binary classifiers designed to differentiate between similarity and dissimilarity, any model demonstrating successful learning must achieve an accuracy higher than 50%.The second criterion is the model output is not consistently the same across all inputs.Any model that produces only positive or only negative outputs for any pair of inputs is considered to fail at learning.This aspect is evaluated using a test dataset consisting of 256 pairs.The raw data from the experiment are presented in the Appendix.

G. ABLATION STUDY
SimilarNet consists of L2 normalization and a functional perceptron, corresponding to the L2 regularization of the cosine similarity algorithm and Hadamard product, respectively.121612 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.By excluding L2 normalization from SimilarNet in the model used in the performance experiments and replacing functional perceptron with a concatenate layer, a model equivalent to the concatenate layer can be obtained.To clarify the contributions of L2 regularization and functional perceptron in SimilarNet, we conduct experiments by replacing each component.
Specifically, for the ablation study, we conduct experiments on the same dataset under the same environment for two cases: the concatenate layer with L2 normalization and SimilarNet without L2 normalization.The results are presented in Table 2.

H. DISCUSSION
The experimental results demonstrate that in all cases, Simi-larNet outperforms the concatenate layer in terms of both the accuracy and success rate, regardless of the activation function used.Additionally, when comparing the two activation functions used in SimilarNet, it is observed that parametric SimilarNet exhibits slightly higher or similar accuracy, whereas vanilla SimilarNet achieves a higher success rate.This suggests that the performance of SimilarNet can be further enhanced by optimizing various factors such as the dataset characteristics, model characteristics, and activation function definitions.Future research can thus be focused on optimizing SimilarNet configurations and its activation functions.
The results of the ablation study indicate that both the functional perceptron and L2 normalization in SimilarNet considerably influence the accuracy and training success rate.In particular, when the functional perceptron is replaced with the concatenate layer in all datasets, the accuracy significantly decreases.Moreover, the removal of L2 normalization results in a substantial decrease in the training success rate, except in the case of the MNIST and FashionMNIST datasets.Based on these experimental findings, it can be inferred that the functional perceptron contributes more significantly to the accuracy, whereas L2 normalization contributes more significantly to the training success rate.

V. CONCLUSION
This paper proposes the SimilarNet layer, a novel framework that can be used to create an efficient 1:1 comparison model.By selectively comparing only relevant elements between two characteristic vectors and excluding self-comparison, the proposed layer exhibits excellent performance in comparison problems, rendering it a promising alternative to traditional concatenate-layer-based methods.Moreover, the proposed layer allows the activation function to be flexibly replaced and the comparison metric to be set according to the problem, enabling the comparison of heterogeneous data.
Experiments on several representative datasets reveal that SimilarNet outperforms the concatenate layer in terms of both the accuracy and success rate compared to concatenate.Thus, by using SimilarNet instead of the concatenate layer in various problems necessitating comparison, the overall model performance can be enhanced.
In future work, we will investigate the relationship between the performance of SimilarNet and its activation function and explore the use of various activation functions.

APPENDIX A RAW DATA FROM EXPERIMENT
A. MNIST See Tables 3-7.

FIGURE 11 .
FIGURE 11.Structure of the embedding model for experiments.
training and test sets of each dataset are strictly separated to ensure unbiased evaluation.The data pairs constituting the training set of each dataset are extracted only from the original training set, and the data pairs constituting the test set are extracted only from the original test set of each dataset.Consequently, for Omniglot and miniImageNet, in which the class compositions of the training and test sets differ, all data constituting the test data pairs are composed of unseen classes not included in the training data.

FIGURE 12 .
FIGURE 12. Structure of the testbed models for experiments.
(a) and (b).The datasets used in the experiments consist of data pairs and

TABLE 2 .
Results of ablation study.

TABLE 3 .
Results of MNIST dataset input to SimilarNet.

TABLE 4 .
Results of MNIST Dataset input to parametric SimilarNet.

TABLE 5 .
Results of MNIST dataset input to concatenate layer.

TABLE 6 .
Results of MNIST dataset input to SimilarNet without L2 normalization.

TABLE 7 .
Results of MNIST dataset input to concatenate layer with L2 normalization.

TABLE 8 .
Results of FashionMNIST dataset input to SimilarNet.

TABLE 9 .
Results of FashionMNIST dataset input to parametric SimilarNet.

TABLE 10 .
Results of FashionMNIST dataset input to concatenate layer.

TABLE 11 .
Results of FashionMNIST dataset input to similarnet without L2 normalization.

TABLE 12 .
Results of FashionMNIST dataset input to concatenate layer with L2 normalization.

TABLE 13 .
Results of omniglot dataset input to SimilarNet.

TABLE 14 .
Results of omniglot dataset input to parametric SimilarNet.

TABLE 15 .
Results of omniglot dataset input to concatenate layer.

TABLE 16 .
Results of omniglot dataset input to similarnet without L2 normalization.

TABLE 17 .
Results of omniglot dataset input to concatenate layer with L2 normalization.

TABLE 22 .
Results of CIFAR-10 dataset input to concatenate layer with L2 normalization.

TABLE 23 .
Results of MiniImageNet dataset input to SimilarNet.

TABLE 24 .
Results of MiniImageNet dataset input to parametricSimilarNet.

TABLE 25 .
Results of MiniImageNet dataset input to concatenate layer.

TABLE 26 .
Results of MiniImageNet Dataset input to SimilarNet without L2 normalization.

TABLE 27 .
Results of MiniImageNet dataset input to concatenate layer with L2 normalization.

TABLE 32 .
Results of MNIST:CIFAR Dataset input to concatenate layer with L2 normalization.