Radial Basis Function Networks for Convolutional Neural Networks to Learn Similarity Distance Metric and Improve Interpretability

Radial basis function neural networks (RBFs) are prime candidates for pattern classification and regression and have been used extensively in classical machine learning applications. However, RBFs have not been integrated into contemporary deep learning research and computer vision using conventional convolutional neural networks (CNNs) due to their lack of adaptability with modern architectures. In this paper, we adapt RBF networks as a classifier on top of CNNs by modifying the training process and introducing a new activation function to train modern vision architectures end-to-end for image classification. The specific architecture of RBFs enables the learning of a similarity distance metric to compare and find similar and dissimilar images. Furthermore, we demonstrate that using an RBF classifier on top of any CNN architecture provides new human-interpretable insights about the decision-making process of the models. Finally, we successfully apply RBFs to a range of CNN architectures and evaluate the results on benchmark computer vision datasets.


I. INTRODUCTION
Inspired by the locally tuned response of biological neurons, Broomhead and Lowe introduced radial basis function neural networks (RBFs) in 1988 [1]. The modeling concept behind RBFs is a combination of unsupervised and supervised learning for pattern classification and regression. However, RBFs have not been integrated into the contemporary approaches to computer vision using Convolutional Neural Networks (CNNs) so far due to structural deficiencies. This paper presents developments in a new area of research and lays the foundation for using RBFs in deep learning and computer vision by modifying their architecture and learning process. The results demonstrate that integrating RBFs into CNN models for computer vision provides both a similarity distance metric as well as an interpretable decision-making process.
This research is motivated by the unique opportunities the RBF architectures introduce when used with CNN models. A new training process introduced for RBFs in this paper provides the opportunity of using labeled and unlabeled data by optimizing two loss functions combining supervised and unsupervised learning. The training process of RBF architectures employs a distance metric optimization that we propose to use as a similarity distance metric to find similar and dissimilar images. Additionally, this research proposes visu-alization techniques to illustrate the clusters and activations with training and test images to gain more insight into the reasoning behind the decisions made by the networks, thus improving interpretability. The contributions of this paper to computer vision literature can be summarized as follows: • Combining supervised and unsupervised learning. • Learning a similarity distance metric to find similar images. • Improving the interpretability of decision-making. Despite the advantages of combining RBFs to modern CNN architectures, there are two factors in the architecture and training process of RBFs hindering their integration into CNNs. First, the nonlinear activations and computational graphs of RBFs used in the literature prevent efficient gradient flow. Secondly, RBFs assume that the training features are fixed and the cluster centers are initialized accordingly. Nonetheless, CNN architecture learns the embeddings, which are used as features of RBFs. This paper tackles the limitations of the original RBFs and presents the following contributions to RBF literature: • Introducing a quadratic activation function and a linear computational graph for end-to-end learning. • Adding an unsupervised loss term to update the cluster centers in the training process with the learned embed- The output of a CNN backbone is connected to the input of an RBF through a fully connected layer. The input features of the RBFs are referred to as embeddings in this research work. The embeddings of each image are compared to cluster centers with a trainable similarity distance metric. The same distance metric can be used to find images similar to a test sample amongst training images (visualized in the table in the middle row). The RBFs apply an activation function to the distance of the training images from the cluster centers to compute activation values. The output layer of the RBF is optimized for classification based on these activation values. We train the entire CNN-RBF architecture end-to-end.
dings. • Applying the RBFs to computer vision in a first attempt using deep CNN architectures.
The remainder of the paper covers the related works in Section II followed by the theoretical background of RBFs in Section III. We then present our original research and contributions in Section IV with our proposed modifications to RBFs followed by a visual explanation of the new proposed training and decision-making process in Section V. The experimental results of applying the proposed RBF-CNN architectures using a range of CNN backbones on benchmark datasets are presented in Section VI. The potential contributions of our proposed similarity distance metric on the field of computer vision to enhance the transparency of the decision making process is demonstrated in Section VII. We finally present our conclusions in VIII.

II. RELATED WORKS
Research into optimizing RBF architectures has followed two approaches. The first appraoch concentrates on the training process and initialization of the networks while the second approach aims to find better optimized activation functions. This paper presents improvements to both approaches and integrates the RBFs into contemporary vision models using CNNs.
RBFs were originally introduced as a supervised approach for classification and regression tasks. Broomhead and Lowe proposed to draw the cluster centers either from a uniform distribution or randomly from the training samples and then optimizing the output weights using a pseudo-inverse analytic solution [1]. Initializing the cluster centers randomly and only training the output weights is a one-phase training process for RBFs. Two-phase training for RBFs uses various methods to initialize the cluster centers as well as optimizing the output weights. Research since 1988 has used supervised and unsupervised methods to initialize the centers. Moody and Darken proposed an unsupervised algorithm to initialize these cluster centers [2] while Schwenker et al. proposed supervised vector quantization [3]. Decision trees were used to find centers independently by [4] and [5] before training the output weights. Finally, Schwenker et al. proposed a third phase to optimize the entire RBF network end-to-end including output weights, cluster center, and parameters of activation functions using gradient descent [6].
All of these methods for cluster center initialization assume a fixed feature space for the input layer. However, CNNs learn the embeddings automatically and develop the feature space of the images during the training process. Therefore, we suggest optimizing an unsupervised learning loss during the training to cope with this change in the feature space. This work differs from previous research as it combines supervised and unsupervised learning by optimizing two separate losses using gradient descent.
Various applications and implementations have motivated several activation functions presented in the literature of RBFs [7]. The Gaussian function is the kernel encouraged by modeling the data through a multivariate Gaussian distribution [1]. Other functions adopted in the RBF architecture include linear kernels, thin-plate splines, logistic functions, and multiquadratic functions [8]- [11]. Hardy's multiquadratic functions motivate an activation function for RBFs used by Karimi et al. and Zhao et al. [12], [13]. Du et al. proposed a kernel for digital signal processing (DSP) units 9. In this paper, we suggest a quadratic kernel to build a linear computational graph for efficient gradient flow and integrate RBFs for end-to-end training with CNN architectures.
Besides the mature fundamental research, RBFs have been applied to a broad range of applications for pattern classification and regression in recent years. Nicodemou et al. used RBF networks for 3D hand pose estimation [14], Dehghan and Mohammadi estimated a numerical solution for Fokker-Planck differential equations with RBFs [15], Li et al. used sparse multiscale RBFs for seizure detection in EEG signals [16], and Zhao et al. predicted interfacial interactions by training RBFs [13]. Furthermore, Geng et al. introduced deep RBF networks and applied the method to food safety inspection data hland RBFs are used to train models for classification and regression in discrete and continuous pain quantification [17].
RBFs have been used for computer vision tasks and image classification as well. Friedhelm et al. used raw images as feature vectors to classify hand-written digits [6]. Er et al. extracted the features from facial images using principle component analysis (PCA) and processed these features using Fisher's linear discriminant (FLD) technique before classifying the patterns using RBFs [18]. However, the successful rise of the modern conventional neural networks, such as LeNet-5 [19] and AlexNet [20], led to a paradigm shift from using hand-crafted features to automated deep convo-lutional feature and representation learning using CNNs. In recent years, most computer vision tasks, like facial recognition [21], are dominated by modern CNN architectures as they present superior performance compared to classical methods for image processing. To the best of our knowledge, this paper presents the first attempt to integrate RBFs into modern CNN architectures for computer vision.
This research work relates to literature focusing on deep metric learning since RBFs optimize a similarity distance metric automatically during training based on their architecture. Euclidean distance, Mahalanobis distance and cosine similarity have been used to evaluate similarity between the embeddings (extracted features from CNNs) of two images in the literature [22]- [24]. Researchers have applied different strategies and loss functions to optimize these similarity metrics for same-class images while also maximizing the distance of different-class images. The research in this area concentrates on the training process and the design of a loss function which brings similar images closer in the embedding space based on a similarity measure. Hu et al. proposed to minimize the inter-class scores and maximize the intraclass scores based on Euclidian distances [25]. Hoffer and Ailon suggested optimizing a similarity-based loss function defined for selected triplets of images [22]. Song et al. used the pairwise distances between images of an entire batch and proposed a structured loss function for metric learning [26]. Similar research work aimed at optimizing angular distance, cosine distance and large-margin Euclidean distance of similar and dissimilar samples [24], [27], [28].
This paper presents a method to retrieve a ranked list of similar and dissimilar images, which leads to visually appealing results for similarity metric learning. However, the proposed similarity metric learned by the RBFs does not require any complicated triplet sample section or loss design. The presented results are obtained using a typical supervised loss function for classification (softmax cross-entropy). Furthermore, RBFs can not only optimize for Euclidean and Mahalanobis distances, but also for the entire covariance matrix.

III. RADIAL BASIS FUNCTION NETWORKS
In this section, we briefly review and explain the theoretical foundation of radial basis function networks. RBFs are presented in the literature as a global approximation method for learning a mapping F from a given feature space with dimensionality d to label a space with dimensionality K (F : IR d → IR K ) [1]. In this paper, the function F of features x approximates the one-hot encoded labels y. The features used in this work to train the RBFs are the embeddings of deep (CNNs) to predict the class labels using end-to-end optimization. However, we use a fully connected layer between CNN architectures and RBFs to provide compatibility between the two architectures and prevent overfitting. The architecture of the RBF consists of an input layer, a single trainable hidden layer with C cluster centers (c j ) 1, and an output layer.  During the evaluation, also termed inference in the deep learning architecture, the RBF computes a distance between embeddings of deep CNNs and the cluster centers and applies an activation function to this distance. The network outputs are then computed by multiplying the weights of the output layer with the activation values. This evaluation process formally defined as: where r represents the distance, R j is the positive definite covariance matrix (trainable distance), T denotes the matrix transposition, w jk shows the weights of the output layers, h is the activation function, and w 0k are the biases. In these Equations, µ, j, and k enumerate the number of samples, cluster centers, and classes, respectively. Trainable parameters in Equation 1 and 2 are the output weights, cluster centers, and covariance matrix.
Optimizing the RBF networks with an identity covariance matrix results in training in Euclidean space. It is possible to optimize a Mahalanobis distance [29] by training the main diagonal on the covariance matrix. Any arbitrary distance metric can be trained by optimizing the entire covariance matrix while projecting the matrix to the space of positive definite matrices. The distance r computed in Equation 1 is not only a measure of the proximity of an image to a cluster center but can also be used to compare images and find similar and dissimilar images in the embeddings space.
The linear and nonlinear activation functions used in RBFs are as follows [9]- [11]: Gaussian : Thin-plate spline : h(r) = r 2 ln r (5) In addition to the standard machine learning activation kernels in Equations 3 to 6, the kernel presented in Equation 7 is derived from the generalized Hardy's multiquadratic function [8] and results in the same function for α = 1/2. Du et al. [7] proposed the kernel in Equation 9 because of its convenience for implementation on DSP units. Various activation functions for RBF are depicted in Figure 2.
The complete process of training RBFs was introduced by Schwenker et al. [6] as three phase process: Unsupervised learning: This step is aimed at finding cluster centers that are representative of the data. The k-means [30] clustering algorithm is widely used for this purpose. kmeans iteratively finds a set of cluster centers and minimizes the overall distance between cluster centers and members over the entire dataset. The target of the k-means algorithm can be written in the following form: where x µ ∈ ϑ j denotes the members of the j th cluster shown by ϑ j .
Computing weights: The output weights of an RBF network can be computed using a closed-form solution. The matrix of activation of the samples is defined from the training set (H) as follows: Based on Equation 2, the matrix of output weights (W ), which estimates the matrix of labels (Y ), is computed by the following equation: where H † is the Moore-Penrose pseudo-inverse matrix [31] of H and is computed as: End-to-end optimization: After initializing the RBF weights and cluster centers with a clustering algorithms such as k-means, it is possible to optimize the network end-to-end via backpropagation and gradient descent. Schwenker et al. computed the gradients of the loss function for a Gaussian activation function in [6].

IV. ADAPTING RBFS FOR CNNS
In this section, we propose using RBF classifiers on top of CNNs as depicted in Figure 1. The deep embeddings are computed using standard convolutional layers and inception blocks. The deep embedding of the CNNs are flattened and fed to an RBF after using a fully connected layer in the architecture. The network ends with an output layer with softmax activation and is optimized end-to-end. Integrating the RBFs into deep structures and using them in conjunction with CNNs presents three challenges: Initialization: Training the RBFs from scratch with randomly initialized weights using gradient descent is quite inefficient due to inappropriate initial cluster centers. The large initial distances in high dimensional spaces leads to small activation values and the gradients attenuate considerably after the RBF hidden layer during backpropagation. Therefore, we use the k-means algorithm to initialize the cluster centers before starting the training. Furthermore, computing the weights from Equation 12 is not feasible at the scale of computer vision problems such as ImageNet [32], which has over 14 million images and 1000 classes. Hence, we randomly initialize the output layer and optimize it using gradient descent.
Dynamic input features: The input features of classical RBFs are fixed, but this assumption is not valid with respect to CNNs. As the embeddings of CNNs develop during the training process, the cluster centers initialized by the kmeans algorithm are no longer optimal after a few epochs of training. We propose to optimize the k-means algorithm target with the unsupervised loss defined in Equation 10 during the training process.
Activation: The nonlinear computational graph drawn by computing the distance in Equation 1 and applying the activations in equations 3-9 leads to inefficient gradient flow. We target this challenge by introducing a new activation function.

A. INTRODUCING UNSUPERVISED LEARNING LOSS
In this section, we explain two modifications to classical RBFs to make them suitable for deep CNNs. First, we introduce an additional loss term to the RBF hidden layer. This term is based on the target function of the k-means algorithm defined in Equation 10 and continues in the unsupervised learning process during the development of embeddings. Secondly, we introduce a new quadratic kernel to build a linear computational graph for efficient optimization using backpropagation.
The embeddings of CNNs change during the training process and necessitates updating the cluster centers with an unsupervised loss. We introduce an additional term to the networks loss function based on the unsupervised target in Section III to optimize the cluster centers during training using the k-means loss in Equation 10.
The final loss of a CNN with RBFs as the classifier (CNN-RBFs) is computed as Loss rbf = Loss supervised + λ × Loss unsupervised (14) where the classification loss Loss classification is any arbitrary loss function, for instance categorical cross entropy. It is conventional to use clustering algorithms such as the k-means or expectation-maximization (EM) algorithms to initialize the cluster centers. The loss function in Equation 14 is optimized using gradient descent by minimizing the distance of the embeddings for each sample from its nearest cluster center regardless of the class labels. The distance from the nearest cluster center is computed from the distance metric R j defined in Equation 1.

B. QUADRATIC KERNEL
The kernels used for classical RBFs are nonlinear, increasing model complexity. The architectures proposed in 1 profit from utilizing the state-of-the-art in representation learning, i.e. CNNs, as a backbone. Therefore, CNN-RBF architectures can be trained with simpler linear models to improve the gradient flow during backpropagation. The proposed quadratic activation function is linear in the space of r 2 and is defined as follows: where σ is the parameter that determines the width of the kernel. The proposed kernel is depicted in Figure 2 along with the conventional activation kernels. Our proposed quadratic kernel reduces the nonlinearity of the CNN-RBF computational graph for backpropagation.  a distance computation with matrix multiplication and linear activations.

V. VISUALIZATION OF TRAINING PROCESS
In this section, we visualize the performance of the RBFs on top of CNNs on a simple dataset. The experiments are conducted on the modified national institute of standards and technology dataset (MNIST) [34], which is a dataset of hand-written digits including 10 classes. Learning the dataset is considered a simple task in computer vision. The simplicity of the dataset and learning task provides us with the opportunity to visualize the training process at a fine level of detail. We chose the same number of cluster centers as classes (10) in the dataset to depict the training process of a CNN-RBF. The network architecture in this section consists of a four-layer CNN and the output of these layers is connected to the RBF after a global average pooling layer and a fully connected layer. Figure 3 demonstrates the evolution of a representation around the cluster center during the training process. The data samples in Figure 3 are placed according to their distance from the center and at a random angle. The samples are shown with a number denoting their class and are additionally color code accordingly in Figure 3. To reduce the overlap of similar samples, we add random uniformly distributed noise at an amplitude of 0.1 to the distance of the samples from the cluster centers.
Minimizing the unsupervised loss in Equation 10 reduces the distance of the data samples from the cluster centers. Furthermore, the supervised loss enforces the samples of the same class to maintain the same distance from cluster centers, as the activations are the only information for the final decision of the network. The circles with samples of the same class around the cluster centers demonstrate the effect of supervised loss in training. It should be noted that the clusters presented in Figure 3 are selected to optimally illustrate the concepts underlying training CNN-RBFs. Figure 4 illustrates the two-dimensional mapping of the CNN embeddings (top row) and RBF activations (bottom row) using t-SNE [33]. The effect of both supervised and unsupervised loss from Equation 14 is also visible in this figure. The data samples split into clusters regardless of their class labels in the embedding space of CNN due to the unsupervised loss (top row in Figure 4). The activation values divide into clusters corresponding to the class labels, a process encouraged by the supervised loss.

VI. EXPERIMENTAL RESULTS
This section presents the experimental results that reinforce the applicability of RBFs to CNNs. We use several standard computer vision benchmark datasets and investigate the effect of tweaking various hyperparameters of the CNN-RBF architectures in the training phase and generalization to test data. We used three convolutional backbones: those of EfficientNet-B0 [35], InceptionV2 [36], and ResNet50 [37]. A list of the benchmark computer vision datasets is presented in Table 1.
The hyperparameter searches in Figure 5 are conducted using the hyperband [45] algorithm with 4 agents running in parallel on two Quadro T2000 graphic processing units (GPUs) for approximately 10 days. It should be noted that dropout is only applied after the CNN backbone and before the fully connected layer in Figure 1. The output of the fully connected layer, without any activation function, is used as the input feature of the RBFs. The results in Figure 5 shows that training CNN-RBF architectures leads to high performances with a wide range of hyperparameters. However, achieving good test performance with a high dropout rate and a large input dimension is challenging. CNN-RBF architectures show a better performance without dropout and rectified linear unit (ReLU) activations in the input layer of RBFs. Thus, we neglect the dropout for further hyperparameter searches conducted on the datasets in Table 1. The list of optimal hyperparameters for all datasets is presented in Table  2. We also applied CNN-RBFs to several other computer vision datasets with various backbone architectures. The experimental results of applying the CNN-RBFs to the computer vision benchmark datasets with the standard train and test splits are presented in Table 3. CNN-RBFs show the capacity to learn the entire dataset in all of the cases. There is, however, a small gap between the best reported performances in computer vision literature and CNN-RBF architectures. Using dropout with CNN-RBFs for regularization does not lead to desirable results and reducing the number of parameters of the RBFs, while limiting the input size, is the best regularization strategy that we found for RBFs besides data augmentation. Developing regularization methods for RBFs to improve generalization is an open research topic for reducing the gap between our current results and those VOLUME 4, 2016 of state-of-the-art computer vision models.

VII. SIMILARITY METRIC LEARNING AND INTERPRETABILITY OF CNN-RBFS
Using a different approximation strategy compared with fully connected layers provides CNN-RBFs with the chance to probe the decision-making process based on these visual clues: • Similar images as measured by the similarity distance metric of RBFs trained on CNN embeddings • Visualizing the clusters with higher contribution to the network's decision and distance of the samples from the centers of these clusters The embeddings of CNN are evaluated by their learned distance metric from cluster centers in RBFs. The same distance can be used to measure the distance between a test image and similar images from training data. Figure 6 shows the similar images found in the training dataset for a given test sample by the similarity distance metric in Equation 1. The most similar and dissimilar images are computed using the following criteria: where x test presents the input of RBFs for a given test image, x µ train shows the input vector for training samples, and µ enumerates the training samples from 1 to N . x simliar and x dissimliar represent the most similar and dissimilar images to the given test image (x test ) respectively. We can use the same similarity metrics in Equation 16 and 17 to create a ranked list of similar and dissimilar images for a given test sample. Figure 7 compares the performance of the similar sample selection for a given test images. The figure suggests that the learned metric and Euclidean distances outperform the cosine distance for similar sample selection. Furthermore the learned metric slightly outperforms the Euclidean distance in this specific case.
The active clusters for every sample provide the reasoning behind the final decision of a CNN-RBF. The clusters can be described by the distance of images from their centers. Figure  8   centers against a test sample. The product of activations and output weights determines the final decision of an RBF. Thus, the importance of clusters for a decision can be determined by sorting the product of activations and class weights. Figure  8 depicts the clusters with the highest contributions to the correct class (ground truth) and the wrong class based on this multiplication product. The wrong class here refers to the class with the second-highest level of confidence.

VIII. CONCLUSION
The research work presents fundamental architectural modifications that are applied to RBFs to integrate them with CNNs for computer vision. The experimental results indicate that the integration of RBFs on top of CNNs achieves competitive performances in benchmark computer vision datasets by combining supervised and unsupervised learning. The proposed activation and training process is compatible with any arbitrary state-of-the-art CNN architecture, including inception blocks and residual connections. The small gap between the CNN-RBFs performance and best CNN models is a subject for future research to find optimal regularization methods for RBF networks. Using RBF architecture with CNNs introduces two unique and network-specific opportunities for learning a similarity distance metric and interpreting the decision-making process in more detail. Similar and dissimilar images found using a similarity distance metric trained by RBFs are interpretable by humans. The cluster representations are currently only used to trace the decision making process as in the current research, the distribution of images around clusters are not visually conclusive due to being optimized in an unsupervised manner regardless of ground-truth labels.

ACKNOWLEDGMENTS
We acknowledge Thilo Stadelmann and Yvan Putra Satyawan for their suggestions, grammatical and stylistic edits. This paper was made possible by the open source Tensorflow [46] and Keras [47] deep learning libraries, the  Table 1 with the same order.   Weights & Biases toolbox [48] 1 for experiment logging, and PlotNeuralNet 2 for neural network visualization. We acknowledge the significant effort invested into developing these tools. Processing. His research interests include -but are not limited to -artificial neural networks, machine learning, statistical learning theory, data mining, pattern recognition, information fusion and affective computing. He (co-)edited 20 special issues and workshop proceedings published in international journals and publishing companies. He published 200+ papers at international conferences and journals and served as (Co-)Chair of the IAPR TC3 on Neural Networks and Computational Intelligence. Since 2016 he is the Chair of the new IAPR TC9 on Pattern Recognition in Human-Computer Interaction.