Hybrid Restricted Boltzmann Machine– Convolutional Neural Network Model for Image Recognition

Convolutional Neural Networks (CNNs) have become a standard approach to many image processing dilemmas. Consequently, most of the proposed CNN architectures tend to increase the model deepness or layer complexity. Thus, they are composed of many parameters and need considerable computing resources and training examples. However, some recent works show that either shallow neural networks or architectures without convolutions can achieve similar results with these models often being used in systems with limited resources. Consideration of these aspects led us to a relatively simple preprocessing layer that increases the accuracy of CNN or may reduce its complexity. The layer is composed of two parts: the first is used to transform RGB data to binary representation, the second is a neural network that transforms the binary data into a multi-channel, real-value matrix and is trained in a fully unsupervised manner. Our proposal also includes a metric that may be used for measuring the similarity of training data, with the latter proving useful when performing transfer learning. Our experiments show that the resulting architecture not only helps to improve accuracy but is also more robust to image noise, including adversarial attacks, when compared to state-of-the-art models.


I. INTRODUCTION
Due to the consolidation in artificial intelligence (AI) many problems are being approched from the deep learning perspective [1]. To date, there were plenty of deep models, architectures, training methods designed and implemented for better general solutions efficiency. Also, there has been many evaluations, surveys or reviews the existing state of the art models like [2]- [5] to just mention a few recent works. Having so many existing solutions there are even methods to compare the models between each other based on their behavioral responses [6]. The majority of these solutions are based on the supervised approach since, due to its deterministic nature, are easier to evaluate. On the other hand we have an extremely important, in many applications, parameter that stands partially for the complexity of the model -the number The associate editor coordinating the review of this manuscript and approving it for publication was Jiachen Yang . of its parameters. There is a clear trend that shows that more accurate solutions have more parameters and take more time to compute the output. Since our work aims to optimize the model complexity by lowering the number of parameters of the model and also introduces a hybrid architecture trained on both: supervised and unsupervised data we would like to present an overview of the trends in both areas with respect to the model size as an evaluation factor.
First layers in our hybrid model are trained in a fully unsupervised manner, which means we can use unlabelled data. Many methods can handle these kinds of problems and for models with categorical output, the solution most likely leads to clustering-based methods [7]. Otherwise, the most common tasks would be either data transformation [8], [9], regression [10] or any sort of structured predictions [11]. These are relatively simple techniques with low parameter count, proving useful in many scenarios. However, these methods have major disadvantages one of which is primarily the lack VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of simple measures describing how a model fits the data. Generally, this depends on the model hyper-parameters that have to be chosen before model fitting which is not simple to estimate. Moreover, there are no simple methods to measure how models generalize to the new data that have not been seen in the training process, especially for large inputs like images. Improved performance may be achieved with models that combine data transformation and clustering such as the neural network-based models called auto-encoders [12]. The way these models work is that they transform one space to another with lower dimensionality and they fit to auto-associative memory to recognize potential clusters [13]. A good example of the above-mentioned models is the Boltzmann Machine (BM) [14] and its simplified versionthe Restricted Boltzmann Machine (RBM) [15] which has proven to be effective in many tasks such as feature extraction [16], dimensionality reduction [17], classification [18], or collaborative filtering [19]. An RBM is a generative model that can learn the probability distribution in the training dataset. This feature is extremely useful in many different tasks when a model can detect the most important features occurring in the image data [20]- [22]. It is fundamental in the research presented in this paper since the preprocessing we propose is meant to transfer the input image into high-dimensional features useful in semantic object recognition. The RBMs are characterized with rather low parameter count ranging in thousands which make them compact and efficient models. Another good examples of generic machine learning models trained in unsupervised manner can be observed in projects that use Variational Autoencoders (VAE) [23]. VAEs are used as a pre-trained first convolutional layers in the CNN model. There exist many variations of VAEs [24] that are useful, among others in image processing [25]. Although VAEs are much more complex models than RBMs where number of parameters range between 8 to 40 millions [26] making them quite complex to compute. They also require two-phase training for effective solutions [27]. To achieve accuracy comparable to VAE-based models RBMs have been stacked together to create Deep Believe Networks [28]. This also allows training RBM models with similar VAEs training techniques [29]. These methods improved RBM performance (and increased number of parameters) but not to the level of modern very deep CNNs [30]. This is due to their binary nature and complexity for high-dimensional data and leads to the second challenge taken in this paper which is data preprocessing for the binary input of RBMs.
The backbone of the hybrid model introduced in this paper is a classic convolutional network. Since in our experiments we used it to categorize images it is trained in the supervised manner. It is done by introducing the preprocessed images by the RBM layer to the input of the backbone. Due to the preprocessing technique the backbone can be limitted in number of parameters making the entire model less complex to compute. As it was mentioned there exist a number of models in the unsupervised training category that are featured with high accuracy together with comprehensive rewievs. However, the number of parameters in these models ranges in tens of millions [31] that makes them very complex and GPU dependant. In addition, newer models usually add complexity with respect to the previous solutions [32].
Our proposition for an efficient and relatively shallow network is the concatenation of the DBN with the input preprocessing unit with the classic convolution back-end. The natural choice is to use binary descriptors, moreover, the data transformed with binary descriptors contain more complex features of an image than its raw RGB form. Among many descriptors we decided to use a Local Binary Pattern with eight neighbors (LBP8) [33] because it performs best combined with the RBM layer [33]. In this paper though, we propose how to expand it to include the colorful nature of processed images. This model provides us with feedback on the data that can be used for assessment if the model has to be retrained for any unseen data. Moreover, it is also featured with high accuracy, similar to the deep [30] and shallow [34] counterparts. Due to the RBM layer the robustness to the noise is increased. This is true even for heavy adversarial noise which constitutes a serious problem in other stateof-the-art models [35]. Experimental results show that our model's accuracy can be higher by tens of percentage points in the case of adversarial attacks.

A. COLOR LOCAL BINARY PATTERN
The LBP descriptor performs feature coding in gray-scale colorspace, hence the color information is lost. What we also lose is the intensity of the center pixel as this transformation relies only on comparisons of the intensity of the neighboring pixels. This may be sufficient for simple tasks and has been successfully used for example in face recognition [36], [37]. However, more complex classification problems require including color features as they can add important information. To compensate, the LBP8 descriptor was enhanced with another 8 bits that represent the color and intensity of the processed feature. The value of the additional binary feature is obtained from the center pixel in the currently processed blob. For each pair of colors: R -red, G -green, B -blue 2 bits are computed (d = [b 0 , b 1 ]). For single colors pair -C 0 and C 1 we propose to compute the corresponding bits according to the following logic formulas: where T = 2− √ 2 2 MAX (C) is a threshold. This procedure equalizes the probability of each possible descriptor over all the color pairs.
Assuming the color pixel values are in range [0; 255], T is equal to 73, the descriptor distribution can be visualized in Fig. 1.  It is important to note that since computing the CLBP descriptors for all pixels in the image is independent, it can be done at the same time which makes this feature extraction algorithm very fast, especially on GPU accelerated systems.
The LBP8 is computed in a standard way. For given part of image: where p is a single pixel in grayscale, a single bit of the descriptor is given by this formula: where and j, k pairs start at (i − 1, j − 1) and are chosen sequentially clockwise, the LBP8 descriptor is a vector as follows:

B. RBM AS BINARY PATTERNS PROCESSOR
The RBM can be presented as an undirected graph with input vector v, hidden units h, weights matrix W, visible biases a and hidden biases b as presented in Fig. 3. The probability of hidden units activation can be computed as follows [22]: where σ is a sigmoid function. In our pipeline, RBMs are used to transform the binary data obtained by using the CLBP descriptor into real values vectors. This data represents image features like edges, blobs, or simple shapes, that are usually recognizable by the first layers of the CNNs filters trained in a supervised way [38]. Usage of the RBMs permits skipping these layers which results in a smaller architecture to solve given image classification task. As a consequence, our preprocessing and the RBM layers allow us to process more complex features in the first convolutional layers.
Once the CLBP stage is done, the RBM processes the data. In the simplest form, the descriptor values are passed directly to the RBM so the number of visible units is 16, but the receptive fields can be larger which can be achieved by concatenating CLBP descriptors from a kernel of size K . The stride S can be also used to adjust the overlapping regions and the size of an output matrix. The dimension of the RBM input vector v is n = K 2 . Fig. 4 presents how the input vector for the RBM is formed for K = 2. The input of the RBM layer is a matrix formed as follows: VOLUME 10, 2022  The RBM in this architecture becomes de facto a standard feed-forward neural network with the sigmoid activation function, so the value of h j unit for the visible vector v is given by: So, the final equation for RBM layer in matrix notation is: The dimensionality of output data depends on the number of hidden units -RH and stride -S, so the matrix computed by the preprocessing pipeline is of size The CLBP-RBM transformation and the entire preprocessing pipeline are illustrated in Fig. 5.

C. DATASET COMPARISONS BASED ON RBM ENERGY
RBMs are energy-based models, which means that for every possible configuration of v, h they assign a scalar value [39] (named usually as energy), defined as: The energy is directly related to the probability of occurrence of given v and h configuration: where Z is a partition function that sums e −E(v,h) over all possible v,h configurations. The partition function is not tractable for high-dimensional RBMs and has to be estimated [40]. The marginal probability of given v is defined as a sum of the probabilities over all possible h configurations, but it may be computed with the use of the free energy value F(v) [41] by using the following formula: where the F(v) is defined as: and Finally the marginal probability of a given v vector is defined by the following equation: that has a small number of bins. The histogram of the original data used for training has to be remembered as a reference when compared to the unknown data. For practical purposes, we recommend using the interquartile range (IQR) to get the most important part of the histogram, and then using Scott's rule [42] to optimize the number of bins. Intervals for P(V ) to generate a histogram may be constant, given as max(P(V )) number of bins or chosen automatically to flatten the reference histogram, then height of each bin is ≈ 1 number of bins . Flattening the histogram allow to avoid empty bins and large missbalance bettwen heights of particular bins, in this case the width of each bin is computed for reference histogram, then same widths are used to cumpute histograms for other datasets.
To measure the similarity of two datasets (T 1 and T 2 ) using their histograms(P and Q) with a scalar value d we can use the Chi-Squared (χ 2 ) distance [43] defined as: where i denotes a particular histogram index.

D. NOISED PATTERN RECONSTRUCTION WITH RBMs
Since the RBMs are similar to Hoppfield network models, they may be used for data reconstruction [44]. The binary patterns processed by the network may be noised or corrupted. Especially in real-time systems, the data is being gathered directly from image sensors and there is no time for re-send procedures in the case of transmission failures.
For reconstruction, we propose using CLBP patterns. The RBM is a recurrent neural network, so the noised pattern may initialize the Markov Chain and after running several Gibbs steps, the data is closer to the patterns the RBM has been trained for. The simplified process of the reconstruction is presented in Fig. 6. The algorithm that obtains the data for CNN for given v vector with reconstruction is as follows: So the output is always a real-value vector as it is without reconstruction, but the intermediate values are binary. It is hard to predict how many Gibbs steps should be run to achieve the best reconstruction results, because in general, too few steps may not denoise the pattern correctly. On the other hand, too many steps may result in the RBM tending to meander in the atractor space. A detailed investigation of the number of Gibbs steps for our case is given later in the results section.

III. RESULTS
The previous section presents the idea of how Restricted Boltzmann Machines can be applied for input data preprocessing for convolutional neural networks. The main assumption is that the RBM is a relatively small and fast structure but its use may result in better performance of the overall image recognition pipeline. This section presents experiments that we performed to check this hypothesis. The experiments are split into 4 cases: A) input data preprocessing in order to make use of unlabeled data, B) input data preprocessing in order to reduce the size of convolutional neural network, C) input data denoising for better recogniton ability, D) images dataset comparison. For the experiment A we trained RBM to show visualize how it can learn to images features without labeling input data then we compared how use of CLBP-RBM layer affects the recognition ability of convolutional network. For the experiment B we were reducing the number of convultional layers in CNN and testing how the accuracy decreases when the CLBP-RBM layer was used and when not. These tests showed that the preprocessing layer proposed by us improves the overall effectivness of CNN in terms of generaliztion ability and possible size reduction. For the experiment C we used two types of noise added to input data: • random pixel noise, • gradient distortion. Then tested how the CNN network recognizes validataion data. For both types of noise the accuracy of the network decreases as the noise factor increases, but we showed that CLBP-RBM can significantly reduce the decrease in efficiency. The experiment D consisted of comparing the probability histograms computed by RBMs from different datasets, it showed that this type of measure can be used in order to check the similarity of input data.

A. TRAINING RBMs TO LEARN SIMPLE IMAGE FEATURES
The first step in investigating whether an RBM is able to recognize image features is to visualize its responses to hidden units. Single hidden units should react to particular features, like edges, corners, blobs, colors, etc. Visualization of the RBM filters is not useful in this case because they process the binary patterns that are not a part of the image directly. That is why we processed a testing image containing different features and colors and visualized the RBM responses in each hidden unit which is presented in Fig. 7. The middle part of Fig. 7 presents how the RBM processes the image when it has not been trained, the hidden units respond in a random way independently of the image features. The responses for trained RBMs in each channel differ depending on what particular unit is reactive to. For example, the regions marked with blue rectangles present the connection between the blue color in the image and the high response in filter number 37 (row 5, column 5), so this hidden unit is trained to recognize blue regions. Similar connections may be observed in regions marked with red and green rectangles but they are reactive on lines or blobs.

B. CLBP-RBM PREPROCESSING VALIDATION
This section presents how a CLBP-RBM preprocessing layer affects the overall validation accuracy of CNN. To make use of unsupervised learning we utilized the STL-10 dataset [45] which is composed of three parts: • 100000 unlabeled images for unsupervised learning, • 5000 labeled images per category for supervised training, • 8000 labeled images per category for testing. For the CNN part of the network we chose 3 commonly used backbones: [47]. We also added a small backbone network composed of six convolutional layers followed by max pooling layers. It is referred to further as ''our'' network. For all the backbones after the convolutional layers, we added global average pooling [48] (GAP) and dropout [49] before the last fully connected layer (FC) as shown in Fig. 8. First step was to investigate best parameters for preproprocessing layer. For kernelsize and numberofhiddenunits we tested the accuracy of entire network and time od response in preprocessing layer. The visualization of results is shown in Fig. 9. All the results are relative to the lowest point (kernel size = 1, number of hidden units = 16). The best accuracy with the low response time is for kernel size = 2 and number of hidden units = 48. Therefore we use this set of parameters for further experiments.
We trained an RBM on the unlabeled data for 30 epochs, then the CNN and FC on training data for 150 epochs with cross-entropy loss function and RMSProp optimizer. Then compared the validation accuracy when CLBP-RBM preprocessing is used and when it is not. The measured value is a standard relative error given as: The results in table 1 illustrate that using the proposed preprocessing, the final neural network achieves higher validation accuracy, this was proven using four backbones, the accuracy metric for all of them was better when CLBP-RBM was used. The table also includes the size of the network and the time needed to process the image relative to CLBP-RBM time. This indicates that adding the preprocessing layer does not increase the overall processing time and that we can achieve better accuracy independently of the size of the CNN backbone. Fig. 10 presents learning curves on our backbone.   The network with CLBP-RBM preprocessing outperforms the version without it.
Based on the assumption that CLBP-RBM preprocessing enriches the input features for CNNs, we tested how the reduction of the number of convolutional layers affects the overall validation accuracy. The table 2 presents the results achieved by the network relative to ''our'' network with six convolutional layers. For a CNN network without preprocessing, the deletion part of convolutional layers affects the accuracy metric more significantly than for the network using our preprocessing, so our approach may lead to the possibility of using smaller networks performing the same tasks with higher accuracy. For example, the deletion of two convolutional layers reduced the accuracy by 5% while, when no preprocessing was used, the reduction was 10%, therefore designing the CNN network for resource-limited systems when the trade-off between size and accuracy of the network is significant may be simpler with the proposed preprocessing.

C. PATTERN RECONSTRUCTION
As mentioned in the previous section RBMs may be used to reconstruct noised data and we use this ability with binary patterns. Since the LBP transformation is not linear, we cannot inverse it to visualize the reconstructed patterns, but we can measure the Hamming distance between the CLBP descriptor taken from the original and reconstructed pattern and compare the distance between the CLBP from the original and from the noised pattern. In order to investigate the reconstruction ability. We trained RBMs on input vectors obtained from STL-10 dataset, since the time of response is not critical for this experiment we used kernel size = 4, which gives 256 visible units in an RBM. Then we chose random patterns from the dataset and added rndom noise in n pixels. Some examples are shown in the table 3. Using this noising method we defined a metric to measure the reconstruction ability: (17) where D is a Hamming distance [50], OP is the original pattern, NP is a noised pattern, R is a reconstructed pattern. Hamming distance is a natural method to compare binary vectors, thus this metric allows to observe the general improvement of reconstruction, the higher value the better reconstruction. The comparison was carried out 100 times on 50 randomly chosen patterns from a training dataset, then we averaged the value for each noise and each number of Gibbs steps. Results in table 4 show that the reconstruction efficiency increases with an increase in noise. The ideal number of Gibbs steps for reconstruction is 1 because this results in the most significant improvement compared to other numbers of steps and needs the least computing.
Since the previous experiments seem to show that a CLBP-RBM layer may help in recognition of noised images, we added random noise to the images being classified in a validation set in STL-10. Noise was added randomly in each validation stage with three noise factor values. The noise factor is the proportion of noised pixels in the image. Examples of noised samples are shown in Fig. 11. The accuracy and loss over the epochs are presented in Fig. 13 and Fig. 12a.  The results show the impact of the noise effects on the generalization ability of the entire network. The accuracy decreases as the noise factor increases, but the influence of a CLB-RBM layer is also significant. We can observe that the CLB-RBM layer improves the accuracy independent of the noise factor. For the highest noise factor = 0.75, the accuracy was 18% higher with RBM layer than without.
Another type of perturbation of an input image is the fast gradient sign method [51] which relies on noising the image based on the gradient computed by the network. The formula VOLUME 10, 2022  for obtaining the distorted image is as follows: where x is an input image assuming the value of pixels are in range [0, 1], J is a loss function, y is an output label, is noise factor. In this case, the distortion is performed on an already preprocessed image (i.e., after the CLBP-RBM layer), however, an RBM may reconstruct the data with regard to the algorithm 1 assuming v is the noised input. We tested how the fast gradient method affects the generalization ability in the network with CLB-RBM preprocessing compared to a network without. The results are shown in Fig. 14. This distortion affects both networks, but for the network that uses CLB-RBM preprocessing the decrease of accuracy is significantly smaller. That demonstrates that 'our' method may also be applied for this use case.

D. RBM ENERGY FOR MEASURING THE SIMILARITY OF TWO DATASETS
For measuring the similarity of data from different datasets, we used the previously mentioned STL-10, and two others: • Indoor Scene [53] -composed of images from different room scenes. For humans, the DTD images differ from the other two while the Indoor Scene and STL-10 seem to be similar, although they present different objects they are taken from the real world as pictures, so the features, in general, should be similar. Table 5 includes the distances between those datasets. 24992 VOLUME 10, 2022 The first two rows in Table 5 show the distances for RBMs trained on STL-10 and Scene dataset. For these two rows, the distance to the DTD dataset is much higher than the distances to other rows. Furthermore, for an RBM trained on the DTD dataset, the distances to Scene and STL-10 datasets are similar but significantly higher than it was for the other two RBMs. This implies that the RBMs should not be transferred between Scene or STL-10 datasets to the DTD dataset and vice versa. Fig. 15 presents the histograms for all trained RBMs and different datasets. For visualization purposes the number of bins is 10, no IQR is used and the histograms are flattened. Despite the comparisons are simplified, viusalizations lead to the similar conclusions as we had for Table 5.

IV. CONCLUSION
This paper aims to describe and investigate the potential of Restricted Boltzmann Machines as Local Binary Pattern processors in terms of a preprocessing tool for Convolutional Neural Networks. First, we introduced the improved Local Binary Pattern, which enhances the original LBP by an additional 8 bits that describe the color and simplified intensity of processed pixels, in further tests, we showed that an RBM is capable of recognizing these additional features. Fig. 7 shows an example of how hidden units are trained to recognize the basic image features. These are then efficient representations for the CNN to classify complex objects. The enhanced LBP is an efficient solution to provide additional information about image features.
The primary aim of this project was to achieve an improved generalization ability with the use of an RBM as a preprocessor for the CNN by using an unlabeled datset and unsupervised learning. We addressed that with a CLBP-RBM layer, which can be trained in a fully unsupervised manner. This layer used in image preprocessing increased overall accuracy for the entire model on the validation dataset by 2.6%-7.5%. We tested several commonly used CNN backbones plus one small custom backbone created by us. They differ in the number of convolutional layers and the number of parameters. This implies that a CLBP-RBM layer is a good choice for any CNN backbone architecture. The time measurements show that the preprocessing layer does not greatly affect the overall processing time. Another test revealed that by using a CLBP-RBM we can reduce the number of convolutional layers without a significant decrease in accuracy. This feature can be extremely useful for systems with limited resources. An additional advantage of the preprocessing layer proposed by us is its potential use in denoising of input images. We performed a range of tests with recognition of noised images. The results of these tests lead to the conclusion that the CLBP-RBM layer improves image recognition quality. Thanks to their recurrent structure and reconstruction capability, they may also denoise the corrupted patterns by running a number of Gibbs steps. This proved useful in corrupted images when the noise factor was very high. Furthermore, the proposed denoising method was effective in adversarial attack problems, we showed that our method achieves significantly better accuracy in this type of perturbation than a network without any preprocessing.
We were also challenged with a problem to measure the fit of trained RBMs in transfer learning tasks. This is achievable since the free energy of an RBM can be used to compute the marginal probability of a given input vector. This way we can create a histogram of these probabilities taken from a subset of data and then measure the distance to the reference histogram that was created during training the RBM on the original data. In our opinion, this is an essential feature of the preprocessing step since it introduces the ''awareness'' of the model of the input data. In other words, the model can estimate if it is fitted well to the input data or has to be retrained. Most of the images used in training for the presented models depict natural objects, therefore they have a similar binary pattern distribution. This way, if the preprocessing model is fitted well to the data, time can be saved in the retraining procedure.
Since we have shown that a CLBP-RBM preprocessing layer is applicable in classification tasks, we believe that it may be useful in other image processing problems, such as object detection or segmentation because the first layers in CNNs for those rely on similar feature extraction methods. Therefore, there are potentially many other areas in image processing where this preprocessing may be successfully applied.
SZYMON SOBCZAK received the Master of Science degree in automatic control and robotics from the Poznań University of Technology, in 2016, after writing his thesis on the generalization abilities of deep neural networks. He currently works in industry as a Software Engineer and an Artificial Intelligence Researcher focusing mainly on video and audio processing and automatic event detection. His research interests include deep neural networks and unsupervised learning, particularly in terms of optimization for resources-limited systems. He is interested mainly in the application of machine learning in video processing and vision systems.
RAFAL KAPELA received the M.S. and Ph.D. degrees in control and robotics from the Poznań University of Technology, Poland, in 2000 and 2008, respectively. In August 2009, he had the opportunity to join a new team of quality assurance engineers at Mentor Graphics, Poznan. His role there included quality assurance of software intended for design for test purposes. He was working simultaneously at both PUT and MGC, until May 2011. At that point, he became an FP7 Intra-European Fellowship Marie Curie Fellow for the VISION Project which was conducted in CLARITY at Dublin City University, Ireland. Following this, he returned to Poland, where he continues his research on image processing algorithms and FPGA systems. His research interests include multimedia, video annotation and compression, signal processing systems, and artificial intelligence.