Fixed-Sign Binary Neural Network: An Efficient Design of Neural Network for Internet-of-Things Devices

High computational requirement and rigorous memory cost are the significant issues which limit Convolutional Neural Networks’ deployability in resource-constrained environments typically found in edge devices of Internet-of-Things (IoT). To address the problem, binary and ternary networks have been proposed to constrain the weights to reduce computational and memory costs. However, owing to the binary or ternary values, the backward propagations are not as efficient as normal during training, which makes it tough to train in edge devices. In this paper, we find a different way to resolve the problem and propose Fixed-Sign Binary Neural Network (FSB), which decomposes convolution kernel into sign and scaling factor as the prior researches but only trains the scaling factors instead of both. By doing so, our FSB avoids the sign involved in backward propagations and makes models easy to be deployed and trained in the IoT devices. Meanwhile, the convolution-acceleration architecture which we design for our FSB results in a reduced computing burden while achieving the same performance. Thanks to the efficiency of our FSB, even though we randomly initialize the sign and fix it to be untrainable, our FSB still has remarkable performances.


I. INTRODUCTION
Although Convolutional Neural Networks (CNNs) have been rapidly developing and achieved remarkable improvements in most Artificial Intelligence (AI) domains, most of CNNs cannot be directly deployed on the resource-constrained devices. Early works of CNNs [1]- [4] achieve promising results over many datasets with different tasks but still have large model sizes and huge computation burden. The future development trends of CNNs for Internet-of-Things (IoT) will be the lightweight design and network compression.
There are numerous research efforts made on network compression in order to get rid of redundancies of weights and unnecessary parameters, such as network pruning [5], [6], knowledge distillation [7]- [9], weight quantization [10]- [13] and the combinations of these techniques above.
Among these efforts, weight quantization compresses the networks by reducing the number of bits which represent the weight. Typically, weight binarization [14]- [18] approximates the parameters with binary values, which results in The associate editor coordinating the review of this manuscript and approving it for publication was Zhenyu Zhou . almost 32× storage savings. Meanwhile, assisted by bitwise operations, weight binarization is able to result in about 58× faster convolutional operations saving [17]. Similarly, we can also quantize the weights with ternary values [12], [13], [19]- [21]. These algorithms can achieve over 10× model compression rate and accelerate convolutional operations as well. Further, the performances of some of them are slightly worse than their full-precision counterparts.
The core idea of these research efforts above is to constrain the filters to the binary or ternary space. To address the issue, they utilize sign function or some other step functions to binarize or ternarize the weights. Therefore we can decompose the filter into the sign and the scaling factor. And, due to the promising tradeoff between performance and efficiency, we can deploy the CNNs with binary weights on the edge devices. However, the training of the sign is not as efficient as normal because of the discrete space of the binary value.
Meanwhile there are many frameworks proposed for CNNs inference on edge devices, but most of these frameworks are software based, which means that we cannot accelerate the computation through the hardware layer. To address these problems, we propose Fixed-Sign Binary Neural Network (FSB), which constrains all the convolution kernels into the binary space and keep the sign fixed all the time. Then we can deploy the sign of the FSB on hardware, and update the FSB by adjusting the scaling factor, which can be regarded as the signal strength of each neuron correspondingly. By doing so, we transfer most of the convolutions calculation from software to hardware and reduce the size of parameters for model update, which results in more energy saving and calculation efficiency.
The main contributions of our article are the following: • We concisely introduce the FSB, an efficient method which trains only one parameter of the convolution kernel during the forward and backward propagations; and we make code for the FSB available.
• Different from general convolution kernels, we design a brand new convolution calculation algorithm for the FSB to speed up the computation.
We arrange this paper as follows: first, we overview recent works related to our paper. Then, we describe the details of our algorithm. Third, we illustrate the techniques which we apply for computation acceleration. Then, we present experimental results over widely-used datasets to show the effectiveness of our algorithm; and finally conclude with a summary and future work.

II. RELATED WORKS
In this section, we overview recent studies on binary neural networks. As the discrete binarized weight inevitably results in information loss and the approximation error of gradient, to address the above issues, we can categorize these studies into directly binarizing the networks, adjusting the loss function with quantization loss and approximating the gradient of sign function.
In 2016, BinaryConnect [14] is proposed to pioneer the study of binary neural networks; it converts the full-precision weights into binary values and utilizes a clip function for backward propagation. Meanwhile, Binarized Neural Network (BNN) [15] brutally binarizes the input signals in addition to the binary weights and applies Hardtanh as the activation function instead of ReLU.
There are also some researches focusing on minimizing the quantization error. Binary Weight Networks (BWN) and XNOR-Net [17] take the lead in considering the quantization error. Both BWN and XNOR-Net approximate the full precision weights by introducing the scaling factor with −1,+1. Meanwhile, there are many researches which extend the framework of XNOR-Net, such as HORQ [16], ABC-Net [24], XNOR-Net++ [25]. These researches represented by two-step quantization (TSQ) [26] and Learned Quantization (LQ-Nets) [27] pay more attention to minimizing quantization error efficiently. The Incremental Network Quantization (INQ) [28] adds a regularization term in loss function.
For optimization, the backward propagation algorithm is still the key for training. However, the derivative of the binarization function either doesn't exist or is equal to zero almost everywhere. For the above problem, many researches usually employ straight-through estimator (STE) [29] to estimate the gradients. Moreover, both Circulant Binary Convolutional Networks [30] and Bi-Real [31] utilize the continuous functions to replace the sign for back-propagation. Straightforwardly, Binary Neural Networks+ (BNN+) [32] defines a new approximation to the gradients of the sign.
These state-of-the-art research results verify that it is reasonable to employ the sign and the scaling factor to approximate the full-precision weight. Next, we will introduce our algorithm in detail.

III. PROPOSED METHOD
Since all afore-mentioned algorithms constrain the weights of convolutions into a sparse space with binary values, then the parameters' gradients will not be as accurate and efficient as the gradients of the full-precision weights; it definitely results in more iterations during training to get the same result.
Meanwhile, just as much research focuses on how to finetune the weights with only two or three possible values to achieve a good tradeoff between accuracy and complexity, we are motivated to raise the questions: What role does the sign value of the convolutions play in the convolutional neural networks? Or, if we fix the sign values and only train the scaling factors, can we obtain a CNN with a satisfactory accuracy in the validation set?
For the questions above, we randomly initialize the signs of all the convolutions and keep them fixed without gradients, which means we only train the scaling factors of all the convolution kernels. As we only bring the scaling factors into the forward and the backward propagations, we name our algorithm as Fixed-Sign Binary Neural Network (FSB) in this paper, which shrinks the size of the training parameter and brings efficiency to model update and deployment. Next, we detail our algorithm with different settings.

A. DEFINITION
For a normal convolutional layer, it can be represented as the triplet X , W, Y . Here X ∈ R C in ×h in ×w in , Y ∈ R C out ×h out ×w out are the input tensor and output tensor correspondingly, and X j refers to the j-th input channel of X, (1) * is the convolution operation. In the following subsections, we will present our techniques for CNN compression and computational efficiency. VOLUME 8, 2020

B. SIGN AND SCALING FACTOR REPRESENTATIONS OF THE KERNELS
In this subsection, we introduce the way in which we quantize the weights of the convolution kernels. In general, a convolutional layer C consists of C out × C in distinctive kernels without any constraints. Denote W ij ∈ R s×s as the weight of the kernel, referring to the related work [15], let here and assign each W ij a scaling factor ω ij . In our FSB, we utilizẽ Generally,W needs two trainable parameters: B ij and ω ij . However, in our algorithm, we keep B ij fixed to be untrainable all the time. In other words, for every convolution kernel in our FSB, we only train ω ij of the s 2 elements in the kernel. Therefore, we avoid taking the derivative of the sign function whose derivative is either zero or does not exist.
Furthermore, parameters that need to be updated in the FSB are drastically reduced. As most of the convolution calculation involves the fixed sign, we can transfer most of the convolution calculation from software layer to hardware layer by deploying the sign of the FSB on hardware.

IV. ACCELERATING CONVOLUTION
Thanks to the binary weight and bitwise operations, most of binary neural networks can significantly speed up the calculation when compared to the full-precision neural network. For a given network topology, we denote the amounts of calculation required by the full-precision and binarization as C and C b correspondingly. Then the calculation acceleration ratio of the binarization algorithm, r, is equal to C b /C. For example, the ratio of XNOR-Net is nearly 1/58 [17].
For the FSB, benefiting from the fixed sign, we can speed up the convolution calculation by reducing the multiply-accumulate operations (MACs) with a brand new technique different from the ones mentioned above. In short, we employ the known convolution results for calculating the convolution of new kernels, which means that we can reduce the amount of calculation despite of the bitwise operations.

A. TECHNIQUE USED IN THE FSB
In this subsection, we analyze the way to accelerate convolutions with our FSB by counting multiply-accumulate operations. Following the definition in Subsection III-A, it requires C in ×C out ×h out ×w out ×s×s multiplication and addition operations to calculate Y for general convolution calculations.
Considering the convolution kernels with size larger than 1, for one of the input tensor channels, X i , its convolution is  executed with C out different kernels with size s × s. For each kernel, it requires s 2 multiplication operations and s 2 −1 addition operations. In the FSB, we can multiply the convolution result with the scaling factor, which results in reducing the multiplication operations from C in × C out × h out × w out × s × s down to C in × C out × h out × w out ; then we calculate the theoretical acceleration factor of the multiplication, r mul , of each convolutional layer as demonstrated in (5), For the 3 × 3 kernels with signs in our FSB, we can decompose a kernel as Fig. 1 shows. It is worth noting that the rightmost kernel only has s 2 = 9 different kinds and it only occupies C in × h out × w out × s × s addition operations to get all the convolution results. If we have already obtained the convolution result between X i and a kernel κ, then for a kernel κ with only one element different from κ, we can get the convolution result with h out × w out addition operations instead of 8 × h out × w out . Fig. 2 demonstrates the brief calculation process for every input channel. In stage 0, we pick up a kernel κ 0 and get its convolution result, R 0 , with 8 × h out × w out addition operations. Define κ i as the set of the convolution kernels with i elements different from κ 0 and R i as the set of the convolution results of the kernels in κ i .
As we mentioned above, for each kernel in κ 1 , we calculate its convolution result based on R 0 with only h out × w out addition operations. Recursively, in stage j + 1, we can utilize the results in R j to calculate the convolution result of each kernel in κ j+1 with h out × w out addition operations.
However, for a kernel set κ with size equal to k, the other C out − k kernels could have more than one elements different from the kernels in κ. Thus, it will take more than h out × w out addition operations for the calculation of each convolution kernel. When C out is small, the above situation is more likely to happen.
Denote the amount of additions in the above process as N × h out × w out . Then the total amount of the addition operations is equal to: Here, C in × h out × w out × s × s is the amount of the addition operations used to calculate the results of the kernels with only one element equal to 2, which are similar to the rightmost kernel in Fig. 1; and C in × C out × h out × w out is the amount of addition operations used to add the C in input channels and the bias together for C out × h out × w out output feature maps in total. Thus, we calculate the theoretical acceleration factor of addition, r add , of each convolutional layer as demonstrated in (7), It is worth noting that the convolution kernels with sign only have 2 s 2 different kinds, which means that there should be an upper bound of N . Thus, as C out increases, r add will become larger. Furthermore, by combining the technique used in our FSB with bitwise operations, we can further speed up the convolution calculation.

B. SIMULATION
In this subsection, we analyze the relationship between C out and N , r add via Monte Carlo simulations. We consider the convolution kernel with size 3 × 3, and the range of C out is from 16 to 2048. For each C out , 100 replications are used. As Fig. 3 shows, with the exponential growth of C out , r add increases linearly and approximates to its upper bound s 2 . Meanwhile, although the 3 × 3 kernels with the sign have 2 9 = 512 different kinds, there are 256 pairs of kernels whose signs are opposite correspondingly. Thus, the upper bound of N is s 2 + 2 s 2 /2 − 2 = 263, which matches the description (red line in the Fig. 3).

V. EXPERIMENTS
We evaluate our FSB on CIFAR-10 and ImageNet datasets. We will demonstrate the performance between our FSB and the full-precision network with the same network topology. And we also show the experimental results of our FSB when compared with other state-of-the-art algorithms. Our code will be available on https://github.com/nkstatly/Fixed-Sign-Binary-Neural-Network.

A. EXPERIMENTS ON CIFAR-10 WITH FSB
To evaluate the role of the fixed sign, we suggest that the number of the output channels in each convolutional layer should be large enough, in order to avoid the extreme cases of the distribution of these convolution kernels. Thus, we utilize the ResNet-18 [1] and adjust it for our FSB.
Different from the original ResNet-18 architecture, we utilize the convolution kernels with size equal to 3 and stride equal to 1 in the first convolutional layer instead. For detailed analysis, 100 replications are used to train both the FSB and the full-precision model with a minibatch of size 256. We train these models for 500 epochs with an SGD optimizer and momentum 0.9. The initial learning rate is 0.1, and we reduce it by a factor of 10 after 120, 240, and 400 epochs.
For the full-precision model, it contains 1,220,800 convolution kernels, which result in over 11 million trainable parameters, but, we assign 16-bit instead of 32-bit for all the scale parameters resulting in over 14× storage saving of our FSB on ResNet-18. We report the best accuracy on the testing set. As Fig. 4 shows, we draw the distributions of the accuracies of these replications of the FSB and ResNet-18 correspondingly. From the results shown, we find that our FSB is only slightly worse than the full-precision counterpart and achieves nearly state-of-the-art results. It is worth noting that we randomly initialized all the sign parameters for each result of the FSB, which means that the scale plays a vital role in those binarized CNNs when compared with the sign.
Meanwhile, the distribution curve of our FSB is sharper than the full-precision model. Beyond our expectation, random initialization of the sign has not caused much fluctuation on the performance of the FSB. On the contrary, due to having fewer parameters, our FSB is more stable when compared with the full-precision CNN of the same architecture. Moreover, we also display the performance of the FSB with the signs from trained CNN. As Fig. 4 shows, there is almost no gap between FSB * and ResNet-18 and the distributions of their accuracies are approximately identical. It can be inferred that the signs from well-trained CNNs can be used as the prior knowledge for our FSB.

B. EXPERIMENTS ON ImageNet WITH FSB
As the resolution of images in CIFAR-10 is only 32 × 32, there may be relatively small images. To be more convincing, we also evaluate the performance of our proposed method on the more challenging ImageNet (ILSVRC2012) [23] dataset in this subsection. Since it contains over 1.2 million training images from 1,000 categories and 50 thousand validation images and the images in this dataset are natural images with reasonably high resolution.
We train the FSB and other methods with ResNet-18 architecture on the ImageNet dataset. We run our training algorithm for 200 epochs with the mini-batch size of 256. The learning rate starts at 0.1 and is scaled by 0.1 after 40, 70 and 100 epochs. We compare our FSB with various methods and demonstrate Top-1 and Top-5 accuracies. Meanwhile, we also analyze how much our FSB compresses the convolution. For the compression ratio, we only count the trainable parameters from the kernels with size larger than 3, as they comprise the most parts of the CNNs in our experiments. Table 1 demonstrates the classification accuracies (Top-1 and Top-5) on the validation set and the compression ratios of these methods. The accuracies of our FSB are the highest among these methods. An impressive fact of our method is that we only denote the scale to each 7 × 7 convolution kernel in the first kernel resulting in 192 trainable parameters in the first layer. Even so, our method is far better than other methods with binary weights (XNOR and BWN). Besides, although there is no need for the FSB to set any thresholds for ternary assignments, the performance of the FSB is still better than these ternary neural networks. Moreover, we test the FSB with ResNet-34 as well (Top1: 70.3% vs. 73.3%; Top-5 89.3% vs. 91.4% compared with the full-precision model), and the accuracy gap does not get larger as the depth of ResNet increases.

VI. CONCLUSION
In this paper, we propose a novel binarized design of CNNs called FSB with no need for training the signs of the convolution kernel. Unlike other binary or ternary weight networks, the proposed FSB gives up training all the signs and focuses on adjusting the scales of the kernels. With remarkable performances over CIFAR-10 and ILSVRC2012 datasets, our FSB achieves over 18× compression ratio of trainable parameters and reduces the MACs by a ratio on the same order of kernel size. By encoding the scales and inputs during the forward/backward propagations, we can further accelerate the computations of the FSB while enduring little loss of accuracy. Last but not least, both theoretical analysis and experimental results verify that the FSB is capable enough to approximate the full-precision CNNs. Our future works will focus on figuring out how and why the sign and scale exercise different influences on the FSB, and evaluating the performance on other tasks (e.g. object detection) or models (e.g. RNN).