Exploring Convolution Neural Network for Branch Prediction

Recently, there have been significant advances in deep neural networks (DNNs) and they have shown distinctive performance in speech recognition, natural language processing, and image recognition. In this paper, we explore DNNs to push the limit for branch prediction. We treat branch prediction as a classification problem and employ both deep convolutional neural networks (CNNs), ranging from LeNet to ResNet-50, and deep belief network (DBN) for branch prediction. We compare the effectiveness of DNNs with the state-of-the-art branch predictors, including the perceptron, our prior work, Multi-poTAGE+SC, and MTAGE+SC branch predictors. The last two are the most recent winners of championship branch prediction (CBP) contests. Several interesting observations emerged from our study. First, for branch prediction, the DNNs outperform the perceptron model as high as 60–80%. Second, we analyze the impact of the depth of CNNs (i.e., number of convolutional layers and pooling layers) on the misprediction rates. The results confirm that deeper CNN structures can lead to lower misprediction rates. Third, the DBN could outperform our prior work, but not outperform the state-of-the-art TAGE-like branch predictor; the ResNet-50 could not only outperform our prior work, but also the Multi-poTAGE+SC and MTAGE+SC.


I. INTRODUCTION
The ever-increasing computational power is a key factor for recent advances in deep learning. In microprocessor design, pipelining is a critical technique for the high performance as it facilitates high clock rates and instruction-level parallelism (ILP). As pipelines become deeper, the control hazard due to conditional branches (e.g., as a result of an if-then-else structure) incurs higher performance overhead. The reason is that the outcome of the conditional branch, i.e., 'Taken' or 'Not Taken', is computed late in the pipeline, i.e., the execution stage after it is fetched, decoded, and dispatched. In this case, the pipeline would have to wait for the branch outcome before the next instruction can be fetched. To avoid this situation, branch prediction, which guesses the branch outcome before it is executed, is commonly used to continue fetching and executing instructions along the predicted paths.
The associate editor coordinating the review of this manuscript and approving it for publication was Chunbo Xiu . And the accuracy of predicting the execution results of branches directly affects the performance of deep pipelined processors as a result of the severe performance penalty of mispredictions.
Due to its importance, branch prediction has been a focus in computer architecture research [1]. The accuracy of the traditional predictors, such as 2-bit Counter, Yeh/Patt [2], Agree [3], Filter [4], and YAGS [5] branch predictor, is up to 95%. The bottleneck of this accuracy has arisen in the traditional hybrid predictor. The improvement of both performance and energy efficiency of the contemporary super-deep pipelined processor could be a lot, even if improved branch prediction accuracy is very little [6]. In fact, even for modern architectures, an average of 30% total instructions executed in an integer program are wasted due to branch misprediction [1].
There have been a great deal of research and practice on branch prediction in academia and the industry and new branch prediction algorithms [7]- [12] are emerging. The global Championship Branch Prediction (CBP) is organized by the International Symposium on Computer Architecture (ISCA) or International Symposium on Microarchitecture (MICRO) every few years. In the past five championships, the high performing branch predictors could be divided into two primary categories: one is based on the TAGE (TAgged GEometric history length) branch predictor; the other is based on neural networks. The TAGE, which is commonly considered as the lowest misprediction rate predictor, is derived from Seznec's GEHL predictor [13] and Michaud's tagged PPM-like predictors [14]. TAGE-SC-L [15] adds a statistical corrector predictor and a loop predictor. Multi-poTAGE+SC predictor [16], which is the champion of the unlimited resource group of CPB-4 in 2014, combines multiple TAGE predictors and the final prediction is selected from these predictors via a combined output lookup table [17] (COLT) predictor. This colossal multiple-TAGE predictor and its further fine-tuned version MTAGE+SC [18], which is the champion of the unlimited resource group of CPB-5 in 2016, are not meant for practical usage and are mainly for pushing the lower bound of misprediction rate of branch predictors, similar to the purpose of this work.
Another type of the state-of-the-art branch predictors is the perceptron predictor [19]- [21]. It uses a single-layer perceptron, one of the simplest neural networks to learn the correlation between the branch history and branch outcomes. It confirms that neural networks can be useful in branch prediction. However, only a few works [22] explored more advanced machine learning methods on the branch prediction problem, much less deep neural networks. Recently, there have been significant advances in deep learning [23] and they have shown superior performance in image recognition [24], [25], speech recognition [26] and natural language processing [27]. It is worthwhile to examine whether the more advanced deep neural networks can discover new possibilities for branch prediction.
This paper explores deep neural networks for branch prediction by treating it as a classification problem. We explore both deep belief network (DBN) and convolutional neural networks (CNNs) for branch prediction. Note that, in this work, we focus on pushing the limit of branch prediction and do not consider the complexity of the actual predictor designs. It is consistent with the CBP competition rules that ''CBP will make no attempt to assess algorithm-level exploration for predictors the unlimited storage budget''. We make several interesting observations from our study. First, we confirm that deep learning outperforms the perceptron predictor. Second, between DBN and CNNs, we find CNNs is a better choice for branch prediction. Third, we analyze the impact of the depth of CNNs on misprediction rates. Our experimental results show that deeper CNN structures have lower misprediction rates. Fourth, we found deep CNN algorithms could achieve lower misprediction rates than state-of-the-art branch predictors, Multi-poTAGE+SC and MTAGE+SC, with unlimited storage limits.
The rest of the paper is organized as follows: Section 2 reviews perceptron branch predictors, and describes the DBN and CNN predictor model. Section 3 describes our experimental methodologies and benchmarks. In Section 4, we present the comparison results and analysis of DNNs and other related predictors. Section 5 concludes our work, and section 6 discusses the future work.

II. PERCEPTRON PREDICTOR AND DEEP LEARNING
Perceptron is widely used to build branch predictors, which achieves low misprediction rate. Perceptron uses a layer of hand-coded features and tries to recognize objects by learning how to weight these features. Unfortunately, such a naïve single-layer perceptron is only capable of learning linear-separable branches. In order to overcome this shortcoming, different variations have been proposed. The piecewise perceptron [28] adds one more dimension to the perceptron table -global history address corresponding to the instruction address of each bit in the global history. However, both perceptron and piecewise perceptron predictors imply that a weight can only be assigned to a single history bit or history address. It means the complexity of the output computation grows linearly with the number of bits in the global history. Tarjan and Skadron [20] proposed that this side effect could be mitigated by a hashed perceptron, in which multiple history bits are hashed to a single weight.

A. PERCEPTRON-BASED PREDICTOR
A perceptron-based predictor uses a single-layer perceptron, one of the simplest neural networks to learn the correlation between the branch history and branch outcomes. The predictor builds a perceptron table, which is indexed by the branch address (PC) [21], as shown in Fig.1. Each entry in the table consists of a set of weights. When making a prediction, the predictor first computes the output as the dot product of VOLUME 8, 2020 the input (i.e., history bits which is merged from local and global histories (LHR/GHR [21])) and the indexed weights. Then the sign of the output provides the final prediction. After the branch is resolved, if it is mispredicted or the output is smaller than a pre-defined threshold, the selected weights will be trained. It trains each weight via adding the product between the corresponding input bit and the branch outcome. This training policy effectively strengthens the weights corresponding to the inputs with strong correlation to the outcome.
A perceptron-based predictor is essentially a perceptron binary classification model for each branch PC address. If the PC address is merged into branch history bits, as shown in Fig.2, the predictor becomes a classic input-data-to-outputbinary-classification model. The middle black box may use any machine learning classification method.

B. DNN AND DNN-BASED PREDICTOR MODELS
With recent advances in deep learning showing highly impressive misclassification rate for image or audio-based processing [29], we aim to further push the lower bound of misprediction rates by applying deep learning algorithms to branch prediction. Deep learning [23] is a set of algorithms to train and utilize multi-layer neural networks. The deep hierarchical architecture tries to extract and represent the high-order features of the training data. Deep belief network (DBN) and convolution neural network (CNN) are the focus of this work.

1) DBN AND DBN-BASED PREDICTOR MODEL
DBN [30] is a class of deep neural networks, composed of multiple layers of hidden units, with connections between the layers but not between units within each layer. A DBN is built from several stacked restricted Boltzmann machines (RBMs). It first uses unlabeled data to pre-train the network layer by layer using contrastive divergence learning on every RBM. This step [31] is a way of unsupervised feature learning. After pre-training, global training algorithms such as back propagation are used to fine-tune weights in the network.
For branch prediction, as shown in Fig.3, the input and output are same as those to the perceptron-based predictor and the classification black box is replaced with 4 fullyconnected layers of hidden units.

2) CNN AND CNN-BASED PREDICTOR MODEL
Locally connected networks with the longest history bits [22], such as convolution neural networks (CNNs) [32], [33], are likely to be a better choice for branch prediction as most branches show high correlation with nearby history. CNNs [29], [34]- [38] are deep feedforward neural networks. The vital components of a CNN architecture are convolution layers and pooling layers, as shown in Fig.4. They exploit the high correlation in local groups of data. The convolution layer [39] is used to detect the local conjunctions of features from the previous layer, and the pooling layer combines similar local features. A CNN typically has multiple stages of convolutional and pooling layers stacked one after the other, followed by a fully-connected (fc) layer. Backward propagation through a CNN is used to update the weights during the training phase, the same as training regular deep network layers. As shown in Fig.4, for branch prediction, the input and output are also same as those to the perceptron-based predictor while the classification black box is replaced with convolution layers, pooling layers, and several fully-connected layers of hidden units.

A. THE DATASET
In order to evaluate various algorithms for branch prediction, we leverage the simulation framework provided in the 4th Championship Branch Prediction (CBP-4) which was held with 41 th ACM/IEEE International Symposium of Computer Architecture(ISCA-41). The framework is based on trace-driven simulation and features 20 benchmarks, which are grouped into four categories: I (integer), F (floating point), M (multimedia) and S (server). In this work, we focus on the conditional branches from each trace as listed in Table 1. The number of branches shown in table 1 is the total number of sequential executions of all branches. For example, a same branch repeat to execute N times, N branches record is counted in the number of the branches on a trace. As shown in figure 5, the times of branches execution is only once, accounted for nearly 20%; and the times of branches execution is less than 10, accounted for nearly 50%. Compared with early high repeat branch data set [40], it was more difficult to predict. The more recent CBP-5 has 233 traces, which would have resulted in impractical time spent on training the DNNs. Therefore, we focus on CBP-4 in this work.  In this paper, the problem of branch prediction is treated as a binary classification problem. We use an off-line training method with a training set of 90% random sampling branches, a validation set of 5% random sampling branches and a testing set of 5% random sampling branches. The training set is used to train the networks. The validation set is used to estimate how good a network has been trained during the training progress. If the network is good enough, meaning that the classification error rate on the validation set does not decrease in the latest several iterations, the training process will stop. Then, the test set is used to evaluate the final classification error rate after the network has been trained. Our DNNs are constructed, trained, and tested using the Caffe [41] framework.

B. THE ARCHITECTURE
We have already discussed the influence of the length of history bits for the misprediction rate of DBN models [22], and DBN with a 944 history bits get a lower misprediction rate. So, the 944 history bits are used as input date in our DNNs. Each input data sample has a size of 944 bits, which includes a 32-bit PC, 16-bit LHR, 512-bit GHR, and 48 8-bit GAs [19]. Every bit is an input to a neuron of the input layer of a DNN as shown in Fig 6. So all input to our models is a fixed-size 944 dimensional vector. We have tested DBN and CNNs with various configurations, and have observed consistent results. To provide instances for discussion, we describe one DBN and four CNN models for branch prediction as follows.
Perceptron: Based on the perceptron branch model shown in Fig. 2, the input data is contain 944 history bits, and every bit of the input data is one neuron of the input layer of the perceptron model, and the output layer is a neuron of binary classification.
DBN: Based on a popular DBN structure [22], the numbers of neurons of the layer 1 and of the layer 2 are selected from a thorough search in the large design space of their structures, as shown in Fig 7. Layer 3 has the same size as Layer 1 and Layer 4 is the same size as the input layer. The last layer is    the output layer, which is constructed as a simple single-layer neural network. The notation 'fc' means fully-connected.
LeNet: The first CNN branch prediction model is inspired by the LeNet [32], which contains two convolutional layers and two fully-connected layers. The convolution layers have 1 × 8 filters. The network ends with a fully-connected layer with logistic regression. The configuration of the first fully-connected layer (i.e., 500) is also selected from a thorough search in the large design space of their structures. AlexNet: The second is based on the AlexNet [34]. We delete one fully-connected layer, and adapt the filters to branch data. The network contains five convolutional layers and two fully-connected layers. The first convolutional layer filters the 944 bits input history information with 96 kernels of size 1 × 11 with a stride of 4 bits. The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 1 × 5. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 1 × 3 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 1 × 3, and the fifth convolutional layer has 256 kernels of size 1 × 3. The fully-connected layer are the same as the LeNet branch prediction model, which is different to AlexNet [34].
VGG: The third is based on the VGG [35], as shown in Fig. 10. The network contains sixteen convolutional layers and two fully-connected layers. All convolutional layers mostly have 1 × 3 filters. The number of kernels starts from 64 in the first convolution layer and then increases by a factor of 2 after each max-pooling layer, until it reaches 512. The convolution layers are not all followed by max-pooling. Every two convolution layers are followed one pooling layer for the 64-kernel layers or the 128-kernel layer in front. Every four convolution layers are followed one pooling layer for the 256-kernel, 512-kernel layers. The fully-connected layers are also same to the CNN1.
ResNet-50: The fourth is based on ResNet [37], as shown in Fig. 11. It contains 50 parameter layers which are 49 convolution layers and one fully-connected layers. The first convolutional layer has 64 kernels of size 1 × 7, followed by a 1 × 3 max-pooling. The other convolution layers are several 3-layer bottleneck blocks. The three convolution layers are 1 × 1, 1 × 3, and 1 × 1. The first kind bottleneck, which are stacked 3 times, contains 64, 64, and 256 kernels. The second kind bottleneck, which are stacked 4 times, contains 128, 128, and 512 kernels. The third kind bottleneck, which are stacked 6 times, contains 256, 256, and 1024 kernels. The last one contains 512, 512, and 2048 kernels, is stacked 3 times, followed by a 1 × 3 max-pooling. Only one fully-connected layer is followed in the last of whole network.

IV. RESULT AND DISCUSSION
The misprediction rate of the testing set is of our primary concern, all results are on the testing sets. We report the misprediction rate ( = #misprediction / #prediction * 100) of all the prediction algorithms.
A. COMPARING PERCEPTRON AND DNNs Fig.12 shows the misprediction rate of perceptron, DBN and CNNs branch prediction models which are described in the last section. The misprediction rates of the perceptron predictors with is labeled 'Pcptrn'. The misprediction rates of the DBN is labeled 'DBN'. The misprediction rates of CNNs models are directly labeled 'LeNet', 'AlexNet', 'VGG', and 'ResNet-50' respectively.
From the Fig.12, we can see that DNNs outperform perceptron consistently. Compared to perceptron, the average reductions on the misclassification rate of the DBN and ResNet-50 are 60.7% ( = (6.266-2.461)/6.266) and 84.0% respectively. The misprediction rate of perceptron is much VOLUME 8, 2020  lower on the benchmark 'F3', 'F4', 'F5', and 'I5', So we could simplify say that these benchmarks are linearly separable. All DNNs branch model have similar lower misprediction rates on them on which the improvement is very little. On the contrary, the branch outcomes of a benchmark are not linearly separable well, such as 'I2-I4', 'M1-M2', and 'Ss', the DNNs show their potential to greatly reduce the misprediction rate. This confirms that the DNNs could reduce the misprediction rates of the non-linearly separable branches. The multiple layers of the DNNs help to implement complex nonlinear functions more concisely, and it could work well on the nonlinear separable branches.
Form the Fig.12, we can also see that LeNet and DBN have similar misprediction rates while other CNNs has much lower misprediction rates than DBN or LeNet. LeNet contains only two convolution layers whereas AlexNet contains 5, VGG contains 16, and ResNet-50 contains 49 convolution layers. This shows that for branch prediction, deep CNNs are more effective than DBN, and a higher number of convolution layers and pooling layers could lead to lower misprediction rates. Fig.12 shows that, compare to the LeNet branch prediction model which contains only two convolution layers, the misprediction rate of AlexNet, VGG, and ResNet-50 are all lower on every benchmark. So, compared with the traditional convolution neural network 'LeNet', the deeper convolution neural network could lead to lower misprediction rate. There are 14 benchmarks show that the misprediction rate consistently reduces with an increase in the convolution layer depth which is from 2 convolution layers of the LeNet to 49 convolution layers of the ResNet-50, Except linear separable benchmark 'F3-F5' on which the misprediction rate of the deeper CNN model(AlexNet, VGG, and ResNet-50) is too lower to equal to almost 0. There are none benchmark whose misprediction rates increase from the LeNet to the ResNet-50 adversely. It clearly demonstrates the importance of depth of convolution in CNN branch prediction model.
As a result, we can see that the CNN approaches have the potential to further reduce the lower bound of the branch misprediction rates. The deep convolution layers help to extract the local features which have same history bits but different results in some branches, just like the separation of homonyms. Since the misprediction rate of ResNet-50 is the lowest among the four CNN branch prediction models, we focus on only one CNN branch prediction model, which is the ResNet-50, in our evaluation and discussion in the next sections. Fig.13 presents the misprediction rate of DBN, ResNet-50, state-of-the-art models. Our prior work, AIP [21] (Adaptive Information Processing, AIP), which is the winner of CBP1, is labeled 'priorWork'. It is a state-of-the-art perceptron-based predictor. The Multi-poTAGE+SC predictor [16], which is labeled 'Multi-poTAGE+SC', is the winner of CBP-4. The MTAGE+SC [18], which is labeled 'MTAGE+SC' in Fig.13, is the winner in CBP-5. These two TAGEs represent the state-of-the-art in achieving the upper bound of branch prediction.

B. COMPARING DNNs WITH THE STATE-OF-THE-ART BRANCH PREDICTORS
From the Fig.13, we can see the misprediction rate of the DBN is lower than AIP for 14 benchmarks. The average misprediction rate of the prior work is higher than that of the DBN. In Fig 13, We can hardly find one benchmark on which the misprediction rate of DBN is obviously lower than the TAGEs. Conversely the average of the misprediction rate of DBN is obviously higher than the TAGEs. In other words, DBN could outperform our prior work, but not outperform the state-of-the-art TAGE-like branch predictor.
From Fig.13, we can also see that the misprediction rate of the ResNet-50 is lower than the prior work, and DBN on any benchmarks. And there are 18 benchmarks on which the ResNet-50 model has lower misprediction rates than Multi-poTAGE+SC and MTAGE+SC. The average of misprediction rate of the ResNet-50 model is 29% and 28.6% lower than Multi-poTAGE+SC and MTAGE+SC, respectively. Thus, the ResNet-50 could not only outperform our prior work, but also the Multi-poTAGE+SC and MTAGE+SC.

C. COMPARING THE RANDOM-SELECTED BRANCHES WITH THE BRANCHES IN THE DYNAMIC ORDER
In this experiment, we also select branches in their dynamic order: the training set is the first 90% branches, a validation set is the next 5% branches and a testing set is last 5% branches. In this way, the causality of data is maintained, meaning that future data are not used to train the network to produce a current prediction. Fig.14 shows the average misprediction rate on the 20 CBP-4 traces. The difference between the two approaches, i.e., the random selection and the selection based on the dynamic order, is very small. It confirms that the discussion on static random branch traces is also valid for the ones based on the dynamic order.

V. CONCLUSION
This paper takes a binary classification perspective on the branch prediction problem. We utilize deep neural networks as a classifier and explore both DBN and CNNs to push the lower bound of branch misprediction rates. We made the following observations from our experiments: (1) deep neural networks significantly outperform simple perceptron classifiers; (2) deep CNNs outperform DBN; (3) The depth of convolution layer is also important for branch prediction.
(4) DBN could outperform the prior work, but not outperform the state-of-the-art TAGE-like branch predictor; (5) ResNet-50 could outperform state-of-the-art TAGE-like branch predictors; and (6) the discussion on static random branch traces is also valid for the ones based on the dynamic order.

FUTURE WORK
This paper takes branch prediction as a pure binary classification stochastic problem. In order to simplify the problem, we only implemented off-line training. In order to apply deep learning for branch prediction, an online training algorithm needs to be employed. In addition, since most of the state-of-the-art branch predictors integrate several standalone predictors, it is also worthwhile to explore the influence of incorporating such complementary predictors into the deep CNNs.