Recurrent Neural Networks: An Embedded Computing Perspective

Recurrent Neural Networks (RNNs) are a class of machine learning algorithms used for applications with time-series and sequential data. Recently, there has been a strong interest in executing RNNs on embedded devices. However, difficulties have arisen because RNN requires high computational capability and a large memory space. In this paper, we review existing implementations of RNN models on embedded platforms and discuss the methods adopted to overcome the limitations of embedded systems. We will define the objectives of mapping RNN algorithms on embedded platforms and the challenges facing their realization. Then, we explain the components of RNN models from an implementation perspective. We also discuss the optimizations applied to RNNs to run efficiently on embedded platforms. Finally, we compare the defined objectives with the implementations and highlight some open research questions and aspects currently not addressed for embedded RNNs. Overall, applying algorithmic optimizations to RNN models and decreasing the memory access overhead is vital to obtain high efficiency. To further increase the implementation efficiency, we point up the more promising optimizations that could be applied in future research. Additionally, this article observes that high performance has been targeted by many implementations, while flexibility has, as yet, been attempted less often. Thus, the article provides some guidelines for RNN hardware designers to support flexibility in a better manner.


I. INTRODUCTION
Recurrent Neural Networks (RNNs) are a class of Neural Networks (NNs) dealing with applications that have sequential data inputs or outputs.RNNs capture the temporal relationship between input/output sequences by introducing feedback to FeedForward (FF) neural networks.Thus, many applications with sequential data such as speech recognition [1], language translation [2], and human activity recognition [3] can benefit from RNNs.
In contrast to cloud computing, edge computing can guarantee better response time and enhance security for the running application.Augmenting edge devices with RNNs grant them the intelligence to process and respond to sequential problems.Realization on embedded platforms in edge devices imposes some optimizations to RNN applications.Embedded platforms are time-constrained systems that suffer from limited memory and power resources.To run RNN applications efficiently on embedded platforms, RNN applications need to overcome these restrictions.

A. SCOPE OF THE ARTICLE
In this article, we study RNN models and specifically focus on RNN optimizations and implementations on embedded platforms.The article compares recent implementations of RNN models on embedded systems found in the literature.For a research paper to be included in the comparison, it should satisfy the following conditions: FIGURE 1: Structure of the survey article.RNN models should run on an embedded platform in an edge device.Section II discusses the objectives of such an implementation and the challenges facing it.Section III describes the RNN models in detail.There follows a discussion of how algorithmic optimizations (Section IV-A) may be applied to RNN models and how platform-specific optimizations (Section IV-B) are applied to embedded platforms.The resulting implementations are discussed in Section V and compared to the objectives in Section VI. recurrent layer of an RNN model.
• The target platform is an embedded platform such as FPGA, ASIC, etc.
To provide a complete study, the survey also addresses the methods used for optimizing the RNN models and realizing them on embedded systems.This survey distinguishes itself from related works because no existing article includes the components of the RNN models with optimizations and implementations in a single analysis, as may be seen in Table 1.The other surveys focus on one or two aspects compared to those covered in this article.Some articles study RNNs from an algorithmic point of view [4], [5].While another group of survey articles looks at the hardware implementations.For example, one survey on neural networks efficient processing [6] studied CNNs, CNN optimizations, and CNN implementations, while another CNN survey [7] studied CNN mappings on FPGAs.Some articles were specialized in algorithmic optimizations such as compression [8].All algorithmic optimizations for both CNNs and RNNs were surveyed in one article that also discussed their implementations [9].However, the main scope of the article was optimizations, and so RNN models and their components were not studied.Furthermore, the RNN implementations included were limited to speech recognition applications on the TIDIGITs dataset.

B. CONTRIBUTIONS
This survey article provides the following: • A detailed comparison of RNN models' components from a computer architecture perspective that addresses computational and memory requirements.
• A study of the optimizations applied to RNNs to execute them on embedded platforms.• An application-independent comparison of recent implementations of RNNs on embedded platforms.• Identification of possible opportunities for future research.

C. SURVEY STRUCTURE
This survey article is organized as shown in Figure 1.Section II defines the objectives of realizing RNN models on embedded platforms and the challenges faced in achieving them.We then define a general model for RNN applications and discuss different variations for the recurrent layers in RNN models in Section III.However, it is difficult to run RNN models in their original form efficiently on embedded platforms.Therefore, researchers have applied optimizations to both the RNN model and the target platform.The optimizations applied to the RNN model are called algorithmic optimizations and are discussed in Section IV-A; the optimizations applied to the hardware platform are called platform-specific optimizations and are discussed in Section IV-B.Then, in Section V, we present an analysis of the hardware implementations of RNNs suggested in the literature.The implementations are compared against the applied optimizations and their achieved performance.In Section VI, we compare the implementations analyzed in Section V with the objectives defined in Section II to define the gap between them and propose research opportunities to fill this gap.Finally, in Section VII, we summarize our survey.

II. OBJECTIVES AND CHALLENGES
Implementation efficiency is the primary objective in implementing RNN applications on embedded systems.Implementation efficiency requires the implementation to have high throughput, low energy consumption, and to meet realtime requirements.A secondary objective for the implementation would be flexibility.Flexibility requires the implementation to support variations in the RNN model, to allow for online training, and to meet different applications requirements.In meeting these objectives, there exist challenges in mapping these applications onto embedded systems, such as the large number of computations to be performed within the limited available memory.These objectives and challenges are discussed in detail below.

A. OBJECTIVES OF REALIZING RNNS ON EMBEDDED PLATFORMS
To realize RNN models on embedded platforms, we define some objectives that will influence the solution.These objectives are divided into implementation efficiency objectives and flexibility objectives.

1) Implementation Efficiency
Since we target embedded platforms, we consider the online execution of the application.To satisfy the implementation efficiency objective, the implementation should have a high throughput, low energy consumption, and meet the realtime requirements of the application.The real-time requirements of the application pose additional demands for the throughput, energy consumption and the accuracy of the implementation.Accuracy indicates how correct the model is in performing recognition, classification, translation, etc.
• High throughput Throughput is a measure of performance.It measures the number of processed input/output samples per second.Application-level inputs and outputs are diverse.For image processing applications, the input can be frames and the throughput can be the number of consumed frames per second, which may also depend on the frame size.For speech/text applications, it can be the number of predicted words per second.Thus for different sizes and types of input and outputs, throughput can have different units and the throughput value may be interpreted in various ways.
To compare different applications, we use the number of operations per second as a measure of throughput.• Low energy consumption For an implementation to be considered efficient, the energy consumption of the implementation should meet embedded platforms' energy constraints.To compare the energy consumption of different implementations, we use the number of operations per second per watt as a measure of energy efficiency.
• Real-time requirements In real-time implementations, a response cannot be delayed beyond a predefined deadline, and energy consumption cannot exceed a predefined limit.The deadline is defined by the application and is affected by the frequency of sensor inputs and the system response time.Normally, the RNN execution should meet the predefined deadline.

2) Flexibility
The flexibility of the solution in this context is the ability of the solution to run different models under different constraints without being restricted to one model or one configuration.For an implementation to be flexible, we define the following requirements that should be satisfied: • Supporting variations in RNN layer The recurrent layers of RNN models can vary in the type of the layer (different types of the recurrent layer are discussed in Section III-B), the number of hidden cells, and the number of recurrent layers.• Supporting other NN layers RNN models have other types of NN layers as well.A solution that supports more NN layers is considered a complete solution for RNN models, and not just a flexible solution.Convolution layers, fully connected layers, and pooling layers might be required in an RNN model.• Supporting algorithmic optimization variations Different algorithmic optimizations are applied to RNN models to implement them efficiently on embedded systems (Section IV).Supporting at least one algorithmic optimization for the hardware solution is in many cases mandatory for a feasible execution of RNN models on an embedded system.Combinations of optimizations will lead to higher efficiency and flexibility as this gives the algorithmic designer more choices while optimizing the model for embedded execution.• Online training Training is a process that sets parameter values within the neural network.In embedded platforms, training is performed offline, and only inference is run on the platform at run-time.For real-life problems, it is often not enough to run only inference on the embedded platforms -some level of training is required at run-time as well.Online training allows the neural network to adapt to new data that was not encountered within the training data, and to adapt to changes in the environment.For example, online training is required for object recognition in autonomous cars to achieve lifelong learning, by continuously receiving new training data from fleets of robots and updating the model parameters [10].Another example is in automated visual monitoring systems that continuously receive new labeled data [11].
• Meeting the requirements of different application domains One aspect of flexibility is to support the requirements of different application domains.This makes the implementation attractive because the solution can support a wider range of applications.However, different application domains can have different performance criteria.Some application domains, such as autonomous vehicles [12], might require very high throughput with moderate power consumption, while others, such as mobile applications [13], [14], require extremely low power consumption but have less stringent constraints on throughput.

B. CHALLENGES IN MAPPING RNNS ON EMBEDDED PLATFORMS
We shall now take a look at the challenges faced by hardware solutions to meet the objectives discussed above.

1) Computation challenges
The main computation bottleneck in RNNs is the matrix to vector multiplications.The LSTM layer (Explained in detail in Section III-B) has four computation blocks, each of which has one matrix to vector multiplication.For example, if the size of the vector is 1280 and the size of the matrices is 1280 × 1024, each matrix to vector multiplication requires 1280 × 1024 MAC (Multiply And Accumulate) operations.The total number of MAC operations in the LSTM would be 4×1280× 1024 = 5.24 Mega MAC, which is approximately equivalent to 10.5 MOP.The high number of computations negatively affects both the throughput of the implementation and energy consumption.
One other problem in RNNs is the recurrent structure of the RNN.In RNNs, the output is fed back as an input in such a way that each time-step computation needs to wait for the previous time-step computation to complete.This temporal dependency makes it difficult to parallelize the implementation over time-steps.

2) Memory challenges
The memory required for the matrix to vector multiplications can be very large.The size and the access time of these matrices become a memory bottleneck.The previous example of the LSTM layer, requires four matrices, each of size 1280 × 1024.Consider 32-bit floating-point operations: the size of the required memory for the weights would be 32 × 4 × 1280 × 1024 = 21M B. Also, the high number of memory accesses affects the throughput and energy consumption of the implementation [15].

3) Accuracy challenges
To overcome the previous two issues (computation and memory challenges), optimizations can be applied to RNN models as discussed in Section IV.These optimizations may affect accuracy.The acceptable decrease in accuracy varies with the application domain.For instance, in aircraft anomaly detection, the accepted range of data fluctuation is only 5% [16].

III. RECURRENT NEURAL NETWORKS
The intelligence of humans, as well as most animals, depends on having a memory of the past.This can be shortterm, as when combining sounds to make words, and longterm, for example where the word "she" can refer back to "Anne" mentioned hundreds of words earlier.This is exactly what RNN provides in neural networks.It adds feedback that enables using the outputs of previous time step while processing the current time-step input.It aims to add memory cells that function similarly to human long-term and shortterm memories.
RNNs add recurrent layers to the NN (Neural Network) model.Figure 2 presents a generic model for RNNs that consists of three sets of layers (input, recurrent, and output).Input layers take the sensor output and convert it into a vector that conveys the features of the input.These are followed by the recurrent layers, which provide feedback.In most recent recurrent layer models, memory cells exist as well.Subsequently, the model completes similarly to most NN models with Fully Connected (FC) layers and an output layer that can be a softmax layer.FC layers and the output layer are grouped into the set of output layers in Figure 2. In this section, we discuss the input layers, different types of recurrent layer, output layers, RNN modes of operation, deep RNN, and RNN applications and their corresponding datasets.

A. INPUT LAYERS (FEATURES EXTRACTOR) AND CORRESPONDING APPLICATIONS AND DATASETS
Input layers are needed by many implementations to prepare the sensor output for processing (these may also called feature extraction layers).Often, the raw sensor data, e.g., the audio samples or video frames, are in a form that is unsuitable for direct processing in the recurrent layer.Also, the RNN performance (in learning rate and accuracy) can be significantly improved if suitable features are extracted in the input layer.
As sensor types (and numbers) change with the application, RNN models show a large variation with application types as well.Thus it is important to study which applications an RNN model is used for and their corresponding datasets.Datasets are used by researchers to demonstrate success in applying their methods and the modifications to them.Datasets differ in the size of the data samples, the values of data samples, and the total size of the dataset.The success of NN models is measured by accuracy.Accuracy indicates how correct the model is when carrying out recognition, classification, translation, etc.
In this section, we discuss examples from three application domains where input layer pre-processing is used: audio, video, and text.In Table 2, we summarize these application domains and their corresponding datasets.For different datasets, different metrics are used to assess accuracy.

1) Audio inputs
Audio feature extractors translate sound signals into feature vectors.In speech processing, we often want to extract a frequency content from the audio signal (in a similar way to the human ear) [17].There are many ways to do this, for example, by using short-time Fourier transform (STFT), mel frequency cepstral coefficients (MFCC) and linear predictive coding (LPC) coefficients [18].
Applications: Speech recognition Speech recognition applications receive audio as input, understand it, and translate it into words.Speech recognition can be used for phonetic recognition, voice search, conversational speech recognition, and speech-to-text processing [19].

2) Video inputs
When the input is a video signal, that is, a sequence of images or frames, it is natural to use a convolutional neural network (CNN) as an input layer.CNN layers then extract image features from each video frame and feed the resulting feature vector to the recurrent layer.This use of a CNN as an input layer before a recurrent layer has been employed for many applications with video inputs, such as activity recognition, image description [3], [20], or video description [21].
The use of CNN as an input layer can also be found for audio signals [22].In this case, a short segment of audio samples is transformed into a frequency domain vector using, for example, STFT or MFCC.By combining a number of these segments into a spectrogram, we can show information about the source's frequency and amplitude against time.This visual representation is then fed into a CNN as an image.The CNN then extracts speech or audio features suitable for the recurrent layer.
Applications: Image/Video applications Image/video applications cover any application that takes images as input, for example, image captioning, activity recognition, and video description.

3) Text inputs
When the input is in the form of text, we often want to represent words as vectors, and word embedding is one common way to do this [23].The word embedding layer extracts the features of each word in relation to the rest of the vocabulary.The output of the word embedding is a vector.For two words with similar contexts, the distance between their two vectors is short, while it is large for two words that have different contexts.
Following word embedding in an input layer, deeper text analysis or natural language processing is performed in the recurrent layers. Applications: • Text generation RNN models can be used for language-related applications such as text generation.RNN models can predict the next words in a phrase, using the previous words as inputs.

• Sentiment analysis
Sentiment analysis is the task of understanding the underlying opinion expressed by words [24], [25].Since the input words comprise a sequence, RNN methods are well-suited to performing sentiment analysis.

B. RECURRENT LAYERS
In this section, we cover the various types of recurrent layers.
For each layer, we discuss the structure of the layer and the gate equations.The most popular recurrent layer is the Long Short Term Memory (LSTM) [39].Changes have been proposed to the LSTM to enhance algorithmic efficiency or improve computational complexity.Enhancing algorithmic efficiency means improving the accuracy achieved by the RNN model, which includes LSTM with peepholes and ConvLSTM, as discussed in Sections III-B2 and III-B3.Improving computational complexity means reducing the number of computations and the amount of memory required by an LSTM to run efficiently on a hardware platform.Techniques include LSTM with projection, GRU, and QRNN/SRU, which are discussed in Sections III-B4, III-B5, and III-B6, respectively.These changes can be applied to the gate equations, interconnections, or even the number of gates.Finally, we compare all the different layers against the number of operations and the number of parameters in Table 3.

1) LSTM
First, we explain the LSTM (Long Short Term Memory) layer.Looking at LSTM as a black box, the input to LSTM is a vector combination of the input vector x t and the previous time-step output vector h t−1 , where the output vector at time t is denoted as h t .Looking at the structure of an LSTM,  AN4 [27] TIMIT [28] Wall Street Journal(WSJ) [29] LibriSpeech ASR corpus [30] Video input Image/video applications COCO [31] BLEU (Higher is better) Moving MNIST [32] Cross entropy loss (Lower is better) comma.aidriving dataset [33] RMS prediction error (Lower is better)

Text input Text generation
Penn Treebank (PTB) [34] Perplexity per word (PPW) wikitext [35] (Lower is better) Text8 [36] & Bilingual Evaluation Understudy (BLEU) WMT'14 [37] (Higher is better) Sentiment analysis IMDB [38] Testing accuracy (Higher is better) it has a memory cell state C t and three gates.These gates control what is to be forgotten and what is to be updated by the memory state (forget and input gates).They also control the part of the memory state that will be used as an output (output gate).Our description of the LSTM unit is based on its relationship with hardware implementations.Thus, in Figure 3a, we show the LSTM as four blocks instead of three gates because LSTM is composed of four similar computation blocks.The computation block is the matrix to vector multiplication of the combination of x t and h t−1 with one of the weight matrices {W f , W i , W c , W o }.This is considered the dominant computational task in LSTMs.Each block is composed of a matrix to vector multiplication followed by the addition of a bias vector {b f , b i , b c , b o }, and then the application of a nonlinear function.Each block might have element-wise multiplication operations as well.The nonlinear functions used in the LSTM are tanh and sigmoid functions.The four computation blocks are as follow: • Forget gate The role of the forget gate is to decide which information should be forgotten.The forget gate output f t is calculated as where x t is the input vector, h t−1 is the hidden state output vector, W f is the weight matrix, b f is the bias vector, and σ is the sigmoid function.
• Input gate The role of the input gate is to decide which information is to be renewed.The input gate output i t is computed similarly to the forget gate output as using the weight matrix W i and the bias vector b i .
where x t is the input vector, h t−1 is the hidden state output vector, W c is the weight matrix, and b c is the bias vector.Then, the new state vector, C t is calculated by the addition of the previous state vector C t−1 elementwise multiplied with the forget gate output vector f t and the new state candidate vector C t element-wise multiplied with the input gate output vector i t as where is used to denote the element-wise multiplication.
• Output gate The role of the output gate is to compute the LSTM output.First, the output gate vector o t is computed as where x t is the input vector, h t−1 is the hidden state output vector, W o is the weight matrix, b o is the bias vector, and σ is the sigmoid function.Then, the hidden state output h t is computed by applying the elementwise multiplication of the output gate vector o t (that holds the decision of which part of the state is the output) to the tanh of the state vector C t as The number of computations and parameters for LSTM are shown in Table 3. Matrix to vector multiplications dominate the number of computations and parameters.For each matrix to vector multiplication, the input vector x t of size m and the hidden state output vector h t−1 of size n are multiplied with weight matrices of size (m + n) × n.That requires n(m + n) MAC operations, which is equivalent to nm + n 2 multiplications and nm + n 2 additions.The number of parameters in the weight matrices is nm + n 2 as well.Since this computation is repeated four times within the LSTM computation, these numbers are multiplied by four in the total number of operations and parameters for an LSTM.For the models in the studied papers, n is larger than m.Thus, n has a dominating effect on the computational complexity of the LSTM.

2) LSTM with peepholes
Peephole connections were added to LSTMs to make them able to count and measure the time between events [40].As seen in Figure 3b, the output from the state computation is   used as input for the three gates.The LSTM gate equations are changed to: and where x t is the input vector, h t−1 is the hidden state output vector, C t−1 is the state vector at time t − 1, W f , W i , W o are the weight matrices, and b f , b i and b o are the bias vectors.
The number of operations and computations for an LSTM with peepholes are shown in Table 3.There exist two rows for an LSTM with peepholes.The first one considers the multiplication with the cell state in the three gates as a matrix to vector multiplication.The number of multiplications, additions, and weights increases by 3 × n 2 .However, the weight matrices multiplied with the cell state can be diagonal matrices [41].Thus, the matrix to vector multiplication can be considered as element-wise vector multiplication, which has become widely used for LSTM with peepholes.In this case, the number of multiplications, additions, and weights will increase by 3n only.

3) ConvLSTM
ConvLSTM is an LSTM with all matrix to vector multiplications replaced with 2D convolutions [42].The idea is that if the input to the LSTM is data that holds spatial relations such as visual frames, it is better to apply 2D convolutions than matrix to vector multiplications.Convolution is capable of extracting spatial information from the data.The vectors x t , h t , and C t are replaced with 3-D tensors.One can think of each element in the LSTM vectors as a 2D frame in the ConvLSTM vectors.Convolution weights need less memory than to vector matrices weights.However, using them involves more computation.
The number of operations and parameters required for a convLSTM are shown in Table 3.The calculated numbers are for a convLSTM without peepholes.If peepholes are added, the number of multiplications, additions, and weights will increase by 3n.Since the main change from an LSTM is the replacement of the matrix to vector multiplications with convolutions, the change in the number of operations and parameters would be via the nm + n 2 factor that appears in multiplications, additions, and the number of weight equations.The number of multiplications and additions (MACs) in convolutions of input vector x t and hidden state output vector h t−1 is rcnmk i 2 + rcn 2 × k s 2 , where r is the number of rows and c is the number of columns in the frames, n is the number of frames in input x t , m is the number of frames in output h t (or the number of hidden cells), k i is the size of the filter used with x t , and k s is the size of the filter used with h t−1 .The number of weights is the size of the filters used for convolutions.

4) LSTM with projection layer
The LSTM is changed by adding one extra step after the last gate [43].This step is called a projection layer.The output of the projection layer is the output of the LSTM and the feedback input to the LSTM in the next time-step, as shown in Figure 3c.Simply, a projection layer is like an FC layer.The purpose of this layer is to allow an increase in the number of hidden cells while controlling the total number of parameters.This is performed by using a projection layer that has a number of units p less than the number of hidden cells.The dominating factor in the number of computations and the number of weights will be 4pn instead of 4n 2 , where n is the number of hidden cells and p is the size of the projection layer.Since p < n, n can increase with a smaller effect on the size of the model and the number of computations.
In Table 3, we show the number of operations and parameters required for an LSTM with a projection layer.In the original paper proposing the projection layer, the authors considered the output layer of the RNN as a part of the LSTM [43].The output layer was an FC layer that changes the size of the output vector to o, where o is the output size.Thus, there is an extra po term in the number of multiplications, additions, and weights.We put the extra terms between curly brackets to show that they are optional terms.The projection layer can be applied to an LSTM with peepholes as well.In Table 3, we show the number of operations and parameters for an LSTM with peepholes and a projection layer.

5) GRU
The Gated Recurrent Unit (GRU) was proposed in 2014 [44].The main purpose was to make the recurrent layer able to capture the dependencies at different time scales in an adaptive manner [45].However, the fact that GRU has only two gates (three computational blocks) instead of three (four computational blocks) as with the LSTM makes it more computationally efficient and more promising for highperformance hardware implementations.The three computational blocks are as follows: • Reset gate The reset gate is used to decide whether to use the previously computed output or treat the input as the first symbol in a sequence.The reset gate output vector r t is computed as where x t is the input vector, h t−1 is the hidden state output vector, W r is the weight matrix, and σ is the sigmoid function.
In the table we use the following symbols: m is the size of input vector xt, n is the number of hidden cells in ht, p is the size of the projection layer, o is the size of the output layer, r is the number of rows in a frame, c is the number of columns in a frame, k i is size of the 2D filter applied to xt, ks is the size of the 2D filter applied to h t−1 , and k is the size of 1D convolution filter.The term {po} is an optional term as discussed in Section III-B4.

VOLUME , 2020
• Output computation The role of this block is to compute the hidden state vector h t .First, it computes the possible values for the hidden state vector h t where x t is the input vector, h t−1 is the hidden state output vector, and W is the weight matrix.Then, the hidden state vector h t is computed from the old output h t−1 and the new possible output h t as As with LSTM, we visualize a GRU in Figure 3d as three blocks, not two gates, as it has three blocks of matrix to vector multiplications.In Table 3, we show the number of operations and parameters required for a GRU.The number of operations and parameters is approximately 0.75 the number of operations and parameters in the LSTM.

6) QRNN and SRU
The purpose of Quasi-RNN (QRNN) [46] and Simple Recurrent Unit (SRU) [47] is to make the recurrent unit friendlier for computation and parallelization.The bottleneck in an LSTM/GRU is the matrix to vector multiplications.It is difficult to parallelize this part because it depends on the previous time-step output h t−1 and previous time-step state C t−1 .In QRNN/SRU, h t−1 and C t−1 are removed from all matrix to vector multiplications and appear only in elementwise operations.QRNN has two gates and a memory state.It has three heavy computational blocks.In these blocks, only the input vector x t is used as input.It replaces the matrix to vector multiplications with 1D convolutions with inputs along the time-step dimension.For instance, if the filter dimension is two, convolution is applied on x t and x t−1 .The three computation blocks compute the forget gate vector f t , candidate for new state vector C t , and the output gate vector o t as and where W f and W c , W o are the convolution filter banks and " * " is to denote the convolution operation.
The state vector C t is computed as and the hidden state vector h t is computed using equation 6.
Figure 3e is used to visualize the QRNN layer.The number of operations and parameters required for a QRNN is shown in Table 3, where k is the size of the convolution filter.
The SRU has two gates and a memory state as well.The heavy computational blocks (three blocks) are matrix to vector multiplications, not convolutions.The two gates (forget and update gates) are computed using the equations and respectively.In both gate calculations, C t−1 is used but only for element-wise multiplications.The parameter vectors v f and v r are learned with weight matrices and biases during training.
The third computational block is the state computation where C t−1 is the old state vector and x t is the input vector.The computation is controlled by the forget gate output vector f t that decides what is to be forgotten and what is to be treated as new.
Finally, the SRU output h t is computed from the new state C t and the input vector x t checked by the update gate (which decides the parts of the output that are taken from the new state and the parts that are taken from input) using the equation Figure 3f visualizes the SRU.The output computation is performed in the same block with the update gate.It is worth observing that in neither QRNN nor SRU, h t−1 are used in the equations -only the old state C t−1 is used.The number of operations and parameters for an SRU is shown in Table 3.
In Table 3, we compare the LSTM and all of its variations against the memory requirements for the weights and the number of computations per single time-step.This comparison helps to understand the required hardware platform for each of them.To make it easier for the reader to understand the difference between the LSTM and the other variants, we show the equations for operations and parameters in terms of LSTM operations and parameters if they are comparable.

C. OUTPUT LAYERS
The output layers in the RNN model are the FC layers and the output function.

1) FC (Fully Connected) Layers
The RNN model might have one or more FC layers after the recurrent layers.Non-linear functions may be applied between the FC layers as well.This is called fully connected because each neuron in the input is connected to each neuron of the output.Computationally, this is done by matrix to vector multiplication using a weight matrix of size Input size × output size , where Input size is the size of the input vector and Output size is the size of the output vector.One purpose of the FC layer in RNN models can be to change the dimension of the hidden state output vector h t to the dimension of the RNN model output to prepare it for the output function.In this case, the FC layer might be replaced by adding a projection layer in the recurrent layer.

2) Output function
The output function is the final step in the neural networks inference.It generates the output of the neural network model.This output can be a prediction, classification, recognition, and so on.For example, in a text prediction problem, the softmax function is used as an output function.The output is a vector of probabilities that sum to one.Each probability corresponds to one word.The word with the highest probability becomes the prediction of the neural network [48].

D. PROCESSING OF DATA IN RNN MODELS
There are many ways that processing of data may vary in RNN models.The first is to vary the way time steps are treated.This is influenced by the nature of the application, which may have inputs with temporal relations, outputs with temporal relations, or both.The second form of variation is related to bidirectional RNNs.We discuss below how a bidirectional RNN can process inputs both forwards and backwards in time.We also discuss what is meant by a deep RNN model.

1) RNN unfolding variations through time-steps
RNN unfolding/unrolling is performed to reveal the repetition in the recurrent layer and to show the number of time steps required to complete a task.Unfolding the RNN illustrates the different types of RNN models one can meet.
• One to many A one to many model generates a sequence of outputs for a single input, as shown in Fig- ure 4a.Image captioning is one example [3].The model takes one image as input and generates a sentence as an output.The words of the sentence compose a sequence of temporally related data.In this case, the temporal sequence is only in the output.• Many to one A many to one model combines a sequence of inputs to generate a single output, as shown in Figure 4b.Activity recognition [3] and sentiment analysis [49] are two examples.In activity recognition, the model takes a sequence of images as input and determines the activity taking place in the images.In sentiment analysis, the model takes a sequence of words (sentence) as input and generates a single emotion at the end.In this case, the temporal sequence sequence is only in the input.• Many to many A many to many model has a sequence in the input and a sequence in the output, as shown in Figure 4c.Language translation [2] and video description [3] are two examples.In language translation, the model has a sequence of words (sentence) as an input and a sequence of words (sentence) as an output.In video description applications, the model has a sequence of image frames as input and a sequence of words (sentence) as output.
• One to one There is no RNN model with one to one unrolling.One to one simply means that there is no temporal relation contained in the inputs or the outputs (a feedforward neural network).

2) Bidirectional RNN
In Bidirectional RNN, input can be fed into the recurrent layer from two directions: past to future and future to past.That requires a duplication of the recurrent layer, so that two recurrent layers work simultaneously, each processing input in a different temporal direction.This can help the network to better understand context by obtaining data from the past and the future at the same time.This concept can be applied to different variations of recurrent layers such as BiLSTM [50] and BiGRU [51].

E. DEEP RECURRENT NEURAL NETWORKS (DRNN)
Making a neural network a deep neural network is achieved by adding non-linear layers between the input layer and the output layer [52].This is straightforward in feedforward NNs.However, in RNNs, there are different approaches that can be adopted.Similarly to feedforward NNs, there can be a stack of recurrent layers (stacked RNN) [41] as shown in Figure 5, where we have a stack of two recurrent layers.The output of the first layer is considered as the input for the second layer.Alternately, the extra non-linear layers can be within the recurrent layer computations [53].Extra non-linear layers can be embedded within the hidden layer vector h t calculation, where the x t and h t−1 vectors used to calculate h t , pass through additional non-linear layers.This model is called the deep transition RNN model.The extra non-linear layers can also be added in computing the output from the hidden state vector; this model is called the deep output RNN model.It is possible to have an RNN model that is both a deep transition and a deep output RNN model [54].
One other way to have extra non-linear functions within the recurrent layer is to have them within the gate calculationsa method called H-LSTM (Hidden LSTM).

IV. OPTIMIZATIONS FOR RNNS
As with all neural network applications, RNN applications are based upon intensive operations performed on high precision values.They therefore require high computation power, large memory bandwidth, and high energy consumption.
Because of the resource constraints of embedded platforms, there is a need to decrease the computation and memory requirements of RNN applications.In this section, we present optimizations that have been applied to RNNs to realize them on embedded systems.In Section V which follows, we discuss hardware implementations of RNNs on embedded platforms and how they relate to the optimizations presented here.Researchers have been working on two types of optimizations.The first type is related to the RNN algorithms themselves, where RNN algorithms are modified to decrease computation and memory requirements.The modification should have no effect or only a limited effect on accuracy.
The second type of optimization is related to the embedded platform, where hardware improvements are applied to increase the parallelization of the application and decrease the overhead of memory accesses.Figure 6 illustrates these two types of optimizations.

A. ALGORITHMIC OPTIMIZATIONS
In this section, we discuss the different algorithmic optimizations that may be performed on the recurrent layer of an RNN application to decrease its computation and memory requirements.We discuss how these optimizations are carried out, and how they affect accuracy.Applying optimizations directly to inference can have unacceptable effects on accuracy.Thus, training the network would be required to enhance the accuracy.optimizations may be applied during the model main training or after the model is trained and then the model is retrained for some epochs (training cycles).Different datasets measure accuracy using different units.For some units higher values are better, while for others lower values are better.To provide a unified measure of the change in accuracy, we calculate the percentage change in accuracy from the original value to the value after applying the optimization method as where a ∆ is the effect of the optimization method on accuracy as a percentage of the original accuracy value, V b is the value of accuracy before optimization, V a is the value of accuracy after optimization, and α is an indicator that has a value of 0 if higher accuracy values are better and 1 if lower accuracy values are better.Thus, if the baseline accuracy achieved by the original model without optimizations is 96% and the accuracy after optimization is 94%, the effect of optimization on accuracy is −2.1%.If the accuracy after optimization is 98%, the effect of optimization on accuracy is +2.1%.If the optimization has no effect on accuracy, then the effect on accuracy is 0%.
As shown in Figure 6, the algorithmic optimizations are quantization, compression, deltaRNN, and nonlinear.The first three optimizations are applied to the matrix to vector multiplications operations and the last is applied to computation of non-linear functions.The table in Figure 6 compares quantization, compression, and deltaRNN with their effect on memory requirements, number of memory accesses, number of computations, and MAC operation cost.MAC operation cost can be decreased by decreasing the precision of operands.

1) Quantization
Quantization is a reduction in the precision of the operands.Quantization can be applied to the network parameters only, or to the activations and inputs as well.While discussing quantization, there are three important factors to consider.First, the number of bits used for weights, biases, activations, and inputs.Second, the quantization method.The quantization method defines how to store the full precision values in a lower number of bits.Third, discussing whether quantization was applied with training from the outset or the model was re-trained after applying quantization.These three factors all affect accuracy.However, they are not the only factors affecting accuracy, which may also be affected by model architecture, dataset, and other factors.Yet, these three factors have more relevance when applying quantization to the RNN model.
In discussing quantization methods, we cover fixed-point quantization, multiple binary codes quantizations, and exponential quantization.We also study whether the selection of the quantized value is deterministic or stochastic.In deterministic methods, the selection is based on static thresholds.In contrast, selection in stochastic methods relies on probabilities and random numbers.Relying on random numbers is more difficult for hardware.

a: Quantized values representation
There are different methods for representing quantized values.In the following, we explain three commonly used methods., where m is the number of integer bits, and f is the number of fractional bits.The total number of bits required is k.The sign bit may be included in the number of integer bits [55] or added as an extra bit added to m and f [56].For example, in the first case [55], Q 1.1 is used to represent 2 bits fixed-point that has three values {−0.5,0,0.5}.This quantization method is also called Pow2-ternarization [57].Usually, fixed-point quantization is deterministic, in that for each floatingpoint value, there is one quantized fixed-point value defined by an equation (i.e. it is rule-based).Fixed-point quantization is performed by clipping the floating-point value between minimum and the maximum boundaries, and then rounding it.2) Exponential quantization Exponential quantization quantizes a value into an integer power of two.Exponential quantization is very beneficial for the hardware as multiplying with exponentially quantized value is equivalent to shift operations if the second operand is a fixed-point value, and addition to exponent if the second operand is a floating-point value [55], [58].Exponential quantization can be both deterministic and stochastic.3) Binary and multi-bit codes quantization The lowest precision in RNNs is binary precision [59].Each full precision value is quantized into one of two values.The most common two values are {−1, +1}, but it can also be {0, +1}, {−0.5, 0}, {−0.5, +0.5}, or any combination of two values [55].Binarization can be deterministic or stochastic.For deterministic binarization, a sign function can be used for binarization.For stochastic binarization, selection thresholds depend on probabilities to compute the quantized value where σ h is the "hard sigmoid" function defined as Binarization has great value for hardware computation as it turns multiplication into addition and subtraction.The greatest value comes with full binarization, where both the weights and the activations have binary precision.In this case, it is possible to concatenate weights and activations into 32-bit operands and do multiple MAC operations using XNOR and bit-count operations.Full binarization can reduce memory requirements by a factor of 32 and decrease computation time considerably [60].Adding one more value to binary precision is called ternarization.Weights in ternarized NN are restricted to three values.These three values can be {−1, 0, 1} [61].Power two ternarization is discussed above as a form of fixed-point quantization, and is an example of ternarization with three different values {−0.5, 0, 0.5}.Both deterministic and stochastic ternarization have been applied to RNNs [55].Having four possible quantization values is called Quaternarization.In quaternarization, the possible values can be {−1, −0.5, +0.5, +1} [62].In order to benefit from the high computational benefit of having binary weights and activations while using a higher number of bits, multiple binary codes {−1,+1} has been used for quantization [63].For example, two bit quantization has four possible values {{−1,−1}, {−1,1}, {1,−1}, {1,1}}.The most common method for deterministic quantization is uniform quantization.Uniform quantization may not be the best quantization method as it can change the distribution of the original data, especially for nonuniform data, which can affect accuracy.One solution is balanced quantization [64].In balanced quantization, data is divided into groups of the same amount of data before quantization to ensure a balanced distribution of data following quantization.Other suggested solutions treat quantization as an optimization problem, and include greedy quantization, refined greedy quantization, and alternating multi-bit quantization [63], [65].

b: Training/Retraining
As mentioned earlier, there are three options to minimize accuracy loss due to quantization.The first is to apply quantization with training [66], where quantized weights are used during the forward and backward propagation only.Full precision weights are used for the parameters update step in the (Stochastic Gradient Descent) SGD.Copies for both quantized and full precision weights are kept to decide at inference time which one to use [55].In the second approach, quantization is applied to pretrained parameters and the RNN model is retrained to decrease the accuracy loss.Also, binarization of LSTM gate outputs during training have been applied by using the GumbelSoftmax estimator [67].Authors in one RNN implementation [56] adopted a mix of training and retraining approaches, where only the activations were not quantized from the beginning.Activations were quantized after training and then the model was retrained for 40 epochs.The third approach is to use quantized parameters without training/retraining.This is very commonly used with 16-bit fixed-point quantization.Usually, training happens at training servers and quantization is applied at the inference platform without having the opportunity to retrain the model.It is very common as well to use 16-bit fixed-point quantization with other optimization techniques such as circulant matrices compression [68], pruning [69], and deltaRNN (discussed later in Section IV-A3) [70].

c: Effect on accuracy
In Table 4, we list studies that included experiments on the quantization of RNN models.Not all of the studies have a hardware implementation, as the purpose is to show that quantization can be performed while keeping accuracy high.In the table, we put the three factors affecting the accuracy discussed earlier (number of bits, quantization method, and training) with an addition of the type of recurrent layer (LSTM, GRU...) and the dataset.Then, we show the effect of quantization on accuracy computed with respect to the accuracy achieved by full precision parameters and activation using Eq.22.For the number of bits, we use W/A where W is the number of bits used for weights and A is the number of bits used for activations.For the RNN type, we put the recurrent layers used in the experiments.All recurrent layers are explained in Section III.We use x*y*z, where x is the number of layers, y is the type of the layers, and z is the number of hidden cells in each layer.
For training, if quantization was applied with training from the beginning, we write "With training".If quantization was applied after training and the model was later retrained, we write "Retraining".Positive values for accuracy means that quantization enhanced the accuracy and negative values for accuracy means that quantization caused the model to be less accurate.
Each experiment in Table 4 is applied to a different model, different dataset, and may also have used different training methods.Thus, conclusions about accuracy from Table 4 cannot be generalized.Still, we can make some observations: • Fixed point quantization, exponential quantization and mixed quantization have no negative effect on accuracy.Accuracy increased after applying these quantization methods.Quantized models can surpass baseline models in accuracy as weight quantization has a regularization effect that overcomes over-fitting [56].• Regarding binary quantization, the negative effect on accuracy varied within small ranges in some experiments [56], [62].Experiments showed that using more bits for activations may enhance the accuracy [56].
Using binary weights with convLSTM is not solely responsible for the poor accuracy obtained, as Ternary and Quaternary quantization resulted in poor accuracy with convLSTM as well [62].However, these quantization methods were successful when applied on LSTM and GRU in the same work [62].

2) Compression
Compression decreases the model size by decreasing the number of parameters or connections.As the number of parameters is reduced, memory requirements and the number of computations decrease.Table 5 compares different compression methods.The compression ratio shows the ratio between the number of parameters of models before and after applying compression methods.Accuracy degradation is computed using Eq.22.
(i) Pruning Pruning is the process of eliminating redundancy.Computations in RNNs are mainly dense matrix operations.To improve computation time, dense matrices are transformed into sparse matrices, which affects accuracy.However, careful choice of the method used to transform a dense matrix to a sparse matrix may result in only a limited impact on accuracy while providing significant gains in computation time.Reduction in memory footprint along with computation optimization is essential for making RNNs viable.However, pruning results in two undesirable effects.The first is a loss in the regularity of memory organization due to sparsification of the dense matrix, and the second is a loss of accuracy on account of the removal of weights and nodes from the model under consideration.The transformation from a regular matrix computation to an irregular application often results in the use of additional hardware and computation time to manage data.To compensate for the Accuracy is also affected by the compression scheme and nonlinear functions approximation used in this work. 2We calculate the error at the tenth frame (third predicted frame).
In the table we have used the symbols: W/A for number of bits for weights/number of bits for activations, P2T for power two ternarization, EQ for exponential quantization, B for binary quantization, T for ternary quantization, and Q for quaternary quantization.
loss of accuracy caused by pruning, various methods, including retraining, have been applied.The following sections describe methods of pruning and compensation techniques found in the literature.Table 5 summarizes the methods of pruning and its impact on sparsity and accuracy.Sparsity in this context refers to the number of empty entries in the matrices.In Table 5, sparsity indicates the impact on the number of entries eliminated because of the method of pruning used.Within RNNs, pruning can be classified as either magnitude pruning for weight matrix sparsification, or structurebased pruning.
Magnitude pruning Magnitude pruning relies on eliminating all weight values below a certain threshold.In this method, the choice of threshold is crucial to minimize the negative impact on accuracy.Magnitude pruning is primarily based on identifying the correct threshold for pruning weights.
• Weight Sub-groups For weight matrix sparsification, the RNN model is trained to eliminate redundant weights and only retain weights that are necessary.There are three categories to create weight subgroups to select the pruning threshold [72].These three categories are class-blind, class-uniform, and classdistribution.In class-blind, x% of weights with the lowest magnitude are pruned, regardless (blind) of the class.In class-uniform, lower pruning x% of weights is uniformly performed in all classes.In class-distribution, weights within the standard deviation of that class are pruned.• Hard thresholding [73], [74] identifies the correct threshold value that preserves accuracy.ESE [74] uses hard thresholding during training to learn which weights contribute to prediction accuracy.• Gradual thresholding This method [75] uses a set of weight masks and a monotonically increasing threshold.Each weight is multiplied with its corresponding mask.This process is iterative, and the masks are updated by setting all parameters that are lower than the threshold to zero.As a result, this technique gradually prunes weights introduced within the training process, in contrast to hard thresholding.• Block Pruning In block pruning [76], magnitude thresholding is applied to blocks of a matrix instead of individual weights during training.The weight with the maximum magnitude is used as a representative for the entire block.If the representative weight is below the current threshold, all the elements in the blocks are set to zero.As a result, block sparsification mitigates the indexing overhead, irregular memory accesses, and incompatibility with array-data-paths that characterises unstructured random pruning.• Grow and prune Grow and prune [54] combines gradient-based growth [77] and magnitude-based pruning [74]  • Network sparsification Pruning through network sparsification [78] introduces sparsity for the connections at every neuron output, such that each output has the same number of inputs.Furthermore, an optimization strategy is formulated that replaces nonzero elements in each row with the highest absolute value.This step avoids any retraining, which may be compute-intensive and difficult in privacy critical applications.However, the impact of this method of pruning on accuracy has not been directly measured.Design space exploration over different levels of sparsity measures the quality of output and gives an indication of the relationship between the level of approximation and the application-level accuracy.• Drop-out DeepIoT [79] compresses neural network structures into smaller dense matrices by finding the minimum number of non-redundant hidden elements without affecting the performance of the network.For LSTM networks, Bernoulli random probabilities are used for dropping out hidden dimensions used within the LSTM blocks.Retaining accuracy levels Pruning alongside training and retraining has been employed to retain the accuracy levels of the pruned models.Retraining works on the pruned weights and/or pruned model until convergence to a specified level of accuracy is achieved.Pruning has shown a regularization effect on the retraining phase [72].The regularization effect might be the reason for outperforming baseline model accuracy.Another benefit for pruning which might be the reason for outperforming the baseline accuracy is that pruning allows the finding of a better local minimum.Pruning increases the loss function immediately, which results in further gradient descent.Handling irregularity in pruned matrices Pruning to maximize sparsity results in a loss in regularity (or structure) of memory organization due to sparsification of the original dense matrix.Pruning techniques that are architecture agnostic, mainly result in unstructured irregular sparse matrices.Methods such as load balancingaware pruning [74] and block pruning (explained earlier within magnitude pruning) [76] have been applied to minimize these effects.Load balancing-aware pruning [74] works towards ensuring the same sparsity ratio among all the pruned sub-matrices, thereby achieving an even distribution of non-zero weights.These techniques introduce regularity in the sparse matrix to improve performance and avoid index tracking.(ii) Structured matrices Circulant matrices A circulant matrix is a matrix in which each column (row) is a cyclic shift of the preceeding column (row) [58].It is considered as a special case of Toeplitz-like matrices.The weight matrices are reorganized into circular matrices.The redundancy of values in the matrices reduces the space complexity of the weights matrices.For large matrices, circulant matrices can use nearly 4× less memory space.The back-propagation algorithm is modified to allow training of the weights in the form of circulant matrices.Block-circulant matrices Instead of transforming the weight matrix into a circulant matrix, it is transformed into a set of circulant sub-matrices [68], [80].Figure 7 shows a weight matrix that has 32 parameters.The block size of the circular sub-matrices is 4. The weight matrix has transformed into two circulant sub-matrices with 8 parameters (4 parameters each).The compression ratio is 4×, where 4 is the block size.Thus, having larger block sizes will result in a higher reduction in model size.However, a high compression ratio may degrade the prediction accuracy.In addition, the Fast Fourier Transform (FFT) algorithm can be used to speed up the computations.Consequently, the computational complexity decreases by a factor of O( k log k ).(iii) Tensor decomposition Tensors are multidimensional arrays.A vector is a tensor of rank one, and a 2-D matrix is a tensor of rank two and so on.Tensors can be decomposed into lower ranks tensors, and tensor operations can be approximated using these decompositions in order to decrease the number of parameters in the NN model.Canonical polyadic (CP) decomposition, Tucker decomposition, and tensor train decomposition are some of the techniques used to apply tensor decomposition [81].Tensor decomposition techniques can be applied to the FC layers [82], convolution layers [83], and recurrent layers [81].In Table 5, we show an example of applying tensor decomposition on a GRU layer using the CP technique.In another example, Adam's algorithm has been used as an optimizer for the training process [84].Tensor decomposition techniques can achieve a high compression ratio compared to other compression methods. 1 H-LSTM is hidden LSTM.Non-linear layers are added in gate computations (Explained in Section III). 2 Dataset name is not mentioned in the paper. 3Accuracy is also affected by quantization (Table 4) and nonlinear functions approximation used in this work.(iv) Knowledge distillation Knowledge distillation is a method that replaces a large model with a smaller model that should behave like a large model.Starting from a large model (teacher) with trained parameters and a dataset, the small model (student) is trained to behave like the large model [86].In addition to knowledge distillation, pruning can be applied to the resulted model to increase the compression ratio, as shown in Table 5.

3) DeltaRNN
Delta Recurrent Neural Networks (DeltaRNN) [87] makes use of the temporal relation between input sequences.For two consecutive input vectors x t and x t−1 , the difference between corresponding values in the two vectors may be zero or close to zero.The same holds for the hidden state output vector.The idea is to skip computations for input/hidden state values that when compared to input/hidden state values of the previous time step, have a difference that is less than a predefined threshold called delta (Θ).Improvement comes from decreasing the number of computations and the number of memory accesses required by the recurrent unit.However, memory requirements will not decrease because we still need to store all the weights as we cannot predict which computations will be skipped.The value of delta threshold affects both accuracy and speedup.In Table 6, we summarize the effect of DeltaRNN on accuracy for two different datasets.In some occasions, it was required to train the RNN using a delta algorithm before inference to obtain better accuracy at inference time.Furthermore, the speedup gained by the delta algorithm at one delta value is not static.It depends on the relation between the input sequences.The highest speedup could be reached using video frames (open driving dataset) as input data, as seen in Table 6.However, the time-consuming CNN before the recurrent layer negated the speedup gained by deltaRNN.Thus, the 100x speedup in GRU execution dropped down to a non-significant speedup for the model as a whole.On the other hand, CNN-Delta [88] applied a similar delta algorithm on CNNs.Applying delta algorithms to both recurrent layers and CNN layers might prove beneficial.

4) Non-linear function approximation
Non-linear functions are the second most used operations in the RNN after matrix to vector multiplications, as may be seen in Table 3.The non-linear functions used in the recurrent layers are tanh and sigmoid, respectively.Both functions require floating-point division and exponential operations, which are expensive in terms of hardware resources.In order to have an efficient implementation for an RNN, nonlinear function approximations are implemented in hardware.This approximation should satisfy a balance between high accuracy and low hardware cost.In what follows, we present the approximations used in the implementations under study.
Look-up tables (LUTs): Replacement of non-linear function computation with look-up tables is the fastest method [89].The input range is divided into segments with constant output values.However, to achieve high accuracy, large LUTs are required and that consumes a large area of silicon, which is not practical.Several methods have been proposed to decrease the LUTs size while preserving high accuracy.
Piecewise linear approximation: This approximation method is done by dividing the non-linear function curve into a number of line segments.Any line segment can be represented by only two values: the slope and the bias.Thus, for each segment, only two values are stored in the LUTs.The choice of the number of segments affects both accuracy and the size of LUTs.Thus, the choice of the number of segments must be made wisely to keep the accuracy high while keeping the LUTs as small as possible.The computational complexity of the non-linear function changes to be a single comparison, multiplication and addition, which may be implemented using shifts and additions.Compared to using look-up tables, piecewise linear approximation requires fewer LUTs and more computations.
Hard tanh / Hard sigmoid: Hard tanh and hard sigmoid are two examples of piecewise linear approximation with three segments.The first segment is saturation to zero or −1 (zero in case of sigmoid and −1 in case of tanh), the last segment is saturation to one, and the middle segment is a line segment that joins the two horizontal lines.
There is a variant of piecewise linear approximation called piecewise non-linear approximation.The line segments are replaced by non-linear segments and the use of multipliers cannot be avoided as they can in the linear version.This made the linear approximation preferable in hardware design.
RALUT One other method to reduce the size of the LUTs is to use RALUT (Range Addressable Look Up Tables) [90].In RALUTs, each group of inputs is mapped into a single output.

B. PLATFORM SPECIFIC OPTIMIZATIONS
In this section, we discuss the optimizations performed on the hardware level to run an RNN model efficiently.These optimizations may be related to computation or memory.For computation-related optimizations, techniques are applied to speedup the computations and obtain higher throughput.For memory-related optimizations, techniques are applied to carry out memory usage and accesses with reduced memory overhead.

1) Compute-specific
The bottleneck in RNN computations is the matrix to vector multiplications.It is difficult to fully parallelize matrix to vector multiplications over time-steps as the RNN model includes a feedback part.Each time-step computation waits for the preceding time-step computations to complete so it can use the hidden state output as an input for the new time step computation.
• Loop unrolling Loop unrolling is a parallelization technique that creates multiple instances of the looped operations to gain speedup at the expense of resources.
There are two kinds of loop unrolling used in RNN implementations.The first is inner loop unrolling, where the inner loop of the matrix to vector multiplication is unrolled [21], [91].The second kind is unrolling over time-steps.RNN needs to run for multiple timesteps for each task to be completed.The computation of the recurrent unit can be unrolled over time-steps [92].However, this cannot be fully parallelized, as discussed earlier.Only computations that rely on inputs can be parallelized, while computations relying on hidden state outputs are performed in sequence.One solution can be to use QRNN or SRU, as discussed in Section III-B.In QRNN and SRU, the matrix to vector multiplications do not operate on the hidden state output and thus can be fully parallelized over unrolled time steps [93].• Systolic arrays 2D Systolic arrays are a good candidate for matrix to vector multiplication [94], [95] and convolution units [15].Systolic arrays are efficient as multiplications operands move locally between neighbor PEs (processing elements) [96].Thus, systolic arrays require less area, less energy, and less control logic.Well designed systolic arrays can guarantee that PEs remain busy to maximize throughput.• Pipelining Pipelining is an implementation technique that can increase throughput.Pipelining has been used in RNN implementations in various ways.Coarse-grained pipelining (CGPipe) is used to tailor the LSTM/variants data dependency [68], [80].LSTM computation is performed in three stages, with double buffers in between.The first stage is for weight matrices multiplications with inputs and hidden cells vectors, the second stage is for non matrix to vector operations, and the third stage is for projection layer computations.Fine-Grained Pipelining (FGPipe) can be used to schedule the operations within the CGPipe stages.The design of the pipelining scheduler is a critical task due to the data dependency in LSTM/variants [74].Some operations need to be performed sequentially, while some operations can be done concurrently.Having sparse weight matrices (due to applying pruning) increases the complexity of the scheduler design.• Tiling Tiling consists of dividing one matrix to vector multiplication into multiple matrix to vector multiplications.Usually, tiling is used when a hardware solution has built-in support for matrix to vector multiplication of a specific size in one clock cycle.When the input vector or the weight matrix size is larger than the size of the vector or the matrix supported by the hardware, tiling is used to divide the matrix to vector multiplication to be performed on the hardware in multiple cycles [58], [91].Thus, tiling can be combined with Inner-loop unrolling or systolic arrays.Figure 8 shows a vector that is broken into three vectors and a matrix that is broken into nine matrices.Thus, one matrix to vector multiplication is broken into nine matrix to vector multiplications.Each vector is multiplied with the matrices having a similar color.The output vector is built from three vectors, where each of the three output vectors are accumulated together to form one vector in the output.This computation requires nine cycles to be completed, assuming that new weights can be loaded into the hardware multiplication unit within the cycle time.
FIGURE 8: Tiling: converting one matrix to vector multiplication into nine matrix to vector multiplications.
• Hardware sharing In the GRU recurrent layer, the execution of r t and h t has to be in sequence as h t computation depends on r t as shown in Eq. 12. Thus, the computation of r t and h t is the critical path in the GRU computation.While z t can be computed in parallel as it is independent of h t and r t .The same hardware can be shared for computing r t and z t to save hardware resources [97].• Load balancing In the case of sparse weight matrices (resulting from pruning), load balancing techniques might be needed during parallelization of the matrix to vector multiplication over processing elements [73], [74].• Analog computing Analog computing is a good candidate for neural network accelerators [98].Analog neural networks [99] and analog CNNs [100] have been studied recently.Interestingly, RNN implementations using analog computing have started to attract attention from researchers [98], [101].Analog computing brings significant benefits, especially for the critical matrixvector computation, by making it both faster and more energy-efficient.This is true for the non-linear functions that normally are calculated between the NN layers as well.Analog computing also allows for more efficient communication as a wire can represent many values instead of only a binary value.The performance of an analog computer will however, critically depend on the digital to analog and analog to digital converters, for both speed and energy consumption.

2) Memory considerations
For the processing of an RNN algorithm, memory is needed to store weight matrices, biases, inputs, and activations, where the weight matrices have the highest memory requirement.The first decision related to memory is the location of weights storage.If all the weights are stored in the off-chip memory, accessing the weights comprises the largest cost with respect to both latency and energy [91], [102].
On-chip memory After applying the algorithmic optimizations introduced in Section IV-A, the memory requirements of the RNN layer are reduced, which increases the possibility of storing the weights in the on-chip memory.However, this results in a restriction on the largest model size that can run on the embedded platform.On-chip memory has been used for storing the weights by many implementations [56], [58], [68], [70], [103].
Hybrid memory Storing all the weights in the on-chip memory restricts the size of the model executed on the embedded solution.Storing some of the weights in on-chip memory with the remainder in off-chip memory might provide a solution [69].
In addition to maximizing the use of on-chip memory, some researchers use techniques to reduce the number and the cost of memory accesses.

• Multi time-step parallelization
The fact that QRNN and SRU remove the hidden state output from the matrix to vector multiplications can be leveraged to allow multi time-step parallelization [93].Multi time-step parallelization is performed by converting multiple matrix to vector multiplication into a fewer matrix to matrix multiplications.This method decreases the number of memory accesses by reusing the weights for computations involving multiple time-steps.• Reordering weights Reordering weights so they occupy memory in the same order as computation helps decrease the memory access time [91].Reordering the parameters in memory is carried out in a way that ensures memory accesses will be sequential.• Compute/load overlap In order to compute matrix to vector multiplications, weights need to be accessed and loaded from memory and then used for computations.The total time is the sum of the access time and computation time.To decrease this time, memory access and computations can be overlapped.This overlap can be achieved by fetching the weights for the next timestep while performing the computation of the current time-step.The overlap would require the existence of extra buffers for storing the weights of the next time-step while using the weights of the current time-step [74].[105].DWM technology is based on a magnetic spin [106]- [109].Information is stored by setting the spin orientation of magnetic domains in a nanoscopic permalloy wire.Multiple magnetic domains can occupy one wire which is called race-tracking.Race-tracking allows the representation of up to 64 bits.DWM density is hoped to improve SRAM by 30x and DRAM by 10x [110].Using DWM in RNN accelerator can achieve better performance and lower energy consumption [106].
Processing In Memory (PIM) PIM gets rid of the data fetching problem by allowing computation to take place in memory, eliminating memory access overhead.In such an architecture, a memory bank is divided into three subarray segments: memory sub-arrays, buffer sub-arrays, and processing sub-arrays, which are used as conventional memory, data buffer, and processing sub-arrays respectively.ReRAM-based PIM arrays is one approach used to accelerate CNNs [111]- [113] and RNNs [114].ReRAM that supports XNOR and bit counting operations will only be sufficient for RNN implementation if binary or multi-bit code (Section IV-A1) quantization has been applied [71].Memristors crossbar arrays have successfully been used as an analog dot product engine to accelerate both CNNs [115] and RNNs [101].

V. RNN IMPLEMENTATIONS ON HARDWARE
In the previous section, we discussed the optimizations applied to decrease the computation and memory requirements of RNN models.In this section, we study recent implementations of RNN applications on embedded platforms.The implementations are divided into FPGA, ASIC, and other implementations.We analyze these implementations and study the effects of the applied optimizations.However, the effect of each optimization is not shown separately.Instead, the outcomes of applying the mix of optimizations are discussed with respect to the objectives presented in Section II.First, with regard to efficiency, the implementations are compared in terms of throughput, energy consumption, and meeting real-time requirements.Then, for flexibility, we discuss implementations that support variations in the models, online training, or different application domains.
Table 7 shows the details of the implementations studied here.Authors names are shown, along with the name of the architecture; if named; the affiliation, and the year of publication.Table 8 and Table 9 present the implementations under study.Table 8 shows implementations performed on FPGAs, while Table 9 shows implementations performed on other platforms.Each implementation has an index.The index starts with "F" for FPGA implementations, "A" for ASIC implementations, and "C" for other implementations.For each implementation, the tables show the platform, the RNN model, the applied optimizations, and the runtime performance.
In most cases, only the recurrent layers of the RNN model are shown, as most of the papers provided the implementation for these layers only.The recurrent layers are written in the format x*y*z, where x is the number of recurrent layers, y is the type of recurrent layers (e.g.LSTM, GRU, ..), and z is the number of hidden cells in each layer.If the model has different modules (e.g. two different LSTM models or LSTM + CNN), we give the number of executed time-steps of the RNN model.Both algorithmic and platform optimizations are shown in the tables.All the optimizations found in the tables are explained above in Section IV using the same keywords as in the tables.For quantized models, "Quantization X" is written in the optimizations column, where X is the number of bits used to store the weights.The effective throughput and the energy efficiency given in the tables are discussed in detail in the sub-section below.

A. IMPLEMENTATION EFFICIENCY
To study the efficiency of the implementations understudy, we focus on three aspects: throughput, energy consumption, and meeting real-time requirements.

1) Effective Throughput
To compare the throughput of different implementations, we use the number of operations per second (OP/s) as a measure.Some of the papers surveyed did not directly state the throughput.For these papers, we have tried to deduce the throughput from other information given.One other aspect to consider is that compression optimization results in decreasing the number of operations in the model before running it.Consequently, the number of operations per second is not a fair indicator of the implementation efficiency.In this case, the throughput is calculated using the number of operations in the dense RNN model, not the compressed model.We call this an Effective Throughput.Below, we list the methods used to deduce the throughput values for the different papers.
• Case q1: Effective throughput is given in the paper.
• Case q2: Number of operations in the dense model and computation time are given.By dividing number of operations n op by time, we get the effective throughput Q ef f as shown in Eq. 25.In some papers, the number of operations and the computation time time comp were given for multiple time steps (multiple inputs), which would require running the LSTM n steps times.
• Case q3: The implemented RNN model information is provided in the paper.Thus, we calculate the number of operations from the model information and then divide it by computation time to get the throughput as in case q2.To compute the number of operations, the number of operations in the matrix to vector multiplications is counted as they have the dominant effect on the performance.If the paper does not give enough information to calculate the number of operations, the number of operations can be approximately calculated by multiplying the number of parameters by two.• Case q4: The energy efficiency is given in terms of OP/s/watt and the power consumption is given in watt.
By multiplying the two values, throughput is calculated.• Case q5: Effective throughput could not be computed.For a fair comparison between the ASIC implementations, we have applied scaling to 65 nm technology at 1.1 V using the general scaling equations in Rabaey [117] and scaling estimate equations for 45 nm and smaller technologies [118].If the voltage value is not mentioned in the paper, we assume the standard voltage for the implementation technology.For example, since A7 was implemented on 65 nm, we assume the voltage value to be 1.1 V.
To analyze Table 8 and Table 9 and understand the effect of different optimizations on throughput, the tables entries are ordered in descending order, starting with the highest throughput implementation.There exist two optimization groups that appear more frequently among the high throughput implementations.The first group is related to decreasing memory access time.Memory access time is decreased either by using on-chip memory for all weights or overlapping the computation time and the weights loading time.
The second group is related to algorithmic optimizations.Algorithmic optimizations present in all high throughput implementations are compression (pruning, block-circulant matrices, etc.), deltaRNN, and low precision quantization.Non-linear function approximations and 16-bit quantization are not within the groups of high effect optimizations.Quantization with 16-bit is present in many implementations that do not aim for lower precision, and it does not have a great effect on computation cost.Thus, it is not a differentiating factor.Non-linear function approximations do not contribute to the most used operations (matrix to vector multiplications).
Finally, the throughput values are plotted against the implementations in Figure 9.The scaled effective throughput values for the ASIC implementations are used.Implementations that have memory access optimizations and/or algorithmic optimizations are highlighted by putting them inside a square and/or circle.It can be observed from Figure 9 that all of the implementations with high throughput have some algorithmic optimization and applied memory access optimization.For example, F1 [56] applied low precision quantization and placed all the weights on the on-chip memory.F2 [80], F3 [70], F4 [68], and A2 [58], all applied both on-chip memory optimization and algorithmic optimizations.In F6 [74], the architecture had a scheduler that overlapped computation with memory accesses.All the weights required for computation are fetched before the computation starts.Thus, they managed to eliminate the off-chip memory access overhead by having an efficient compute/load overlap.
Both F2 and F4 applied block-circulant matrices optimization.In addition, A2 applied circulant matrices optimization.This indicates that restructuring weight matrices into circulant matrices and sub-matrices is one of the most fruitful optimizations.The reason could be that circulant matrices optimization does not cause the irregularity of weight matrices seen in pruning [68].Additionally, circulant/blockcirculant matrices can be accompanied by low precision quantization without a harsh effect on accuracy as in A2 (6bit) and F2 (12-bit).It is observed in Table 8 that F2 and F4 optimizations are almost identical, but their performance is different.F2 and F4 have differences in the hardware architecture and F2 applied lower precision than F4, but the most important reason is that F2 used a better approach in training the compressed RNN model.F2 was able to reach the same accuracy level reached by F4 with block size 8 while using block size 16.Thus, the RNN model size in F2 is approximately 2x less than F4.
Nevertheless, it is noticed that the only computeoptimization in F2 and F4 is pipelining.In these two implementations, pipelining served in two roles.The first is coarsegrained pipelining between LSTM stages, and the second, fine-grained pipelining within each stage.It worth knowing that F1 is based on the same architecture as F5.F1 achieved higher throughput than F6 by applying higher frequency and using lower precision.Assuming linear frequency scaling, the ratio between the two implementations' throughput is close to the ratio between the precisions used for storing the weights by the two implementations.
The lack of algorithmic optimizations in A1 [101] was compensated by the use of analog crossbar-based matrix to vector multiplication units.Analog crossbar units allowed low latency matrix to vector multiplications.Implementations that used analog computing are marked with an "A" sign in Figures 9 and 10.Comparing A1 to A3, both were using analog crossbars.A1 surpassed A3 by applying PIM (Processing In Memory), which removes memory access overhead.Therefore, in Figures 9 and 10, we consider PIM as a memory access optimization.
One implementation that stands out is A6 [104], which has a very low throughput for an ASIC implementation while applying on-chip and algorithmic optimizations.This particular implementation was meant to meet a latency deadline of 16ms while consuming low power -at the micro-watt level.Thus, high throughput was not the objective from the beginning.Implementations that defined real-time requirements are marked by an "RT" sign in Figures 9 and 10.Another implementation that rewards close inspection is F8.Despite applying the two mentioned optimizations, it could not reach as high performance as expected.The conclusion here is that applying memory access optimization and algorithmic optimization is necessary but not sufficient for high performance.< q3 > 0.0011 < e4 > - 1 The cases q1-q4 are explained in Section V-A1. 2 The cases e1-e4 are explained in Section V-A2. 3 Scaled to 65 nm at 1.1 volt using general scaling [117] and scaling estimates for 45 nm and smaller technologies [118]. 4The throughput is not high as the purpose was to reach very low power consumption while performing inference within 16ms. 5 RenderScript is a mobile-specific parallelization framework [119]. 6Quantization used 1 bit for weights and 4 bits for activations. 7A4 proposed a GRU core without providing specific model details. 8 A1 did not specify which model achieved the provided throughput and energy efficiency.
In addition, Figure 9 shows that most of the ASIC implementations were not exceeding FPGA implementations in terms of throughput.We think the reason is that the ASIC implementations under study did not use the latest ASIC technologies, as shown in Table 9.
For the low throughput implementations, Figure 9 shows that some implementations did not apply either of the two optimizations (memory access and algorithmic), such as F9 [16] that had a strict accuracy constraint bounding the use of algorithmic optimizations and C3 [116].In addition, some implementations applied only one of the two optimizations, including F11 [78] and F12 [69].

2) Energy efficiency
To compare the implementations from the energy consumption perspective, we use the number of operations per second per watt as a measure.The last columns in Table 8 and Table 9 show the energy efficiency.Energy efficiency is calculated based on the dense model, not the sparse model, as for effective throughput.However, it was not possible to obtain values for energy efficiency for all implementations.In some cases, the power consumption was not mentioned in the paper, while in others, the consumed power was not provided in a precise manner.For instance, the power of the whole FPGA board may be provided, which does not indicate how much power is used by the implementation with respect to the peripherals [21], [91].
Here, we list the cases used for computing the energy efficiency in Table 8 and Table 9.The case number appears in triangular brackets, <>, before the numeric value • Case e1: The E ef f energy efficiency is given in the paper.
• Case e2: The power consumption is given in the paper.
To compute the energy efficiency E ef f , the effective throughput Q ef f (OP/s) is divided by the power P (watt) as • Case e3: Energy and computation time are provided.First, we divide energy by time to get power.Then, we divide effective throughput Q ef f by the power to get energy efficiency, as in case e2.• Case e4: energy efficiency could not be computed.Figure 10 is a plot of the energy efficiency found or deduced for the implementations under study against the implementation index.Implementations are sorted in the plot according to energy efficiency and the scaled values for the ASIC implementations are used.Again, to show the effect of optimizations, we chose the two most effective optimizations from Table 8 and Table 9 to include in the figure.They are the same as in Figure 9: memory access optimization and algorithmic optimizations.Comparing the effective throughput and energy efficiency of FPGA and ASIC implementations, it is observed that FPGA and ASIC have close values for effective throughput while ASIC implementations are more energy efficient.The credit should go to ASIC technology.
It can be observed that the highest energy efficiency was achieved by A7 [120] and A3 [98].Both implementations used analog crossbar based matrix to vector multiplications.A7 managed to save memory access time by computing in memory.The quantization method used was multi-bit code quantization (1-bit for weights and 4-bit for activations).Multi-bit code quantization enables replacing the MAC operation with XNOR and bit-counting operations, as discussed in Section IV-A1.It was sufficient to use an XNOR-RRAM based architecture to implement the RNN.
Both A1 (applying PIM and analog computing) and A6 (Applying memory and algorithmic optimizations) were less energy efficient than expected.They were both less energy efficient than A4 (Applying only memory optimization).A1 had a quite high clock frequency of 1 GHz.This high frequency helped the implementation to achieve high throughput.However, we suspect that this high frequency is the main reason for the energy efficiency degradation compared to the other implementations.A6 had the least power consumption among all implementations (≤ 5 µW ).The low throughput of A6 affected the overall energy efficiency value.

3) Meeting real-time requirements
In some of the implementations, real-time requirements for throughput and power have been determined.For example, in F8 [103], the speech recognition system had two RNN models.One model for acoustic modeling and the other for character-level language modeling.The real-time requirement was to run the first model 100 times per second and the second model 3,840 times per second.While in A6 [104], an LSTM accelerator for an always-on Keyword Spotting System (KWS), the real-time response demanded that a new input vector should be consumed every 16 ms and the power consumption should not exceed 10 µW .

B. FLEXIBILITY
Flexibility, as defined in Section II is the ability of the solution to support different models and configurations.The flexibility of the solution can be met by supporting variations in the model.Models can vary in the number of layers, the number of hidden units per layer, optimizations applied on the model, and more.Flexibility can be met by supporting online training or meeting different application domain requirements.
Flexibility is not quantitative, like throughput.Thus, we use a subjective measure for flexibility to reach a flexibility score for each implementation.Table 10 shows the flexibility aspects supported by each implementation, as discussed in the papers and the flexibility score for each implementation.Papers that do not discuss any flexibility aspects are omitted from Table 10.In A4 [97], the architecture should be able to support various models, but the number of cells and layers the architecture can support are not mentioned in the paper.Hence, we cannot deduce how the implementation could  support variations in the RNN model.Also, the variations should be supported on the hardware platform and not only by the method before fabrication.In A2 [58], the design method can support two different RNN layers.However, the fabricated chip supports only one of them.Thus, we do not consider A2 [58] to meet the flexibility objective.
To understand how far flexibility is met by the implementations, Figure 11 shows the percentage of implementations supporting each flexibility aspect.Flexibility is visualized as levels.Level 0 is used to indicate no flexibility.Level 0 requires the implementation to support only one recurrent layer configuration.All papers meet level 0 requirement but thereafter they vary in meeting other flexibility aspects.The flexibility aspects and how they can be met are discussed below.
Supporting variations in RNN layers (level 1) Recurrent layers can vary in the type of layers, the number of cells in each layer, and the number of layers (the depth of the RNN model).One optimization that might have a side effect on the flexibility of the solution is the choice of using onchip or off-chip memory to store the weights.Being able to store all the weights in on-chip memory is very beneficial.It leads to better performance and less energy consumption by decreasing the cost of memory accesses.However, this solution may be unfeasible for larger problems.For example, in F8 [103], the number of weights in the model and their precision restricted by the on-chip memory size.It is not possible to run a model with an increased number of hidden cells or increased precision.A possible solution is to use an adaptable approach, where the location chosen to store the weights is dependent on the model size and thus can a wide range of models can be supported.Another solution was adopted in F12 [69], where some of the weights are stored in internal memory, and the rest are stored in off-chip memory (Hybrid memory).
Supporting other NN layers (level 2) Supporting other NN layers allows the solution to run a broader range of NN applications.Also, other NN layers may exist in the RNN model, such as convolutions used as a feature extractor.Supporting such a convolution in the implementation increases the flexibility of the solution, as it can run RNN models with visual inputs and run CNN independent applications.

Index
Flexibility aspects in papers Score F2 [80] Varying layer (LSTM/GRU) Varying number of cells Varying block size (block circulant matrices) F4 [68] Varying layer (LSTM/BiLSTM) Varying number of layers Varying number of cells F1 [56] Varying layer (LSTM/BiLSTM) Varying precision FC supported F5 [92] Varying layer (LSTM/BiLSTM) FC supported F7 [21] Convolution supported FC supported F8 [103] Varying number of layers Varying number of cells Input layer F10 [91] Varying number of layers Varying number of cells A4 [97] Online training A5 [73] Varying number of cells FC supported A6 [104] Varying number of layers Varying number of cells Linear/nonlinear quantization FC supported A8 [114] Varying type of layer(LSTM/GRU) Convolution supported FC supported A9 [95] Varying number of cells Varying number of layers Dense/Sparse Convolution supported A10 [94] Varying number of cells Varying number of layers Convolution supported Varying precision A1 [101] Varying number of cells Varying number of layers Varying type of layers Convolution supported FC supported C2 [93] Varying layer (LSTM/SRU/QRNN) Varying number of cells C3 [116] Varying number of layers Varying number of cells Supporting algorithmic optimization variations (Level 3) Variations in the optimizations applied are also considered as variations in the model.For example, variation due to applying or not applying pruning is related to the presence of sparse or dense matrices in the matrix to vector multiplications computations.The design in A9 [95] employed a configurable interconnection network topology to increase the flexibility of the accelerator.The accelerator in A9 [95] supported both LSTM and CNN layers.The accelerators supported both sparse and dense matrices.One other variation in the precision of the weights and activations.The design in A10 [94] supported varying precision models by allowing dynamic precision per layer for both CNN and RNN models.Similarly, the Microsoft NPU brainwave architecture [121] supported varying precision using a narrow precision block floating-point format [122].To maximize the benefit of varying precision, F1 [56] applied a parameterizable parallelization scheme.When lower precision is required, LSTM units are duplicated to exploit the unused resources to gain speedup.And, when higher precision is used SIMD folding is applied to save resources for the needed high precision.
Online training (Level 4) Incremented online training was included in A4 [97] to support retraining pre-trained networks to enhance accuracy.Changes in hardware design were applied to support both training and inference without affecting the quality of inference.For example, three modes of data transfer were applied.The first to load new weights; the second to load input sequences; and the third to update certain weights.Extra precision was only used for training.
Meeting different applications domains constraints (Level 5) None of the implementations target variations in the application domain constraints.NetAdapt is a good example of an implementation that can adapt to different metric budgets [123].However, it only targets CNNs.

VI. DISCUSSIONS AND OPPORTUNITIES
In the previous section, we studied the implementations of RNN on embedded platforms.In Section II, we defined objectives for realizing RNN models on embedded platforms.In this section, we investigate how these objectives are being met by the implementations.
Throughput It is clear that throughput was the main objective for most of the implementations.As seen in Figure 9, high throughput was achieved by many of them.Algorithmic and memory optimizations are present in most of these high throughput implementations.The algorithmic optimizations applied were effective because they decrease both the computation and the memory requirements of the RNN models.For example, if 4-bit precision is used instead of 32-bit for weights storage, the memory requirement is decreased to 1/8.Multiple 4-bit weights can be concatenated during weights fetching.Thus, the number of memory accesses can decrease as well.Furthermore, the hardware required for 4bit operations is simpler than the hardware required for 32-bit floating-point operations.
Memory-specific optimizations are effective because they decrease or hide the overhead of accessing the large number of weights used in RNN computations.Memory access time can be decreased by storing all weights in on-chip memory.However, this can bound the validity of the solution for larger models as on-chip memory may not be sufficient to store the weights.Overlapping the memory access time with computation and computation in memory are also considered to be memory optimizations.
Energy efficiency Applying algorithmic and memory access optimizations has a positive effect on energy efficiency as well.Algorithmic optimizations lead to a decrease in the number of computations, the complexity of computations, and the number of memory accesses, and so decrease the energy consumed by the implementation.Also, minimizing off-chip memory use by storing weights on on-chip memory is an effective way to enhance energy efficiency.Analog computing and processing in memory implementations showed superior energy efficiency in ASIC implementations.
Meeting real-time requirements was not an objective for many of the implementations.In a few of them, real-time deadlines were mentioned and followed in the design of the solution.
Flexibilty In Section II-A, flexibility is defined as a secondary objective.Thus, we do not expect flexibility to be fully met by the implementations.Variations in the RNN model was partially fulfilled by many implementations.However, the number of variations covered by each implementation is quite low.Few implementations included other NN layers and variations in algorithmic optimizations.Onlinetraining was targeted by only one implementation.Embedded implementations do not usually support online-training.However, on the algorithmic side, researchers are carrying out interesting work based on online or continuous training [10], [11].None of the RNN implementations support different applications, but this has been done by the CNN solution in [123].Following a similar method in RNNs, and in addition, also supporting model variations, could lead to interesting solutions.
Opportunities for future research Based on the survey reported on in this article, we list some opportunities for future research.
QRNN and SRU: QRNN and SRU (Section III-B6) are two alternatives to LSTM where the matrix to vector computations for the current time-step are independent of previous time-step computations.Thus, using them in RNN models can make the parallelization more efficient and consequently lead to better performance.
DeltaRNN [87] and DeltaCNN [88]: We believe that applying the delta algorithm to both recurrent and convolution layers is a logical step because of the temporal relation between the input sequences.Adding a delta step to other algorithmic optimizations such as pruning and quantization would decrease memory access and computation requirements.
Block-circulant matrices Using block-circulant matrices as an algorithmic optimization decreases the RNN size while avoiding irregularity of computation as introduced by pruning [68].Applying circulant matrices can be accompanied by low precision parameters and activations, with a small effect on accuracy [58].With the addition of the delta algorithm, as mentioned earlier, RNN inference can achieve a promising throughput and energy efficiency.
Hybrid optimizations: It has been shown that a mix of algorithmic optimizations can be applied to an RNN model with a loss in accuracy that is acceptable [58].Applying a mix of optimizations would enable the implementations to benefit from each optimization.For an RNN implementation, three classes of optimizations can be mixed with tuning.The first optimization is the delta algorithm, and the corresponding parameter is delta.The second is quantization and the corresponding parameters are the number of bits and the quantization method.The third optimization is compression.If the applied compression technique is block-circulant matrices, the parameter is the block size.Tuning the three parameters delta, number of bits, quantization method, and block size, the designer can achieve the highest performance while keeping the accuracy within an acceptable range (the range is dependent on the application).
Analog computing and processing in memory: Analog computing [98] and processing in memory [71], [101] have shown promising performance, especially in energy efficiency.Analog crossbar based matrix to vector multiplication units can provide low latency and computing in memory overcomes the memory access problems.
Flexible neural networks and domain-specific architectures Domain-specific architectures (DSAs) have been highlighted as a future opportunity in the field of computer architecture [124].DSAs (also called accelerators or custom hardware) for neural networks applications can reach high performance and energy efficiency.Designing an architecture for neural networks applications as a specific domain with known memory access patterns enhances parallelism and the use of the memory hierarchy.It is possible to use lower precision and benefit from domain-specific languages (DSLs).Google Edge TPU is an example of a DSA for neural networks inference using 8-bit precision [125].Based on the study in this article, we add that the neural networks DSA needs to support flexibility.For the flexibility aspects defined earlier in Section II to be fulfilled, there are some features need to be supported in the underlying hardware.
• Variable bit-width operations as in A10 [94] to support different quantization schemes.• Some optimizations require pre/post-processing on input vectors and weights.Support for weights reordering, delta vectors computation, retaining circulant matrices from equivalent vectors, and other operations required by miscellaneous optimizations would be useful.• Support for training that would imply the support of back-propagation and the allowance of weights modification.

VII. SUMMARY
Today we see a trend towards more intelligent mobile devices that are processing applications with streamed data in the form of text, voice, and video.To process these applications, RNNs are important because of their efficiency in processing sequential data.In this article, we have studied the state-ofthe-art in RNN implementations from the embedded systems perspective.The article includes all the aspects required for the efficient implementation of an RNN model on embedded platforms.We study the different components of RNN models, with an emphasis on implementation rather than on algorithms.Also, we define the objectives that are required to be met by the hardware solutions for RNN applications, and the challenges that make them difficult to implement.For an RNN model to run efficiently on an embedded platform, some optimizations need to be applied.Thus, we study both algorithmic and platform-specific optimizations.Then, we analyze existing implementations of RNN models on embedded systems.Finally, we discuss how the objectives defined earlier in the article have been met and highlight possible directions for research in this field in the future.We conclude from the analysis of the implementations that there are two optimizations that are used for most of the efficient implementations.The first is algorithmic optimizations.The second is to decrease the memory access time for weights retrieval, which can be implemented by relying on on-chip memory for storing the weights, applying an efficient overlap between weights loading and computations, or by computing in memory.However, using analog crossbar based multipliers can achieve high performance without relying too much on algorithmic optimizations.A study of the implementations in the literature shows performance high enough for many streaming applications while also showing a lack of flexibility.Finally, we deduce some opportunities for research to fill the gap between the defined objectives and the research work under study.We highlight some hardware efficient recurrent layers and algorithmic optimizations that can enhance implementations' efficiency.Additionally, we describe how custom embedded hardware implementations can support flexible RNNs solutions.

FIGURE 2 :
FIGURE 2: Generic model of RNNs with diverse recurrent layers.

FIGURE 3 :
FIGURE 3: Different variations of an RNN layer.
(a) One to Many RNN.(b) Many to One RNN.(c) Many to Many RNN.

FIGURE 4 :
FIGURE 4: Unfolding RNN models through multiple time steps.

FIGURE 5 :
FIGURE 5: Stacked RNN.The first layer output is h 1 t and the second layer output is h 2 t .

FIGURE 6 :
FIGURE 6: Optimizations applied to RNN applications with section numbers indicated, comparing the effect of different algorithmic optimizations on memory and computation requirements.

FIGURE 9 :
FIGURE 9: Effective throughput of different implementations and the key optimizations affecting them.

FIGURE 10 :
FIGURE 10: Energy efficiency of different implementations and the key optimizations used.

FIGURE 11 :
FIGURE 11: Percentage of implementations meeting flexibility aspects for different flexibility levels and the definition of flexibility levels.
TOMAS NORDSTRÖM received his M.S.E.E.degree in 1988, his licentiate degree in 1991, and his Ph.D. in 1995, all from Luleå University of Technology, Sweden.His PhD Thesis "Highly Parallel Computers for Artificial Neural Networks" bridged the two fields of computer engineering and signal processing, between which he has been moving ever since.Between 1996 and 1999, Tomas Nordström was with Telia Research (the research branch of the major Swedish telephone operator) where he developed broadband Internet communication over twisted copper pair.He also became Telia's leading expert in speaker verification during these years.In December 1999, he joined the FTW Telecommunications Research Center Vienna, Austria, where he has been working as a Key Researcher in the field of Şbroadband wireline access Ť.During his years at FTW, he worked on various aspects of wireline communications such as the simulation of xDSL systems, cable characterization, RFI suppression, exploiting the common-mode signal in xDSL, and more recently, dynamic spectrum management.In 2009 was appointed Associate Professor in computer systems engineering at Halmstad University (HH), Sweden.At HH he has returned to the area of computer architecture and his current research interests include all aspects of energyefficient embedded computers.In 2012 he became full Professor in Computer Engineering at HH and has built up a research group focusing on heterogeneous many-core architectures.Additionally, he has been working in the field of dependable wireless communication studying optimal usage of communication resources, dynamic spectrum management, and IoT reference architectures.In 2019 he became full Professor in Embedded and Intelligent Systems at Umeå University, Sweden, where his research focuses on edge computing, intelligent IoT systems, and high-performance embedded computing architectures and platforms.

TABLE 1 :
Comparison with related survey articles.

TABLE 2 :
RNN input layer types and their corresponding application domains and datasets.

•
State computationThe role of this computation is to compute the new memory state C t of the LSTM cell.First, it computes the possible values for the new state

•
Update gateThe update gate decides how much of the output is updated.The output of the update gate z t is computed as the reset gate output r t using the weight matrix W z as

TABLE 3 :
Comparing LSTM and its variations.

TABLE 4 :
Effect of quantization methods on accuracy.

TABLE 5 :
Effect of compression methods on accuracy.

TABLE 6 :
Effect of DeltaRNN method on accuracy

TABLE 7 :
Detailed information about papers under study

TABLE 8 :
Comparison of RNNs implementations on FPGAs.

TABLE 9 :
Comparison of RNNs implementations on ASIC and other platforms.

TABLE 10 :
Flexibility score of implementations under study.