Accelerating Forward Algorithm for Stochastic Automata on Graphics Processing Units

A stochastic automaton is a non-deterministic automata with input and output behavior which works serially and synchronously. Stochastic automata is being used in different application areas. For large state space and sequence lengths, performance of stochastic automata is a major concern. For this purpose, graphics processing units can be employed to improve the performance. In this study, a parallel version of inference algorithm for stochastic automata is designed. The parallel version is mapped to graphics processing unit using the dynamic parallelism. The performance of parallel version is compared with different realizations and parameters. Parallel implementation of inference algorithm achieved approximately speedup factor of 50 for 256 states.


I. INTRODUCTION
Stochastic automata are probabilistic automata with input/output behavior. Stochastic automata emits an output symbol and moves into another state after reading input. Stochastic automata have been applied in language understanding [1], [2], modeling of soft real-time systems [3], [4], and machine learning. Typically, a stochastic automaton can have all transitions between states [5]. This can lead to high computational complexity for real world problems. There is need to improve the performance of stochastic automata algorithms.
Performance of stochastic automata algorithms can be enhanced with the help of modern high performance computing. Graphics Processing Units (GPUs) can efficiently solve complex problems. GPUs are many core and massively parallel architectures which can perform computation intensive tasks [6].
The Compute Unified Device Architecture (CUDA) programming model introduced by NVIDIA provides extension to the C language and supports the CPU/GPU execution. CUDA provides a hierarchy of thread groups, shared memories, and barrier synchronization [6]. The execution of a thread in CUDA is sequential. Many threads can be executed in parallel to process different parts of data [6], [7].
The associate editor coordinating the review of this manuscript and approving it for publication was Imran Ahmed .
The threads are grouped into a two-level hierarchy, i.e., grid and block. A grid is composed of one or more blocks and each block can have one or more threads. The efficiency of GPU programs depends on hardware configuration, the utilization of the allocated hardware, and the amount of parallelism exhibited by the problem [6]- [10].
In stochastic automata, the inference algorithm called Forward algorithm is a variant of Viterbi algorithm [11]. Forward algorithm is a dynamic programming algorithm to find the optimal sequence of states. For m states, the Forward algorithm finds the path with time complexity O(m 2 )n. This work presents the formulation and experimental setup of parallel version of the Forward algorithm. The Forward algorithm is partitioned into data independent and dependent parts. The data independent part is implemented on the GPU to enhance the efficiency.
The remainder of the paper is structured in different sections. Section II provides a brief overview of the stochastic automata. Sections III and IV discuss the Forward algorithm for stochastic automata and different approaches to parallel this algorithm. Section V presents the results. Finally, section VI concludes the outcomes.

II. STOCHASTIC AUTOMATA
An automaton (Automata in plural) is a control mechanism or an abstract computing device which performs a pre-determined sequence of operations automatically. Finite automata is an automaton with a finite number of states. An automaton which change their state and give output according to probability is called stochastic automata. Stochastic automata has been introduced as systems which change their state and give some output according to some probability depending on the input and the actual state [11].
Shannon [12] and Von Neumann [13] introduced the theory of discrete stochastic systems by working on memory less communication channels and synthesis of reliable systems from unreliable components, respectively. The hidden Markov model (HMM) is a probabilistic network related to stochastic automata [5], [14], [15]. HMM addresses evaluation, decoding, and learning problems [1], [16]. The formal definition of stochastic automata is [17], [18]: • Nonempty finite set of states Q, • an alphabet of input symbols , • an alphabet of output symbols , and • conditional probability distribution p on × Q. A stochastic automaton can be considered an abstract machine that takes on a well-defined state at each time step of computation. Conditional probability distribution p(·, ·|u, q) on × Q consists of non-negative numbers p(v, q |u, q) for all q ∈ Q and v ∈ so that For |u| = |v|, where all u, u ∈ * , v, v ∈ * , and q, q ∈ Q. The implementation of stochastic automata can be constrained due to local maxima problem. This problem can be avoided by determining the allowed transitions for the given problem [5].

III. FORWARD ALGORITHM
Stochastic automata are abstract machines with input/output behavior. Consider the stochastic automata SA = (Q, , , P, π, f ) with m-element set of states Q, m 1 -element input alphabet , and m 2 -element output alphabet . The marginal distribution gives the probability of input sequence U = u 1 , u 2 , · · · , u n ∈ n and output sequence The sum-product decomposition is A n × m matrix F can be used to calculate the probability p(u,v).
The aim is to determine the optimal state sequencē q ∈ Q n+1 .q m.m 1 4: end for 5: for i ← 1 to n do 6: for q ← 1 to m do 7: 10: end for 11: end for 12: end for 13: p ← 0 14: for q ← 1 to m do 15: p ← sum(p, F[n, q]) 16: end for VOLUME 8, 2020 Given the input sequence U = u 1 , u 2 , · · · , u n and output sequence V = v 1 , v 2 , · · · , v n , the Forward algorithm finds the sequence of states that correspond to this input and output behavior [11]. This algorithm initializes the forward matrix F with fraction 1 m.m 1 . Matrix entries are calculated using the values of previous row. This algorithm finds the matrix entry F[i, q] by adding the marginal probability of current state to the preceding value The computational time complexity of Forward algorithm is O(m 2 n). The optimal state sequenceq can be computed efficiently using tropicalization of sum-product decomposition [11]. For this, put d( The sums and products are replaced by tropical addition and tropical multiplication in sum-product decomposition [11]. By using the tropicalized sum-product decomposition, the term d(u,v) can be calculated.
F[0, q] ← 0 4: end for 5: for i ← 1 to n do 6: for q ← 1 to m do 7:  d(u, v) is computed. The high computational complexity limits the usage of stochastic automata. Optimal state sequence(s) can be obtained by backward algorithm [11].

IV. ACCELERATING FORWARD ALGORITHM
For large sequence length and state space, the high complexity restricts the usage of the stochastic automata. The huge performance boosts can be attained by mapping the stochastic automata algorithms on GPU. Researchers have mapped HMM based applications to GPU and achieved order of magnitude speedup. They have applied task parallel [19]- [23], data parallel [24]- [27], and combination of task and data parallel [28]-[32] approaches for HMM. Similar approaches can be adopted to improve the performance of stochastic automata.
Different approaches can be used to enhance the performance of Tropical_Forward algorithm. One method is to compute the probabilities in advance and save in the matrix. The downside is this large matrix should be transferred into the GPU memory. Moreover, the maximum number of sequences and states that can be processed is restricted due to small GPU memory size.
Another approach to accelerate the Tropical_Forward algorithm is to divide the algorithm into sequential CUDA kernel calls. In this manner, n kernels are launched for n sequences ( Figure 2). Moreover, the number of blocks and threads for each kernel are selected according to problem size. Parallel implementation of this approach uses both coarse and fine grain granularity depending on the state space. However, each kernel launch requires to transfer the control to CPU. For large sequence lengths, multiple kernel launch and execution overheads can impact the performance.
Dynamic parallelism can be employed to reduce the GPU kernel launch overhead. Dynamic parallelism can minimize the need to transfer execution control and data between CPU and GPU [7]. This approach can be adopted to implement Tropical_Forward algorithm. This work launches multiple kernels in blocks using dynamic parallelism. Each block processes an input sequence (Figure 3). Parallel implementation uses both coarse and fine grain granularity. However, execution parameters configuration and allocation of resources is major concern for dynamic parallelism.

V. RESULTS AND DISCUSSION
In this section, performance results of the serial and parallel implementations of Forward algorithm for stochastic automata is presented. We have considered execution time and speed up for performance evaluation. The execution times is measured by taking average over twenty runs. Moreover, theoretical floating point operations (FLOPs) are not considered as performance measure. The reason is single floating point operation can be transformed to multiple operations during compilation. Moreover, we considered single precision arithmetic and different optimization flags for experimentation.
The computing environment used for implementation is an Intel Core i7 6700 CPU (3.40 GHz) the CUDA version 9.0 on an NVIDIA Titan XP graphics card. Different parameters and realizations were used to obtain the results. In order to achieve highest performance results, ECC mode was disabled. The tests are performed using different optimization flags like maxrregcount, use_fast_math. The kernel exploited different memory optimization techniques by considering different memory types. The test data is generated randomly. The performance is calculated using constant sequence size and variable state space and vice versa. The experimental results are examined for the state space up to 256 and maximum sequence size is 32,768.   of Tropical_Forward algorithm have approximately the similar runtime for small number of states. The reason is threads within the block does not fully utilize the hardware. There exists processing overhead for context switching between the CPU and GPU. Parallel version of Tropical_Forward algorithm is not suitable for small state space. However, parallel version performs better than serial implementation by an order of magnitude for large state space. Moreover, the average execution time for parallel version is almost similar for all states (Table 1).  Figure 5 shows the performance by altering the states and fixing the sequence length. Parallel version performs much better than the serial version for large number of states. By increasing the number of states and launching multiple kernels using dynamic parallelism, the performance of parallel version is increased by an order of magnitude. The average runtime is directly proportional to the sequence length. This is valid for all approaches. Table 2 provides more detail. Next, the speedups obtained is calculated by comparing with serial implementation (Figure 6 and 7). Figure 6 illustrates the speedup by altering sequence length and fixing the number of states. For small state space, the parallel version shows the non-monotonic behavior. The reason is small degree of parallelism exhibited by the small number of states. Large state space have better speed-up due to large degree of parallelism. The maximum average speedup is approximately 49 (Table 3).
Finally, the speedup by altering state space and fixing the sequence size is shown in Figure 7. For the large sequence size, the parallel version has performance degradation. The reason is large number of blocks are created. There are hardware limitations of parallel executions of the blocks. The maximum average speedup attained is approximately 21 ( Table 4).

VI. CONCLUSION
Stochastic automata is a class of probabilistic automata with input/output behavior. The high computational complexity limits the usage of stochastic automata. This study presents a parallel version of Forward algorithm. The parallel version was designed using dynmaic parallelism. Forward algorithm achieves speed-up factor of approximately 49 for 256 states. This approach should be investigated for learning problem of Stochastic automata.