Improving Characteristics of LUT-Based Moore FSMs

Almost any digital system includes sequential blocks which can be represented using a model of finite state machine (FSM). It is very important to improve such characteristics of FSM circuits as the number of logic elements, operating frequency and consumed energy. The paper proposes a novel design method targeting a decrease in the number of look-up table (LUT) elements in logic circuits of FPGA-based Moore FSMs. The method is based on using two sets of variables for encoding the collections of outputs. It results in a partition of the set of outputs by two blocks. The outputs from the first block depend on state variables, the outputs from the second block on additional variables. A method is proposed for splitting the set of outputs. The conditions for using the proposed method are given. An example of synthesis is shown. The results of experiments with standard benchmarks are discussed. The experiments outcomes show that the proposed approach allows diminishing the number of LUTs and consumed energy. Also, it leads to an increase in the operating frequency. The method targets rather complex FSMs when the number of state variables exceeds the number of LUT’s inputs.


I. INTRODUCTION
The model of finite state machine (FSM) is used very often for specification and design of sequential blocks of digital systems [1], [2]. It is used, for example, for implementing: 1) the hardware-software interfaces of embedded systems [3]; 2) the complex functions such as hyper-tangent and exponential functions [4], [5]; 3) the activation functions in deep neural networks [6], [7]; 4) some blocks for integral stochastic computing [8]; 5) different stages of cascaded digital processing systems [9]- [11]; 6) the control units of computers and other digital systems [12]- [14]. As follows from [15]- [19], the model of Moore FSM is very often applied in logic design. Due to it, we chose this model in our current research.
Since the mid-twentieth century, many methods have been developed for FSM design [20], [21]. To compare outcomes of these methods, three basic metrics are used. They are: 1) the hardware amount, 2) the performance and 3) the consumed energy [12], [21]. Nowadays, the hardware amount is determined as a chip area occupied by an FSM circuit [22]. The performance is determined by either the propagation time or operating frequency. The operating frequency The associate editor coordinating the review of this manuscript and approving it for publication was Abdallah Kassem .
is inversely proportional to the number of logic levels in an FSM circuit. The consumed energy depends strongly on the hardware amount [15], [19]. It is known [19] that the reducing hardware amount leads to improved performance and energy consumption. Our article targets at reducing the number of look-up table (LUT) elements and their levels in Moore FSM circuits implemented with field programmable gate arrays (FPGA).
The FPGAs are widely used for implementing FSMs [23], [24]. The majority of FPGAs are based on LUTs [25], [26]. A LUT together with a flip-flop forms a logic element (LE). A slice includes up to four LEs. A configurable logic block includes up to four slices [25], [26]. The LE's output could be either combinational or registered (connected with the flip-flop). As a rule, the number of inputs, S L , of a LUT does not exceed 6 [25], [26]. Very often, it leads to the necessity of functional decomposition for Boolean functions representing FSM circuits [27]- [30]. In turn, it results in an increase in the number of logic levels in a circuit. Also, it makes interconnections more complex. All this has a negative impact on both the operating frequency and power consumption [15], [31].
To improve the basic metrics of LUT-based FSMs, it is necessary to reduce the numbers of arguments in systems of Boolean functions (SBF) representing FSM circuits [32].
This can be achieved through the application of methods of structural decomposition [21], [33]. In this case, an FSM circuit is represented as a composition of several large logic blocks. Each block is represented by an SBF with unique systems of arguments and output functions [13], [14], [33], [34]. It leads to an increase in the number of different functions compared to FSM circuits based on the functional decomposition. But these functions are much simpler and their implementation requires less hardware than in the case when the functional decomposition is used [21].
The main contribution of this article is a novel approach for reducing the number of LUTs (and their levels) in the part of the Moore FSM circuit generating outputs. It allows improving all metrics of Moore FSMs.

II. BACKGROUND OF MOORE FSMs
A Moore FSM is defined as a 6-tuple [35], [36] including the following components: X = {x 1 , . . . , x L } is a finite set of inputs, Y = {y 1 , . . . y N } is a finite set of outputs, A = {a 1 , . . . , a M } is a finite set of internal states, the functions of transitions and outputs, and the initial state a 1 ∈ A. There are many methods used for representing FSMs. In this article, we use a state transition table (STT) to represent a Moore FSM [36].
An STT has the following columns: a m is an initial state; a s is a state of transition; X h is an input signal determining the transition < a m , a s > and equal to a conjunction of some elements of the set X (or their compliments); h is a number of transition (h ∈ {1, . . . , H }). There is a collection of outputs Y (a m ) ⊆ Y written in the column a m of an STT. It includes outputs y n ∈ Y generated in the state a m ∈ A.
There is an example of STT of Moore FSM S 1 represented by Table 1. The following sets could be derived from Table 1: When the set A is constructed, it is necessary to represent each state a m ∈ A by a binary code K (a m ) having R bits. It is a step of state assignment. The state variables T r ∈ T are used for creating codes K (a m ), where |T | = R.
If R = M , it is a one-hot state assignment. This method is very popular in FPGA-based design [37], [38]. But, for example, such systems as SIS [39] and ABC [22] by Berkeley use a binary state assignment. In this case, there is We also use this approach in our article. State codes are kept into a state register (RG). It includes R flip-flops with mutual pulses of synchronization (Clock) and clearing (Start). As a rule, D flip-flops create the RG for LUT-based FSMs [12], [35]. To change the content of RG, input memory functions D r ∈ are used, where | | = R.
A Moore FSM logic circuit is represented by the following SBFs:  To find systems (2) -(3), an STT should be transformed into a structure table (ST) [14]. An ST is an expansion of the STT by the following columns: K (a m ) is a code of the current state a m ∈ A; K (a s ) is a code of the next state (state of transition) a s ∈ A; h is a collection of input memory functions equal to 1 to load K (a s ) into RG. The systems (2) -(3) determine a logic circuit of Moore FSM U 1 (Fig. 1). In Fig. 1, the symbol LUTer determines a block whose circuit is implemented with LUTs [21].
In FSM U 1 , the LUTer implements the system (2), the LUTerY the system (3). If a function D r is generated as the output of some LUT, then this output is connected with a flip-flop. These flip-flops form a state register RG distributed among the logic elements. It explains the presence of pulses Clock and Start as inputs of LUTer .
To improve the characteristics of Moore FSM's circuit, it is very important to reduce the chip area occupied by the circuit [2], [35]. The methods of solving this problem depend strongly on logic elements used for implementing FSM circuits [12], [32]. Let us analyze design methods targeting FPGA-based FSMs.

III. STATE-OF-THE-ART
The process of FSM design always has been associated with necessity of the solution of some optimization problems [2]. As a rule, when designing FPGA-based FSMs, VOLUME 8, 2020 four basic optimization problems arise [13], [32]. They are: 1) the decrease in the chip area occupied by an FSM circuit (the hardware reduction); 2) the reduction in the signal propagation time (the increase in the clock frequency); 3) the reduction in power consumption and 4) the improvement of testability. In this article, we consider first of these problems. The analysis of library [40] shows that for some benchmark FSMs there is L+R ≥ 20. At the same time, for modern LUTs there is S L ≤ 6 [25], [26]. Thus, the following condition very often takes place: If condition (4) is satisfied for some FSM, then the problem of hardware reduction arises for the block LUTer . If the following condition takes place then the LUTerY could be represented by a multi-level circuit. So, it is necessary to reduce the number of levels in the Moore FSM circuit if conditions (4) -(5) take places. There are four main approaches for solving this problem, namely: 1) The optimal state assignment [1], [2], [28].
3) The replacement of LUTs by embedded memory blocks (EMB) [1], [13], [15]- [17], [19], [32], [42]. 4) The structural decomposition of an FSM circuit [13], [20], [21], [43]. We shall call the optimal state assignment a process of obtaining state codes allowing to reduce the number of arguments in functions (2) -(3). These functions are represented as sum-of-products (SOP). But there is a different nature of functions (2) and (3) In (6), the symbol A m stands for a conjunction of state variables corresponding to the state code K (a m ) from the h-th row of ST. The functions y n ∈ Y depend on terms A m (m ∈ {1, . . . , M }) determined above. The number of bits in K (a m ) can be ranged from log 2 M to M . If R = M , it is a one-hot state assignment [15]. When the one-hot is used, only a single state variable forms a conjunction A m (m ∈ {1, . . . , M }). It allows decreasing for the number of arguments in terms (6). It leads to circuits with less amount of LUTs and layers of logic than in the case of binary encoding. The results of investigations [15] show that one-hot is'attractive for large FSMs, but a better implementation of small machines can be obtained using binary state assignment'. The results of investigations [17] show that binary encoding gives better results if L > 10.
One of the most popular state assignment algorithms is JEDI, which is distributed with the system SIS [39]. It targets a multi-level logic implementation. It maximizes either the size of common cubes in logic functions (the input dominant algorithm) or the number of common cubes in a logic function (the output dominant algorithm).
Modern industrial packages use different state assignment strategies. For example, there are the following methods used in the design tool XST of Xilinx [44]: the automatic state assignment; one-hot; compact; Gray codes; Johnson codes; speed encoding. The same methods are implemented in the design tool Vivado [45].
It is possible to encode the states a m ∈ A in such a manner that it minimizes the number of arguments in functions y n ∈ Y [32]. For example, the methods [46] could be used to solve this problem. It is important if the condition (5) takes place.
So, there is a lot of state assignment methods. It is really difficult to say which is the best for a particular FSM.
The functional decomposition is very popular in FSM design [27], [30], [41], [42]. If number of arguments for some function exceeds S L , then the original function is broken down into smaller and smaller components. There are three approaches in this area: serial, parallel and balanced decomposition. These approaches are used, for example, in systems DEMAIN [47] or PKMIN [48]. Obviously, there are program tools for functional decomposition in any CAD system targeting FPGA-based design. One of the best CAD tools using this approach is the ABC package by Berkeley [22], [49].
In the best case, an FSM circuit is implemented as a single EMB. It is possible if the following condition takes place [43]: In (7), the symbol V 0 stands for the number of cells for EMB configuration with a single output. Our investigations [43] of library [37] shows that condition (7) is true for 68% of benchmarks.
If (7) is violated, then an FSM circuit could be implemented as: 1) a network of EMBs or 2) a network of LUTs and EMBs. The survey of different approaches for EMB-based design can be found in [30]. Let us point out that these methods could be used only if there are'free' EMBs, which are not used for design other parts of a digital system.
In the case of structural decomposition, an FSM circuit is represented by several blocks [21], [32]. Each block implements functions different from (2) -(3).   For example, in the case of replacement of inputs x e ∈ X [14], the set X is replaced by a set P = {p 1 , . . . , p G } such that G L. Now functions (2) are represented as It leads to Moore FSM U 2 ( Fig. 2), where the block LUTerP generates functions For Mealy FSMs, the encoding of collections of outputs (CO) could be used [20], [33]. In this case each collection Y q ⊆ Y is encoded by the binary code K (Y q ). Let it be Q different collections in an FSM. In this case, R Q bits are necessary for their encoding, where: Let us use variables z r ∈ Z for the encoding, where |Z | = R Q . It leads to so called PY Mealy FSM (Fig. 3).
In PY Mealy FSM, the LUTer Z implements systems (2) and The LUTerY implements the functions This method leads to a single-level circuit of LUTerY, if the following condition takes place: If (13) is violated, then it is possible to use the mixed encoding of outputs [50]. In this case, some functions y n ∈ Y are represented as SOPs depending on the terms (6). Let us point out that this approach never has been used for Moore FSMs.
In this article, we discuss a case when the condition (5) takes place for some Moore FSM S. In this case, there is a multi-level circuit of the block LUTerY generating the system of outputs Y = Y (T ). To diminish the number of LUTs required for generating outputs y n ∈ Y , we propose to divide the set Y by two disjoint sets (Y = Y L ∪ Y 0 ). We propose to use the method of encoding of collections of outputs including outputs y n ∈ Y L . These outputs depend now on some additional variables from the set Z . The outputs y n ∈ Y 0 are still implemented as functions (3) depending on state variables T r ∈ T . The proposed method is an evolution of ideas from our work [50]. In [50], we proposed a method of mixed encoding of outputs for Mealy FSMs. In this article, we have adapted the approach of mixed encoding for Moore FSMs where outputs depend only on states.

IV. MAIN IDEA OF PROPOSED METHOD
Let us create a set V of CO s Y q ⊆ Y for some Moore FSM. For example, there is Q = 17 in the case of Moore FSM S 1 . These collections are listed in Table 2.
Let us use LUTs with S L = 3 to implement an FSM circuit. Because R = log 2 18 = 5, the condition (5) takes place for FSM S 1 . So, there is a multi-level circuit of LUTerY for U 1 -based circuit of S 1 .
Using (10) gives R = R Q > S L = 3. Let us try to diminish the number of COs to reach the equality To do it, we should eliminate some functions Therefore, the elimination of y n from Y i leads to decrease in the number of COs by I (y n ).
The elimination of y n results in a transformation of the set V into a set V 1 having Q 1 = Q − I (y n ) elements. Now, it is enough R 1 bits to encode the COs Y q ∈ V 1 : Let the following condition take place: In this case, the output y n is implemented as (3), whereas the outputs y m ∈ Y \ {y n } are represented as (12). If the condition (16) is violated, then it is necessary to find an output y m such that its elimination from COs Y q ∈ V 1 leads to the set V 2 having Q 2 elements. Now, it is enough R 2 variables to encode the remaining COs: Let the following condition take place:

VOLUME 8, 2020
It means that the elimination of y m results in further decreasing for the number of bits in codes of COs. The process of elimination should be continued till the following condition will be true: The subscript'i' shows that Let Y 0 be a set of eliminated outputs. If y n ∈ Y 0 , then y n is represented as (3). Let Y L be a set of outputs creating the COs Y q ∈ V i . If y n ∈ Y L , then it is represented as (12). It leads to Moore FSM U 3 shown in Fig. 4.
In FSM U 3 , the LUTerL implements outputs y n ∈ Y L , the LUTerR implements outputs y n ∈ Y 0 . The LUTer Z implements SBFs (2) and (11). So, the proposed approach results in the mixed encoding of outputs [50].
Let the outputs y n ∈ Y L create Q L collections. To encode them, it is necessary R L variables: Obviously, the following condition should take place: In this case, it is enough a single LUT to implement the circuit for each function y n ∈ Y L . Because the condition (5) takes place, there is a multi-level circuit of LUTerR. To diminish the number of arguments in SOPs of functions y n ∈ Y 0 , it is possible to use the following approach for state assignment. Let us encode the states a m ∈ A in such a way that functions y n ∈ Y 0 are represented by the minimum possible number of intervals of R-dimensional Boolean space. Let us name such approach the special state assignment (SSA). To execute the SSA, the methods from [46] could be used.
In this article, we propose the method of synthesis of Moore FSM U 3 . The method includes the following steps: 1) Deriving the set V from the initial STT.   number of LUT's inputs S L . It uses as inputs the set V and the value of S L . The algorithm requires not more than N cycles. The algorithm generates sets Y 0 and Y L as its output. In the beginning, the set Y L = Y and the set V L includes all COs Y q ⊆ Y (line 2).
The main idea of the method is reduced to finding the outputs y n ∈ Y L such that their excluding from Y L leads to the maximum possible reduction in the number of COs Y q ⊆ Y L \ {y n }. The search is organized as a cycle with a cycle variable k. It starts from the operator 3. Each cycle starts from the organizing the queue γ (line 4). The queue includes outputs y n ∈ Y L .
For each output y n of the queue γ , the value of Q(y n ) is calculated. It is equal to the number of COs Y q ∈ V L . Next, the elements of γ are ranked in the descending order of the value of Q(y n ). Each cycle k(k ∈ {1, . . . , N }) can have up to I steps where I = |V L |.
During each step (starting from line 5), a single element of γ is analyzed (line 6). The value of Q i is calculated for a chosen output y n ∈ Y L . The Q i is equal to the number of COs excluded from V L due to transferring y n from Y L into Y 0 . Next, the value of R k is calculated (line 7) as If the excluding the i-th element of γ leads to reduced value of R Q (line 8), then it is necessary to check the condition (14). It is executed as the operator 9. If condition (14) is true, then the solution is found. The algorithm is finished (go to 17). Otherwise, the modification is executed for sets Y L and Y 0 (line 10). During this step, the output y i is excluded from Y L and included into Y 0 (Y L := Y L \ {y i } and Y 0 := Y 0 ∪ {y i }). The value of k is incremented (line 11). If k > N , then all outputs are analyzed and the algorithm is finished (line 12). Otherwise, the correction of V L is executed (line 13). After obtaining new COs, the next queue γ is creating (line 14).
If condition (14) is violated (line 15), then the next element of γ should be analyzed (i := i + 1). If the queue is not exhausted, then its next element should be analyzed (go to 6 in line 16). Otherwise, the algorithm is terminated.  Let us apply this algorithm to Moore FSM S 1 . As follows from Table 2, there is Q = 17. Using (10) gives R Q = 5. Let us use LUTs having S L = 3 inputs. So, the condition (13) is violated and it is necessary to divide the set Y . The process of dividing is shown in Table 4.
There are outputs y n , values of Q(y n ) and Q i shown in the corresponding columns of Table 4. If some output y n ∈ Y is taken for analysis during the cycle i, then it is shown in the brackets in the column Q(y n ). The sign'+' shows that the corresponding output is included into Y 0 . The sign'-' means that the output y n is excluded from the analysis. The row Y 0 shows the outcome of dividing.
Analysis of Q(y n ) allows creating the queue γ =< y 3 , y 1 , y 2 , y 4 , y 6 , y 5 , y 7 , y 8 , y 9 >. Including y 3 into Y 0 (cycle 1) gives Q 1 = 1 and R 1 = 5. Because R 1 = R Q , the analysis should be continued for the next element of the γ . The analysis of Y q (cycle 2) shows that R 1 = 4. Because there is R 1 < R Q , the output y 1 is included into the set Y 0 . Because the condition (13) is violated, the next step should be executed. To do it, the new table of COs should be constructed (Table 5).

VI. EXAMPLE OF SYNTHESIS
Let us discuss an example of synthesis for Moore FSM S 1 . The steps 1 and 2 are already executed for this example. There is M = 18. Using (1) gives R = 5 and T = {T 1 , . . . , T 5 }.    Let us execute the special state assignment for FSM S 1 . Using the method from [46] gives the state codes shown in Fig. 5.
Let us encode the COs Y q ⊆ Y L using the method [46]. It targets diminishing the number of arguments in the functions (12). The outcome is shown in Fig. 7.
The system (11) is generated by the LUTer Z. So, the ST of Moore FSM U 3 should include the column Z h . This column contains variables z r ∈ Z equal to 1 in the code of CO Y q for a state a s from the row h (h ∈ {1, . . . , H }). Table 7 is an ST for Moore FSM S 1 . It includes state codes from Fig. 5, the function D r ∈ = {D 1 , . . . , D 5 } and variables z r ∈ Z . Now, the column a m includes only outputs y n ∈ Y 0 . The column q includes the subscripts of COs Y q from Table 6.  Let us explain, for example, the row 2 of Table 7. There is a m = a 2 . As follows from Table 1, Y (a 2 ) = {y 1 , y 2 , y 3 }. So, there is y 1 in the column a m . There is a s = a 3 with Y 3 = {y 3 , y 4 , y 6 } (Table 1). After elimination y 6 ∈ Y 0 , the CO Y 3 = {y 3 , y 4 } ( Table 6) should be generated in the state a 3 . As follows from Fig. 7, there is K (Y 3 ) = 010. So, there is the symbol z 2 in the column Z h . The state codes are taken from Fig. 5.
The functions (2) and (11) are derived from Table 7. They depend on the terms (6). It is done in the trivial way [14].
As follows from Table 7, the outputs y n ∈ Y 0 are represented as Using state codes (Fig. 5) transforms the system (23) into the following system Using Table 6 and codes from Fig. 7, we can get the following system y n ∈ Y L : As follows from (25), there is only a single LUT in the circuit of LUTerR. As follows from (25) there are 6 LUTs in the circuit of LUTerL.
We do not discuss the last step of synthesis for this example. This step is connected with using some standard tools [25], [26] and a VHDL model of U 3 .

VII. EXPERIMENTAL RESULTS
To investigate the efficiency of proposed method, we use standard benchmarks from the LGSynth93 library [40]. The library includes 48 Mealy FSMs taken from the practice of FSM design. They are presented in the KISS2 format. We transform these Mealy FSMs into equivalent Moore FSMs using three rules taken from [14]. Rule 1: if K different collections of outputs are generated during transitions in a state a m ∈ A of Mealy FSM, then this state is replaced by K states (a 1 m , . . . , a K m ) of equivalent Moore FSM. Rule 2: each of states a k m (k ∈ {1, . . . , K }) has a unique collection of outputs. Rule 3: all states a k m ∈ A have the same function of transition. The characteristics of obtained Moore FSMs are shown in Table 8.
To use these benchmarks, we use the CAD tool named K2F [43]. It translates the KISS2 file into a VHDL model of an FSM. To synthesize and simulate the FSM, we use the Active-HDL environment. To get the FSM circuit, we use Xilinx Vivado 2019.1 tool [45]. The investigation path used in our system is shown in Fig. 8. The Xilinx Vivado 2019.1 package was used for synthesis and implementation of FSM for a given benchmark. The target platform was the FPGA device Xilinx Virtex-7 (XC7VX690tffg1761-2). It includes LUTs with S L = 6 [44].
The column U 3 of Table 8 shows the feasibility of using the proposed method for a particular benchmarks. If condition (5) is true, then our method can be used. We marked these benchmarks with a plus in the corresponding row of Table 8.
There 16 signs'+' in Table 8. So, only corresponding 16 benchmarks are used in our research.
We compared our approach with four other methods. They are: 1) Auto of Vivado; 2) one-hot of Vivado; 3) JEDI-based U 1 and 4) DEMAIN-based U 1 . The results of experiments are shown in Table 9 (for the number of LUTs), Table 10 (for the operating frequency, MHz), and Table 11 (for the consumed energy, Watts).
These tables are organized in the same manner. Their rows are marked with the names of benchmarks, the columns by design methods. The rows'Total' include results of summations for numbers from each column. We took as 100% results of addition for the method U 3 . The rows'Percentage' show the percentage of summarized characteristics respectively to the results obtained for U 3 .
As follows from Table 9, the proposed method allows diminishing the number of LUTs compared to other researched methods. There is the following gain: 1) 35% regarding to Auto; 2) 58% regarding to one-hot; 3) 13% regarding to JEDI-based FSMs and 4) 24% regarding to DEMAIN.
In all cases studied, our approach produces FSM circuits having exactly a single level of LUTs for blocks generating output functions y n ∈ Y . Due to it, U 3 -based FSMs have better results for operating frequency than it is for other methods used in the research. As follows from Table 10, our approach gives the following gain in the operating frequency: 1) 32,5% compared to Auto; 2) 32,5% compared to one-hot; 3) 10,2% compared to JEDI and 4) 28% compared to DEMAIN.
Reducing the numbers of LUTs and their levels allowed obtaining FSM circuits with lower energy consumption than for other methods. As follows from Table 11, U 3 -based FSMs have the following gain in consumed energy: 1) 30,4% in comparison with Auto; 2) 40,8% in comparison with one-hot; 3) 12,1% in comparison with JEDI and 4) 22,9% in comparison with DEMAIN.
So, if R > 6, then our approach gives better results than they are for Auto, one-hot, JEDI and DEMAIN. Of course, it is true only for benchmarks [40] and the device XC7VX690tffg1761-2. Let us point out that we conducted similar research using device Xilinx Virtex-5 (XC5VLX30FF324) having LUTs with S L = 6. To do it, VOLUME 8, 2020  we used the Xilinx ISE 14.1 package [44] The results of these investigations confirmed our hypothesis about the feasibility of using the model U 3 when the condition (5) is met.
Each of quantities from Table 9 - Table 11 evaluates only one of the characteristics of FSM circuits. In this article, we propose a comprehensive assessment that takes into account all three characteristics (the number of LUTs, operating frequency and consumed energy). We propose to evaluate an FSM circuit by the following value: The value of β is the inverse of the amount of energy (mJ) consumed per LUT of an FSM circuit.
We show these values for generalized characteristics of FSMs (Table 9 -Table 11) in Table 12. We use total characteristics to get the average value of β (row β of Table 12). We took as 100% the values of β for U 3 . As follows from the row'Percentage', our approach allows to improve this general characteristic in comparison with other researched methods.

VIII. SOME IMPORTANT ADDITIONAL ISSUES
A sequential block can be represented as either Mealy or Moore FSM. For both Mealy and Moore FSMs, the input  memory functions are represented as the system (2). As a rule, a Moore FSM has more states than an equivalent Mealy FSM [14]. It makes the system (2) of a Moore FSM more complex than its counterpart of an equivalent Mealy FSM. But the outputs of Mealy FSMs depend on inputs and state variables [14], [37]: Obviously, the functions (27) have more arguments than functions (3) of an equivalent Moore FSM. So, each FSM model has its own advantages and disadvantages. It is impossible to say unequivocally that one model is always better than another. Let us analyze the influence of specifics of Mealy and Moore FSMs on optimization methods for LUT-based design. We hope this will help to show more clearly the novelty of our approach to optimization of LUT-based Moore FSMs.
As follows from (27), the one-hot codes of outputs are generated by the block LUTerY. These codes have N bits. In Moore FSMs, outputs depend only on state variables. Here, state codes play role of maximum codes of collections of outputs. In Moore FSMs, the LUTerY transforms these maximum codes into one-hot codes of outputs. Due to this difference, different approaches are used to optimize characteristics of the part of FSM circuit generating outputs y n ∈ Y .
As follows from (2), for Mealy FSMs, functions f i ∈ ∪Y depend on terms (6). If the condition NA(f i ) > S L (28) takes place, then the corresponding function f i (i ∈ 1, . . . , N + R) is represented by a multi-level circuit. It is known, that multi-level circuits have less operating frequency and consume more power than their single-level counterparts [2], [33]. If condition (6) takes place, then a corresponding circuit can be optimized by elimination of the direct dependence of outputs y n ∈ Y on inputs x l ∈ X . This could be done using the encoding of collections of outputs. This approach allows improving the characteristics of Mealy FSMs. It leads to PY Mealy FSMs shown in Fig. 3. But if condition (13) takes place, then the circuit of LUTerY is multi-level. To diminish the number of levels in circuits implementing FSM outputs, we propose the approach of mixed encoding of collections of outputs [50].
In the case of mixed encoding, some outputs y n ∈ Y 0 depend on terms (6). They are represented by one-hot codes generated by LUTer Z. The outputs y n ∈ Y R form new collections of outputs which are encoded by maximum binary codes. These codes are transformed into one-hot codes of y n ∈ Y R by LUTerY. This approach leads to PY M Mealy FSMs shown in Fig. 8 Our research [50] shows that this approach significantly reduces the number of LUTs in circuits of Mealy FSMs. The negative effect of this approach is a decrease in the operating frequency due to elimination of direct dependence between outputs and inputs.
This method cannot be used to optimize the circuit of Moore FSM because its outputs do not directly depend on inputs x l ∈ X . Given this feature, we have adapted the approach of mixed encoding of collections of outputs proposed in [50]. If condition (5) takes place, then we construct two sets of collections of outputs. The outputs y n ∈ Y L form collections encoded by maximum binary codes. To encode them, additional variables z r ∈ Z are used, where set Z includes R L = S L variables. These codes are transformed by LUTerL into one-hot codes (Fig. 4). Because R L = S L , it is enough a single LUT to implement any function y n ∈ Y L . The collections of outputs y n ∈ Y R are encoded using state variables. So, these outputs are still implemented as functions (3). They are generated by LUTerR. We execute the partition of the set Y by sets Y L and Y R in a way minimizing the number of outputs in the set Y R . Such an approach allows encoding of states in a way minimizing the number of literals in functions y n ∈ Y 0 . The minimization can decrease the number of LUTs and their levels in the circuit of Moore FSM U 4 compared with equivalent FSM U 1 .
So, the proposed approach is a new one. It is not a mixed encoding of collections of outputs proposed in [50]. We propose a quite different way for reducing the numbers of LUTs in circuits of Moore FSMs. In this new approach, only maximum codes of collections of outputs are used. This is a main scientific contribution of this article. Until now, we have considered FSMs as separate blocks. However, they are parts of digital systems. So, FSMs interact with other parts of digital systems. Theoretically, the Mealy FSMs have benefits such as lower resource usage and faster response to their inputs. But in the case of FPGA-based digital systems, these benefits can be significantly reduced.
For Mealy FSMs, outputs (27) are generated in parallel with input memory functions. But in practice, outputs (3) can depend on up to L + R ≈ 30 arguments [14]. Due to the rather small amount of LUT's inputs (S L ≤ 6), it is necessary to use the methods of functional decomposition. After decomposition, the clock cycle time increases. For Moore FSMs, outputs (3) depend on R arguments. Moreover, there are methods of state encoding [36], [47] which allow minimizing the numbers of arguments in system (3). In the best case, there are exactly N LUTs in the circuit of LUTerY. This leads to a reduction of the difference in the operating frequency for equivalent Mealy and Moore FSMs.
Next, it is known [2], [24] that outputs of Mealy FSM are not stable. To make them stable, it is necessary to use an additional output register having N latches. To operate, the register consumes power. Also, it is necessary to use an additional circuit to deliver the master clock to the output register. This circuit consumes some resources of an FPGA chip (LUTs, interconnections, power). The output register adds an additional increase to the cycle time. For Moore FSMs, only the state register is required which includes only R flip-flops. If R N , then the state register of a Moore FSM consumes significantly less power than a pair <state register, output register> of equivalent Mealy FSM. In the case of Moore FSM U 3 , it is necessary to use flip-flops to keep additional variables z r ∈ Z . But even in this case, it is necessary only (R + R L ) ≤ 2R flip-flops.
In the case of Moore FSMs, there are effective methods [24], [33] for optimizing the LUT-based circuits of blocks LUTer generating input memory functions (1). These methods allow getting a circuit of LUTer having practically the same amount of LUTs as for an equivalent Mealy FSM. The methods [24], [33] use classes of pseudoequivalent states of Moore FSMs. A single class of pseudoequivalent states corresponds to a state of an equivalent Mealy FSM. We do not discuss this approach in our article. But the corresponding methods can be found in [24], [33].
It follows from the above that it is difficult to say which FSM model (Mealy or Moore) will be better to implement the LUT-based circuit of a particular sequential block. It depends on characteristics of both an FSM (the numbers of inputs, outputs and states, relations of the number of functions for which the condition (28) is met to the total number of functions) and FPGA (number of LUT's inputs, architecture of a configurable logic block). The criteria of optimality of a digital system as a whole are also significant. For example, for embedded systems [3], it is very important to reduce the power consumption. It is quite possible that a Moore FSM-based sequential block will consume less power than a Mealy FSM-based block. Therefore, the final decision on which FSM model to use is made by a designer of a particular digital system.

IX. CONCLUSION
The paper presents an original approach targeting FPGA-based Moore FSMs. The proposed design method leads to FSM circuits having a single level of LUTs in the blocks generating outputs y n ∈ Y . The method is based on dividing the set of outputs Y by the sets Y 0 and Y L . The outputs y n ∈ Y 0 depend on state variables, the outputs y n ∈ Y L on some additional variables. The splitting outputs is performed so that the condition (21) takes place.
The experiments show that this approach leads to reducing such FSM characteristics as the number of LUTs, consumed energy and delay. We compared our approach with four other methods. The experiments were conducted with Xilinx Vivado 2019.1 package. Also, we propose a generalized characteristic for comparing different design methods. It is directly proportional to the FSM operating frequency and diversely proportional to the product of consumed power and the number of LUTs in an FSM circuit.
There is a single limitation to the application of the proposed method. Namely, it does not make sense to use our approach if the number of state variables does not exceed the number of LUT's inputs.