REDRESS: Generating Compressed Models for Edge Inference Using Tsetlin Machines

Inference at-the-edge using embedded machine learning models is associated with challenging trade-offs between resource metrics, such as energy and memory footprint, and the performance metrics, such as computation time and accuracy. In this work, we go beyond the conventional Neural Network based approaches to explore Tsetlin Machine (TM), an emerging machine learning algorithm, that uses learning automata to create propositional logic for classification. We use algorithm-hardware co-design to propose a novel methodology for training and inference of TM. The methodology, called REDRESS, comprises independent TM training and inference techniques to reduce the memory footprint of the resulting automata to target low and ultra-low power applications. The array of Tsetlin Automata (TA) holds learned information in the binary form as bits: <inline-formula><tex-math notation="LaTeX">$\lbrace 0,1\rbrace$</tex-math><alternatives><mml:math><mml:mrow><mml:mo>{</mml:mo><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>}</mml:mo></mml:mrow></mml:math><inline-graphic xlink:href="maheshwari-ieq1-3268415.gif"/></alternatives></inline-formula>, called excludes and includes, respectively. REDRESS proposes a lossless TA compression method, called the include-encoding, that stores only the information associated with includes to achieve over 99% compression. This is enabled by a novel computationally minimal training procedure, called the Tsetlin Automata Re-profiling, to improve the accuracy and increase the sparsity of TA to reduce the number of includes, hence, the memory footprint. Finally, REDRESS includes an inherently bit-parallel inference algorithm that operates on the optimally trained TA in the compressed domain, that does not require decompression during runtime, to obtain high speedups when compared with the state-of-the-art Binary Neural Network (BNN) models. In this work, we demonstrate that using REDRESS approach, TM outperforms BNN models on all design metrics for five benchmark datasets viz. MNIST, CIFAR2, KWS6, Fashion-MNIST and Kuzushiji-MNIST. When implemented on an STM32F746G-DISCO microcontroller, REDRESS obtained speedups and energy savings ranging 5-5700× compared with different BNN models.


REDRESS: Generating Compressed Models for Edge Inference Using Tsetlin Machines I. INTRODUCTION
T HE emergence of AI in sensor based systems has empowered greater model functionality thus allowing for substantial advances in intelligent wearables, personalized healthcare [1] and smarter and more sustainable cities [2]. To address the burden of long inference times and meeting embedded device resource constraints, offloading the data for cloud computation is often the method of choice [3], [4]. This comes with the added issues of privacy, increased latency and network connectivity.
To enable true wide-spread integration of AI sensors for low and ultra-low power edge inference there is a need for focused design effort towards delivering energy efficient and memory frugal implementations [5], [6], [7]. The prevailing approach to edge inference is based on Neural Network (NN) models [5], [8]. However, for a given inference application, designers are forced to choose intelligent trade-offs through hardware-software codesign considerations as the deep neural network (DNN) models are resource hungry in terms of storage, runtime memory (RAM) and computation. These trade-offs stem from finding the balance between two key considerations: the memory and compute limitations of the target platform and selecting a trained model that achieves a competitive accuracy within acceptable latency [9]. Recently, dedicated Application Specific Integrated Circuits (ASIC) have been designed and validated for ultra-low power edge inference using NN variants [10], [11]. They present custom hardware capable of operating at near-threshold voltages with power-gating to reduce active chip power during operations.
Several approaches have already been considered for easing NN based edge transition. This includes network pruning [12], [13] to reduce the number of parameters and reduce model complexity, weight quantization [9], [14] to alleviate the compute intensity of floating point arithmetic or layer decomposition [15] to allow for model compression. These methods all tackle one significant and unavoidable challenge: the arithmetic nature of Neural Networks. Moreover, there are practical limitations of black box ML models being used in critical applications to make high-stake decisions [16] which suggests exploration of interpretable models instead [17]. This work considers a fundamentally different machine learning ( Tsetlin Machine (TM) [18]. The TM addresses the aforementioned challenges by removing floating-point arithmetic from training and inference routines. The learning element of the TM is the Tsetlin Automata (TA), which decides whether to include or exclude a feature while creating logic propositions to the learning problem. The includes and excludes are represented by bits '1' and '0', respectively. The logic-based underpinning enables bit-wise operations, in the inference routine, between boolean input features and include-exclude decisions making it computationally less intensive and energy frugal.
While TM offers the advantage of logic based inference, the memory footprint of trained models is usually quite significant [19]. A naive implementation would use the model as it is, which makes it impractical for edge inference. Each TM contains user-specified number of clauses each of which hold the TA that forms logic proposition. As the classification problem complexity increases, so does the number of logic propositions (or clauses) required by the TM. All TM related work clearly demonstrate that more clauses lead to better model accuracy with a large associated impact on memory. However, there is no deterministic and definitive method of selecting the optimum number of propositions per class for a given problem to minimize the memory footprint. MILEAGE [20] is the first attempt at determining the number of clauses or logic propositions sufficient to classify a given dataset. It proposes an automated method to minimizes the number of clauses at runtime depending on their contribution towards correct classification. Although it demonstrates that high-accuracy can be achieved using relatively fewer clauses, the model size still remains large for edge-inference. Therefore, to reap the advantages of logic over arithmetic computation when comparing against NNs in inference, TM's require substantial model compression.
This work proposes REDRESS: a novel leaRning EDge infeRence methodology for Embedded tSetlin machineS. RE-DRESS's primary objective is to find the optimum balance between the TM model size, accuracy and latency. To achieve this, the methodology first employs an automated TM architecture search paradigm to find the optimal model size and hyperparameters to best represent the classification problem. This is accomplished by automated training of multiple candidate TM configurations. The user can examine the resulting models to decide which model should proceed to the next stage. The next stage involves a computationally minimal REDRESS training procedure involving Tsetlin Automata Re-profiling, which attempts to increase the sparsity of useful automata decisions (the number of include decisions) but provides an optimally trained model with high accuracy. REDRESS proposes a lossless compression approach called include-encoding to compress the TA by storing only the information associated with includes, achieving over 99% compression in terms of memory footprint for all benchmark datasets. Finally, REDRESS proposes a bit-parallel inference methodology that operates on TA in the compressed domain and performs fast multi-class classification.
The key difference between MILEAGE and REDRESS is that MILEAGE aims to reduce the model size by reducing the number of clauses while REDRESS aims to minimize the model using sparsity of includes irrespective of the number of clauses.
The study of TM in this work mainly targets microsystems with resource constraints for, e.g., IoT or edge inference applications, which fall in the low and ultra-low power category with limited storage and computational capacity. The scope of the paper is therefore limited to smaller datasets with Boolean features such as images encoded with Booleans. The validation is performed on STM32F746G-DISCO micro-controller that offers 1 Mbytes of flash memory and 340 Kbytes of runtime memory (RAM) with an ARM Cortex-M7 core. Deploying large models trained for CIFAR 10/100 or ImageNet is not possible on such a platform and lies outside the scope of this paper. Other variants of TM have been evaluated with larger datasets such as CIFAR10 and CIFAR100 where they have demonstrated accuracy up to 75% and 45%, respectively [21]. TM is an actively evolving field of research where novel architectures and training methods are being developed.
Through the REDRESS workflow we propose the following contributions: r A novel TM training approach using TA Re-profiling procedure to find the best trade-off between model accuracy and memory footprint.
r An Include-Encoding based lossless TA compression technique.
r A fast inference algorithm capable of using bit-parallel arrangement to perform many image classifications simultaneously, achieving order-of-magnitude speedups.
r Extensive validation of the REDRESS-based TM vis-à-vis the state-of-the-art Binary Neural Networks [9] using several ML benchmark datasets on STM32 micro-controller.

II. BINARY NEURAL NETWORK (BNN)
The BNN, first proposed by Courbariaux et al. [22], is the state-of-the-art NN variant used for edge inference and embedded applications [9]. It performs the most extreme quantization of features and weights to a single bit, where the weights are passed through a sign function that converts them to ±1. The negative weights are clamped to −1 and positive weights are clamped to 1, and are represented with 0 and 1 respectively. BNN enables fast computation by replacing 32-bit weights that require multiply accumulate operations to a 1-bit weights, utilizing xnor and popcount [9], [22]. This is similar to the TM clause output and class sum computation. A comparative analysis of TM and NN based approaches can be found in [23].
The use of the sign function poses challenges in training during back-propagation given that gradient will be zero and thereby eliminating the possibility of gradient descent. To accommodate for this Courbariaux et al. retain the real value weights and binarize them each time for the forward pass and a straight-through-estimator is used to update the real weights during the backward pass. This is a key difference between DNN and BNN. For details interested readers should refer to [22], [24]. To further optimize the effectiveness of the feed forward process, McDanel et al. propose the combination of the BNN  I  TABLE OF KEY TERMS FOR EXPLAINING TM TRAINING AND INFERENCE AND  THE REDRESS METHODOLOGY layer computation with batch normalization (if specified) and the activation function. The temporary results are stored as binary outputs which further reduce the memory footprint in embedded implementation [9]. It is enabled by re-ordering the operations such that instead of computing all intermediates of a layer, the layer is computed in chunks where the batch normalization and activation function can be applied to this chunk and produce on the final output, thus, trading extra computation effort for a reduced memory footprint. To date this is the best performing method of deploying BNN to a resource constrained device, hence, has been used in this paper as gold standard for comparison. BNN and TM are two completely different approaches to learning data. BNN is very similar to multilayered or DNN as it uses floating-point operations, backpropagation and multiple layers to learn input data. TM is single layered and the training process uses learning automata devoid of floating-point operations with randomised feedback process. In BNN, all neurons receive feedback while in the TM the clauses and the automata that receive feedback are selected randomly. In this paper, we have trained BNN in different configurations including single layered FC-128 which closely resembles the TM and shows that it outperforms FC-128 in computation time and energy.

III. TSETLIN MACHINES
The Tsetlin Machine is a machine learning algorithm that relies on the principles of learning automata called Tsetlin Automata and game theory to create logic propositions for classification. Theoretical proof of TM's capability to solve complex pattern recognition problems and derivations of propositional formulas and its alignment with Nash equilibrium can be found in [18]. Interested readers can find the proof of convergence of TM in [25], [26] and further details on theoretical aspects of TM in [27], [28], [29], [30]. In this section, we visualize the details of the working principle and actual implementation of TM algorithm. A visual depiction is extremely helpful in understanding the REDRESS approach and facilitating TM adoption, exploration and knowledge sharing. In Section III-A, we examine the pre-processing methodology for input literals and the inference process to highlight the logic driven nature of the TM. In Section III-B, we present the training methodology used by TM to illustrate the reduced computational intensity in comparison with the gradient descent based NN approaches. Table I collates the key terms that will be used across these subsections and the subsequent sections. Fig. 1 demonstrates the inference routine for TM. It shows how raw data is prepared through a Booleanization method, this Boolean data is then linked to the learning elements, the TA, by computation of a Clause output. Multiple clause outputs are passed through a voting system, the class with the most votes is the inferred class.

A. Data Preparation and Inference
Data Preparation: The TM requires Boolean Literals as inputs to the model. To transform raw data into Boolean Literals a Booleanization process is used, as demonstrated in the upper left corner of Fig. 1. Raw features that are of integer or floating point value are passed through a Booleanizer that compares it with a pre-determined threshold and generates a single bit boolean value of 0 or 1 called the Boolean Features. In the example presented in Fig. 1, the threshold is chosen arbitrarily for demonstration proposes, however, generally the threshold(s) can be created as per the designer's choice. For example, a threshold can be decided as a mid-point in the raw feature range or chosen through off-the-shelf adaptive thresholding techniques for images. It should be noted that the number of bits used to represent the Booleanized data is user and application dependent. It allows the designer control over the granularity of the input space seen by the TM. In Fig. 1, we have used a single bit to store the Booleanized data but the granularity required by datasets used in this work is discussed in detail in Section VII. The Boolean Features are then expanded into Boolean Literals by including their complements. By using both the features and their complements the Boolean Literal space can represent every possible value that each Boolean Feature can acquire. In Fig. 1, we have presented the Booleanization process using two Boolean features that produce four literals to be used as inputs to the TM.
Clause Computation: The main computation component in the inference is the clause output which interprets Boolean literals using the TA. Fig. 1 presents the clause computation process using the four literals, obtained from the previous Booleanization step, in conjunction with the TA that pass through the proposition logic consisting of NOT, OR and AND gates. Each literal is assigned a corresponding TA in each clause. Each automaton in our diagram has 6 states, the three on the left are the exclude states and the other three on the right are the include states. If the state of the automaton is exclude then its output will be a bit '0', else if the state is include then the output will be a bit '1'. The number of states that each TA can have is a design choice. Having a larger number of states will increase the granularity of the decision making. We will explore this later in Section VI. The positions of TA states shown in Fig. 1 assume that they have settled in near optimum positions after training. During inference the positions of TA can no longer transition and, therefore, have a fixed include or exclude decision. Through the logic circuitry for the clause, we create a logic proposition that relates the literals to the TA state decision and generate a 1-bit Clause Output.

Structure and Inference:
To classify a problem with multiple classes, we need the Multiclass TM model. Since most problems have multiple classes, we will use TM and Multiclass TM interchangeably in this work. In Fig. 1, we present a hypothetical example of a Multiclass TM of M classes with N clauses each containing 4 TA per clause. In actual implementation the number of TA per clause will mirror the number of input literals l (= 2 × f ), where f is the number of Boolean input features.
An important design choice determining the size of TM and compute complexity is the number of clauses N specified by the user for any application. This is illustrated in the bottom half of Fig. 1. For every input datapoint, 1 which includes l Boolean literals, we compute a 1-bit Clause Output using the TA of the respective clauses. Each class must contain an even number of clauses, as each clause carries a polarity of +1 and −1 alternatively. The clause polarity determines whether the 1-bit Clause Output is multiplied with +1 or −1 before they are summed together to determine the Class Sum. The clause polarity indicates whether the clause will support or oppose the classification, i.e., a +ve polarity clause will support the classification while the −ve polarity clause will oppose the classification. With the (Class Sums) for all the classes, an 1 A datapoint can be any instance of information that needs to be classified by a trained model, for example, an image or audio clip. argmax function is used to determine the largest value and determines the classification for the given datapoint. Fig. 2 visualizes a trained TM model with model/automata size of M × N × l (see Table I). Each class has the same Fig. 3. Overview of the training procedure followed in Multiclass Tsetlin Machine classification. It begins with the initialization of the Tsetlin Automata with one of two randomly chosen values: middle_state ± 1. In an epoch, the training procedure iterates over all the datapoints in the training sample space. For each datapoint, two classes receive feedback, one of them is the expected class and the second is any randomly chosen class other than the expected class. They are indicated by expected output y c ∈ (0, 1), with y c = 1 indicating the feedback is to the expected class while y c = 0 indicating the feedback to random class. number of user-specified clauses N and each clause consists of l automata, where l is the number of literals. For a dataset with f features l = 2 × f . In Section VI-C, we discuss how an appropriate value of N is selected for a particular dataset. The range of values that TA states can acquire is [1,400], with any value > 200 is considered as include and the rest as exclude. The range of values TA can acquire is a user design choice. From our experimentation and the available literature, mostly the ranges of [1,200] or [1,400] are used and are sufficient for all encountered classification problems. The difference between the two is discussed in Section VI. In Fig. 2, the include-exclude decisions corresponding to TA state values can be seen. The includes are highlighted and purposefully shown relatively sparse compared with excludes as is, generally, the case. The sparsity of TA will be discussed further in Section VI.

B. Training Tsetlin Machines
The rationale behind the training process is to find an optimum combination of state positions for the TAs across the model to achieve high classification accuracy. TM provides Feedback to each automaton for its state transition. The TM is a supervised learning algorithm, therefore, TM feedback procedure is that it is independent of classification output obtained at the end of TM inference in Fig. 1. Instead it requires the class sum and the actual class the input datapoint belongs to. Fig. 3 presents an overview of the training procedure. The process starts with initialization of TA states randomly to one of the two values: middle_state ± 1. Using uniform random number generator, initialization gives every automaton an equal chance of becoming an include or exclude. Let us assume a training sample space S train consisting of τ datapoints: (X, y) ∈ S train . X ∈ (0, 1) l is a binary literal vector of length l belonging to class y, referred to as the expected class. In each epoch, the feedback process iterates for all τ datapoints. In an M multiclass TM, each class is assigned one TM that consists of N clauses each with l automata. For each datapoint X i , a pairwise learning approach is employed where two classes are selected for feedback. One of them belongs to the expected class y i as in (X i , y i ) ∈ S train , while the other is any randomly chosen class = y i . To differentiate between the two selected classes they are assigned an expected output value y c = 1 or 0, as shown in Fig. 3. y c is used to determine the type of feedback at a later stage, which will be discusses as follows. Fig. 4 shows the feedback procedure within a selected class using decision tree. The feedback procedure requires two user defined hyperparameters viz. s and threshold T. For the selected class, class_sum is calculated using the inference procedure discussed in Section III-A. While training the class_sum needs to be clipped to lie within the threshold range of [−T, T ] such as: The clipped Class_sum plays a crucial role in determining the probability of a clause getting feedback using T and a random number comparison as shown in Equation C1 and C2 in Fig. 4. This ensures that all clauses have equal chance of getting feedback and are randomly selected. A clause can get one of the two types of feedback, viz. the Type I and Type II, depending on the clause polarity and y c . For Type I feedback, the hyperparameter s and the random number are used to determine the probability of an automaton transitioning state, be it increment or decrement in the state value, as shown in Equations S1 and S2 in Fig. 4. The clause output and the literal determine whether the state value will be incremented or decremented. The Type II feedback is straight forward and always increments the state value if clause output is 1 and the automaton is exclude. The hyperparameters will be discussed further in Section VI.
The Type I feedback combats false negatives and Type II feedback combats false positives. Ideally, for accurate classification, the class sum of the expected class must be higher than that of all other classes. By randomly selecting another class, all other classes have an equal chance to calibrate their TA for smaller class sums. Class sum increments if a +ve polarity clause output is 1 and decrements if −ve polarity clause output is 1. Therefore, to obtain a high class sum more +ve clauses need to have an output 1 and −ve clause outputs should be 0. Similarly, it is vice-versa if a smaller class sum is desired. As shown in Fig. 5, the Type I and Type II feedback attempts to change the clause outputs to achieve desired class sums. The figure presents a qualitative comparison of the motives and effects of both the types of feedback. To change the clause outputs, the feedback will alter the number of automata that identify as includes. The number of includes is extremely crucial to the TM model memory footprint as we will see in Sections IV and VI. Fig. 4. Overview of the feedback procedure within the class. There are two types of feedback, viz. Type I and Type II. The type of feedback a clause gets depends on the hyperparameters, clause polarity, expected output (y c ), and probabilities C1 and C2. C1 and C2 determine if a clause will get any feedback based on the comparison with random number, generated from uniform distribution. Similarly, in the Type I feedback, random number is used to determine the probability of an automaton getting increment or decrement in state value. y -Yes, n -No, Inc -Include, Exc -Exclude, T -Threshold hyperparameter and s -the second hyperparameter, as mentioned in Table I.

IV. INCLUDE-ENCODING
The proposition logic used for clause computation is shown again in Fig. 6(a). Here, we observe an interesting property of TM, where only the literals corresponding to includes contribute to the clause output. Assuming Exc means exclude, bitwise OR with 1 (= NOT Exc) will invariably result in output of 1 and does not affect the output of the following AND gate. In the hypothetical example shown in Fig. 6(a), only the literals B 1 and B 2 determine the clause output as their corresponding automata are includes. In almost all datasets trained using standard TM, the TA has been found to be extremely sparse consisting of > 99% excludes. Considering the fact that < 1% of TA are includes and only contribute in the clause computation, its reasonable to individually store only include-related information. Fig. 6(b) shows l TA corresponding to l literals and their respective offsets (or addresses). We encode these offsets along with some additional information in a 16-bit integer. It should be noted that only includes determine the clause outputs and we encode and store all includes, therefore, no loss of information occurs in the proposed compression scheme making it lossless. Fig. 6(c) presents the detailed explanation of the encoded include. The additional information indicates the polarity of the clause and demarcate the change of clause while computing clause output. The encoded includes are stored serially, hence, it is required to know the polarity of the clause it belongs to, alongside clubbing all includes belonging to one clause together using 2nd most significant bit (MSB). Fig. 6(c) presents three cases to show how the first two MSB bits flip depending on the serial number of clause and its polarity. If a clause with all excludes is encountered then there is nothing to store, hence, the 2nd MSB flips only when the next include is encountered in the 3rd clause as shown in Fig. 6(c). Using Including-Encoding scheme, 2 Bytes is required to store each include. Hence, the memory footprint of the TM model becomes directly proportional to the number of includes. In Section VI, we present the REDRESS training procedure using TA re-profiling to address this challenge. We will present the details on compression achieved using the proposed encoding scheme in Section VIII. Although not specifically encoded, the least significant bit (LSB) in Fig. 6(c), by default, indicates the literal polarity. Literal polarity is used during inference to access the literals from the features of an input datapoint in the testing dataset and will be discussed along with the proposed inference algorithm in Section III-A. The include-encoding scheme stores only includes, hence, if more clauses with all excludes are encountered then they can be skipped entirely resulting in higher compression. The algorithm scans through the TAs once, and while, scanning it keeps track of clause changes, thus, no extra overhead is involved when more classes with only excludes are encountered.

V. REDRESS INFERENCE
The data structure obtained following the Include-Encoding method comprises of two arrays: one holds the number of  includes per class and the second holds the encoded includes as shown in Fig. 7. The input datapoints from the testing dataset are bit-packed by forming W -bit integers from the corresponding bits of W datapoints. As we will see, this enables classification of W images simultaneously taking advantage of the maximum wordsize, which is assumed here to be 32 as we validate TM models on a 32-bit micro-controller. The input datapoints only contain the features and literals are obtained during runtime by complementing them such as: {l i0 = f i , l i1 = f i }. The LSB of encoded include indicates if the corresponding literal is l i0 or l i1 , where i is the feature offset and 0 or 1 indicates the literal polarity, respectively. Algorithm 1 presents the pseudocode of the proposed RE-DRESS inference algorithm. Inference of W datapoints is performed simultaneously in each iteration and hence, the total number of iterations performed is τ /W , where τ is total number of datapoints in the testing dataset. The algorithm is based on the assumption that W = 32 and 16 bits or 2 bytes are used to encode each include which justifies the constants appearing in the pseudocode. However, theoretically different values of W can be used depending on the maximum word size of the underlying hardware.
The algorithm iterates through the array of encoded includes IncEnc once for every W datapoints. This enables parallel inference of W datapoints and significantly reduces the total computation time. While iterating through the IncEnc array, the algorithm, basically, scrolls through all the includes in the model. It uses the offset from the IncEnc to identify the literals corresponding to includes in each clause, and ANDs them to calculate the clause outputs, simultaneously, for W datapoints. When the end of the clause is encountered, the clause outputs are then added/subtracted to the class sums depending on +ve or -ve clause polarity, respectively. The algorithm keeps track of the maximum class sum and the corresponding class. literal_polarity is used in line 21 to determine if the literal is the input feature or its complement as it is needed to compute the clause output. Comments in the Algorithm 1 provide further explanation to its implementation.

A. Hyperparameter: T
The hyperparameter T (or threshold) plays a crucial role in classification accuracy and the number of includes. Section III-B and Fig. 4 show that the class sum is clipped to [−T, T ] and cannot exceed the threshold in magnitude. Equations C1 and C2, shown again in Fig. 8, determine the probability that a clause will get feedback, i.e., either Type I or Type II. Fig. 8 presents the average probability of a clause getting feedback for the possible range of class sums with respect to T = 20. It assumes that all classes obtain the same number of feedback hits τ × α for τ training samples and a certain constant α. We can Algorithm 1: REDRESS Inference Algorithm Pseudocode. see that for the expected class the probability reduces to zero if Class_sum = T , implying that no further change will happen in TA profile of the expected class. Similarly, random class with Class_sum = −T will stop getting feedback to prevent further deterioration of its TA profile vis-à-vis the expected class, as they already can be clearly distinguished with high accuracy due to the large difference in their class sums. At this point, a higher T would allow continued feedback even though not required.
Ideally, it is assumed that the training sample space S train is unbiased such that all classes have equal number of samples for training. However, it is seldom the case in practical scenarios. With more samples the expected class will get more chances of obtaining higher class sums in comparison to other classes that can skew the training in favor of a particular class. Moreover, any classification problem has inherent biases where some class is relatively easier to distinguish among others depending on its properties, input raw data preprocessing and the underlying ML algorithm. Threshold combats these biases by limiting the number of feedback as shown in Fig. 8.  Table I for key term definitions), using MNIST dataset for T = {5, 10, 20, 40, 60, 100}. We can draw the following observations from the figure: r The higher the T the more the number of feedback.  It is unable to combat the aforementioned biases in the dataset. Similarly, a very small T fails to calibrate the TA profile to capture the important features. Fig. 9 shows that an optimum range of T exists for a specific datasets that will result in high accuracy.

B. Hyperparameter: s
The hyperparameter s is used to determine if the state transition (increment or decrement) will happen for an automaton in the Type I feedback, as shown in Fig. 4. Fig. 10 shows the Equations S1 and S2 again and plots the probability of S1 and S2 to be TRUE for β = 1 million checks for different values of s, each one being an independent trial. If S1 is TRUE then the state will decrement with the aim of making it exclude, if not already. If S2 is TRUE then the state will increment and more includes may appear for automata whose corresponding literal is 1. A high s will lead to more includes and vice-versa as shown by their probabilities in Fig. 10. Thus, the hyperparameter s determines the learning rate and capacity. The learning capacity relates to the maximum training accuracy at which the TM saturates and it will require an increase in s to achieve any better accuracy. The hyperparameter s plays an extremely crucial role in avoiding overfitting and achieving high accuracy while minimizing number of includes and forms the core of TA Re-profiling. We will   TABLE II  AN EXCERPT FROM THE TABULATED OUTPUT OF TM ARCHITECTURE SEARCH  PARADIGM FOR MNIST ARRANGED IN THE DESCENDING ORDER OF  ACCURACY. THE DATA BELONGS TO FIG. 11 continue to discuss hyperparameter s in detail in the upcoming sections. However, it should be noted that similar to T , there exists an optimum range of s for accurate classification of a dataset. The empirical evidence will be provided in the following sections.

C. TM Architecture Search Paradigm
There does not exist a deterministic method to find TM architecture parameters suitable to classify any dataset with satisfactory accuracy and a certain number of includes. TM architecture parameters are determined empirically and mainly consist of the number of clauses, T and s and less often the number of epochs as it can be decided during runtime. As mentioned previously, there exists a range of TM architecture parameters that can produce an optimal result for a dataset. To identify the optimal solution, we propose a automated TM Architecture Search Paradigm (TMASP) approach where TM source code is integrated with data visualization tool from Weights & Biases [31]. It generates sweeps for each combination of architectural parameters specified, a subset of which can be visualized in Fig. 11 and tabulated in Table II. TMASP is an automated process that stores the raw state values of TA for each sweep, generates visualization, tabulate the results and can work in both online and offline modes. Multiple instances of TMASP can run simultaneously by running one architecture combination per core to speedup the exploration process. The source code can found at: https://github.com/nclaes/tmasp.
From Table II, we notice that s = 7.5 has consistently produced high accuracy compared to others while s = 2.5 is the worst performer. The number of includes is quite high for high T = 15, 20 for s = 7.5, N = 300 and epochs = 100. This reinforces the understanding developed in Section VI-A that high threshold increases the number of feedbacks, specifically the Type I feedback, resulting in more includes. Often more clauses may lead to more includes. The storage and computational requirements are directly proportional to the number of includes. Fewer includes will require smaller space to store and will consume less computation time during inference. It should be noted that using the REDRESS training procedure the number of includes can be reduced and the accuracy can be improved as will be discussed in Section VI-D, hence, the selection here determines the configuration at the start of training.  Using Table II, we can select the TM architecture parameters for MNIST that satisfy our requirements of includes and accuracy. Among other choices, we select the following parameter values: N = 200, s = 7.5, T = 10. Section VII elaborates on the parameters selected for other datasets in this paper while discussing the complete REDRESS flow.

D. Tsetlin Automata Re-Profiling
TA re-profiling is the procedure to disturb the Nash equilibrium of the automata [18] while retaining the include-exclude decisions, hence, the training accuracy from the last training epoch. It, basically, involves forced resetting of state to minimum possible value such that it retains the include-exclude characteristic. Fig. 12 presents examples of two automata where they retain include and exclude decision, however, their values are reset to middle_state + 1 and 1, respectively. The standard training procedure initializes the TA with random values around the middle of the state value range as shown in Fig. 3. TA re-profiling enables us to initialize the TM with a previously trained model and gives TM an opportunity to re-calibrate the automata with user specified hyperparameters. Re-profiling TA saves us from starting the training process from scratch and enables TM to search for global optimal solution. It enables re-using a well trained model for fine tuning to increase the accuracy and decrease the model size.

E. Methodology
The proposed REDRESS training methodology is based on the hypothesis that by re-profiling the TA and adjusting the s hyperparameter, a similar or better accuracy may be achieved with fewer includes. We discuss the methodology here and empirically validate this hypothesis in Section VIII. Fig. 13. (a) presents the REDRESS training methodology using the TA re-profiling procedure. First, the TMASP is performed to determine optimal values for clauses and hyperparameters. In TMASP mentioned in Section VI-C, the TA is initialized randomly to includes and excludes following the standard training procedure presented in Section III-B. However, in our experiments we have found that when all TA are initialized as excludes and assigned the state value of 1 then the TM tends to achieve same accuracy with fewer includes albeit with more epochs of training. Similarly, the range of state values from [1,200] to [1,400] increases the number of  automaton-level increments in the Type I feedback required to become an include. Therefore, any include requires double the feedback and may demonstrate high confidence and importance in accurate classification. Larger ranges of state value may be explored by interested readers. However, this exploration is beyond the scope of this paper.
TMASP, also, provides the raw TA file for the chosen architectural parameters to start the training process. The raw TA file can be used to initialize the TA after going through the TA re-profiling procedure or, at user's discretion, the first training cycle can start from scratch with all TA initialized as excludes with state value of 1. A training cycle includes the TA re-profiling based initialization of TA file obtained from the previous training cycle followed by the desired number of training epochs. In all training cycles the number of clauses and T are kept fixed, however, s can vary as per user-specification. The training procedure saves the raw TA file every time the training accuracy improves. If the accuracy and include profile, i.e., the number of includes, after the current epoch are found to be satisfactory then the training process can be terminated. Otherwise, the user has a choice of continuing training or start next training cycle that will require TA re-profiling based re-initialization of the TA. In Section VIII, we will see that the choice of s is crucial before starting the training cycle to achieve the desired model. Fig. 13(b) presents the overview of the entire REDRESS approach. All the datasets used in this paper for validation have gone through the same flow to obtain the optimized TM models.
The same REDRESS training and inference methodology, shown in Fig. 13(b), is used with all the five datasets (or benchmarks). Before training, however, we initialize the TA as excludes with state value of 1 in all our experiments. The state value range used in our experiments is [1,400]. Table III presents the optimum TM architecture configuration obtained from TMASP runs for different datasets. We compare REDRESS models with seven models of BNN with the configurations shown in Table IV. We have used the default settings to train the BNN as provided    Table III. the number of includes are particularly larger compared to other dataset and hence, a different y-axis is used for representational purpose. This reinforces the choice of T = 10 for MNIST with the aim to reduce includes.
As mentioned in Section VI-B, the probability of state increment tends to 1 with increase in s, implying that includes will increase with s. This is, also, demonstrated in Fig. 15, which presents the effect of s on the number of includes. For all datasets the includes increase with s. Fig. 16 presents the changes in the TA profile and accuracy when a TM model goes through cycles of TA re-profiling based REDRESS training procedure. The experiments have been performed for all five datasets where all parameters, taken from   Table III. are performed for each dataset, one with decreasing and the other with increasing s, shown in the left and right columns in Fig. 16, respectively. In each REDRESS training cycle, the TM is trained for variable number of epochs, generally, less that 10 and the training starts from scratch by initializing all TA as excludes with state value = 1. The hyperparameter s is kept the same for three consecutive REDRESS training cycles as shown in Fig. 16. We exit the current training cycle once the training results in improved accuracy. The accuracy shown in the figure is test accuracy in the latest epoch. At the start of the next REDRESS training cycle the TA re-profiling is induced to initialize the TA. Thus, the TA file obtained from the previous training cycle is used in the next after re-profiling.
For MNIST the chosen s = {2.5, 5, 7.5, 10}. Within the three consecutive training cycles with the same s, we notice that better accuracy is achieved with similar or fewer number of includes. The number of includes dip when s is reduced from 10 to 7.5 and rises when it increases from 2.5 to 5. This corroborates our observation in Fig. 15. An important observation in the decreasing s case can be seen when s is reduced from 5 to 2.5, which leads to significant decrease in the number of includes and accuracy, both. This suggests that s = 2.5 is too small for this dataset. For other experiments, we avoid s = 2.5 as they all demonstrate the same trend.
For FMNIST, KMNIST, CIFAR2 and KWS6, s ∈ [5,15]. For higher s values such as 12.5 and 15, we observe erratic behaviour in the case of decreasing s in FMNIST and KMNIST. This happens when TM is unable to improve the accuracy while at the current optimal solution and hence, the number of includes jump to find a better solution. However, once the accuracy improves this jump in includes can be mitigated at a later stage by decreasing the s as shown in Fig. 16. This demonstrates that higher s can help approach a global optima with satisfactory accuracy and then the TA re-profiling based REDRESS may be used with smaller s to minimize the number of includes. Fig. 16 shows the effectiveness of our approach in finding an optimum model with high accuracy and a minimum number of includes. The three colums of subfigures, in Fig. 17, each representing a training cycle, show the trends of feedback and includes before and after the junction at which TA re-profiling is applied twice. It shows the number of includes in the y-axis (left), feedback on the secondary y-axis (right), and accuracy (in the plot) at the TA re-profiling junctions. All TM parameters are kept constant in all the three training cycles with each cycle training for 10 epochs. Only MNIST test case is provided here, the subfigures for CIFAR2, KWS6, FMNIST and KMNIST can be found in Fig. 1 Table III, are kept constant through the runs except s that is changed after every three consecutive training cycles. The TA file obtained from the previous training cycle is used to initialize the TA in the next training cycle after re-profiling. A total of 12 training cycles, in the decreasing and increasing order of s are shown for each dataset. It demonstrates how optimal model can be obtained through REDRESS training approach with changes in s. High s can help identify global optimal solution and decreasing s in the following training cycles can reduce the number of includes and fine-tune the model. accuracy immediately after the TA re-profiling and gradually rises producing better training accuracy with fewer includes, in the case of CIFAR2 and KWS6, or similar number of includes as in the case of MNIST, FMNIST and KMNIST. This demonstrates that TA re-profiling is a computationally minimal procedure to fine-tune the model to improve accuracy and reduce memory footprint. Table V presents the inference results obtained from the micro-controller for the testing dataset of MNIST, FMNIST, KMNIST, CIFAR2 and KWS6. It compares seven different BNN models against TM models obtained from REDRESS. All BNN models were trained for 100 epochs as per the parameters mentioned in Table IV. The parameters used in TM models are mentioned in Table III. The memory footprint displayed in Table V consists of the model size together with the size of the source code. The model size in kilo bytes (kB) includes coefficients in the case of BNN and encoded includes in the case TM. The accuracy is measured over the testing dataset. The stabilized  power measured during Conv-1 L and Conv-2 L inference was found to be 1.3 W while in the case of all other BNN and TM models, it was measured to be 1.35 W. The difference can be attributed to larger model sizes leading to more data movement. The model size, excluding the source code size, for Conv-1 L and Conv-2 L range between 4-10 kB while the minimum model size for other models is >16 kB. Energy is the product of inference time with power shown in the unit of Joule. The blue colored cells in Table V demarcate those instances where the BNN models outperform TM. As we can see, in most cases TM outperforms BNN, especially, in the amount of energy and computation time by orders of magnitude. The difference in accuracy for MNIST and CIFAR2 between best performing BNN model compared to TM is < 1.2%. TM results in notably better test accuracies in the case of FMNIST, KMNIST and KWS6. Overall, TM models demonstrates significantly better overall performance compared to the state-of-the-art BNN models. Table VI presents a comparison of the memory footprint of TM model upon compression using the proposed Include-Encoding method. It shows that the REDRESS model occupies 86-367× less memory on the micro-controller compared with the entire TM model, if 1 B is used to store an automaton.
Run length encoding (RLE) based compression approach is presented in [38] to use TM in edge inference scenario. Table VI presents a comparison of memory footprints of compressed TA obtained using RLE and REDRESS relative to the original uncompressed TA in kilo bytes. We can see that REDRESS considerably outperforms RLE producing 81-354× more compression. It is possible that the TM models have many alternating include-exclude bits that may be responsible for relatively poor performance of RLE. The compression is performed after training is complete and the raw TA file with include-exclude decisions is available. Here, the same TA file goes through RLE and REDRESS compression methods and both techniques are lossless, hence, both have the same accuracy.

IX. CONCLUSION
This article proposes how to significantly improve the sparseness of Tsetlin Machines (TMs), a recent and increasingly popular machine learning algorithm. After introducing the vanilla TM training and inference methodology to a wider audience, we present an off-and online TM Architecture Search Paradigm (TMASP) named REDRESS for exploring model architectures and hyper-parameters on application-specific datasets. The proposed training methodology drastically increases the sparseness of TM models. We achieve this by re-profiling the underlying Tsetlin Automata that drives the learning, helping the TM search for a better solution with fewer includes (more concise pattern representation). The sparse models then go through lossless compression using the so-called include-encoding, which distills the bare minimum information required to perform inference without loss in accuracy. Using the compressed-domain inference algorithm, the REDRESS TM outperforms seven BNN models on inference time, energy, and memory footprint across five benchmarks viz. MNIST, FMNIST, KMNIST, CIFAR2 and KWS6. REDRESS further simplifies the vanilla TM clause compute by skipping unnecessary excludes, thereby classifying with fewer bitwise operations. In conclusion, the REDRESS approach demonstrates that highly sparse TMs yield improved accuracy while boosting computation speed and energy efficiency.