A Lightweight Deep Neural Network Using a Mixer-Type Nonlinear Vector Autoregression

The design of a lightweight deep learning model would be an ideal solution for overcoming resource limitations when implementing artificial intelligence in edge sites. In this study, we propose a lightweight deep neural network that uses a Mixer-type architecture based on nonlinear vector autoregression (NVAR), which we refer to as Mixer-type NVAR. We applied overlapping patch embedding to enrich the image input and Sequencer architecture for vertical and horizontal operation inside the Mixer-type NVAR. We utilized a window partition technique and general quadratic positional encoding to increase the performance of the proposed model. Our model achieved a top-1 accuracy of 82.48% for the CIFAR-10 dataset with 0.159 M parameters and 98.36% for MNIST with 0.106 M parameters. Moreover, we evaluated its throughput on a central processing unit, which was 190.1 images per second for CIFAR-10 and 106.7 images per second for the MNIST dataset. These results are competitive with the state-of-the-art convolutional neural network-based model, MLP-Mixer, and the traditional reservoir-computing-based Mixer model with the same tuning of hyperparameters.


I. INTRODUCTION
Implementing artificial intelligence (AI) in the edge site offers promising advantages for tackling problems in communication between end users and data management, such as connection bottlenecks to the cloud, bandwidth limitations, scalability, and privacy [1], [2], [3], [4].However, the implementation of AI in the edge site or edge intelligence comes with its own limitations, especially with resource availability, such as for power and storage.Consequently, its implementation is still challenging, even though present end-user devices appear to have high performance, and it still lacks support for the AI model implementation [5].
Deep-learning technology has the potential for lightweight models to be implemented in edge sites, especially for the vision task.We start with models based on convolutional neural networks (CNNs), such as MobileNet [6] and EfficientNet [5], [7].The latest MobileNet (V3) achieved The associate editor coordinating the review of this manuscript and approving it for publication was Mehdi Sookhak .a top-1 accuracy of 92.80 with 4.21 M parameters [8].Moreover, EfficientNetV2 exhibited superior performance on a dataset of tiny images, achieving a top-1 accuracy of 99 with 121 M parameters [5].Besides these CNNbased models, Transformer-based models such as Vision Transformer (ViT) offer higher accuracy in vision-based tasks [9].Implementation of ViT in the edge site requires further study owing to the model's complexity compared with the number of parameters related to the training and testing costs, so a CNN-based model is currently a better solution for implementation in the edge site.
One particularly promising deep-learning model is MLP-Mixer.This model performed well in vision tasks without convolutions in the CNN-based model or self-attention in the Transformer-based model [10].By taking advantage of the Transformer, MLP-Mixer replaces the multi-head attention (MHA) with the stacks of the MLP layer for token and channel mixing.Since token mixing is sensitive to spatial information [11], MLP-Mixer captures the global translation without local operation.Moreover, this model offers high accuracy with a large number of parameters, which is an important consideration for implementation in an edge site.
Furthermore, models based on recurrent neural networks (RNNs) that have been proven to be effective at time series tasks also have the potential to be implemented into edge sites.RNN-based models also offer lighter and less complex operation.Deep long short-term memory (LSTM) in the Sequencer architecture [12] is an RNN-model that was built for image classification.Sequencer applies two bidirectional LSTMs (BiLSTMs) as the building blocks for replacing the MLP-layer in the MLP-Mixer architecture.These BiLSTMs work in the vertical and horizontal directions as parallel operations.In short, Sequencer is able to achieve a higher efficiency than Transformer-based models and MLP-Mixer.The number of parameters used in the Deep Sequencer model is 54 M, which is large compared with CNN-based models and MLP-Mixer.
Reservoir computing (RC) models are strong candidates for implementation in the edge site.Different from RNNbased models, the neural connections in RC are fixed and are generated randomly, thus enabling a faster training process compared with other fully trained RNN-based models [13].In [14], RC was successfully implemented for the vision task.The researchers in that study applied an echo state network (ESN), a type of RC that uses the nonlinear dynamics of the reservoir and the linear layer to recognize the output.In this state, ESN achieved a top-1 accuracy of 99.07 with 4000 nodes in the reservoir.In 2021, [15] introduced a new RC concept called the next-generation of reservoir computing (NGRC).NGRC proved that RC is mathematically equivalent to nonlinear vector autoregression (NVAR).Their results showed that NGRC is faster than traditional RC owing to its fewer fit parameters, and thus its smaller feature vector size.Extending its implementation to the vision task has potential, following what was done with previous RC models.This model provides an optimized solution for an implementation in the edge site, especially with enlarged applications, such as image classification.
In the present study, we proposed a lightweight deep neural network (DNN) model with the goal of achieving reasonable classification accuracy with fewer parameters than current state-of-the-art models.We adopt token and channel mixing by taking advantage of the MLP-Mixer and Sequencer architectures, allowing us to achieve a robust and lightweight model.We used NVAR to replace the MLP in the MLP-Mixer layer and BiLSTM in the Deep Sequencer.The contribution of this work can be summarized as follows.
• An overlapping patch was implemented in the patch embedding to enrich the feature vectors from image input.
• Mixer-type NVAR utilized only one block of the vertical NVAR Sequencer-Mixer and one block of the horizontal NVAR Sequencer-Mixer for the token and channel mixing.
• We used a window partition [16], [17] and general quadratic positional encoding (GQPE) [16] to improve the proposed model performance by performing local operations inside the Mixer layer.
The rest of this paper is structured as follows.Section II is an overview of related work, and Section III is an outline of our model's architecture and its components.In Section IV, we present the evaluation condition, the results, and a discussion of them.We conclude with Section V, in which we present our conclusions and future work.

II. RELATED WORK
A lightweight DNN model is a potential solution for overcoming resource limitations when implementing AI in edge sites.This is especially the case in vision tasks, for which light CNN-based models such as such as MobileNet and Efficienet are the current state-of-the-art deep learning models.MobileNet was proposed for mobile applications that use depthwise separable convolutions, and MobileNetV3 is an extension of MobileNetV2 [6] that applies the new nonlinearity h-swish, which is faster in computation and easier for quantization.MobileNetV3 also uses the squeeze and excited module in the residual layer to adaptively recalibrate features.EfficientNet also offers good performance in vision tasks for mobile applications [7].As an improvement over the previous version, EfficientNetV2 supports a faster and smaller model for image recognition [5].EfficienetV2 also uses adaptive regularization, which can be adjusted depending on the image size, thus enabling a faster training time.Since both MobileNetV3 and EfficientNetV2 have been proposed for mobile applications and have successfully demonstrated good performance in the vision task, they can both potentially be implemented in edge devices.However, these model architectures still require a large number of parameters.
MLP-Mixer, a newer deep-learning model architecture for vision tasks, applies MLP blocks consisting of two linear layers, a GELU activation function, and a fully connected layer as a classifier [10].In general, the architecture of MLP-Mixer is similar to that of ViT.A patch-embedding layer is applied to divide and embed image inputs using linear projection, which are then fed into MLP blocks.The first MLP block is responsible for token mixing and the second is for channel mixing.Both the MLP-Mixer and the ViT models [9] use non-overlap patching as a token for the input.Consequently, MLP-Mixer relies only on global information captured within the respective patch, and it may lack important information from the edge of the patch as a result, especially for tiny images like those in the CIFAR dataset.
Sequencers, which are a new architecture in RNN-based models, are also an optimized solution for enhancing model efficiency and decreasing the training time in vision tasks [12].The Sequencer adopts the architecture from ViT and applies the BiLSTM to replace the MHA block.The BiLSTM is constructed in two parts: a vertical BiLSTM and a horizontal BiLSTM.Since the vertical and horizontal BiLSTMs operate in parallel, the sequence length is reduced, thus decreasing the time involved in training and the inference process.Moreover, Sequencer is flexible with regard to the image input resolution, so it preserves accuracy if the image resolution differs from the training process.Since Sequencer uses recursion and memory saving to mix spatial information, many parameters are required to build the model architecture.Sequencer requires 54 M parameters for a large Sequencer2D model.Similar to MLP-Mixer, Sequencer also applies a non-overlap image input.Thus, Sequencer may lack important features for tiny images.
As an RNN-based model, RC has been successfully implemented in a time series.It offers a faster and lighter training process by applying a fixed and randomly generated network for extracting features.An ESN, as a type of RC, also has a fixed random connection in the network architecture [18], [19].Compared to a traditional RNN, ESN has less complexity in its training process, and is robust against noise and overfitting.After extending the implementation of RC to a vision task, ESN was successfully implemented to perform the recognition task with the MNIST dataset [14].Additionally, ESN is elaborate enough to be able to handle the high-dimensionality of input owing to its random connections, but its application to vision tasks still needs to be improved.The researchers in [13] and [15] demonstrated that RC is equivalent to NVAR, and thus it has been referred to as next-generation reservoir computing.Their NVAR model performed better by reducing the training data and testing time.However, implementation of this model is still limited to time-series tasks, such as forecasting the short-term dynamics and long-term climate of chaotic systems.
In the present study, we applied the Mixer and Sequencer architecture processes by implementing NVAR as a token and channel mixing.We utilized the CIFAR-10, MNIST, Fashion MNIST, and EMNIST datasets to improve our proposed model application in vision tasks and to evaluate its performance for tiny images.

III. PROPOSED LIGHTWEIGHT DNN MODEL USING A MIXER-TYPE NVAR A. OVERVIEW
In this section, we describe our proposed lightweight DDN model that applies a Mixer-type construction and a Sequencer in its architecture.As shown in Fig. 1(a), we adopted the architecture of MLP-Mixer and Sequencer as the backbone of our model.We applied the overlapping patch to generate tokens from image inputs before feeding them to the Mixer blocks and a pooling layer (general average pooling).In the Mixer block, we constructed the vertical NVAR Sequencer-Mixer and the horizontal NVAR Sequencer-Mixer, which mixes the tokens and channels.In the edge of the architecture, a fully connected layer is used as a classifier for image classification.The normalization layer is used to normalize the outputs of patch embedding before feeding them to the NVAR Sequencer-Mixer block.Our model also applies the skip connection to improve the information that is introduced into the Mixer.Fig. 1(b) shows the construction of the Sequencer that builds the Mixer blocks.It is here that the vertical and horizontal Sequencers mix the token and channel information.We can see in Fig. 1(c) how the NVAR model is used in the Sequencer as the base Mixer layer.Additionally, a window partition is used to support the local operation inside the Mixer block.Furthermore, we applied the GQPE as a type of relational positional encoding (PRE) to enhance the token and channel mixing.

B. NONLINEAR VECTOR AUTOREGRESSION
Our model is primarily based on NVAR, which is used as a base in the Mixer layer to replace the MLP in MLP-Mixer and the BiLSTM in the Deep Sequencer LSTM.We utilized one block of NVAR [15] in the vertical and horizontal Mixer blocks, as shown in Fig. 1(b) and (c).Since NVAR is mathematically equivalent to RC, it can serve as a more optimized solution compared with the state-of-the-art CNN and MLP models.In addition to its robustness to noise and overfitting, NVAR has a delicate operation with a smaller number of parameters, making it suitable for overcoming the resource limitations in edge site implementation.As explained in [15], NVAR, as the next generation of reservoir computing, utilizes a constant (c), and both a linear and a nonlinear portion of the feature vector, as shown in (3).
An illustration of the total output of NVAR is given in Fig. 2. The linear features shown in (1) consist of an input vector X at the current time step i and at k − 1, the previous time for space s. (s−1) represents the skipped steps before the input vector of X at i, and k is the number of reflections.The nonlinear feature vector (O nonlin,i ) in ( 2) is obtained by taking the tensor product of each linear feature vector.As important parameters, this model requires only s × k time steps at the starting point of feature vector processing.We also applied the polynomial as a nonlinear function to get the nonlinear output from the linear feature vector of NVAR, as in (4), where (p) is the order of the polynomial feature vector.

C. NVAR SEQUENCER-MIXER
In the implemented Mixer architecture, we utilized a normalization layer, the Sequencer architecture for token and channel mixing, and a residual.As mentioned above, we replaced the MLP blocks in MLP-Mixer with NVAR as a base model.In Fig. 1(b) we can see that the primary process Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.  in our Mixer architecture is composed of the vertical and horizontal Sequencer-Mixer blocks for token and channel mixing.In these Mixer blocks, the inputs are split from four dimensions, (batch size × height × width × channel, or B × h×w×C) yielding two two-dimensional outputs, (h×C) and (w × C).The vertical Sequencer-Mixer crosses the vertical axis while conducting token and channel mixing of (h × C) for the input feature, while the horizontal Sequencer-Mixer shuffles the horizontal axes of (w × C).We also utilized the layer norm and a skip connection for each Mixer block input before feeding them to Sequencer parts, and we utilized a window partition that has previously been proven to work well in [16], [20], and [17].The window partition crops the output of the linear layer into several small windows and captures the local information within each window.
To enhance the performance of Sequencer, we applied GQPE [16] as an additional feature in the Sequencer portion.This positional encoding supports the token position that is applied by the Transformer as essential information, thus enriching the feature data [21].This local operation reduces the computational complexity in the Sequencer-Mixer blocks, enabling a lighter and faster DNN model.

D. OVERLAP
We applied the overlapping technique to obtain beneficial spatial information on tiny datasets, such as the CIFAR-10 dataset.Overlapping can provide local continuity of an input image in the patch embedding process [22].The overlapfiltered window can also provide additional neighboring information, such as the edge information from the patches, which might contain essential features.Additionally, some information might be lost during the filtering process, especially for tiny images that are easily misinterpreted.In [23], it was shown that slight information exchanges between neighborhood windows increase the model's performance.An illustration of the overlapping patch is given in Fig. 3.For example, a 7 × 7 image input size with a 3 × 3 applied kernel size and stride 2 for patch division with padding 1.The patch is overlapped by the pixels stridden in both directions (horizontal and vertical) for each image input.Consequently, we can obtain a number of patches in addition to the nonoverlapping patching.Overlapping also enriches the information by continuously capturing the local information in the image input and collecting the neighboring information from patches (as shown in the generated patches), resulting in 3.

E. POSITIONAL ENCODING
GQPE is a type of PRE that adds encoding based on time lags [24].When implemented in the manner of [16], GQPE provides the positional information lacking in channel mixing.We applied GQPE in our model specifically to tackle this problem in our Mixer block, especially in the horizontal Sequencer performing as a channel token Mixer.Since GPRE offers O(1) for token mixing complexity [16], this benefit might increase time efficiency, thus reducing operation time for both training and inference.The learnable vector, v, and the relative positional, r δ , are defined as [16] and [24] and δ = (δ x , δ y ) is the relative position of the learnable vector, and is the controller for the displacement of the distribution center relative to that belonging at position p i .controls the distribution functions in a learnable vector.

IV. EVALUATION A. DATASET 1) CIFAR-10 DATASET
Since [25] introduced the CIFAR-10 dataset, many deep learning models [5], [6], [10], [26] have applied it to visionbased tasks.The CIFAR dataset contains the CIFAR-10 dataset for ten classes of tiny images and the CIFAR-100 dataset for one hundred classes [27].These datasets provide training images as well as testing datasets.the present study, we applied the CIFAR-10 dataset to assess how our model works on tiny images.This dataset provides 50,000 images for training and 10,000 for testing, with ten classes for each category.

2) MNIST DATASETS
We also utilized the MNIST datasets [28], [29] in this study, of which there are three: MNIST, Fashion MNIST, and EMNIST.The MNIST and Fashion MNIST datasets have ten classes of images, while the EMNIST dataset has 64 classes.This dataset is composed of one channel of grayscale images with a 28 × 28 resolution.The MNIST and Fashion MNIST datasets provide 60,000 images for training and 10,000 images for testing.EMNIST, on the other hand, offers 814,255 [30].These datasets differ in the characters that are used in them.We applied MNIST to classify ten classes of digits, Fashion MNIST to classify ten classes of clothing images, and EMNIST to classify 64 classes of handwritten digits.

B. EVALUATION CONDITIONS
In evaluating the performance of the proposed model, we applied several lightweight models with traditional RC and NVAR as the RNN-based model, MobileNetV3 as the CNN-based lightweight model, and MLP-Mixer itself.Specifically, we used traditional RC and NVAR in the Mixer architecture, as shown in Fig. 4, the goal being to provide a fair comparison to our overlapping Mixer-type NVAR.Each model was tested using the CIFAR-10 and the MNIST family of datasets with the same settings for the hyperparameters and the training and testing environments.We used 200 reservoir nodes for RC-Mixer, and k = 2, s = 1, and p = 2 for the NVAR-based model.We applied the small architecture of MobileNetV3 [6] as a CNN-based model.The MLP-Mixer architecture and configuration used in this study were provided by the TIMM library [31], and are based on [10].We utilized the separated dataset for training, which consists of the training and validation datasets, and the testing dataset for the inference process.After training our proposed model using the hyperparameters listed in Table 2, we tested it and the benchmark models using the M1 processor, the specifications of which are given in Table 1.We calculated the top-1 accuracy, the number of parameters, and the floating point operations (FLOPs) usage for each model with each dataset.Additionally, we also calculated the throughput so that we could evaluate model performance in the inference process.

C. ABLATION STUDIES
We performed ablation studies before attaining our final model, the details of which are given in Table 3.The ablation studies validated the significance of the techniques applied in our proposed model.Starting from the state-of-the-art RC model that was implemented in the vision task, we applied ESN as the base model for RC.This is compared with the new generation of RC (NVAR) in the Mixer architecture shown in Fig. 4(a).Fig. 4(c) and (d) correspond to the RC-Mixer and NVAR-Mixer layers, respectively.First, we applied the traditional RC and NVAR layers in the Mixer architecture without overlapping in the patch embedding, Sequencer, and GQPE, with a window partition in the Mixer block.The results for RC-Mixer and NVAR-Mixer for each dataset with no additional techniques in the Mixer layer are shown in Table 4. Second, after confirming the results for the traditional RC-Mixer and NVAR-Mixer, we then applied the overlapping patch of an image input to the patch embedding block for the NVAR Sequencer-Mixer model.As a result, the NVAR-Mixer architecture with the Sequencer and the overlapping patch embedding increased model performance up to 3% for the CIFAR-10 dataset and 8% for the MNIST and Fashion MNIST datasets.Third, to validate the effectiveness of GQPE with a window partition, we applied the GQPE to NVAR Sequencer-Mixer along with both overlapping and non-overlapping in the patch embedding, as shown in Table 3.The accuracy of the NVAR Sequencer-Mixer with GQPE significantly improved.
As can be seen in Table 4, accuracy improved when we used NVAR-Mixer with the vertical and horizontal Sequencers in the Mixer block, demonstrating their effectiveness for token and channel mixing.The inference time was also reduced significantly.Our proposed model using the Sequencer-Mixer architecture achieved higher throughput compared with the traditional RC-Mixer and NVAR-Mixer.The NVAR Sequencer-Mixer also reduced the number of parameters and the FLOPs usage.This demonstrates that the applied Sequencer architecture in the Mixer block increased resource usage efficiency during the computation of the token and channel input.Even though the throughput was slightly decreased while using the GQPRE with the window partition, the result was still reasonable.Based on the results of the ablation studies, we chose to adopt the overlapping technique for NVAR Sequencer-Mixer as our proposed model.We also applied the GQPE with the window partition to enhance the performance of token and channel mixing in the Mixer block.

D. RESULTS
Upon conclusion of the ablation studies, we performed experiments with the state-of-the-art models using the same set of hyperparameters for each dataset and the same testing environment.The results are summarized in Table 5.Using the same number of hyperparameters for the CIFAR-10 dataset in the training process, our model showed a competitive result compared with MobileNetV3, a CNNbased model, and MLP-Mixer.It achieved a top-1 accuracy of 82.48% in the CIFAR-10 dataset, while the state-of-theart model achieved a top-1 accuracy of 87%.As mentioned previously, the overlapping patch improved model accuracy, especially for tiny images like those in the CIFAR-10 dataset.For the MNIST dataset, the top-1 accuracy of our proposed model was close to that of the state-of-the-art model: our model achieved 98.36%, MobileNetV3 achieved 98.78%, and MLP-Mixer achieved 98.24%.Our model also showed good performance with the Fashion MNIST dataset, achieving a top-1 accuracy of 89.87%, while MobileNetV3 and MLP-Mixer achieved 93.49% and 91.24%, respectively.Last is the EMNIST dataset, for which our model again exhibited good performance: a top-1 accuracy of 84.15%, close to MobileNetV3 with 86.92% and MLP-Mixer with 85.68%.
Additionally, in our attempts to develop a lightweight model for edge implementation, we also evaluated the number of parameters, the FLOPs usage, and the throughput.We found that the number of parameters and the FLOPs used in our model was significantly smaller than in MobileNetV3 and MLP-Mixer when using the CIFAR-10 dataset, even compared with the traditional RC and NVAR Mixers.Table 5 confirms that our model achieved a higher throughput than MobileNetV3 and MLP-Mixer for each dataset.In the MNIST family of datasets, the values for a number of parameters differed slightly, but they were still smaller than other state-the-art models.

E. DISCUSSION
Our goal was to develop a lightweight model with competitive performance that could overcome the limitations in resource allocation in the edge site.Therefore, the number of parameters, the FLOPs usage, and the model performance in inference are the highlights of this discussion.

1) PARAMETERS AND FLOPS USAGE
As shown in Tables 4 and 5, we compared our proposed model with the traditional RC and NVAR in the Mixer architecture, and to the state-of-the-art models MobileNetV3 and MLP-Mixer.Our model outperformed both of the traditional models and showed a competitive performance compared with MobileNetV3 and MLP-Mixer under identical conditions.The number of parameters in our model decreased significantly, as shown in Table 6.NVAR Sequencer-Mixer was 0.104 times the number of parameters used in MobileNetv3.Since the Mixer block becomes the main block in the Mixer architecture using the vertical and horizontal Sequencers, our model used only one Sequencer-Mixer block for token and channel mixing, which led to a reduction in parameter usage.
We also calculated the number of FLOPs used in inference.Since the FLOPs quantify the computational complexity of a model [32], we can evaluate the efficiency of our model by counting them.FLOPs usage in our model was reduced significantly.As mentioned previously, the NVAR model in the Sequencer-Mixer layer, shown in Fig. 1(c), used two-dimensional tensors as inputs, (h × C) for the vertical Sequencer and (w × C) for the horizontal Sequencer, thus reducing the tensor input size from four dimensions.Additionally, NVAR applied only linear and nonlinear projections into the two-dimensional input.This operation reduced the complexity inside the Mixer blocks.Using the same hyperparameters and datasets, our model used only 0.556 times for CIFAR-10 and 1.533 times for the MNIST datasets of FLOPs used in MobileNetV3.Despite the number of FLOPs used in MobileNetV3 being smaller than in our proposed model for the MNIST datasets, the number used by our model was still reasonable compared with MLP-Mixer.103550 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The overlapping patch from the image input also increased the number of patches compared to the non-overlapping patch.However, adjusting the appropriate kernel size, stride, and padding values did not significantly influence the number of parameters, especially for tiny images.Table 4 shows that the number of parameters remaining was identical for each dataset regardless of whether they used overlapping or nonoverlapping patches.Regarding FLOPs usage, the NVAR Sequencer-Mixer model with GQPE and a window partition consumed the same number of FLOPs for inference with and without overlapping patches.However, the FLOPs varied slightly when the proposed model used the window partition and GQPE inside the Mixer layer.FLOPs usage increased from 0.021 G to 0.023 G for the MNIST datasets.As stated previously, the window partition and GQPE are performed as local operations inside the Mixer block, and can contribute to the number of FLOPs used.Despite the slight increase in the number of FLOPs, the results overall tell us that our model is promising as a DNN model for edge implementation.

2) MODEL PERFORMANCE
As previously stated, the NVAR had a smaller number of parameters, and it used nonlinear mapping to transform the input data into a high-dimensional space.Using the NVAR inside the Sequencer-Mixer layer reduces computational complexity.We received a smaller number of parameters with competitive performance and a faster inference process.As shown in Table 4, our model improved accuracy compared to the traditional RC-Mixer and NVAR-Mixer without Sequencer construction, increasing to 82.48% for CIFAR-10 and 98.36% for the MNIST dataset.Moreover, the overlapping patch from the image input also increased the performance of our model; it achieved greater accuracy than the traditional RC-Mixer and NVAR Sequencer-Mixer models without overlapping input.The top-1 accuracy of our model for tiny images showed that the overlapping patch enriches the feature map, thereby providing essential information for enhancing model performance in image classification.Additionally, compared with the state-of-theart models shown in Fig. 5(a),(b),(c), and (d) for each dataset, our proposed model performed well with a smaller number of parameters.
Additionally, the Mixer architecture, such as in MLP-Mixer, is linear with respect to model complexity.This stands in contrast to Transformer-based models, which are quadratic in model complexity [11].Therefore, we evaluated the throughput of our proposed model by defining the number of images per second in the inference.As stated previously, overlapping patches improved the accuracy of the model, as well as its throughput.Another feature that improved the throughput of the model was using a window partition and GQPE.Both of these reduced the processing time owing to their lower complexity [16].Table 4 shows that the number of throughputs of our model surpassed that of RC-Mixer and NVAR-Mixer.It reached 190.1 images per second for the CIFAR-10 dataset and 111.4 for the EMNIST dataset, the highest among the MNIST datasets.The slight increase in FLOPs usage in our model when using a window partition and GQPE also increased the computation time.Compared with the state-of-the-art model shown in Table 6, our model outperformed the throughput of MobileNetV3 and MLP-Mixer for the CIFAR-10 dataset.For the MINIST, the throughput of the proposed model is still a competitive result.

V. CONCLUSION AND FUTURE WORK
In this study, we proposed a lightweight DNN model that adopts the Mixer and Sequencer architectures.We utilized overlapping for the image input processing in the patch embedding.We developed a Mixer block with an NVAR layer to replace the MLP layer in the MLP-Mixer and the BiLSTM layer in the Deep Sequencer, respectively.Additionally, GQPE and a window partition were also applied to enhance the performance of the local process inside the Mixer block.Our model showed a significant improvement in the accuracy, the number of parameters, the FLOPs usage, and the throughput compared with the traditional RC-Mixer and NVAR-Mixer models.It achieved a top-1 accuracy of 82.48% for the CIFAR-10 dataset with 0.159 M parameters and 98.36% for MNIST with 0.106 M parameters.Also, its throughput on a central processing unit, which was 190.1 images per second for CIFAR-10 and 106.7 images per second for the MNIST dataset.These results were also competitive with MobileNetV3 and the MLP-Mixer model.The proposed method may be an optimized solution for implementation in the edge site.The lower complexity of the model and the smaller parameter usage are beneficial for tackling edge limitations.
Nevertheless, further study is required to improve the continuity of our model.First, regarding the hyperparameter usage in building the model architecture, another experiment with a varying number of parameters is required to achieve better model performance.Second, this study used CIFAR-10 and the MNIST family of datasets, which are in grayscale with a lower resolution, so we need to expand the implementation to a larger dataset of high-resolution images.Using a larger and higher-resolution dataset will allows us to further assess the effectiveness of the window partition in enhancing local computation efficiency inside the Mixer block.

FIGURE 1 .
FIGURE 1.(a) Main workflow for performing image classification using a Mixer-type NVAR.An overlapping patch of image input is utilized in the patch embedding process and then fed to the Mixer block.(b) The Mixer block consists of vertical and horizontal Sequencer NVAR blocks for token and channel mixing, respectively.(c) The NVAR model is the main portion of the Sequencer in the Mixer layer.After passing the linear layer, the output of the NVAR is parted by a window partition and applied to GQPE as relational positional encoding.A Mixer-type NVAR uses the fully connected layer as a classifier in the end architecture.

FIGURE 2 .
FIGURE 2. Illustration of the linear and nonlinear inputs of NVAR.

FIGURE 4 .
FIGURE 4. (a) Illustration of the Mixer block for token and channel (b),(c), and (d) are the Mixer layers used in the study: (b) for the MLP-Mixer model, and (c) and (d) for the RC-Mixer and NVAR-Mixer models, respectively.

Fig. 4 (
b), (c), and (d) are the respective Mixer layers.Since we utilized the overlap techniques in the patch embedding, we also studied the non-overlapping version of the Mixertype NVAR.

TABLE 4 .
Results of the ablation studies.

TABLE 5 .
Results of image classification .