Differentiable Neural Architecture, Mixed Precision and Accelerator Co-Search

Quantization, effective Neural Network architecture, and efficient accelerator hardware are three important design paradigms to maximize accuracy and efficiency. Mixed Precision Quantization is a process of assigning different precision to different Neural Network layers for optimized inference. Neural Architecture Search (NAS) is a process of automatically designing the neural network for a task and can also be extended to search for the precision of each weight and activation matrix. In this paper, we develop the following three methods: (i) Fast Differentiable Hardware-aware Mixed Precision Quantization Search method to find optimal precision, (ii) Joint Differentiable hardware-aware Architecture and Mixed Precision Quantization Co-search, (iii) Joint Accelerator, Architecture, and Precision triple co-search to find best possibilities in all the three worlds. We demonstrate the effectiveness of our proposed methods targeting Bitfusion accelerator by searching mixed precision models on MobilenetV2. We achieve better accuracy-latency trade-off models than the manually designed and previously proposed search methods.


I. INTRODUCTION
Convolutional Neural Networks (CNNs) have attained remarkable results on a wide range of tasks, such as Image Classification [1], Super Resolution [2], [3], [4], etc., and are often being deployed in various real-time scenarios on the cloud and edge platforms.This success is mainly attributed to efficient neural network architecture design in terms of the kernel and channel sizes, methods for efficient inference methods such as Pruning [5], [6] and Quantization [7], [8], and systematic domain-specific hardware configuration for network acceleration [9], [10].The pruned and low-precision neural network models are smaller and faster on the hardware with similar accuracy as the unoptimized network.The rapid development of efficient deep learning optimization algorithms and advances in hardware platforms led to The associate editor coordinating the review of this manuscript and approving it for publication was Mario Donato Marino .deployment in various real-life applications, such as selfdriving cars.
Manually architecting neural network models is tedious as it requires a thorough understanding of many allied fields.Neural Architecture Search (NAS) [11] is an Automatic Machine Learning (AutoML) technique that seeks to automatically discover the most effective neural network architecture from a predefined search space on the given dataset.This method is designed to reduce manual effort and computation time by intelligently searching for efficient neural architectures.NAS is a valuable tool for automating the neural network architecture design process without significant human intervention.Hardware-aware Neural Architecture Search (HW-NAS) [12], [13], [14], [15] is a subset of NAS whose objective is to search for a neural network that is not only accurate in terms of model performance such as accuracy but also hardware-efficient with respect to latency, energy, etc. NAS and HW-NAS have been FIGURE 1.Big picture of our work: differentiable neural architecture, mixed precision and accelerator co-search.the goal is to automatically search for precision/bit-width of every weight and activation matrix, parameters of neural architecture: kernel size (k) and Number of filters/channels (C) and hardware accelerator: array height and array width.
proven successful on many tasks such as Image Classification [11], Object Detection [16], etc., where searched networks surpassed the handcrafted models.
Neural Network model reduction techniques, such as Pruning and Quantization, are crucial for optimized inference on various hardware platforms.The low precision inference is an effective approach to boost DNN performance on many hardware architectures.Quantization [17], [18], a wellknown and widely used method, converts high-precision floating point weight and/or activation parameters, such as Floating Point 32 (FP32), to low-precision values like Integer 8 (Int8) Integer 4 (Int4).Uniform Quantization assigns same precision or bit-width to all weight and activation matrices across all layers in a network.Nonetheless, quantizing the model with ultra-low precision, such as 1 or 2 bits, can result in significant accuracy degradation [17].
Mixed Precision Quantization [19], [20] is a mechanism to allow different bit-width to different layers in a model to overcome the challenges faced by Uniform Quantization.Generally, mixed precision assignment enables the model to capture the layer information better by setting different weight and activation matrices to different range of precisions.However, this method's main problem is determining each layer's precision in the network so that the model attains acceptable accuracy while minimizing the model size.The algorithmic advancement of NAS facilitated the method to apply in other applications in Deep Learning which are combinatorial in nature.Therefore, Mixed Precision Quantization can be translated into a search problem, thereby utilizing similar NAS principles.We propose a Fast Hardware-aware Mixed Precision Search based on a search space pruning and tensor-sharing methodology.
The bit-width and the layer dimensions (kernel and channel sizes) of a neural network are often correlated, thus allowing us to co-design the hyperparameters together.The optimal combination of neural architecture and precision of each layer contributes to the model accuracy and efficiency on the hardware.Therefore, a synergistic co-exploration is a path forward for more robust solutions for better model performance and hardware acceleration.In the second part of this paper, we propose a new Architecture and Precision Co-search methodology to perform end-to-end joint optimization over the neural architecture and quantization space.
An efficient hardware accelerator alongside neural architecture design and model compression methods is critical in accelerating DNNs, especially in real-time and lowmemory devices.The hardware optimization is tied to neural architecture, and the network topology depends on the underlying hardware.A simple heuristic of searching a network for given accelerator specifications such as Systolic array dimensions, register memory, or searching for accelerator size for a given neural architecture could be sub-optimal.Consequently, joint co-search of both neural characteristics (precision and layer dimensions) and accelerator space can achieve a better trade-off in terms of accuracy and efficiency.In the third part of our work, we develop a differentiable Accelerator, Architecture and Precision Co-search method to search for all three dimensions in a single loop.The chosen set of problems is extremely challenging, and complexity is intensified due to its diversified search elements in Architecture, Quantization, and Accelerator space, which is summarized in Figure 1.

A. LIMITATIONS OF THE STATE-OF-THE-ART (SOTA)
There have been works in the past targeting Mixed Precision Search on pretrained models [19], [21], Architecture and Mixed Precision Search for a fixed hardware accelerator specifications [22] and limited work on the triple search for all the three dimensions [23].The mixed precision search algorithms such as HAQ [19] and Releq [21] employ a Reinforcement learning based controller, where the mixed precision model is searched in an iterative training and evaluation process, inducing heavy computation burden.The previous one-shot method EdMIPS [24] for mixed precision cannot be modeled as a hardware-aware search and considers the number of parameters as a proxy instead of latency.The Architecture and Mixed precision search method Energy-aware NAS [22] also employs a controller-based method, thereby requiring significant search time.Also, the SOTA methods consider quantization levels only up to 8 bits.However, we also consider 16-bit precision for a few weight and activation matrices for the layers which exhibit similar latency as 8-bit multiplication.

B. RESEARCH CHALLENGES AND PAPER SUMMARY
Overall, the research challenges in this paper revolve around optimizing neural network architectures and precision choices for specific hardware and datasets.The challenges include reducing the precision search space, developing hardware-aware search algorithms, designing efficient co-search methods for architecture and precision, and enabling differentiable triple search to outperform manually designed network-accelerator pairs.Specifically, in this paper, we propose Neural Architecture Search algorithms for efficient neural network architecture and precision search for given hardware and datasets.The first method prunes weight and activation precision choices to reduce the precision search space, which saves search time to find efficient mixed precision quantized models.The second method is a hardware-aware mixed precision search that assigns different bit-widths to different weights and activations.The third algorithm is a co-search method to jointly search neural architecture specifications and precision of weight and activation tensors.Finally, the fourth method is a differentiable accelerator, architecture, and mixed precision triple search algorithm that co-searches all three dimensions in a single optimization loop, outperforming the manually designed network-accelerator pair.

C. CONTRIBUTIONS
Specifically, the main contributions of our paper are highlighted below: 1) We first analyze the layer-by-layer latencies on the chosen hardware accelerator and develop a generic search space pruning algorithm to remove redundant weight and activation precision choices.The search space pruning reduces the precision search space on a few hardware platforms, which minimizes the search time to find efficient models.We apply the proposed method on MobilenetV2 [25] on Bitfusion [10] accelerator.The pruning method reduces the precision search space of MobilenetV2 on Bitfusion by more than 3x.The method is generic and can be applied to any hardware platform or neural network architecture.2) We develop a Fast, Differentiable Hardware-aware Mixed Precision Search method to search and assign different bit-widths to different weights and activations.The differentiable method is a one-shot search method where a single network called a Supernetwork, which encodes all possible combinations in the search space, is trained only once, and the best network lies in one of the paths in the trained Supernetwork.The automatically searched mixed quantized networks offer better latency accuracy tradeoff than Uniform Quantization on CIFAR [26] and ImageNet [27] datasets.The search method is partly taken from our previous work [28], which we applied to a different hardware platform (Nvidia A100 GPU) and Neural Network (ResNet50).However, we significantly contributed to developing a new search space pruning and weight/activation sharing method in this paper.3) We develop a Differentiable search method to jointly find the neural architecture specifications (kernel and filter size) and precision of each weight and activation tensor for a given dataset and hardware.This co-search algorithm takes advantage of the search space pruning algorithm and efficient tensor sharing in the mixed precision search method.The automatically generated model outperforms the baseline Uniform Int8 network, and the Mixed Precision-only searched network.The search time of the co-search method is significantly enhanced compared to the previous similar method [22].Our search method is efficient as it takes only 8 GPU days to search for an efficient network, while the previous work requires 25 GPU days.4) We develop a Differentiable Accelerator, Architecture, and Mixed Precision triple search algorithm to co-search all three dimensions in a single optimization loop.The searched network-accelerator pair consistently outperforms the manually designed MobilenetV2 and Bitfusion pair for different hardware dimensions, latency, and accuracy metrics.For example, our searched accelerator and model combination attains 0.13% more accuracy, 1.74ms less latency (33% less latency) on the searched hardware accelerator, and two times less computing array size than the Int8 Mobilenetv2 Bitfusion accelerator.Our triple co-search is important as it fully automates the accelerator-network design process for a given dataset and for efficient deployment.

D. PAPER ORGANIZATION
The remainder of this paper is organized as follows: In Section II, we briefly review the fundamentals of Neural Architecture Search, Mixed Precision Quantization methods, co-exploration algorithms and the experimental setup (Accelerator platform and Neural Network benchmarks) we follow in this paper.Section III describes our novel search space pruning method followed by our Mixed precision quantization search method in Section IV.In Section V, we describe our Joint Architecture and Precision exploration method followed by the triple search of Accelerator, Architecture, and Precision search algorithm in Section VI.Section VII discusses the limitations of our work and possible future work to the methods we developed in the paper, followed by the conclusion in Section VIII.

II. BACKGROUND AND DESIGN ISSUES
A. QUANTIZATION Quantization, an important method for efficient inference, approximates the expensive floating point parameters with low bit-width values [29].This technique usually the high-precision values (FP32) to low-precision data (Int8, Int4, Int1, etc.).The quantized model requires less memory, fewer computation resources, and latency on dedicated hardware platforms supporting low-precision multiplication.
In this paper, we use linear quantization [7], [8] to quantize individual weight and activation tensor.
QAT is a technique to induce quantization error into every neural network parameter to simulate low-precision behavior.
It adjusts the values by considering quantization loss and maps floating-point parameters to their appropriate integer or low-precision representation using a predefined function.
The neural network is fine-tuned while considering the effects of quantization noise added during training by simulating quantization effects.The quantization modules are placed between the layers to simulate the clamping and rounding process during the conversion process, and the retrained model can be quantized to integer precision using the data stored in these modules.
For example, the quantization scale parameter for the range {x min , x max } and bitwidth (N) can be calculated as per Equation 1.
The quantization function for lowering the precision is given in Equation 2. The dequantization is exactly opposite of the quantization, as depicted in Equation 3.

C. HARDWARE-AWARE NEURAL ARCHITECTURE SEARCH (HW-NAS)
The manual design of a Neural Network for different applications and datasets is tedious as it requires proficiency in various interconnected fields such as signal processing and optimization.Neural Architecture Search (NAS) is a powerful algorithm to automate the design process of a Deep Neural Network to achieve better model performance in a self-learning manner.Hardware-aware Neural Architecture Search (HW-NAS) aims to automate the design process to search for a network to maximize accuracy and efficiency on the underlying hardware [30], [31].The seminal works Reinforcement Learning-NAS [11], Evolutionary search [32] and Differentiable search [33] pushed the field of NAS research to many domains, applications, and hardware platforms.For more detailed information on NAS and Hardware-aware NAS, please refer to our detailed survey papers [15], [34].

D. MIXED PRECISION QUANTIZATION
Uniform Quantization assigns the same bit-width to all the weight and activation matrices in a network.On the other hand, Mixed Precision Quantization (MPQ) quantizes different layers to different precisions or different weights and activations in the same layer to different bit-width.Mixed precision gained immense attention both in terms of algorithm enhancement and support from commercial [35] and academic accelerators [10].The main challenge lies in assigning these varying precisions to different weight and activation tensors to achieve maximum efficiency, as it is a combinatorial problem.For a network with "L" layers and "b" precisions in our search space to choose, there exists L b distinct combinations.The seminal work Hardware-aware automated Quantization (HAQ) [19] is an RL-based method to search for layer-wise precision for efficient performance on hardware accelerators.

E. VARYING PRECISION ACCELERATORS
Neural Networks have greatly benefited from the design of stand-alone domain-specific accelerators in many scenarios, such as cloud and edge [36].Custom hardware offers better efficiency in terms of various performance metrics such as energy and latency.Due to the effectiveness of mixed precision models, several commercial and academic hardware platforms, such as BISMO [37] and Nvidia A100 [35], have been developed to support varying bit-width multiplication.
BitFusion: In this paper, we use Bitfusion [10] as the underlying accelerator, which has been used in many seminal methods, such as HAQ [19] and Energy-aware-NAS [22].A systolic array in the BitFusion accelerator dynamically fuses different processing elements to support the varying bit-width of each layer.The hardware supports Floating point 16 (Float16), Integer 8 (Int8), Integer 6 (Int6), Integer 4 (Int4), and Integer 2 (Int2), which we use in our search space.

F. CO-SEARCH METHODOLOGIES
There have been many NAS algorithms to find an efficient sequence of neural operations or to search for mixed precision quantized models.However, there are limited NAS works to jointly explore multiple dimensions in the same loop.A typical Convolution network search space consists of the number of filters, depth, kernel size, and precision of each layer.The hardware accelerator search space consists of physical properties such as a systolic array and buffer/register size.Energy-aware Mixed Precision search [22] is an endto-end exploration strategy to find efficient architecture and precision of each operation to maximize accuracy, latency, and energy on the accelerator.It uses a controller-based methodology that iteratively samples high-quality networks and evaluates the accuracy and performance metric.Auto-NBA [23] is a ternary search method to jointly optimize the neural architecture, bit-widths and accelerator properties.

G. MODELS AND BENCHMARKS
In this paper, we use MobileNetV2 [25] and related models as a benchmark for comparison.We train the networks on CIFAR-10 [26] and ImageNet [27] datasets using Pytorch [38].The hardware performance metrics for the accelerator are obtained from the simulator developed by the author [39].The performance metric we target in this paper is latency as this simulator supports the latency estimation for different sized neural architectures.

III. PRECISION SEARCH SPACE ANALYSIS & PRUNING
In this section, we describe our proposed search space pruning to reduce the search space size.The pruning method is based on the latency of weight and activation precision choice and the quantization loss incurred from quantizing from floating point representation to low precision.However, before we explain our method in detail, we first study the latencies of Standard Convolution and Depthwise Convolutions of the MobilenetV2 network on the Bitfusion accelerator to motivate the need for our search space pruning method.
MobilenetV2 model is a concatenation of 17 Mobile Inverted Bottleneck (MBConv) layers, where each module is made up of a 3 × 3 Depthwise Convolution sandwiched between two 1 × 1 standard Convolutions, as shown in Figure 9.The 1×1 Convolution and Depthwise Convolutions are very important operations to study as they form a key component in many standard models for mobile deployment, such as MobileNet family [25], [40], [41], and HPC networks such a ResNet [1] and EfficientNet [42], [43] family.

A. SEARCH SPACE ANALYSIS
We first investigate layer-by-layer latencies of different weight and activation precision combinations of the MobilenetV2 model on the Bitfusion accelerator.The latency of a Depthwise Convolution is calculated in the following two steps: (1) we first calculate the latency of a standard Convolution by setting the input channels to 1, and (2) multiply the obtained latency with the number of output channels present in the depthwise layer to get overall latency [22], [39].  ) with all the weight and activation combinations that are possible with {4, 6, 8, 16} precisions.It is evident from the latency graph that the difference between the highest bit-width (16)-bit weight and 16-bit activation) and lowest bit-width (4)-bit weight and 4-bit activation) is 0.003 ms on the accelerator.Hence, not every low-precision implementation leads to performance improvement on the chosen hardware.Therefore, it can be concluded from latency plots of the two different Convolution operations that a few layers exhibit similar latency even though they differ in weight and activation precision.This phenomenon can be attributed to the low utilization of weight 106674 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.and activation matrices for specific dimensions on the systolic array of the accelerator.

3) PRUNE CHOICES
The third and most important step determines the best choice to retain within a single group by calculating the quantization loss for every such precision combination.We refer to the quantization loss as the L2-norm between the floating point and quantized activations of that layer.We pick the weight-activation precision choice with the least L2-norm loss among the clustered group.Following the above example, the pruned search space after this step is equal to {(16,4), (16,6), (16,8), (16,16)}.

4) CHECK OPTIMAL LATENCY
The last step in the pruning process removes all choices in each layer whose latency is higher than the latency of Int8 weight and Int8 activation choice by at least grouping latency g l .This is based on the intuition that Int8 precision for weight and activation is already optimized for many networks, and our goal is to find a model better than Uniform Int8.These choices are typical with weight or activation with 16-bit precision.The optimal search space after this step for the above example is {(16,4), (16,6), (16,8)} as the precision choice (16,16) is higher than (8,8), as illustrated in Figure 3.
Table 1 summarizes the number of weight and activation choices before and after applying the search space pruning algorithm.Therefore, search space pruning can significantly cut down the computation cost in the search algorithm, which is discussed in the next section.

IV. ACCELERATOR-AWARE MIXED PRECISION QUANTIZATION SEARCH
In the previous section, we analyzed the search space and proposed a pruning strategy to reduce the number of precision choices.In this section, we describe our Accelerator-aware Mixed Precision Quantization Search method to find efficient bit-width for every layer of a pretrained neural network based on the pruned search space.The schematic of our proposed method is illustrated in Figure 5, consisting of the following components: The search space consists of all possible paths supported by the underlying hardware accelerator.As mentioned previously, we consider {2,4,6,8,16} precisions for models trained on the CIFAR-10 dataset [26] and {4,6,8,16} for ImageNet [27] based networks.We do not quantize the first 3 × 3 Convolution layer and the last Fully Connected layer, leaving both at 16-bit precision as quantizing them to low precision leads to accuracy loss.
(ii) Search Space Pruning: The second step is to pick only important weight and activation combinations to reduce the computational burden, which we explained in detail in the previous section.
(iii) Search Process: The third step is a multi-objective differentiable search algorithm.We propose a one-shot search methodology based on weight and activation sharing, where  we construct a supernetwork based on the pruned search space.The Supernetwork is trained only once, and the best weight activation precision combination is obtained from the Supernetwork.
(iv) Accuracy-Latency loss function: The Supernetwork is trained on a loss function that includes accuracy and latency as an optimization function.
The output activation O of Composite Convolution is a Convolution operation (O = W * ⊙ I*) between the Superweight and Superactivation.The main advantage of Composite Convolution is that only a single Convolution operation exists for a given layer, irrespective of the number of choices in the search space.

C. WEIGHT AND ACTIVATION SHARING:
The weight sharing in Composite Convolution can be enhanced in a few layers when the same weight or activation precision exists across all the choices.For example, consider the search space {(W,A)} = {(8,8), (8,6), (8,4)} obtained after pruning the full search space, which can be represented in the condensed format as follows: {W} = {8}, {A} = {8,6,4}.Hence, all paths with the same weight precision can be collapsed into a single path, as shown in Figure 7.
The Convolution layer for which there exists only one precision choice for both weight and activation after search space pruning, for example, {(W,A)} = {(8,8)} or {(16,16)}, is quantized to the corresponding precision, both in Supernetwork and searched model.The weight and activation sharing in a few layers are only possible due to pruning redundant paths, signifying the importance of search space pruning.

D. LEARNING THE SUPERNETWORK
Initially, all the architectural parameters (α) in a layer are treated equally and set to 1  N , based on N parallel paths.The

E. LATENCY LOSS:
Normally, when the Supernetwork is trained based only on the Cross-entropy loss (L CE ), there is a high likelihood that only maximum precision is chosen across all the layers.This is because maximum accuracy can be obtained with the precision of the highest bit-width.Hence, in order to balance accuracy and latency, a latency-aware loss function (L lat ) is added to the final function on which the Supernetwork is trained, as shown in Equation 7.
The λ in Equation 7 is a hyperparameter to adjust the relative importance of accuracy over latency, where a smaller λ value prefers favors a network with better accuracy than latency and vice versa.The latency loss (L lat ) is a function of the architectural parameter (α) and latency of the corresponding weight and activation precision path, as given in Equation 8.

F. RESULTS AND EVALUATION
We evaluate our search process on the MobilenetV2 network on CIFAR and ImageNet datasets.We search for a mixed precision quantized model directly on the target dataset.We compare our work with many State-of-the-art mixed precision quantized models.We pick every architecture for comparison from their paper and evaluate the latency on the Bitfusion simulator.

1) MOBILENETV2 ON CIFAR
Table 2 summarizes the accuracy and latency of our searched model and other similar benchmark networks on CIFAR-10 and CIFAR-100 datasets.We can infer from the table that our mixed precision quantized model attains a better accuracylatency tradeoff.

2) MOBILENETV2 ON IMAGENET
In Table 3, we report the accuracy and latency of our searched model and the same benchmarks on the ImageNet dataset.We can infer from the table that our mixed precision quantized model attains a better accuracy-latency tradeoff than many SOTA networks.We attain accuracy better than the uniform Int8 model as our models contain a few 16-bit weight and activation matrices.

G. SEARCH TIME COMPARISON
The well-known method HAQ [19], based on Reinforcement Learning (RL), is an iterative method where a mixed precision quantized model is first predicted, then fine-tuned to obtain accuracy, and the process is repeated until a solution is found.This method requires close to eight GPU days.However, as our method is a One-shot hardware-aware method with reduced search space, we need less compute time for Supernetwork training.Our search method discovers efficient mixed precision models on MobilenetV2 directly on the large-scale ImageNet dataset in less than four GPU days.However, Ops are only a proxy metric or just a numerical estimation of how many operations can happen on the hardware.Alternatively, the latency of a layer is directly proportional to Ops × Weight Precision × Activation Precision.The actual latency of a mixed precision hardware accelerator is dependent on several factors, such as parallelism, data reuse, memory bandwidth, instruction set, pipelining and software optimizations.

V. NEURAL ARCHITECTURE AND MIXED PRECISION QUANTIZATION SEARCH
In the previous Mixed Precision Search section (Section IV), our goal was to search only for the precision of every weight and activation tensor on a pretrained MobilenetV2 model.In this section, we propose a Differentiable method to search for architectural specifications such as kernel and channel size and precision of each weight and activation 106678 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.matrix simultaneously on the target dataset.The search space pruning and Composite Convolution in the earlier section (Section IV) are utilized in this method.

A. SEARCH ELEMENTS
We use the MobilenetV2-based model as our backbone network, consisting of bottleneck layers in series.This specific network is chosen as it allows us to compare our work with the previous work based on the same macroarchitecture.We initially consider two bottleneck layers in our search space, i.e., Mobile Inverted Bottleneck (MBConv) and Fused Inverted Bottleneck Convolution (Fused-MBConv).We first describe the topology of both modules, followed by a latency evaluation to pick only one of the two modules.

1) MBConv
The MBConv is a sequence of the following three operations: 1 × 1 Standard Pointwise Convolution, k×k Depthwise Convolution, 1 × 1 Standard Pointwise Convolution.The first Convolution expands input channel I by an expansion ratio e.The channel size remains the same in Depthwise Convolution, whereas the channel shrinks in the second 1 × 1 Pointwise Convolution.The order of operations in an MBConv is depicted in Figure 9, and their weight and activation dimensions are summarized in Table 4.

2) FUSED-MBConv
The Fused-MBConv, proposed by Gupta et al. [51], replaces the first expansion 1 × 1 Pointwise Convolution and the k×k  Depthwise Convolution with a k×k Standard Convolution, as shown in Figure 9b.A few previous works [43], [51] have shown that Fused-MBConv is more latency-efficient than the traditional MBConv module for certain kernel and filter sizes.
During the search process, the input and output filter sizes of each MBConv layer remain constant, but we search for the kernel size of Depthwise Convolution and channel size inside the MBConv module.The search elements in each bottleneck module are kernel sizes {3, 5} and expansion ratios {3, 6}.

B. CHOOSING BETWEEN MBConv & FUSED-MBConv
As mentioned earlier, the Fused-MBConv module is more latency-efficient than the traditional MBConv for only a few weight and activation dimensions.We choose Fused-MBConv over MBConv for a layer in the Supernetwork only if the Int8 latency of the Fused-MBConv block is less than the MBConv module.

C. MACROARCHITECTURE SEARCH SPACE
We compare our method with Energy-aware Mixed Precision search [22], which falls on similar lines of architecture and mixed precision search problem.Therefore, for a fair comparison of search time and searched networks, we retain their macroarchitecture in our search algorithm, which is illustrated in Table 6.While the previous work [22] considers only MBConv modules, our search space includes Fused-MBConv for a few layers.

D. PRECISION SEARCH SPACE PRUNING
Our Mixed Precision search method prunes the search space based on latency and the quantization loss between the floating point and quantized activation matrices.The  First, the expansion ratio (e) of 6 is chosen as max(γ 1 , γ 2 ) = γ 2 .Within the chosen expansion ratio path, the best kernel size of 3 is chosen as max(β 3 , β 4 ) = β 3 .Also, within the chosen Convolution, the precision path which has the highest α value is chosen in the final network.

E. SUPERBOTTLENECK
The Architecture-Mixed Precision Supernetwork is built by populating every bottleneck in the backbone network with a Superbottleneck, as shown in Figure 10.We propose a weight and activation-sharing Superbottleneck structure to encode all possible kernel sizes, channel expansion ratios, and bit-widths in an efficient manner.As we consider two expansion ratios {3,6} in our search space, we built two paths from the input, where the first path (left path in Figure 10) has an expansion ratio of 3 and the second path (right path in Figure 10) has an expansion ratio of 6.Also, every path is associated with a γ parameter to guide the selection of the expansion ratio path.In this case, we define {γ 1 , γ 2 } for expansion ratios {3,6}, respectively.We populate every Depthwise Convolution with two Depthwise Convolutions, where the first Convolution has a kernel size of 3 and the second Convolution has a kernel size of 5 internal to each expansion ratio path.Every Depthwise Convolution path is associated with a β parameter to guide the selection of kernel size.We define {β 1 , β 2 } and {β 3 , β 4 } for kernel sizes {3,5}, on expansion ratio 3 and 6 paths, respectively.The output activation of the Depthwise layer is a β-weighted sum of activations of kernel sizes 3 and 5, respectively, as per Equation 9.
The kernel size of the first and third 1 × 1 Convolution in an MBConv is constant, whereas we search the kernel size of the middle Depthwise Convolution in the MBConv module.Therefore, we can share the 1 × 1 Convolution operations for an expansion ratio.The input to a Superbottleneck is the output of (L −1) th Superbottleneck, and the same feature map is broadcasted to all the expansion ratio paths in the current Superbottleneck.The activation feature map at the output of L th Superbottleneck is a softmax γ -weighted sum of all the expansion ratio paths, as per Equation 10. (10) We adopt the Composite Convolution we proposed earlier to search for weight and activation precision, which replaces every Convolution in the Supernetwork.The beta (β) and gamma (γ ) parameters are external to the Convolution operation, whereas there exists a unique set of alpha (α) parameters for every Convolution.

1) FUSED-MBConv
The Superbottleneck in Figure 10 is specific to MBConv, whereas a similar approach can be followed for the Fused-MBConv module.We pass the input of the Superbottleneck module directly to the Standard Spatial Convolution in Fused-MBConv as this unit does not contain the first 1 × 1 Convolution.The beta (β) parameters are assigned to search for the kernel size of this Convolution.
The bottleneck structure is widely used in several networks such as MobilenetV2 [25], MobilenetV3 [41], Resnet [1], EfficientnetV1 [42], and EfficientnetV2 [43].Therefore, our search methodology can be easily ported to similar networks to search for efficient models.Also, the Superbottleneck is scalable with different choices of kernel sizes, expansion ratios, and precision combinations.

F. LATENCY COST FUNCTION
The latency cost in the mixed precision search method is straightforward as there exists only a single architectural parameter set (α).However, the latency cost function in the architecture and mixed precision Supernetwork is complicated due to the presence of the three sets of architectural parameters (α, β, γ ).The latency cost function of a single Composite Convolution is given in Equation 11.
The latency cost of the Depthwise Convolution is a β-weighted cost of the individual Convolution cost function, as given in Equation 12.
The latency cost (Exp_cost) of one expansion ratio path is a sum of the cost of the two 1 × 1 Convolutions and Depthwise Convolution.The cost of the Superbottleneck is a γ -weighted sum of all expansion ratio paths, as given in Equation 13.
The overall latency cost function of the Supernetwork is the sum of all bottleneck costs, as given in Equation 14.
Bottleneck_cost l (14) The final loss function (L loss ) on which the Supernetwork is trained is given in Equation 15.
G. LEARNING THE SUPERNETWORK Similar to training Supernetwork in Composite Convolution in the Mixed Precision search method, the Convolutional weights θ and architectural weight set {α, β, γ } are also updated alternatively.The Convolution weights across all the MBConv modules in the Supernetwork are updated by freezing architectural weights, and the architectural search parameters are updated while freezing the Convolution weights.The most important step in the search method is to pick the best-performing expansion ratio, kernel size, and weight/activation precision choice from the Supernetwork.We explain choosing the best choice with an example depicted in Figure 11.At every MBConv layer in the Supernetwork, we apply the following process: (i) First, we choose the best expansion ratio by selecting the path which has the highest parameter value in the set of {γ 1 , γ 2 }.For example, if γ 2 is highest in the set, the path of expansion ratio 6 is chosen, and the rest of kernel sizes and precisions in the expansion ratio 3 are completely ignored, as depicted in the Figure 11.
(ii) We choose the best kernel size for the Depthwise Convolution by picking the path with the highest β value in the path of chosen expansion ratio path.In the example shown in Figure 11, we pick kernel size 3 if β 3 is highest in the set {β 3 , β 4 }.
(iii) The best-performing weight and activation precisions for each Convolution operation are chosen based on the highest architectural parameter on the chosen Convolution in the Superbottleneck.For example, if α 5 has the highest parameter in the set {α 1 , α 2 , ..α 5 }, then Int4 and Int4 are chosen for weight and activations, respectively.

H. RESULTS AND COMPARISON
Table 7 summarizes the accuracy and latency of the searched Architecture and Mixed Precision searched MobilenetV2based model on ImageNet.We compare our performance metrics with the Uniform Int8 MobilenetV2, and Energy-aware NAS [22].Our searched model attains similar accuracy as the Uniform Int8 model while requiring less latency on the accelerator.Also, our base model performs better than the Energy-aware NAS base network in terms of both metrics.Also, our searched-small model is more accurate than the Energy-aware NAS small network due to the presence of 16-bit precisions for a few weight and activation matrices.The Energy-aware NAS method is a layer-wise search, i.e., the method finds the same precision for both weight and activation matrix, which limits the algorithm to search for robust models.However, we do a tensor-wise search, assigning different precision to different tensors in the network.

1) SEARCH TIME COMPARISON
Our extreme form of weight and activation sharing significantly reduces the search time to find the optimal model combination on the target dataset.The previous Energy-aware Mixed Precision search requires 5 GPU days to search on 20% of the ImageNet dataset, which is approximately 25 GPU days for the entire dataset.However, our Supernetwork training requires only 8-10 GPU days on the full dataset, 2.5× faster than the previous algorithm.

VI. ACCELERATOR, ARCHITECTURE AND MIXED PRECISION QUANTIZATION CO-SEARCH
In the previous two sections, we searched for an optimal network configuration for a fixed Bitfusion accelerator.In this section, we extend the network search space and include accelerator dimensions in the search space to co-design both worlds.This co-search is particularly useful when the user is given a chance to optimize both network and accelerator simultaneously.Although our process is partly inspired by DANCE [52], our methodology has more dimensions in the network search space and an improved accelerator search scheme.The co-search technique consists of an Architecture and Mixed Precision Quantization Supernetwork and an auxiliary neural network, as illustrated in Figure 12.

B. ARCHITECTURE AND MIXED PRECISION SPACE
In our earlier Architecture and Mixed Precision search process, we applied the search space pruning with respect to the Bitfusion accelerator for a single array of size 16*32.However, considering nine distinct array sizes, we cannot directly obtain the optimal pruned search space.Therefore, we build an integrated search space to accommodate choices from all array sizes.Initially, we consider an enlarged search space of all possible architecture and precision choices, and a unique set of pruned search space is obtained for every array size.In the second step, we build a Unified search space set by combining all the choices from all the nine distinct array sizes.The common weight and precision choices on different array sizes are counted only once in the unified search space.The Architecture-Mixed Precision Supernetwork proposed in the previous section (Section V) is built over this unified search space.

C. ESTIMATOR NETWORK
Orthogonal to Supernetwork, we use an auxiliary estimator network to predict accelerator specifications and compute hardware performance loss.While the previous work [52] utilizes a Multilayer Perceptron (MLP) in the estimator network, we use a Transformer network [53].The key feature of the Transformer is the efficient modeling of global dependencies between input elements.This network is important in our search algorithm as our network search space is multi-fold due to the presence of enormous architecture and precision choices throughout the Supernetwork.The estimator network is divided into Accelerator Predictor and Performance Loss Estimator, whose functionality is explained subsequently.

1) ACCELERATOR PREDICTOR NETWORK
The primary purpose of the Accelerator predictor network is to predict array dimensions.It takes the architectural parameters {α, β, γ } of the Supernetwork as input and estimates the array size.The encoder module of the Vanilla Transformer is directly used as the accelerator predictor 106682 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.network.The exact details of Transformer architecture can be found in the paper [53].The weights of this model are randomly initialized and learned during the Supernetwork training process.

2) PERFORMANCE LOSS ESTIMATOR
The latency of a network on an accelerator depends on both the neural architecture dimensions and the array size of the accelerator.The performance loss estimator is also      Generally, a large array size guarantees less latency as more weight parameters are processed in a single computation cycle.However, our searched networks outperform the manually designed MobilenetV2-Bitfusion pair in terms of latency even though the array size is reduced by 2x and 4x.This is due to efficient synergic co-search of neural architecture, precisions and the array size.The searched neural architectures consist of three Fused MBConv modules which perform better than MBConv modules in terms of accuracy.

Figure 2
depicts the structure of the BitFusion accelerator.The individual Processing Element (PE) is a BitBrick (BB), as shown in Figure2(a).The BitBricks are dynamically fused to support different bitwidths.

FIGURE 3 .
FIGURE 3. Latency Comparison of all precision combinations of Convolution (Layer 12) of MobilenetV2 on ImageNet for a batch size of 16. w_a in the x-axis indicate weight (w) and activation (a) precision.

Figure 3
Figure3illustrates the plot of 1 × 1 Standard Convolution (layer no. 12 of MobilenetV2) with all the weight and activation combinations that are possible with {4, 6, 8, 16} precisions on ImageNet dataset for a batch size of 16.It can be observed from the plot that a few precision combinations exhibit similar latency.

2) 3
Figure 4  illustrates the plot of 3 × 3 Depthwise Convolution (layer no. 8 of MobilenetV2) with all the weight and activation combinations that are possible with {4, 6, 8, 16} precisions.It is evident from the latency graph that the difference between the highest bit-width (16)-bit weight and 16-bit activation) and lowest bit-width (4)-bit weight and 4-bit activation) is 0.003 ms on the accelerator.Hence, not every low-precision implementation leads to performance improvement on the chosen hardware.Therefore, it can be concluded from latency plots of the two different Convolution operations that a few layers exhibit similar latency even though they differ in weight and activation precision.This phenomenon can be attributed to the low utilization of weight

FIGURE 4 .
FIGURE 4. Latency Comparison of all precision combinations of depthwise convolution (Layer 8) of MobilenetV2 on ImageNet for a batch size of 16. w_a in the x-axis indicate weight (w) and activation (a) precision.

FIGURE 5 .
FIGURE 5. Outline of our neural architecture search methodology.

Figure 6
depicts the construction of a Supernetwork, where a Composite Convolution replaces every Convolution layer.A Composite Convolution consists of "N" parallel paths, each corresponding to a weight and activation choice.The pretrained Convolution weight (W) and Input activation (I) are passed to their respective paths, which are quantized to different precision.Each weight and activation precision choice is assigned a learnable parameter called architectural search parameter α to guide the search process.It is important to note that we need to use the same set of architectural parameters (α) across weight and activation paths.

TABLE 6 .
Macroarchitecture of the MobilenetV2-based Search Space inspired by Gong et al. [22].#Filters indicate the output filter size of MBConv, stride indicates the stride size of, n represents the number of repeated MBConv blocks.

FIGURE 10 .
FIGURE 10.Architecture and mixed precision superbottleneck in supernetwork.search elements: kernel size = {3, 5}, channel expansion ratio = {3, 6}. the composite convolution in the figure is shown only for one convolution operation.however, the composite convolution for precision search is replicated across every convolution, across every layer.

FIGURE 11 .
FIGURE 11.The searched kernel and channel expansion sizes in the Superbottleneck.The blue arrows indicate the searched paths in the Superbottleneck module.First, the expansion ratio (e) of 6 is chosen as max(γ 1 , γ 2 ) = γ 2 .Within the chosen expansion ratio path, the best kernel size of 3 is chosen as max(β 3 , β 4 ) = β 3 .Also, within the chosen Convolution, the precision path which has the highest α value is chosen in the final network.

FIGURE 12 .
FIGURE 12.Differentiable neural architecture, mixed precision and accelerator co-search method.the architecture and mixed precision supernetwork is retained from Section V, Figure10.

FIGURE 14 .
FIGURE 14.The searched precisions for weight matrices for network-1.

FIGURE 15 .
FIGURE 15.The searched precisions for activation matrices for network-1.

FIGURE 17 .
FIGURE 17.The searched precisions for weight matrices for network-2.

FIGURE 18 .
FIGURE 18.The searched precisions for activation matrices for network-2.
1) LOSS FUNCTIONThe loss function is a linear combination of the traditional Cross-entropy loss (L CE ) from the Supernetwork and the hardware loss (L HW _cost ) from the auxiliary network as per Equation 16 and Figure12.

TABLE 1 .
Number of unique [weight, activation] precision combinations before and after search space pruning.

TABLE 3 .
Comparison with other mixed precision methods on MobilenetV2 on ImageNet dataset on BitFusion.

TABLE 7 .
Comparison of architecture and quantization co-search methods on ImageNet dataset.

Table 8
reports the accuracy of two searched architecturemixed precision models (for two different λ values) on the ImageNet dataset, along with their searched Bitfusion accelerator array size and latency of the searched network on the searched accelerator array.Figures 13 and 16 illustrate the searched neural network macroarchitecture on the predicted systolic array of sizes 16*16 and 8*16 respectively.Figures 14 and 17 depict the searched weight precisions of each layer of the two searched networks, while Figures 15 and Figures18 depict the searched activation precisions.