Neural Architecture Search for Transformers: A Survey

Transformer-based Deep Neural Network architectures have gained tremendous interest due to their effectiveness in various applications across Natural Language Processing (NLP) and Computer Vision (CV) domains. These models are the de facto choice in several language tasks, such as Sentiment Analysis and Text Summarization, replacing Long Short Term Memory (LSTM) model. Vision Transformers (ViTs) have shown better model performance than traditional Convolutional Neural Networks (CNNs) in vision applications while requiring significantly fewer parameters and training time. The design pipeline of a neural architecture for a given task and dataset is extremely challenging as it requires expertise in several interdisciplinary areas such as signal processing, image processing, optimization and allied fields. Neural Architecture Search (NAS) is a promising technique to automate the architectural design process of a Neural Network in a data-driven way using Machine Learning (ML) methods. The search method explores several architectures without requiring significant human effort, and the searched models outperform the manually built networks. In this paper, we review Neural Architecture Search techniques, targeting the Transformer model and its family of architectures such as Bidirectional Encoder Representations from Transformers (BERT) and Vision Transformers. We provide an in-depth literature review of approximately 50 state-of-the-art Neural Architecture Search methods and explore future directions in this fast-evolving class of problems.


I. INTRODUCTION
Deep Learning has achieved remarkable progress in the last decade due to its powerful automatic representation capability for a variety of tasks, such as Image Recognition [1], Speech Recognition [2], and Machine Translation [3]. This success is associated with network design, which is crucial to feature representation, leading to many innovative architectures such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Graph Neural Network (GNN) and Transformers.
The associate editor coordinating the review of this manuscript and approving it for publication was Frederico Guimarães .
Transformer networks [4] have attracted significant interest over the last few years due to its effectiveness in handling long-range dependencies in data, and attained state-of-the-art (SOTA) performance on several Natural Language Processing (NLP) tasks. The architecture and its family of networks rely on the Self-attention mechanism to model the global dependency in the input sequence data. The Transformer, based on Self-attention, can exhibit high scalability, model capacity and data parallel processing as opposed to its predecessor, Long short-term memory (LSTM) [3].
The original Transformer model [4] was primarily developed for NLP applications. Later, Vision Transformer (ViT) [5], a type of Transformer network, has been developed for many Computer Vision applications such as Image classification [6], Object Detection [7], Semantic Segmentation [8]. The ViT model establishes interdependence among the patches of an image using the Self-attention mechanism. Recent studies have shown that Vision Transformers demonstrate superior performance over traditional CNNs in terms of model size scaling and training time. The variants built on top of ViT, such as DeiT [9] and ConViT [10], are able to learn visual representations better than CNNs. In this paper, we refer to the original Transformer architecture [4] for language tasks as Vanilla Transformer, and for Computer Vision applications, we use the term Vision Transformer.
Neural Architecture Search (NAS), a subset of Automatic Machine Learning (AutoML) [11], is a method to automate the design process of Neural Network architecture on a given task and dataset without significant human intervention. The NAS method is an intelligent algorithm to automatically search for an efficient neural architecture to save the researcher's manual effort and computation time. Hardwareaware Neural Architecture Search (HW-NAS) is a class of problems whose goal is to search for networks that are not only accurate on the given dataset but also hardware-efficient in terms of latency. The resulting searched models outperform manually designed networks in several aspects, such as model performance and inference latency on the actual hardware. NAS and HW-NAS have been very successful in searching for efficient models that achieve SOTA performance on many tasks such as image classification [12], object detection [13], machine translation [14], etc.
The success of NAS methods for CNNs motivated researchers to develop similar methods for the highly successful Transformer [4] and Vision Transformer [5] architectures, giving rise to a new research direction in NLP and Computer Vision research. Hence, a review paper focusing on NAS methods for Transformers and their family of architectures is essential to summarize several key contributions, including efficient search spaces and search algorithms. This is the first paper to provide an in-depth review of SOTA Neural Architecture Search methods targeting Vanilla Transformers, BERT model, and Vision Transformer for language, speech, and vision tasks, along with exploring future directions.

A. LIMITATIONS OF STATE-OF-THE-ART SURVEY PAPERS
In the past, several survey papers on Transformers and Vision Transformers have been published. The previous AutoML, NAS and HW-NAS survey papers emphasized the theoretical concepts of the search methods, with a greater focus on CNNs than Transformers. However, there is no dedicated review paper for Transformer-based architecture search methods. The existing surveys on Vanilla Transformers, Vision Transformers and Neural Architecture Search are outlined in Table 1. As far as we know, our work is the first dedicated review paper on Neural Architecture Search for Transformers and their family of architectures.

B. PAPER ORGANIZATION
The detailed composition of this survey paper and classification of several Transformer NAS methods is given in Table 2. The remainder of this paper is arranged as follows: Section II briefly reviews the fundamentals of the Self-attention mechanism and Vanilla Transformer. The basics of Convolution and Vision Transformer are summarized in Section III, while Section IV presents several primitive attention and Convolution search spaces for Transformers. Section V reviews different search algorithms for Transformers, followed by Hardware-aware NAS methods in Section VI. Sections VII, VIII, and IX discuss different NAS methods for Vanilla Transformer, BERT model, and Vision Transformer, respectively. Section X briefly describes Hardware-aware NAS methods for Transformers. Section XI provides future directions to explore and Section XII concludes this survey paper.

II. SELF-ATTENTION AND TRANSFORMER MODEL
The Transformer network [4], one of the most remarkable topologies, was initially designed for NLP tasks and has outperformed the predecessor RNNs in terms of model performance. The key feature of this architecture is the modeling of global dependencies by learning relationships between elements of the input through a self-attention mechanism by pairwise correlation. RNNs recursively process the input sequence, fetching only short-term information, but the Transformers can establish the global dependencies by learning long-range relationships. The Vanilla Transformer (illustrated in Fig. 1) for a sequence-to-sequence application is built using the following basic modules: (1) Input and Output Embedding, (2) Positional Embedding, (3) Multi-Head Self-Attention, (4) Pointwise Feed-Forward Networks, (5) Residual Connection, and (6) Normalization. On a macro-level, the Transformer architecture stacks ''N'' identical segments of the following two units: (1) Encoder and (2) Decoder.

A. BASIC MODULES 1) INPUT AND OUTPUT EMBEDDING
Transformers cannot directly process text, hence the input is converted to embedding vectors, similar to word2vec [58]. VOLUME 10, 2022  The first step in Vanilla Transformers is to feed the input sequence to an input embedding layer to convert the word tokens into a vector format. Transformers employ a learnable parameter model so that the network learns in such a way that each word is mapped to the corresponding vector that characterizes that word.

2) POSITIONAL EMBEDDING
The Transformer is ignorant of the positional information as it does not contain any Convolution operation or Recurrent modules, and thus an additional positional vector is required to model the input word tokens. Positional Embedding, an encoding that denotes the position of words in the input sequence, is added to the embedded input vector to denote the position of each word. The pre-processed embedding step is performed before the encoder module combines input and the positional embedded vector. The Positional Embedding proposed by Vaswani et al. [4] in the original work is a d-dimensional vector of encoding information of each word based on the position index ''i'' in the sequence. For each where pos is the index of time step in the token embedding and d is the vector dimension.

3) SELF-ATTENTION
The attention mechanism is the core operation of a Transformer to establish long-range dependencies within a data sequence. The process is similar to a human's perceptual system of selectively focusing on the object's most crucial part. There are two types of attention methods: (1) Attention: between the input and output elements, and (2) Self-Attention: within the input elements only. The Self-attention mechanism in the transformer draws its inspiration from the human visual system, which manages quantification of interdependence to amplify the essential information by associating each word to other words and lessen the noisy information. The Self-attention [15] nature is attained by feeding the input sequence after Positional Embedding to three distinct Fully Connected (FC)/Linear layers to generate Query (Q), Key (K), and Value (V) matrices, as shown in Fig. 2. The individual nodes in the Fully Connected layer produces a unique weighted sum of all the inputs, thereby learning the interdependence differently. Given the matrix depiction of the Query Matrix (Q ∈ R N xD k ), Key Matrix (K ∈ R M xD k ), and Value Matrix (V ∈ R M xD v ), the scaled dot-product attention is given in Eq. 2.
The N and M notations in the dimension of Q-K-V matrices denote the length of Query, Keys, and Value. D q , D k , and D v represent the output dimensions of the Query, Key, and Value FC layer, respectively. The Query matrix and Key matrix undergo an element-by-element dot product multiplication to generate a score matrix, which is further divided by √ D k (square root of the matrix dimensions of Query and Key). The softmax function is applied to the scaled score to generate the attention weights, boost the high score elements and lessen lower score values. A = softmax QK T √ D k is known as Attention matrix, where the softmax function is applied row-wise to the matrix. The dot products of queries and keys are divided by √ D k to alleviate gradient vanishing problem of softmax function as the dot product produces high magnitude values. The score matrix and attention matrix determine the compatibility between the Query and Key to measure the importance of one word over the other words in the sequence.
W O , as per Eq. 5. The output of this FC layer is the output of the Multi-head Self-Attention, which is further processed in next layer.

5) POINTWISE FEED FORWARD NETWORK (FFN)
The Pointwise Feed-Forward Network (FFN) unit in the Transformer is a series of two Fully Connected layers with ReLU [59] or GELU [60] activation and dropout function [61] in between the two FC layers. The output vector of Multi-Head Self-Attention is fed to the successive Pointwise FFNs. The Feed-forward layers can be treated as a point-wise correlation where each position in the input tensor is treated equally but learns the weighed sum differently with a different set of weight parameters. The input dimension of the first FC layer and output dimension of the second FC layer in the FFN module is denoted by Dim, as shown in Fig. 4. The output dimension of the first FC layer and input dimension of the second FC layer is denoted by MLP_Dim (typically the search element).

B. ENCODER AND DECODER 1) ENCODER
The encoder and decoder block is predominantly built using Multi-Head Self-Attention unit and Position-wise Feed-Forward Network. The first sub-layer is an MHSA module to allow the encoder to remain focused on the most relevant information of sequence. The MHSA output vector (Y MHSA ) is added to the Positional Embedding tensor (X PE ) through a residual connection [1] and passed through a Layer Normalization [62] layer for further processing, as given in Eq. 6.
The normalized residual MHSA output vector (Z MHSA ) is fed to a Position-wise Feed-Forward Network to enhance the expressiveness of the Transformer. The output of the FFN module (Y FFN ) is added to the MHSA output vector or input of FFN (Z MHSA ) as a residual connection and normalized to produce output of the encoder unit i, as given in Eq. 7.
The Transformer encoder is constructed by repeatedly stacking ''N'' such two sub-layered capsules of MHSA and FFN, along with residual connection and normalization.

2) DECODER
Similar to an encoder, the decoder is also composed of ''N'' identical blocks of a concatenation of three sub-layers. The first sub-layer is a masked Multi-Head Self-Attention, where the operation is similar to the MHSA module discussed earlier, except that the future positions in the sequence are masked. The attention mechanism is restricted in such a way that i th position in the Query matrix can attend only up to (i-1) th positions. The future positions in the MHSA module of the encoder are not limited as the input sequence is fully available. However, the decoder cannot access the future positions because they are yet to be predicted by the network. The masking is done using a mask matrix, an element-byelement dot product of the mask matrix and the unnormalized attention matrix. This kind of masked scheme is often referred to as autoregressive or cross-attention. The second sub-layer is the cross-attention MHSA module, where an encoder-decoder mixer blends the output of the encoder and the output of the masked MHSA unit. This scheme allows the decoder to utilize previously generated data from the encoder block and focus only on relevant data in the sequence. The third sub-layer is a Pointwise FFN unit which takes the input vector from the cross-attention MHSA unit. The output vectors of all three sub-layers are coupled with residual connection and Normalization, similar to the encoder. Even though it is common for Transformer architecture to have the same number of encoder and decoder blocks, we can adjust their depth independent of each other.

C. FAMILY OF TRANSFORMER ARCHITECTURES
A wide variety of Transformer-based large self-supervised and pre-trained models are proposed for language applications such as Bidirectional Encoder Representations from Transformer (BERT) [63] and Generative Pre-trained Transformer (GPT) [64]. These networks are designed to learn universal language representation from large unlabeled text datasets and pass the knowledge to a downstream application on labeled data. The pre-trained BERT or GPT is further fine-tuned on a specific task to avoid training from scratch [21].

1) BIDIRECTIONAL ENCODER REPRESENTATIONS FROM TRANSFORMERS
Traditionally, the language models are designed to process input text data sequentially and in a single direction: right to left or left to right. However, the BERT model [63] predicts missing data from both previous and following words in an input sequence and is made up of only the encoder stack of the Vanilla Transformer. Hence, the bidirectional encoder name is associated with BERT. It uses masked language modeling by masking 15% of the words in input data, as shown in Fig. 5

2) GENERATIVE PRE-TRAINED TRANSFORMER (GPT)
While BERT retains only the encoder of Vanilla Transformer, GPT [64] exploits its decoder. GPT can perform a wide range of language tasks with millions of parameters. The usual cross-attention connection from the encoder to the decoder in the Vanilla Transformer is removed as the encoder module is absent. Therefore, a GPT consists of only positional encoding, masked MHSA, FFN, and normalization units. As far as we know, there are no GPT-based NAS methods in the literature due to their size and computational complexity.

III. CONVOLUTIONS AND VISION TRANSFORMER
Convolutional Neural Networks (CNNs) have absolutely dominated Computer Vision applications for a long time, with successful networks from AlexNet [68], VGGNet [69], ResNet50 [1], to the State-of-the-art EfficientNet [70]. Inspired by the success of Vanilla Transformers in language and speech applications, its variant, Vision Transformer (ViT), is heavily utilized on vision tasks and has become a dominant player in this arena. In ViT and its children, the Self-attention mechanism is either used as the main operation to extract features or combined with Convolution in a hybrid manner. In our paper, a Neural Network falls under the category of Vision Transformer if a Self-attention module is used in the network and applied only on a Computer vision task such as Image Classification, Object Detection, etc. The Convolution operation is not limited to vision applications but is also explored and is now indispensable in many language and speech applications. For example, ConvBERT [71] replaces a few linear operations with Convolution to better learn local dependencies in language tasks. In this section, we first review basic Convolution operations, followed by examining Vision Transformer architecture.

A. CONVOLUTION OPERATIONS
A Convolution operation is the most basic unit in a CNN, whose primary job is to extract features and model local representations from the input image or intermediate feature map. Even though Convolutions are predominantly used in vision applications, they are significantly utilized in language and speech tasks alongside Self-attention.

1) STANDARD SPATIAL CONVOLUTION
In a Spatial Convolution, the weight matrix of size (k, k, I, O) is divided into a set of ''O'' filters, where each kernel has a uniform size of (k, k, I). The input kernel of size (k×k) convolves with a region equal to kernel size in the input activation/feature map to produce one pixel in the output feature map, as shown in Fig. 6. Each kernel of size (k, k, I) performs Convolution with all the channels of input activation map to generate one output pixel in a single channel of the output feature map. Pointwise Convolution is a special type of standard Convolution whose kernel size equals 1 × 1.

2) DEPTHWISE CONVOLUTION
In a Depthwise Convolution, each individual filter is convolved with only one channel in the input feature VOLUME 10, 2022 map/activation map to produce one output pixel in a single layer, unlike in spatial Convolution, where a filter is convolved with all channels in the feature map. The weight matrix of dimension (k, k, 1, O) is divided into a set of ''O'' filters, where each kernel is of size (k, k, 1). The (k×k) kernel convolves only with one channel to produce one pixel in the output feature map, as shown in Fig. 7. Therefore, the complexity of the Convolution operation is reduced by a factor of ''I,'' where I is the input channel size. Depthwise-Separable Convolution, first introduced in Xception [72] and MobileNet [73], is a sequence of a Depthwise Convolution followed by a Pointwise Convolution to combine the outputs in the depth dimension. The Depthwise Convolution operation captures spatial correlations within individual channels, while the Pointwise operation is for cross-channel correlations.

3) MOBILE INVERTED BOTTLENECK CONVOLUTION (MBConv)
MBConv, introduced in MobileNetV2 [74], is an encapsulation of three distinct Convolution operations. This unit is a straightforward extension to Depthwise Separable Convolution to improve model performance and efficiency over the predecessor MobileNet [73]. The input feature map to MBConv is a low-dimensional compressed tensor of Channel size ''I,'' fed to a 1 × 1 Pointwise Convolution to enlarge the depth dimension of the feature map by a factor of expansion ratio ''e.'' The expanded feature map, whose channel size is ''I*e,'' is applied to a Depthwise Convolution for spatial correlation between pixels. The final operation is another 1 × 1 Pointwise Convolution to generate a low-dimension activation map for the next module. The MBConv module forms a fundamental unit in many Convolution-only and Hybrid Attention-Convolution search spaces.

B. ATTENTION FOR CONVOLUTIONS 1) SQUEEZE AND EXCITATION (SE) CONNECTION
The concept of channel attention is utilized in Squeeze and Excitation Network (SE-Net) [75], even before using the Self-attention style computation (Q-K-V style) for vision applications. SE-Net introduced feature recalibration to model interdependency between different channels of a feature map at a low computing cost, as shown in Fig. 8.
The first step is to perform a regular Convolution operation (F tr ) on the input feature map (W) of size (H'×W'×C') to produce an output (U) of size (H×W×C). The second step involves applying a global average pooling operation (F sq (.)) to squeeze the channel-wise elements to produce a 1D tensor of size (1 × 1 × C). The squeezed tensor, which condenses the information of each channel into a single dimension to reduce the computation cost, is fed through two FC layers to establish interdependence. The output of the second FC layer, a 1D weighted tensor, is multiplied with the unsqueezed feature map (U) to produce a weighted output feature map (Z). This way, useful features are carefully enhanced while suppressing the less important ones.

2) SELF-ATTENTION FOR IMAGE FEATURE MAPS
The Self-attention mechanism can be used directly on the image feature maps, as shown in Fig. 9, instead of token embeddings described in the previous section. The input feature map is passed through three different 1 × 1 Convolutions to produce a set of Query, Key and Value feature maps [76], similar to Vanilla Transformers. Even though this kind of implementation is computationally very expensive, the global information in a Convolution operation is greatly enhanced.

C. VISION TRANSFORMER (VIT)
Vision Transformer (ViT) [5] paved the way for a new research direction of using Self-attention based modules for vision applications. ViTs have great flexibility and exhibit great characteristics in handling vision tasks such as a large receptive field, high model capacity, less inductive bias, grouping effect, etc. Similar to Vanilla Transformers, ViTs are also efficient in modeling long-range interdependencies and can process multi-modal data such as image, video, speech, and text. The original Vision Transformer [5] does not use any Convolutional layer, instead utilizes the encoder module of Vanilla Transformer to perform image processing. ViT divides the input image (H, W, 3) into several patches, which are fed as input to the Transformer encoder. ViT attains better performance than traditional CNN models on many standard Image Classification benchmarks such as ImageNet [77], CIFAR-10 [78], and CIFAR-100 [78] by pre-training the network on a large-scale private dataset such as JFT-300M [79]. The initial Vision Transformer (Fig. 10) consisted of the following three important modules: (1) Patch Embedding, (2) Position Embedding, and (3) Transformer Encoder.

1) PATCH EMBEDDING
The Vanilla Transformer receives a 1D vector sequence of token embeddings as input for language modeling. However, input to a Vision Transformer is an image of dimension (H×W×3), which is transformed into a sequence of flattened 2D patches. ViT splits the input image into several non-overlapped patches of size p×p and treats them as token embeddings. The input sequence is a flattened vector (2D to 1D) of image pixel values, as illustrated in Fig. 11. Consider an ImageNet input image of dimension (B, H, W, C) = (B, 224,224,3), where B, H, W, and C indicate batch size, the height of the image, width of the image, and input channel size, respectively. The patch size (p) must be chosen in such a way that the height H and width W of the input image must be divisible by the patch size. The resolution of each image patch is (p, p), and effectively, the input sequence length is given as N = H*W/P 2 . The patch height p h (patch width p w ) is equal to image height H (width W) divided by the patch size p, as given in Eq. 8.

2) POSITION EMBEDDING
The Transformers receive a set of vector elements as input, and therefore the Neural Network learns irrespective of the input sequence order. Nevertheless, the position of an element in the vector or the sequence order is very important. This holds true for both text for language modeling, and for a pixel in an image for Computer Vision. Hence, a 1D learned parameter for positional encoding is linearly added to every patch in the flattened vector to preserve spatial information, similar to the Vanilla Transformer. ViT inserts BERT's style of learned [class] embedding to inject information about the relative or absolute position of the pixels in the patches of the sequence. Dosovitskiy et al. [5] perform 2D interpolation of pre-trained position embeddings based on pixel location in the input image. Other Vision Transformer-based networks replace non-parametric positional encoding in the original ViT with parametric encoding [80] or Fourier-based kernelized versions [81] or relative position encoding [82].

3) TRANSFORMER ENCODER
The Vision Transformer employs only the encoder unit of Vanilla Transformer for input activation feature extraction. The Patch and Position embedded patches of split images are passed directly to the Transformer encoder for further processing. The encoder is a sequence of ''L'' identical units VOLUME 10, 2022 of Multi-Head Self-Attention and Multi-Layer Perceptron (MLP) units. The MLP discussed here is same as the FNN module in the Vanilla Transformer. Unlike the encoder in the original Transformer, the normalization layer is applied before MHSA and MLP modules, as shown in Fig. 12a. The output tensor of MHSA is added to the input vector of the encoder through a residual connection. The MLP units consist of two FC layers with GeLU activation function whose output is added to the tensor before the second normalization layer through residual connection. The final layer in Vision Transformer is a simple FC layer, similar to the FC layer in a traditional CNN, to predict output probabilities. CvT [83] improves the ViT architecture by replacing the linear projection by a Convolution projection, as shown in Fig. 12b. The output of the Convolution projection is flattened to pass it as a token. DeepViT [84] introduced a Re-attention unit in the place of MHSA to resolve the attention collapse issue.

D. FAMILY OF VISION TRANSFORMERS
The remarkable success of the original Vision Transformer inspired researchers to improve it with respect to architectural structure. The advancement is accomplished either by modifying the QKV Self-attention layout or introducing a Convolution operation along with the MHSA segment in a hybrid manner to extract better local texture cues. The cascade hybrid ViTs only replace the Fully Connected layer in the initial Embedding layer [85] or linear projection [83] with a Convolution and do not disturb the rest of the architecture. On the other hand, the parallel hybrid ViT inserts a series of Convolution blocks between different MHSAs or replaces a few MHSA modules completely to explore the advantages of both modules. Some of the notable MHSA-only ViT improved architectures include Transformer in Transformer (TNT) [86], Swin [6], Twins [87], etc. TNT employs an inner and outer Transformer module to minimize the intrinsic pixel loss caused due to dividing the input image into patches. Orthogonal to efficient Vision Transformer network design, a few researchers enhanced training methodologies for higher model performance. For example, DeiT [9] solves the data-efficient issue by modifying the Transformer topology and adopting the Knowledge Distillation strategy to train ViT using a teacher-student method.

IV. TRANSFORMER SEARCH SPACE
The pivotal factors for an efficient end-to-end Neural Architecture Search (NAS) pipeline for any task, as depicted in Fig. 13, are (i) a set of primitive search elements, (ii) a well-designed search space, (iii) a search algorithm, and (iv) a performance evaluation strategy to pick the optimal network architecture from the pool of searched networks.

A. PRIMITIVE ELEMENTS
The primitive element set typically comprises manually designed Attention and Convolution units and their hyperparameters. The discrete components' size and dimension impact the architecture's computation complexity and model performance. The search components within a Multi-head Self-Attention block are Q-K-V dimensions (W Q , W K , W V matrix dimensions in Fig. 2), head number (h in Fig. 3), and hidden size. The search elements in the Feed Forward Network are MLP dimension/MLP ratio/MLP expansion ratio (MLP_Dim in Fig. 4). MLP ratio is the ratio of Input dimension (Dim in Fig. 4) and MLP dimension (MLP_Dim). The search components outside MHSA and FFN include depth or number of encoder/decoder layers (N in Fig. 1). Additionally, the Vision Transformer considers Patch Size (p) and Embedding dimension (P E in Fig. 11). The search elements for a Spatial or Depthwise Convolution are kernel size (k), number of Channels/Filters (C). The search constituents of the Mobile Inverted Bottleneck (MBConv) block [74] are kernel size (k) and expansion ratio (e).

B. CHALLENGES IN BUILDING A SEARCH SPACE
The first step is to thoroughly curate a search space consisting of a general structure of the Transformer architecture from which the search method finds an acceptable model. The design of fundamental Neural Network operations (such as an FC or Convolution) in the search space is combinatorial, and the search method's computational complexity increases with the size of operations and search space. The larger the search space in terms of the fundamental units' size and dimension, the higher the computation time as the search algorithm requires evaluating more options. Also, similar to the traditional Machine Learning methods, the search method is also prone to overfitting, as the NAS algorithm relies on the standard ML techniques. As seen from the experiments of AutoFormer [88] (Fig. 14), increasing the depth (d), Embedding dimension (e), MLP ratio (r), and the number of heads (h) increases the accuracy on the ImageNet dataset till a peak value and overfits after attaining the maximum accuracy. Hence, there lies a tremendous challenge in choosing the  correct dimensions of a Transformer and a robust search space that can guarantee optimal model performance for a given dataset. The search algorithm can only find average networks with average search space.

C. SEARCH SPACE TYPES
There exist two types of widely used search spaces through which the primitive elements are connected to each other to form an end-to-end Transformer neural architecture: (i) Micro/Cell-level Search space and (ii) Micro/Layer-wise Search. These search spaces are applicable to any kind of Neural Network, not limited to the Transformer model.

1) MICRO/CELL-LEVEL SEARCH
The Micro search works on the principle that a small Directed Acyclic Graph (DAG) or a cell structure is searched instead of searching for the entire network end-to-end. The searched DAG is stacked N times sequentially to form the final architecture, thereby reusing the same structure repeatedly throughout the model. A cell is constructed using M nodes and edges, as shown in Fig. 15, where each node is a tensor, and the edge represents an operation that transforms data between any two nodes. The task boils down to finding the optimal operation for each individual edge from the pre-defined set of primitive elements. Although the same topology is replicated across different layers in the network, the dimensions of the primitive elements, such as filter size, vary with each cell. Examples of NAS methods applying the cell-based search space are Evolved Transformer [89], DARTS-Conformer [90], etc.

2) MACRO/LAYER-WISE SEARCH
The identical cell structure throughout the network in the micro-search process offers limited flexibility and performs poorly on multiple hardware platforms due to its unstructured computation. Layer-wise/Macro search space can resolve this problem by first constructing a chain-type macro-architecture and searching for different operations/configurations at each layer to obtain more robust models for hardware-friendly inference and better model performance. The majority of NAS methods discussed in this survey paper, such as GLiT [91], rely on layer-wise search.

D. TRANSFORMER SEARCH SPACE TYPES
Previously, we divided the search spaces based on the interconnection of several primitive operations, irrespective of the Self-attention or Convolution operation. There are two categories of search spaces with reference to the type of operations in the primitive element set: (i) Self-Attention only search space and (ii) Hybrid Attention-Convolution search space. Both the search spaces can be employed in any architecture in the Transformer family; Vanilla Transformer, BERT, and Vision Transformer depending on the application and requirement. We outline several key Attention and Convolution search spaces which act as the backbone for searching different neural architectures at the respective subsections while discussing the method. The summarized search spaces help understand the different components of a Transformer architecture search space. Also, researchers willing to start researching Transformer NAS-related problems can pick up the search spaces mentioned.

1) SELF-ATTENTION (SA) ONLY SEARCH SPACE
The SA-only search space is limited to the elements in Vanilla Transformer, such as head number, FFN hidden size, etc., simply a Convolution-free search space. The early NAS methods such as Evolved Transformer [89] and Aut-oFormer [88] on language and vision tasks relied only on the encoder or decoder of Vanilla Transformer to search for better hyperparameters. We outline a few commonly used search spaces and their components to get a sense of the problem. VOLUME 10, 2022 a: AUTOFORMER [88] The search space, illustrated in Table 3, consists of key components of a Transformer such as Embedding dimension, number of heads, Q-K-V dimension, MLP ratio, and network depth. The authors propose three network setups (tiny, small, and base) to operate under different constraints and model complexities that vary across layers. The dimensions of search elements can take any value between the highest (x) and lowest (y) value with a step size of z, which is represented as a tuple (x, y, z) in the table. b: TWINS SEARCH SPACE [87] ViTAS [92] extended the tiny, small, large, and base backbones of Twins Transformer [87] to search for operation type (Op.), patch size, head number, and dimensions of MHSA and MLP. The backbone architecture is divided into four stages, as shown in Table 4, where each stage is defined with maximum dimensions and number of repeatable blocks. The local and global in the Table refers to Locally-grouped Selfattention (LSA) and Global sub-sampled attention (GSA), respectively, proposed by Twins [87]. LSA is used to capture the short-distance and fine-grained information, and GSA is utilized to fetch long-distance and global information. The Max a specifies the maximum dimension of MHSA layer, also the size of patch embedding layer, and Max m indicates maximum dimension of MLP layer. The ratio {i/10} denotes the reduction ratio from the maximum output dimension i.e., the search space includes MHSA dimensions of i * Max a 10 for i ={1,2..10}. c: DeiT SEARCH SPACE [9] ViTAS [92] also extended the tiny and small backbones of the Data-efficient Image Transformer (DeiT) [9], whose search space is delineated in Table 5. DeiT retains the skeleton of the original Vision Transformer [5] by searching for patch size, number of heads, and output dimensions of MHSA and MLP. The ''Max Dim.'' in the table indicates the maximum output dimensions of MHSA and MLP blocks, and the ratio specifies the reduction ratio from the max dimension.

2) HYBRID ATTENTION-CONVOLUTION SEARCH SPACE
There is a growing interest in integrating the Self-Attention mechanism and Convolution operations (Spatial and Depthwise Convolutions) in ViTs to combine the strengths of both worlds [94]. For instance, the Convolutional vision Transformer (CvT) [83] introduced a Convolution-based Embedding and QKV projection instead of an FC network in ViT. MobileViT [95] adopted a macro-architecture similar to MobilenetV2 [74] and replaced a few MBConv blocks with proposed attention-based modules. The blended Attention-Convolution search space is not limited to vision applications but also tremendously used in NLP and speech tasks.
a: NASViT SEARCH SPACE [93] The search space of NASViT [93] is inspired by the architecture of LeViT [96], where the first four layers are Convolution operations for better efficiency in processing high-resolution feature maps, followed by MHSA operations in the remaining part of the network to handle low-resolution embeddings for modeling global representation. The initial three layers of NASViT consist of MBConv units, and the remaining four layers are occupied by Transformer modules, as shown in Table 6. The search elements in MBConv are output channel width (C), kernel size (k) of Depthwise Convolution, expansion ratio (e), and the number of repeat MBConvs (depth). The primitive elements in the Transformer block are the number of windows (a technique in Swin Transformers [6]), width (hidden dimension features), number of MHSAs (depth), and MLP ratio in FFN. The shifted windowing method in Swin Transformer limits the Self-attention to a non-overlapping local window for better efficiency while permitting cross-windowing in a few cases.

V. NEURAL ARCHITECTURE SEARCH METHODS
The pivotal aspect to the success of Neural Networks is the efficient design of architecture for a given application. The manual design of a neural architecture is a labor-intensive procedure, requiring domain expertise and significant computing infrastructure to train and evaluate many networks. Furthermore, prior knowledge restricts researchers/engineers from developing out-of-the-box and innovative architectures  [93]. to attain higher model performance than state-of-the-art designs. Neural Architecture Search (NAS) aims to discover best performing neural architecture s* from a given Search Space S on a given Dataset D. The search algorithm describes the methodology used to find the optimal model from a pool of all possible Neural Networks. Although there exist several types of NAS methods in the literature, we discuss preliminaries and basic outline of (1) Reinforcement Learning (RL), (2) Evolutionary Learning NAS, and (3) One-Shot NAS methods on which Transformer NAS strategies rely.

A. REINFORCEMENT LEARNING NAS (RL-NAS)
Reinforcement Learning (RL) is a computational paradigm for learning a policy that takes optimal actions in various states of a task environment to maximize the cumulative rewards received by the acting agent, as shown in Fig. 16. The Reinforcement Learning-based Neural Architecture Search methodologies [12], [97] learn to discover the best-performing model architecture, similar to traditional RL methods, whose objective is to learn the best action. The environment denotes the pool of all possible distinct Neural Networks, built using a pre-defined search space and primitive elements. The NAS methods using RL framework use RNNs as RL controllers whose job is to sample high-quality neural architectures from of all possible models, as shown in Fig. 17. The RNN predicted model is trained end-to-end to obtain validation accuracy or relevant model performance metric. The validation accuracy is the feedback or reward which guides and updates the RNN controller to predict the next best candidate model. In RL terminology, neural architecture moves from the current state to the next best state, where the reward maximizes.

B. EVOLUTIONARY LEARNING NAS (EL-NAS)
Evolutionary Neural Architecture Search (EL-NAS) [98] is a generic population optimization algorithm that draws its inspiration from natural selection and genetics [45]. Although this search method comes in various forms, the most commonly used technique is the Genetic Algorithm (GA) which simulates the evolution of species. GA is an iterative process of evaluating selected individuals according to a fitness function and generating a new set of individuals using characteristics of best-performing models from the previous generation. The basic component of GA is population, also called individuals, each of which describes a distinct architecture in the predefined search space. An initial population is generated by randomly sampling different networks from a large pool of networks. The individuals, each representing a specific neural architecture, are trained end-to-end on the target task to determine fitness (typically model performance such as Validation Accuracy). Generating new individuals or populations involves mutation of primitive operations within a network or crossover of top individual models in the current generation of networks.
The Genetic Algorithm utilizes the Tournament Selection method to choose the fittest candidates from the current generation of architectures [89]. The K-way tournament selection method selects the top k best performing individuals to run a tournament among themselves. The fittest candidate among chosen k candidates is picked for further processing in iteration. In general, there exist several tournaments to pick the best architecture for a given dataset and application. VOLUME 10, 2022 The weaker candidates have less chance of survival in case of a large tournament as it competes with candidates of a higher fitness function. The summary of Evolutionary Neural Architecture Search is depicted in Fig. 18.

C. ONE-SHOT NAS
The RL search methods' resource, time, and computational demand make it very difficult to find efficient neural architectures as it requires the method to train and evaluate several networks. The One-shot NAS algorithm [100] lowers the search burden on limited computing devices by representing or encoding all possible distinct models in the search space in the form of a single Supernetwork. The key idea is to couple network training and neural architecture sampling into a supernet to reduce the search cost. The discrete set of candidate architectures is replaced by one continuous supernetwork, as opposed to the RL or EL-NAS method. Therefore, only one Supernetwork is trained instead of discrete training of individual networks by leveraging the weight sharing concept as the same operation can be used to evaluate several neural architecture combinations. Nevertheless, the search cost increases with the size of search elements as more parameters are trained in the Supernetwork. In the One-shot NAS approach, the term SuperTransformer is a substitute for supernetwork consisting of all possible combinations on the Transformer space and subTransformer refers to the subnetwork sampled from the SuperTransformer.

1) DARTS: DIFFERENTIABLE ARCHITECTURE SEARCH
The pioneering work in One-Shot NAS is DARTS [99], which formulates the search problem in a differentiable manner, unlike RL or Evolutionary search. DARTS uses the softmax function to relax the discrete search space and gradient descent to train Supernetwork and search for architecture. The pivotal idea in this search method is to construct a Supernetwork that encodes all possible distinct neural architectures based on the predefined primitive search elements. For example, DARTS defines a cell structure with four nodes (numbered 0, 1, 2, 3) and six edges (each edge corresponds to an operation). The distinct possible paths with four nodes are as follows: (0→1), (0→2), (0→3), (1→2), (1→3), and (2→3), as shown in Fig. 19a. The task is to search for the best operation at each edge such that overall model performance is as high as possible. Each node is populated with all possible operations in the set of search elements to form a supernode, as illustrated in Fig. 19b. A set of auxiliary parameters called architectural weight parameters {α 1 , α 2 , . . . α N } are defined for every individual node which guides the search process at respective nodes. The input activation at each node is broadcasted to all distinct operations, and the output activation map is a softmax weighted sum of operations. If (α 1 , α 2 , . . . α N ) are designated to operations (Op 1 , Op 2 , . . . Op N ) respectively in a supernode, activation feature map O fmap at output node for input feature map I is given as per Eq. 9.
The entire Supernetwork is trained end-to-end using standard gradient descent, similar to traditional Neural Network training. The Convolution or Attention weights W* of Supernetwork are updated on the training dataset D train and freezing the architectural weights α * . On the other hand, architectural weight parameters α * are updated on the validation dataset D val and freezing operation weights W*. The best operation at each edge of the trained Supernetwork is sampled by choosing the operation with the highest architectural weight α from the respective set of architectural weights {α 1 , α 2 , . . . α N }. For example, the blue-colored path is chosen between nodes (0, 2), as shown in Fig. 19c. The final cell architecture is formed by repeatedly sampling operations at each edge, as depicted in Fig. 19d. Although One-shot NAS training eases the search burden, it is prohibitive in case of a huge search element set and search space. This leads to high GPU memory consumption and eventually running out on a large dataset. Also, training a Supernetwork is non-trivial due to interference caused by child networks; thus, rank correlation is not preserved. Several follow-up methods, such as ProxylessNAS [101] and Single-path NAS [102], addressed these issues and enhanced the differentiable search algorithms.

D. PERFORMANCE EVALUATION
The performance evaluation phase in a search algorithm specifies the estimation methodology of performance of a predicted neural architecture in the search space. It compares different Neural Networks generated by search algorithms and guides the search process to find optimal models. A few examples of performance evaluation strategies include full training of architectures or partial training on the target/proxy datasets. The seminal RL-NAS method [12] trains every network predicted by the RNN controller till convergence. Hence, this process is highly expensive, requiring 60 years of GPU computation time, even on a proxy dataset (CIFAR-10). In order to curtail expensive performance estimation, several methods have been introduced, such as early stopping [103], use of NAS Benchmarks [104], low-resolution dataset, and network search with fewer parameters and cells [99]. The One-shot methods can alleviate the performance estimation burden by sharing weight parameters of all networks within a single graph and evaluating a sub-graph from a Supernetwork. A few examples of performance evaluation in Transformer-based NAS methods are as follow: (i) GLiT [91] uses the validation dataset for evaluating a subnetwork during the search process, and (ii) subnetworks in AutoFormer [88] are evaluated according to the manager of the evolution algorithm.

VI. HARDWARE-AWARE NAS (HW-NAS)
Hardware-Aware Neural Architecture Search fully automates the Neural Network design procedure to search for models that are not optimized for accuracy but also for efficient performance on the underlying hardware platform. The conventional hardware-agnostic NAS methods aim to search for the best neural architecture that maximizes only task performance or accuracy. However, HW-NAS finds promising networks which are optimized for a wide range of hardware performance metrics such as latency [105], [106], FLOPS [70], power consumption, energy [107], and memory usage [108]. This section first reviews efficient methods for hardware inference, followed by a discussion on injecting hardware awareness in RL, One-shot, and EL-NAS methods.

A. HARDWARE ACCELERATION METHODS
Pruning [109], [110], Quantization [111], [112], [113], Knowledge Distillation [114], Tensor Decomposition [115], Neural Architecture Search, etc., are some of the techniques to design lightweight, memory-efficient, and hardware-friendly methods for inference on a variety of devices such as CPU, GPU, ASIC, FPGA. These methods achieved remarkable success in compressing large networks into smaller models with negligible accuracy or task performance loss. Neural Network Pruning refers to removing redundant or unimportant weights/nodes/neurons/filters parameters which do not significantly hinder model performance, thereby reducing the size and computational complexity of a model. Network Quantization converts the high precision model weights/parameters (Floating point 32) to low precision (Integer 8, Integer 4). Quantization methodology has attracted much attention in academia and industry as inference of a model can be performed at a low precision with a negligible drop in accuracy, as opposed to training where a model is trained at high precision. Knowledge distillation is a type of network compression method where a relatively small model is learned to mimic the behavior of a large pre-trained network.

1) REINFORCEMENT LEARNING BASED HW-NAS (RL-HW-NAS)
In RL-HW-NAS, the Controller samples networks that are not only accurate but also less computationally expensive. For instance, MNasNet [105] added hardware latency characteristics as constraints in formulating the NAS problem. Given a Neural Network model M, Acc(M), Lat(M), and T stand for the validation accuracy, the latency of model M on the hardware (or any other constraints), and target latency, respectively. The NAS optimization can be reformulation to include the hardware constraints, as per Eq. 11.
A weighted product of accuracy and latency can be used to approximate Eq. 11 along a tunable parameter γ for the trade-off between these two metrics, as shown in Eq. 12. A higher γ value prefers latency minimization over accuracy VOLUME 10, 2022 maximization and vice versa.
Jiang et al. [116] upgrade the RL-HW-NAS by avoiding training sampled networks that do not meet the target constraint. In this case, the controller receives a zero signal to update and predict the next best network in the search process, thereby decreasing the search cost.

3) DIFFERENTIABLE HW-NAS
Differentiable HW-NAS methods, such as Proxyless-NAS [101], add hardware cost function directly to the loss function of Supernetwork, thereby formulating it as a multi-objective optimization function, as shown in Eq. 14 where C(Lat(α)) is the latency cost function of both architectural parameters (α) and actual hardware latencies. If {α i1 , α i2 ,.. α iN } correspond to architectural parameters of layer ''i'' and {Lat i1 , Lat i2 , .. Lat iN } represent latencies of operations {Op i1 , Op i2 , .. Op iN } in the same layer, the latency cost function C(Lat(α)) is formulated as per Eq. 15.
A similar bi-level optimization of alternate weights (w*) and architectural parameters (α) update is performed to search for a subnetwork that minimizes validation loss and latency cost function.
Once-For-All (OFA) [117] combines the idea of One-shot Supernetwork training and Evolutionary search to reduce the search cost. OFA for CNN search first trains a large over-parameterized network of maximum dimensions (kernel size and channel width) and samples different sized models to specialize for different hardware platforms. The main advantage is avoiding fine-tuning of sampled networks for different cases as weights are retained from the Supernetwork. Several Transformer NAS methods described in this paper, such as AutoFormer and ViT-ResNAS [118], rely on the Oncefor-all methodology by first training a large Transformer Supernetwork and applying Evolutionary search to find the optimal submodel. For instance, an MHSA/FC dimension 256 is encapsulated within a linear layer of dimension 512 in SuperFormer. During the Evolutionary search, various subnetworks within the size of 512 are sampled for evaluation.

C. NAS FOR PRUNING
Automatic pruning applies the search methods to automatically prune the redundant weights/neurons in a network instead of manually designed pruning algorithms. The automatic pruning methods, such as APQ [119] and NetAdapt [120], significantly reduce the size of a CNN model. Several BERT-based automatic pruning methods, like AdaBERT [121] and NAS-BERT [122], have been developed to compress the large pre-trained BERT model on a huge dataset into a small model for downstream tasks.

D. NAS FOR QUANTIZATION
Uniform Quantization (UQ) is a process of lowering the precision of Neural Network parameters to the same precision. For example, Uniform Integer 8 quantization is a simple and promising method to quantize without any accuracy loss. On the other hand, Mixed Precision Quantization (MPQ) quantizes layers within a model or tensors (weight and activation) within a layer to different precisions to attain additional speedup compared to uniform Int8. For a given network of ''L'' layers and ''P'' precisions to choose from, there exist P L distinct configurations. Hence, the mixed precision quantization can be transformed into an architecture search problem, where the operations in the search space are replaced by different precisions such as Int2, Int4, Int8, etc. There have been many CNN-based works, such as HAQ [123] and DNAS [124], which apply NAS principles to solve the problem. There is a growing interest in applying the search methods to Transformers, such as AQ-BERT [125].

VII. NAS FOR VANILLA TRANSFORMERS
This section reviews Vanilla Transformer-based NAS methods for language and speech Applications. As a part of summarizing various methods, we highlight the following three components: (i) search space design, (ii) search algorithm, and (iii) model performance of the searched network. The Vanilla Transformers are used on numerous language modeling and speech tasks. Sequence-to-Sequence (seq2seq) is a Machine Translation task to convert a text sequence in one language (Eg: English) to another (Eg: French). The commonly used datasets for machine translation tasks are WMT 2014 English-German, WMT 2014 English-French (En-Fr), WMT 2014 English-Czech (En-Cs) [126], and One Billion Words Benchmark (LM1B) [127]. BLEU (BiLingual Evaluation Understudy) [128] is a metric for automatically evaluating machine-translated text. The BLEU score is a number between zero and one that measures the similarity of the machine-translated text to a set of high-quality reference translations. Text Classification trains a model to assign a label to input text as per its content. Natural Language Inference determines whether the given hypothesis text is entailment (true), contradiction (false), or neutral. Text to speech (TTS) synthesizes natural speech audio to convert it into a text, as the name suggests. Automatic Speech Recognition (ASR) is the task of processing human speech and extracting useful information in the form of a text.

A. EVOLVED TRANSFORMER
Evolved Transformer [89] is one of the early methods for efficient Transformer search intended for sequence-to-sequence tasks. It adopts a cell-based search space, inspired by NAS-Net [103], using Convolution and Self-attention elements as primitive operations to leverage the strengths of two layers. The search space comprises two stackable cells, as shown in Fig. 20, each for an encoder and decoder unit. The cell topology is repeated ''number of cells'' times to form the overall architecture, where each cell consists of multiple blocks: six unique blocks for the encoder and eight blocks per cell for the decoder. The block receives two hidden states as input and applies transformation individually to inputs, referred to as branches (right and left branches). The Evolved Transformer method builds a large search space focusing on many primitives in the network, as shown in Fig. 20  The authors devised Progressive Dynamic Hurdles (PDH) within the evolution-based architecture search [134]. It dynamically allocates more computational resources to the networks with higher fitness function values and allows them to train for more epochs than the models with lower fitness function values. The Evolved Transformer [89] finds better feedforward configurations by alternating between Convolution and Self-attention modules. The searched Transformer architecture exhibits improvement over the original Transformer on WMT 2014 En-Ger, WMT 2014 En-Fr, WMT 2014 En-Cs, and 1 Billion Word Language Model Benchmark dataset. The Evolved Transformer matches the performance of the Vanilla Transformer with only 78% of its parameters. Nevertheless, the search process is computationally very expensive, requiring training and evaluating several architectures in the search space.

B. PRIMER
Primer [135] introduces the Multi-DConv-Head Attention (MDHA) unit by adding a 3 × 1 Depthwise Convolution after Q, K, and V projection in Self-attention heads, as shown in Fig. 21. Also, it squares the ReLU output activation in MHSA module. The search space includes basic operations, such as add, multiply, and cosine, and key Spatial and Depthwise Convolution parameters like kernel and filter size. Primer's search strategy is similar to the Evolutionary search method used in Evolved Transformer [89], which evolves close to 25k distinct architectures by mutating one network at a time through inserting, deleting, and swapping, or randomly changing operation and parameter values. The search algorithm's constraint is finding models that offer low validation loss for a fixed training budget (24 TPUv2 hours). The search is performed on the One Billion Words Benchmark, and searched Primer architecture matches the performance of Vanilla Transformer with 4.2x less compute time on the language modeling task.

C. DARTSFORMER
DARTSformer [136] first motivates the problem of directly applying DARTS [99] to the Transformer search space. The memory consumption of Transformer Supernetwork increases with hidden size and runs out even with small values. Therefore, DARTSformer combines DARTS methodology with Reversible Networks [137] to search without running out of memory. The idea is to reconstruct the input of a reversible network layer from its output during backpropagation, hence requiring storing only the output of the last layer. This relieves the memory burden on Transformer Supernetwork and allows to include operations with higher hidden sizes and more candidate choices. The searched network for Machine Translation task consistently performs better than Vanilla Transformer and is comparable to the large Evolved Transformer with a significant reduction in search cost.

E. TEXTNAS
TextNAS [141] proposed a multi-path novel search space by using a mix of Convolution (filter sizes of 1, 3, 5, and 7), Pooling (max and average), Recurrent (bi-directional GRU), and Self-attention layers. The macro search space can be depicted as an Acyclic Directed Graph (DAG), as shown in Fig. 22a. Each layer in the DAG can choose any of the previously mentioned operations to form the searched DAG, as depicted in Fig. 22b. The authors of TextNAS directly utilize Efficient Neural Architecture Search (ENAS) [97] to find better Transformer models for Text Classification and Natural Language Inference tasks. ENAS is an enhanced version of RL-NAS where the common weight tensors in all the predicted child models are shared in the same iteration to avoid individual training and save computation time. The network is searched on the Stanford Sentiment Treebank dataset [142], and the performance is measured on eight Text Classification datasets [143]. The searched Transformer, consisting of Convolution and Attention operations, outperforms the previous Convolution-only networks searched using DARTS [99], ENAS [97], etc., by at least a 0.5 significant score.

F. KNAS
KNAS [144] can be dubbed as Green Neural Architecture Search, meaning that candidate networks in the search process are evaluated without training the actual architecture. The authors use gradients as a coarse-grained proxy to extract essential features from random-initialized models, and these key features are directly used to evaluate the selected network. The Text Classification task search space comprises 12 encoder and 12 decoder layers, which also share the same hyperparameters. The searched encoder-decoder network outperforms the baseline RoBERTa-large [65] on two text classification datasets.

G. SEMINAS
SemiNAS [145] is a semi-supervised search method that leverages an accuracy predictor to evaluate networks at a low cost. The accuracy predictor is initially trained on a small set of network-accuracy pairs and uses the trained predictor to estimate the accuracy of unseen architectures. The predicted architecture-accuracy data pairs are further added to the accuracy predictor training dataset to enhance the surrogate model's performance. The end-to-end accuracy predictor framework comprises an LSTM-based encoder-decoder network and an MLP network-based predictor. The backbone for Text-to-Speech application is a multi-layer encoder-decoder network with the following three operations in the search space: a standard Convolution layer with kernel sizes {1, 5, 9, 13, 17, 21, 25}, Transformer blocks with the number of heads {2, 4, 8} in MHSA, and an LSTM operation. SemiNAS explores a Self-attention-based architecture on the LJSpeech dataset [146] in four GPU days, and the searched model achieves a better Intelligibility Rate (%) and Diagonal Focus Rate (%) than the manually designed Transformer TTS network [147] by 9% and 4%, respectively.

H. AUTOTRANSFORMER
AutoTransformer [148] first designs a customized search space for Time Series Classification (TSC) task and applies Gradient-based Differentiable Architecture Sampler (GDAS) [149], an improved DARTS method, to search for an efficient Transformer. The search space incorporates several structures and operations within the Transformer backbone, which can extract global and local features from the time series. Fig. 23 illustrates the overall search space of Auto-Transformer, where an operation choice is selected from the pool of following elements: 1D standard and dilated Convolutions of kernel size {1, 3,5,7,9,11,13,17, 21}, max and average pooling, LSTM [3] and MHSA. The solid black line indicates fixed data flow, red dotted lines are the candidate choices for input, and blue represents choices for residual input choice. The input to a given layer or residual connection within a layer can be chosen from any of its predecessor layers, including the original input. For example, layer 3 can independently choose any tensor from the set of {original input, layer 1 output, layer 2 output}. This type of multi-layer connection is crucial as initial layers capture low-level information and deep layers capture high-level information in the time series input. AutoTransformer utilizes GDAS [149] to sample and train only sub-graph from Supernetwork in each training iteration to reduce search computation time. The searched AutoTransformer model achieves SOTA results on the UCR TSC Archive dataset [150], thereby outperforming the Vanilla Transformer.

I. EVOLVED SPEECH-TRANSFORMER
Evolved Speech-Transformer [151] is a straightforward extension of Evolved Transformer [89] for Automatic Speech Recognition (ASR) task, utilizing the same primitive elements, cell-level search space (Fig. 20), and PDH-based Evolutionary search algorithm. However, the search method is modified, so only models with less than three million parameters are considered in the trimmed search space to improve search time. Evolved Speech-Transformer is searched on Librispeech [152], a dataset of 80 hours of read English speech, and Zeroth [153], a dataset for Korean speech recognition. The searched model outperforms the baseline Speech-Transformer [2] in terms of Word Error Rate (WER %) with considerably fewer parameters. The search time is less than Evolved Transformer as the search space includes only 150 models as networks below three million parameters are considered. [90], as the name suggests, is a DARTSbased search method on the Conformer network [154]. The Conformer is a modified version of the Vanilla Transformer, formed by replacing the final FFN layer with a half-step FFN  [155] show that the searched DARTS-Conformer model performs better than Conformer by at least 4.7% in terms of Character Error Rate (CER).

K. IMPROVED CONFORMER
Improved Conformer [156] is parallel work to DARTS-Conformer using the same Conformer as the backbone, but more primitive elements and larger in terms of size of each operation in the search space. The set includes the following: (i) 1 2 FFN: inner dimension of {1024, 512, 256}, (ii) Convolution: Skip, Spatial and Dilated Convolution, each of kernel sizes {7, 11, 15}, and (iii) MHSA: number of heads from the pool of {4, 8, 16}. Improved Conformer also proposed a Dynamic Search Schedule method to alleviate the start-up problem caused by DARTS. To directly compare DARTS-Conformer and Improved Conformer, the latter achieves a CER score of 7.5% with 28.89M parameters, while the former attains a score of 6.3% with only 26.68M parameters. Hence, DARTS-Conformer is a more robust model than Improved Conformer.

L. BM-NAS
Bilevel Multimodal Neural Architecture Search (BM-NAS) [157] makes the architecture of multimodal fusion models fully searchable via a bilevel searching scheme. Fig. 25 depicts the DAG of multimodal fusion network search space in BM-NAS, where cells {Cell (1) , Cell (2) , Cell (3) } receive inputs either from previous cell Cell (i) or from two pre-trained Unimodal networks A and B. The bilevel scheme has two levels of search: (i) the high-level search space finds intermodal/intra-modal features via a pre-trained unimodal backbone, and (ii) the low-level searchable components space includes primitive operations (Self-attention, Zero, Activation functions) at Step 1 and Step 2 along with connectivity of different operations inside each cell. BM-NAS is adjustable to various multimodal tasks with different scales in such a way that the number of cells and steps are chosen as hyperparameters. The search process is similar to DARTS [99], where a Supernetwork with all possible operations and connectivity is constructed, and the search process is guided by three sets of architectural search parameters {α, β, γ }, corresponding to cell connectivity among {Cell (1) , Cell (2) , Cell (3) }, connectivity inside each cell and primitive operation search (Step 1 and Step 2 ), respectively. The similar bilevel optimization followed in DARTS is used to train network weights and architectural parameters. The results on three multimodal tasks using the BM-NAS framework show an improvement over baseline multimodal NAS methods with less search time and model parameters in the searched model.

VIII. NAS FOR BERT MODELS
Large pretrained models such as BERT have shown their capabilities on several NLP tasks. A BERT is essentially made up of only an encoder from the Transformer. This section reviews search methods specific to BERT models on tasks such as Question answering, Sentence similarity, and Relation Classification (RC). The Stanford Question Answering Dataset (SQuAD) [158] is a widely used comprehension dataset of question and answer pairs, consisting of queries on a set of Wikipedia articles for Question answering tasks. General Language Understanding Evaluation (GLUE) [159] is an extensive collection of training and validation datasets for multiple NLP applications on Single-Sentence, Similarity and Paraphrase, and Inference tasks.

A. ADABERT
AdaBERT [121] leverages differentiable NAS to compress a pre-trained BERT model into smaller ones in a task-dependent manner, unlike previous works which compress the network in an application-independent manner. AdaBERT adopts a cell-based architecture in the search process and integrates task-specific knowledge distillation to impart hints on the application. The search algorithm also incorporates an efficiency-aware loss (Eq. 16) to fetch models with better efficiency and performance. The searched AdaBERT model is 11.5x to 17.0x more compressed, and 12.7x to 29.3x inferred faster than task-agnostic compressed models on GLUE benchmark datasets as task-specific redundant parameters in the network are further pruned.

B. NAS-BERT
NAS-BERT [122] is a task-independent, NAS-based compression strategy to automatically compress a pre-trained BERT model. The NAS-BERT training is performed on a standard self-supervised pre-training task and does not rely on a specific downstream application. The search space is a chain-like structure of MHSA, FFN, and Separable Convolution units. Table 7 illustrates the NAS-BERT search space and each element's dimensions. The Separable Convolution mentioned in the search space is a sequential cascading of two Depthwise Separable Convolution operations (Depthwise + Pointwise Convolution). NAS-BERT operates on block-wise search [160], where the whole Supernetwork is divided into multiple blocks of a few layers, and each block is trained from the blocks present in a pre-trained teacher model. The block-wise search scheme improves the search space and, combined with a performance approximation policy, improves the search efficiency. The NAS-BERT-searched model on GLUE and SQuAD [158] datasets outperform the manually designed BERT model and previously designed searched networks such as the AdaBERT method in terms of model performance metric and number of parameters.

C. AUTOBERT-ZERO
AutoBERT-Zero [161] deviates from the convention design of a Transformer encoder (Fig. 26a) by introducing Inter-layer search space to search for an optimal sequence of Self-attention and Convolution operations in a chain-like structure (Eg: Fig. 26c) to learn better local dependency. Also, it differs from the traditional design of a Self-attention module (Fig. 26b) by proposing an Intra-layer search Space of most fundamental math operations to search for an optimal inter-connection of primitive operations (Eg: Fig. 26d). The pool of Self-attention primitive operations in the intra-layer search space can be divided into the following two types:

D. MAGIC
MitigAtinG InterferenCe (MAGIC) [162], as the name indicates, proposes two methods, MAGIC-T and MAGIC-A, to alleviate interference problems caused due to sampling different child models during the search process. The authors first investigate the reasons for this behavior and conclude that interference in a shared operation strongly correlates with the number of different operations in child models. This is due to receiving different gradient directions from sampled child networks with different topologies, even with the same set of training samples. Also, this phenomenon is particularly severe when multifaceted architectures exist in the search space, such as Convolution and Self-attention. MAGIC-T minimizes interference by sampling a child network that differs from the previously sampled child module with only one change in the architecture structure. In MAGIC-A, the neural architecture with the best validation accuracy among all child networks is selected as an anchor model to orient the child model's inputs and outputs together. The proposed methods are analyzed on Single-path NAS [163]

E. LIGHTHUBERT
LightHuBERT [164] is a Once-for-all BERT search and compression framework to find desirable architectures by automatically pruning the model parameters of a large pre-trained network. Specifically, a Transformer-based Supernetwork is constructed to host several weight-sharing subnetworks followed by a two-stage distillation training using HuBERT [165] as a teacher model. The search elements are Embedding dimension, the number of heads, FFN ratio, and model depth, as detailed in Table 8. The Once-for-all student Transformer is the network with maximum dimensions on a given search space and trained using the pre-training distillation loss function. The student Transformer's trained weights are used in the second step, where subnetworks are randomly sampled at each forward pass during the Supernetwork training. Many subtransformers are selected and evaluated without fine-tuning, and a network that achieves the best

IX. NAS FOR VISION TRANSFORMERS
This section reviews prominent NAS methods for Vision Transformers on several vision applications such as Image Classification and Object Detection. Top-1 and top-5 accuracies are widely used model performance metrics for Image Classification. Top-1 accuracy indicates the percentage of testing samples correctly predicted by the network and top-5 percentage signifies the percentage of samples lying in the top-5 predictions of the network. While most methods target Image Classification on the ImageNet dataset, we also summarize a couple of ViTs for other applications. We classify any network as ViT if the network has at least one Self-attention layer and used for Computer vision application.

A. AUTOFORMER
AutoFormer [88] is a simple yet powerful technique based on the mix of One-shot NAS and Evolutionary search methods to find efficient Vision Transformer architectures. The large AutoFormer search space (Table 3) considers key dimensions of a Transformer, such as the number of heads, Embedding dimension, Q/K/V value, MLP ratio, and network depth. The traditional weight-sharing/One-shot NAS methods [99] decouple the weights of different operations in the search space into different paths, as shown in Fig. 27a. Therefore, this requires forward propagation through all the distinct paths in the Supernetwork and updating all individual weights in the backpropagation. However, the proposed Weight Entanglement in AutoFormer forces all weights to share in the form of a superweight of the highest dimension, as illustrated in Fig. 27b, similar to OFA [117]. Hence, the memory footprint is significantly reduced, and the Supernetwork requires only one forward and backward propagation through a single large weight matrix. After training the Supernetwork, several subnetworks are obtained, and the Evolutionary search is performed to pick the optimal model. The objective function of the search process is to maximize accuracy while minimizing the network size. The searched AutoFormer-tiny/small/base architectures attain 74.7% /81.7% /82.4% accuracy on the ImageNet dataset with 5.7M /22.9M /53.7M model parameters, respectively. The searched models outperform the manually-designed Transformers (ViT [5] and DeiT [9]) and automatically searched CNN model (EfficientNet [70]) in terms of accuracy, number of parameters, and FLOPs.

B. VITAS
Vision Transformer Architecture Search (ViTAS) [92] is a search method based on ensemble contributions in search space, ViT token structure, and training methodology. The search space backbone is a pure Self-attention network, an extension of Twins [87] and DeiT [9] Transformer. The primitive search elements include main Vision Transformer dimensions such as the number of heads, patch size, output dimension of MHSA and FFN, and network depth. A detailed explanation of Twins and DeiT search spaces is provided in detail in Section IV-D1 (Tables 4 and 5). In original ViT, a class token of size p (equal to patch size) is appended to the patch tokens before passing it to the Transformer encoder. However, ViTAS introduces a private class token for each patch size p individually to avoid interaction between class tokens. The authors developed a new cyclic weight sharing method for token embedding to ensure more channel balance and contribute evenly towards all candidate architecture choices. ViTAS uses an Evolutionary search algorithm to find optimal Transformer architecture at a given target budget using weak augmentation and regularization for more steady training. The best-searched model achieved 84.0% accuracy on ImageNet on Twins Transformer search space, outperforming manually designed baselines Swin [6], Twins [87], and DeiT [9] Transformers.

C. TRAINING-FREE TRANSFORMER ARCHITECTURE SEARCH (TF-TAS)
Zero-Cost NAS Proxies [170] are substitutes for accuracy prediction to speed up the search process. TF-TAS [171] introduces a zero-cost proxy, DSS-indicator, to evaluate several ViT architectures at a low cost. The indicator is based on two theoretical components called synaptic diversity, synaptic saliency of MHSA, and MLP modules in the network. As input propagates through the depth of the ViT network, outputs gradually become the same, and rank converges to one. The synaptic diversity estimates the degree of rank collapse in an MHSA module. The synaptic saliency measures the amount of most important weights/parameters in an MLP unit. On AutoFormer search space of pure Self-attention modules, TF-TAS attains an acceptable performance and accelerates the NAS process by searching an optimal network in less than 0.5 GPU from 24 GPU days.

D. VIT-RESNAS
ViT-ResNAS [118] first manually designs a residual spatial reduction (RSR) module to reduce sequence length and increase the Embedding dimension for deeper layers. The search space is formed by adding an RSR unit at each stage of the ViT Transformer encoder, hence the name ViT-ResNAS. The residual spatial reduction unit reshapes a 2D feature map into a sequence using a multi-branch connection of Average pooling and Norm-Convolution operations. At each stage of ViT-ResNAS, the search algorithm finds the Embedding dimension and number of Transformer blocks. The search elements for each block include the number of heads in MHSA and hidden size in FFN. Although the backbone network contains a Convolution operation (in the RSR module), the search space of ViT-ResNAS cannot be directly placed in the Hybrid Convolution-Attention search space as it does not search for Convolution operation and its dimensions. The second step in the search process involves two sub-steps: One-shot Supernetwork training multi-architectural sampling and the Evolutionary search method. The ViT-Res Supernetwork is enlarged to contain all possible combinations in the search space. For each ViT-Res Supernetwork iteration, multiple subnetworks are randomly sampled and trained with a batch of training samples with only one forward and backward propagation, thereby decreasing the training complexity of Supernetwork. The Evolutionary search method evaluates subnetworks from Supernetwork weights without further finetuning. The best network found from the evolutionary search process is trained till convergence. The experiments on ImageNet show that searched ViT-ResNAS model outperforms DeiT and ViT in terms of accuracy and MACs.

E. AS-VIT
Scaling up CNNs or ViTs is a process of designing a small base network on a limited resource budget and progressively increasing the size in terms of channel width, network depth, hidden dimension, or the number of attention modules to get better accuracy or model performance. For example, ResNet [1]/EfficientNet [70] families scaled from ResNet-18/EfficientNet-B0 to ResNet-200/EfficientNet-B7 to attain better accuracy. As-ViT [172] is a NAS framework to automatically search and scale up ViTs without training in a welldefined manner. While EfficientNet used NAS to search for a base CNN network, which is further scaled up manually, As-ViT proposed NAS for both searching the ''seed'' network and scaling up the base ViT network automatically. The search space consists of four stages and searchable elements: Kernel size, attention splits, and FFN expansion choice are listed in Table 9. The RL-NAS method is utilized to find the seed/base network, where best performing networks are iteratively sampled from search space. The controller is updated based on the fast-estimating Length Distortion parameter (L E ) and Neural Tangent Kernel condition (k ) instead of ViT accuracy, thereby avoiding full training of the sampled network. The seed network is scaled up in the training-free As-Vit search process using the same parameters for estimating the scale-up network's importance. The process starts with considering the seed model as the initial network, with only one attention block per stage and a hidden size of 32. The search process once again calculates L E and k values after adding each choice (width, depth, and expansion ratio) to the base network. The network with the highest L E k value is chosen as the best model in the current iteration. This process is repeated until the desired number of parameters in the scaled-up network are attained. The end-to-end process of searching the seed network and automatically scaling it up for a specific parameter count requires only 12 V100 GPU hours (5 for seed search and 7 for scaling up). As-ViT delivers strong top-1 accuracy (83.5%) on ImageNet for Image Classification and mAP (52.7%) on the COCO dataset for Object Detection, outperforming several manually designed and automatically searched ViT architectures.

F. SEARCHING FOR SEARCH SPACE (S3-NAS)
S3-NAS [173], as the name implies, automates both search space and search process, unlike previous methods whose search space is manually constructed. The overall search space of ViT, which includes Embedding dimension, network depth, MLP expansion ratio, number of heads, window size, and QKV FC size, is decomposed into several dimensions. The search space progressively evolves in each dimension, guided by a measurement called E-T Error. The search process on searched space is a two-step process of first training a Supernetwork without any constraints, similar to AutoFormer, and applying Evolutionary search, similar to SPOS [163], with accuracy, model size, and FLOPs as constraints. The best S3-NAS-searched network achieves 84.7% accuracy on ImageNet and outperforms ViT [5], Swin [6], and DeiT [9] in terms of model performance, size and FLOPs. The searched network also shows great performance when transferred to other tasks such as Object Detection, Semantic Segmentation, Visual Question Answering.

G. NASFORMER
NASformer [174] first proposed a structural change to Vision Transformer by introducing a dilated window-based Selfattention, inspired by Dilated Convolution [175], to fetch hierarchical feature representations in a parameter-free approach. The windowing approach presented in Swin Transformer [6] could exchange information only from the neighboring windows. However, the dilated window is designed to cover distant tokens as far as possible rather than only local tokens. The backbone network for searching an architecture is a pure Self-attention-based ViT with the dilated window technique. The end-to-end network is divided into different stages where searchable parameters are the number of blocks and the Embedding dimension. The block-level search space includes window size, number of MHSA heads, and QKV dimensions. The search process is similar to One-forall [117], where a Transformer Supernetwork with maximum VOLUME 10, 2022 dimensions is trained, followed by an Evolutionary search to find an optimal submodel. The best NASformer searched model attains an accuracy of 83.9% on ImageNet, performing better than ViT [5], Swin [6], Twins [87], and DeiT [9].

H. GLOBAL LOCAL IMAGE TRANSFORMER (GLIT)
GLiT [91] is the first work to apply the principles of NAS on hybrid Attention and Convolution search space for Vision applications. GLiT designed a new module called Globallocal (GL) block, a combination of (i) Self-Attention: to model the global dependencies and (ii) Convolution bundle: to capture the local representations. Similar to standard ViT, where Self-Attention modules are stacked to form final architecture, the GL blocks are stacked ''M'' times to form an endto-end GLiT network, as shown in Fig. 28. The SA and Convolution bundle operate in parallel, equivalent to SAs in a conventional MHSA module, and total number of SA units (G) and Convolution bundles (L) for any given layer cannot exceed four. The search task now boils down to finding optimal combination of SA and Convolution bundle in each layer from the following pool of (G, L) choices: {(0,3), (1,2), (2,1), (3,0)}. The total search space is summarized in Table 10. The two-level search process, i.e., searching for optimal number of SA and Convolution bundles (G, L) and their respective hyperparameters (k, e, d k , d z ), is broken down using hierarchical search method of a separate Evolutionary process. GLiT-Small and GLiT-Base achieve up to 80.5% and 82.3% accuracy, respectively, and outperform the baselines DeiT [9] and ViT [5], with an increase in the number of parameters and FLOPs reduction.

I. HIGH-RESOLUTION NAS (HR-NAS)
HR-NAS [176] aims to search for a model that takes a high-resolution image as input, using a multi-branch search space incorporating both Convolutions and Self-attention components. The search space, inspired by HRNet [177], contains multi-scale features and global contexts at the same time while preserving high-resolution representation through the model. Each branch in the HR-NAS Supernetwork is a chain-like search for searchable blocks of Convolution and attention at different resolutions. The same search space can be used for several dense prediction tasks, such as Semantic Segmentation and Instance Segmentation, of different granularities as each branch is specialized for a specific feature resolution. The authors designed a lightweight Transformer, shown in Fig. 29, whose computational complexity is dynamically adjustable for different budget constraints. The Transformer module consists of an encoder and a decoder, similar to Vanilla Transformer. In the projector unit, the input feature map is concatenated with a 2D positional map instead of a sinusoidal encoding. The feature map is transformed into a set of n tokens and fed to the MHSA unit for Self-attention.
The Supernetwork is a multi-branch model, where each branch is composed of a series of searchable blocks, where each block can take either a MixConv [178] or lightweight Transformer unit. The proposed resource-aware search process, a blend of DARTS-like method and progressive shrinking strategy, explores optimal combinations at each stage of the network. The Supernetwork jointly learns network weights and architectural parameters (α), which guides the search process. The search algorithm also discards Convolution channels, and Transformer unit queries progressively from the Supernetwork. As a result, HR-NAS [176] is able to search for networks for various tasks such as Image Classification, Semantic Segmentation, Human Pose Estimation, and 3D Detection. The searched model of HR-NAS performs better than several CNN based searched models such as ProxylessNAS and FBNet.

J. BOSSNAS
Block-wisely self-supervised Neural Architecture Search (BossNAS) [179] is an unsupervised NAS method on hybrid Convolution-Transformer (HyTra) search space. The Unsupervised NAS (UnNAS) methods [180] search for an efficient network without using any human-annotated labels and therefore relying only on the input data (Eg: Images). Block-wise search weight-sharing methods divide the end-to-end network into blocks to individually optimize, thereby reducing the weight-sharing search space. The authors propose ensemble bootstrapping, a self-supervised scheme to optimize each block, where sampled networks are trained to estimate the probability of all sampled models. The search space consists of two fundamental units: Residual bottleneck (ResConv) from ResNet [1] and ResAtt, a modified BoTBlock module [181], formed by replacing the relative position encoding branch with a lightweight Depthwise Separable Convolution. The searched BossNAS model on the hybrid search space attains 82.5% top-1 accuracy on ImageNet, exceeding EfficientNet [70] by 2.4%.

K. NASVIT
NASViT [93] is a One-shot based NAS method for efficient hybrid Convolution-Attention network search. The main contribution of this technique lies in resolving an issue caused by the Supernetwork training. The issue is that the gradients of different subnetworks within a Supernetwork conflict with each other more strongly in ViTs than in CNNs, leading to early saturation and slow convergence. NASViT overcomes this issue by using a projection gradient algorithm to identify and remove the sub-network gradient component causing gradient conflict. A switchable channel-wise scaling is also added in the Transformer layer (Fig. 30b) to reduce overlapping among different Self-attention blocks and lessen gradient conflict. Finally, a weak data augmentation method reduces the training regularization and optimization difficulty to reduce gradient conflicts. The search space backbone is a sequential interconnection of Mobile Inverted Residual block (MBConv) and MHSA unit (Self-attention + MLP). While the positions for MBConv units (Fig. 30a) and Transformer blocks (Fig. 30b) are fixed, their hyperparameters such as depth and block dimensions are searched, as summarized in Table 6 and explained in detail in Section IV-D2.
The experiments on several vision tasks such as Image Classification and Semantic Segmentation show that the searched NASViT models outperform several hand-designed CNNs such as ResNet50 in terms of accuracy and FLOP count.

L. BURGERFORMER
Yang et al. proposed BurgerFormer [182], a novel search space built in three hierarchy levels: Micro, Meso, and Macro levels. As shown in Fig. 31, the backbone network is divided into four stages, and each stage consists of a varying number of Transformer-like blocks. Each stage is a series of several primitive operations, including Norm-Op-Norm-Act to be searched from the search space, where Op comprises Convolution and Self-attention operations. The Micro, Meso, and Macro classification corresponds to the granularity of operation type, Transformer style block structure, and composition of each stage, respectively. The micro (low) level space includes operations (OP), norm and activation functions, while the meso (intermediate) level finds the optimal interconnection or combination of these three elements. On the macro (high) level, each stage's depth, the activation's channel width, and expansion ratio between the two 1 × 1 Convolutions are searched as a whole, as shown in Table 11. The definition of Micro and Macro used in BurgerFormer differs from the traditional definitions we defined in the search space section (Section IV-C). Burg-erFormer utilizes Single path One-shot search [163] and Sandwich sampling [183] methods to train Supernetwork and Evolutionary Search to evaluate subnetworks. The best BurgerFormer searched model achieves 83% accuracy on ImageNet, thereby outperforming AutoFormer, GLiT, and ViTAS.

M. UNINET
UniNet [184] is an RL-based NAS method to jointly search the optimal combination of Convolution, Self-attention, and MLP layer, along with their depth and channel dimension on a macro search space. It considers the three operations in its search space to leverage the advantages of each type at different stages of the network, such as Convolutions for capturing local dependency, MHSA to extract global and long-range representation, and MLPs for high utilization and efficiency. The MLP layer refers to the MLP-Mixer [185] style module to capture spatial features. The backbone of UniNet is a pile of five stages, where each stage consists of General Operators (GOPs) and Down Sampling Modules (DSMs) units, as depicted in Fig.32a. The authors designed three DSM candidates, namely, (i) Local-DSM (L-DSM) of only 2D Convolution, (ii) Local-Global-DSM (LG-DSM) with MHSA and 2D Convolution, and (iii) Global-DSM (G-DSM) with only MHSA and a 1D Convolution. Each stage of the initial network is initialized with a certain repeat number (r), Convolution channel size (c), and initial expansion ratio (r) and progressively increased or decreased during the search process with choices shown in Fig. 32 (a).
The search algorithm jointly optimizes both model accuracy and FLOPs by formulating the reward function as , where a(m) and f(m) are the predicted accuracy and FLOPs of model m, respectively, t is the target FLOPs, and α is the weight factor to balance accuracy and FLOP cost. Even though a network is searched directly on a large-scale ImageNet dataset, a proxy setting is implemented by training the sampled architecture for only five epochs, and validation accuracy for reward function is calculated on the partially trained network. Fig. 32b shows the base searched network with its dimensions, which is scaled up using EfficientNet's [70] compound scaling method. The experiments on Image Classification, Object Detection, and Semantic Segmentation show that the searched model outperforms manually designed CNNs and Self-attention based ViTs. FIGURE 32. UniNet search space [184].

N. SPVIT
Single-Path Vision Transformer pruning (SPViT) [186] is a One-shot method to automatically compress a pre-trained ViT model. SPViT prunes the costly MHSA into a lightweight operation and searches for MLP expansion ratios under different performance constraints. The main contribution of this work lies in designing a novel weight-sharing technique between Convolution and MHSA to encode all candidate choices into a single MHSA layer instead of a multi-path space, as shown in Fig. 33. The output of Convolution can be derived by indexing the intermediate results of MHSA, thereby aiding search process to ease computation burden with less trainable parameters. Also, learnable binary gates (g |p| , g |p|−1 , . . . g 1 ) are introduced to encode candidate choices, which are jointly trained along with model parameters to find the configuration of each layer. The gates determine the significance of each FNN dimension and prune trivial dimensions. The automatic pruning experiments on the DeiT model [9] show that 56.2% of the FLOPs can be reduced with only a 0.3% of loss in accuracy on ImageNet.

O. AUTO-REGRESSIVE NAS
Zhou et al. [187] proposed a fully Self-attention based, selfsupervised search method with context auto-regression. The layer-wise search space consists of five stages with fixed stride, input, and output channels, and each stage searches for either Local Multi-head Self-attention (Local-MHSA) [188] or Non-local Self-attention [189]. The search elements of Local-MHSA are spatial extent: {3, 5, 7} and the number of heads: {4, 8}. The backbone network and search space are detailed in Fig. 34. The search process is formulated as a differentiable/One-shot NAS where a Supernetwork is constructed with all possible combinations of Local Self-attention modules and trained in a bi-level fashion to find a good network. The search phase is accomplished by a self-supervised search method with context auto-regression to guide architecture search, followed by a fine-tuning phase on the target task. The best model attained 73.3% top-1 accuracy on ImageNet with 4M parameters, better than DeiT [9].

P. VTCAS
Vision Transformer with Convolutions Architecture Search (VTCAS) [190] is a One-shot NAS method based on Progressive DARTS (P-DARTS) [191] on a hybrid search space. The DARTS method [99] first finds a base cell architecture on  a small-scale proxy dataset such as CIFAR-10 and transfers the same searched topology on a large-scale dataset such as ImageNet by increasing the network depth. This straightforward transfer of searched architecture on the large-scale data causes a performance gap when increasing overall network depth. P-DARTS resolves this issue by simultaneously dropping candidate choices in the search space and progressively increasing the network depth in a series of steps, unlike DARTS, which drops all candidate choices in Supernetwork in a single training step on proxy task. The initial Supernetwork consists of three Transformer and Convolution Mixture blocks, inspired by Swin Transformer [6], and seven candidate choices on each node. The depth of the mixture block is gradually increased to four and five in three successive steps, as shown in Fig. 35, while the number of candidate operations is reduced to five, three, and one in those three steps. The searched VTCAS network achieves 82.0% accuracy on ImageNet and 50.4% mAP on the COCO dataset, surpassing many manually designed CNN and Transformer architectures.
Q. αNAS αNAS [192] is an Evolutionary search-based approach where the novel property-guided synthesis procedure directs network mutations. The authors developed a principled way of overcoming the limitations posed by prior Evolutionary search algorithms. The previous techniques apply stochastic mutations to neural architectures to produce a new model. Each such mutated network is trained and validated to insert into the population that acts as the beginning for future mutations. Contrary to traditional methods, αNAS considers a significantly large search space and make high-quality changes in each mutation by inferring from a set of program properties. A random submodule is considered from a large network, and a stochastic mutation is applied to the properties of the submodule. The new submodule is synthesized that meets the mutated properties and adds the new submodule in place of the old one. αNAS is applied on the original ViT-based search space and attains better accuracy than ViT with 30% fewer FLOPS and parameters.

R. FBNETV5
FBNetV5 [193] is a framework to search for architectures on diverse Computer vision tasks such as Image Classification, Semantic Segmentation, and Object Detection in a single run. Specifically, the authors first designed a flexible and easily transferable search space followed by a search process that is adaptable to different tasks by not adhering to a single task's training pipeline. The search method is applied on a Transformer style model only for the Semantic Segmentation application. FBNetV5 proposed Lite MaskFormer, a modified version of MaskFormer [194], to use as the backbone network in the search process. The results on the ADE20K dataset [195] show that the searched network outperforms manually designed architectures such as ResNet-50 [1] and Swin Transformer [6].

S. T2IGAN
Generative Adversarial Networks (GANs) [196] are unsupervised Neural Networks that automatically discover input data patterns to generate new realistic natural images. A typical GAN has a Convolution-based Generator and a Discriminator network playing against each other. Transformers are being used to replace the Convolution operation in the traditional Generator and Discriminator for better performance [197]. T2IGAN [198] is the first method to use NAS principles for efficient Transformer-based GAN architecture design on Text-to-Image (T2I) task to generate a realistic image from the input text description. It uses a cell-based search space and finds an efficient topology using the RL-based search strategy on a Convolution and Lightweight Transformer MHSA pool of primitive elements. The results on CUB-200 Birds [199], Oxford-102 Flowers [200], and COCO [201] datasets show that the searched model surpasses many hand-designed GAN models.

T. TRANSFORMER-BASED NAS PREDICTOR (TNASP)
Previously, we discussed how different NAS methodologies were developed to search for efficient Transformer models for different applications. Now, we explore the opposite context, i.e, employing Transformers to aid NAS methods in performance estimation of CNN models. TNASP [202], a Transformer model-based NAS predictor, utilizes MHSA VOLUME 10, 2022 to map discrete architectures to a relevant feature representation and applies Laplacian matrix linear transformation as the positional encoding to strengthen the representation of topology information. This method consists of an encoding phase to encode the operations and positions in a DAG into a continuous representation, accompanied by three Transformer encoder layers (MHSA layers) and a regressor to obtain the final prediction using the output features of Transformer layers. The performance estimator learns the relation between a network in the search space and its validation accuracy, which is used to predict the model accuracy of an unseen network in the same search space. The advantages of using Self-attention styled modules are listed by the authors as follows: (i) the MHSA can explore better features from the structured graph data, and (ii) also, due to its nature of modeling global dependencies, MHSA can enhance the encoding capability at different distant positions. TNASP performs better than previous encoding methods on NAS-Bench-101 [104], NAS-Bench-201 [203], and DARTS [99] search spaces.
Vision Transformer NAS Summary. A detailed overview of all Vision Transformer NAS Methods on the ImageNet dataset is provided in Table 12. As we reviewed several ViT-based NAS strategies, we observed that the search space plays an equal and important role as the search method in finding efficient Transformer models. The majority of NAS methods draw inspiration from methods developed for CNNs, such as One-shot NAS and Evolutionary search, and are tweaked to customize for Transformers. However, the search elements and backbone search space we reviewed in this paper differ in most ViT NAS methods. A given NAS method may perform better on a given search space and may not be well transferable to other search spaces. For a fair comparison between two search algorithms, the search space should be identical, including the total number of search choices. The initial Reinforcement or Evolutionary NAS algorithms are expensive for model search as every network sampled by the controller needs to be trained end-to-end until convergence. Also, the Supernetwork-based technique requires significant GPU memory to load and train all the operations. A few methods like As-ViT [172] and UniNet [184] use approximation or proxy task settings by utilizing the properties of the Vision Transformer, thereby making the search faster. Hence, in the future, priority should be given to zero-cost proxy methods for efficient network design to make NAS approachable.

X. HARDWARE-AWARE NAS FOR TRANSFORMERS
In the previous sections, we extensively discussed NAS methods pertaining to Vanilla Transformers, BERT models, and Vision Transformers without considering a target hardware platform. This section reviews several Hardware-aware multi-objective Transformer NAS methods for efficient inference on several devices.

A. HARDWARE-AWARE TRANSFORMERS (HAT)
HAT [14] is one of the early methods for Hardware-aware Transformer search, targeting NLP tasks on Raspberry Pi ARM CPU, Intel Xeon CPU, and Nvidia TITAN Xp GPU. The search space of HAT breaks the traditional Vanilla Transformer in the following ways: (i) Arbitrary Encoder-Decoder Connection: The Transformer is searched in such a way that the decoder block can choose connections from one or multiple encoder layers; unlike the Vanilla Transformer, where only one encoder output is connected to the decoder, and (ii) Heterogeneous Transformer Layers: The encoder/decoder is more elastic in the sense that each encoder or decoder layer can have different dimensions (key, head number). The traditional CNN-based Hardware-aware NAS methods, such as ProxylessNAS [101] and FBNet [106], train a Supernetwork specific to the target hardware platform and sample an optimal architecture. However, HAT trains only one SuperTransformer (or Supernetwork) and derives different networks suitable for different devices (MCU, CPU, GPU). HAT performs an Evolutionary search with the target latency constraints on the trained SuperTransformer, thereby picking only the models with latency smaller than the target constraints. The HAT-searched model for Raspberry Pi-4 on the WMT'14 translation task attains thrice speed with 3.7x fewer parameters over Vanilla Transformer. The key takeaway from HAT is that a model specialized for one hardware platform may not be optimal for other devices. For example, Table 13 shows that the model searched with respect to GPU runs faster on GPU than CPU and vice versa at the same BLEU score.

B. SHIFTADDNAS
Multiplication-free Neural Network boosts hardware performance by replacing the expensive matrix multiplication such as Convolutions or Transformers with the addition-only operation in AdderNet [204], or a mix of bitwise shift and addition in ShiftAddNet [205]. However, merely replacing the matrix multiplications with multiplication-free operations can hurt model performance despite an increase in hardware performance. Therefore, ShiftAddNAS [206] proposed to search for a hybrid multiplication-based (Vanilla Convolution and Selfattention) and multiplication-free (Shift and Add) network by incorporating both kinds of blocks in the search space. Table 14 summarizes the search space for the NLP task, consisting of block type and key Self-attention dimensions. The vision task search space is given in Table 15, consisting of the type of block to be searched and the depth of each block.
The Convolution weights follow Gaussian distribution, whereas the weight parameters in multiplication-free addition operation follow Laplacian. Hence, naively including both types of operations leads to an inconsistent architecture ranking in the One-shot NAS algorithm. Therefore, the authors developed a novel search method with improved weight sharing capability by including KL-divergence loss in addition to the traditional cross-entropy loss function. The searched-ShiftAddNAS models on WMT'14 En-Fr and WMT'14 En-De NLP datasets outperform the baseline Vanilla Transformer, Evolved Transformer, and HAT networks in terms of latency, energy, and BLEU score on the  Eyeriss accelerator [207]. Also, the searched-ShiftAddNAS network performs better than ResNet50 and other searched ViTs on ImageNet.

C. LIGHTWEIGHT ITERATIVE NAS (LINAS)
Lightweight Iterative NAS [208] is designed to accelerate the subnetwork search phase post the Supernetwork training, as the validation component in One-shot NAS methods comes with a huge computational cost, especially on large datasets. Besides, the main contribution of LINAS is the a generalizable framework to offer support for various search methods and model performance predictors in a multi-objective setting across multimodal environments. LINAS also showed that the Evolutionary search algorithm and Sequential Model-based Optimization (SMBO) methods pair well with One-shot/weight-sharing search spaces. In a nutshell, the validation phase of LINAS is summarized as follows: 1) Randomly sample subnetworks as initial population 2) Validate the subnetworks to fetch accuracy 3) Train the predictors with the validated networks 4) Run the search algorithm with trained predictors 5) Pick only the best subnetworks from the updated pool 6) Validate a few subnetworks and retrain the predictor 7) Repeat steps 2 to 6 until LINAS finds the best network The Supernetwork for the proposed method evaluation in LINAS is inspired by HAT [14]. Finally, various performance predictors, such as Ridge Regression, Support Vector Machine regression, and stacked regression, are compared with each other, showing the effectiveness of LINAS.

D. LIGHTSPEECH
LightSpeech [209] is an HW-NAS framework for lightweight Transformer search targeting Text to Speech models. The search space is adopted from FastSpeech [210], with four feed-forward Transformer (FFT) blocks in both encoder and decoder. As shown in Fig. 36, each FFT unit is made up of MHSA and Separable Convolution (SepConv) operation [211]. The search candidates for LightSpeech are {2, 4, 8} for number of heads in MHSA and {1, 5,9,13,17,21, 25} for kernel size in SepConv. The search process relies on the Gradient Boosting Decision Tree (GBDT) method [212], where an accuracy predictor is leveraged to train on pre-collected samples of architecture-accuracy pairs. The searched Transformer comprises an encoder with Sep-Conv of kernel sizes 5, 13, and 25, and a decoder with kernel sizes 9, 13, and 21. The evaluation results on Intel Xeon CPU E5-2690 v4 show that the searched network is 6.5x faster than the baseline FastSpeech model [210] while preserving voice quality metrics.

E. LITETRANSFORMERSEARCH
LiteTransformerSearch (LTS) [213] performs a zero-cost proxy or training-free Evolutionary search directly on constrained hardware. The authors empirically established a high correlation between final model validation accuracy and the number of parameters in the decoder layer (non-embedding parameter count), thereby utilizing decoder parameter count as a strong substitute in the search process. Also, 200 randomly sampled architectures from the search space are trained on WikiText-103 and (LM1B) datasets to verify correctness. LTS obtains the real latency values by measuring directly on the target hardware during the search process instead of using a Look-up-table or latency predictor. The search space can be expanded without constraints as the search method does not rely on any Supernetwork. It supports all the internal dimensions of a decoder in a heterogeneous manner, i.e., different hyperparameters for different layers in the network. The primitive element set comprises the number of decoder layers, the number of heads, output dimension of the decoder, inner dimension of FFN, embedding dimension, and division factor in adaptive input embedding. The search is performed over the entire Pareto-front, spanning across a wide range of latency and accuracy spectrum. At each search iteration, a set of models are sampled from the current Pareto line and fed to the Evolutionary search algorithm to predict a new set of networks. The accuracy is estimated through proxy metrics and latency on the target hardware. The Pareto-frontier is updated based on a new set of neural architectures. This iterative process is repeated until an optimal model is found at different latencies. The search experiments on ARM CPUs, Intel Core i7, and Nvidia Titan Xp GPU show that the searched model is twice faster than Vanilla Transformer with comparable accuracy.

F. SUPERSHAPER
The standard BERT-base model has a fixed hidden size of 768 across all the layers of the Transformer encoder (Fig. 37a). However, the task of SuperShaper [214] is to find the optimal hidden dimensions of each encoder layer and sublayers separately, which is the only component in its search space. For example, Fig. 37b shows an example of a searched BERT model, where green, blue and yellow colors indicate the height and width of the searched Add & Norm, Feed Forward, and MHSA modules, respectively. The intuition behind choosing this parameter is that the different shape of each layer contributes uniquely to model performance. Also, for a given parameter count, a model that meets the requirement of a device, the shape is a free variable to be optimized for higher accuracy as latency on CPUs and GPUs are insensitive to the shape of the network. The authors adopt Evolutionary search algorithm to find the optimal shape of each layer that meet accuracy, the number of parameters, latency constraints on 1080Ti and K80 GPUs, and a server class single-core Xeon CPU. The SuperShaper Supernetwork is trained irrespective of a downstream task while fine-tuned on a specific application. The Supernetwork training and subnetwork evaluation process enabled specialization across tasks, hardware devices, and model sizes. The searched models using SuperShaper method finds networks several SOTA BERT compression techniques on GLUE benchmarks.

G. COMPILER-AWARE BERT
Niu et al. [215] proposed a compiler-aware architecture search framework to find a BERT network that achieves good accuracy and is latency-friendly on mobile CPU and GPU. The search process is identical to the RL-NAS method, where a controller predicts a BERT model with different hyperparameters. The predicted BERT is fine-tuned on downstream tasks to evaluate validation accuracy. The compiler code generation phase returns the latency information of the predicted model on the target device. The controller is updated based on the combined accuracy and latency metric to predict the next best model. The compiler-aware searched BERT model is 5.2× faster on CPU and 4.1× faster on GPU than BERT-base with only 0.5-2% accuracy loss. Also, it achieves 7.8× speedup over the TFLite compiler [216].

H. AUTOTINYBERT
AutoTinyBERT [217] is a method to search for efficient hyperparameters of pre-trained language models under different latency constraints. The process is similar to the Once-for-all search, where a large Supernetwork of maximum dimensions is trained in a One-shot manner, followed by the Evolutionary search of optimal architecture for a specific latency constraint. The Supernetwork is a proxy for all submodels, where sampled architectures can be evaluated without training from scratch. The searched AutoTinyBERT outperforms NAS-BERT regarding latency on a single CPU and GLUE score on the SQuAD dataset.

I. MOBILEBERT-EDGETPU
MobileBERT-EdgeTPU [218] is an efficient search algorithm for several vision and language tasks targeting On-device Deep Learning accelerators such as an EdgeTPU. The authors first designed a Performance Power Evaluation (PPE) service using the EdgeTPU compiler, a cycle-accurate simulator for latency estimation. TuNAS-based weight sharing approach [219] is utilized for vision tasks to search for tiny architectures. However, a multi-trial NAS approach is used for BERT due to the limitations of the weight-sharing method. The search space of MobileBERT-EdgeTPU is similar to MobileBERT [220]. The accuracy of MobileBERT-EdgeTPU on the SQuAD dataset is comparable to the BERT-large model and can be deployed to edge devices.

J. AE-BERT
AE-BERT [221] is an automatic BERT compression framework to select an optimal subnetwork from a pre-trained BERT model for a given target pruning ratio. The end-to-end process has three steps: (1) the first step randomly generates ''n'' pruning strategies to produce ''n'' pruned BERT candidates, such that each strategy results in the target pruning ratio, (2) the second step trains ''n'' candidates to pick the best subnetwork with the highest validation metric, (3) the chosen model is fine-tuned from the previous stage to obtain the final trained and compressed BERT. The experiments on the GLUE benchmark show that AE-BERT outperforms SOTA manually designed pruning methods on BERT-base. Also, the compressed BERT-base is 1.83× faster on Xilinx Alveo U200 FPGA compared to Intel(R) Xeon(R) Gold 5218 (2.30GHz) CPU.

K. AUTODISTILL
AutoDistill [222] generates a task-agnostic BERT model for the target hardware platform by considering several objectives, constraints, design space, evaluation metrics, and hardware performance. The framework operates in a loop fashion, in a series of three steps: Model Exploration, Flash Distillation, and Evaluation. The Model exploration phase finds a good model based on the comprehensive set of inputs provided to the framework. The authors use Bayesian Optimization (BO) [223] to perform the multi-objective search for a student architecture by considering accuracy and latency. Flash distillation grows promising network candidates as a student architecture from pretraining dataset and teacher network. The Bayesian Optimization with Flash distillation significantly reduces the search cost. The network chosen in the Flash Distillation stage is evaluated in the Evaluation step for accuracy and hardware latency. AutoDistill searched model outperforms several previous works such as Mobile-BERT [220] and BERT-base [63] in terms of accuracy on the SQuAD dataset and latency on TPUv4.
L. REAL-TIME STYLE TRANSFER Benmeziane et al. performed Real-time Style Transfer [224] on Xiaomi Redmi 10 mobile and Raspberry Pi 3 embedded platform by searching efficient ViTs. Style Transfer [225], [226] is a vision task of transferring the style of an image to the content of a different image. The ViT model used for Style Transfer can be divided into two modules: (i) The first style sub-network fetches the style from the style image using the Transformer encoder, and (ii) The second content sub-network obtains the content of the content image using a different Transformer encoder and transfers the extracted style using a Transformer decoder. The search space is composed of typical ViT hyperparameters such as patch size, number of heads, number of FFN layers, and number of encoder/decoder layers. The style sub-module is searched using random search, and the content sub-network is searched using Evolutionary search, where the search optimizes the Frames per second and model accuracy. The searched ViT is 3.5× and 2.1× faster than the CNN-only model AdaIN [227] and Self-attention-based network StyTr 2 [228], respectively.

XI. FUTURE DIRECTIONS
Transformer NAS research has become an important topic due to its capability of circumventing many hurdles while building a model. There have been nearly 50 research papers on Transformer architecture NAS algorithms in the last two years alone. However, we are still in the early stages, as there are only a few works compared to the gazillion methods on the Convolutional Neural Network search algorithm. The current solutions are only a first step towards developing more efficient search algorithms for the family of Transformer models. There is still room for improvement in search space, adaptability in various scenarios, and hardware performance. This section provides an overview of a few limitations and highlights open research challenges.

A. EFFICIENT SEARCH SPACE DESIGN
The primitive search element set and search space often limit NAS algorithms in finding efficient models for any backbone Neural Network, which is not just confined to Transformers. While the previously proposed CNN-based search methods directly inspire Transformer search algorithms, the search space is systematically developed for each search algorithm discussed in this paper. Hence, going into the future, the manual search space design will play a vital role in exploring out-of-the-box Transformer models.

B. TOPOLOGIES
The Transformer-style modules or their variants have been exploited in various classes of Neural Networks such as Unet [229], Graph Neural Networks (GNN) [230], Graph Convolutional Network (GCN) [231], 3D Convolutional Neural Networks [232], Point Clouds [233], Spiking Transformer Neural Network [234], etc. However, very little focus has been given in this direction to use NAS for efficient Transformer search for other topologies. Therefore, expanding the scope of Transformer NAS to other groups of neural architectures is crucial in advancing the field. For example, a Convolution, Self-Attention, and GCN hybrid search space followed by an efficient search algorithm can be studied for Node Classification tasks.

C. LOW-COST PERFORMANCE ESTIMATION
The increase in the size and complexity of search space calls for matur NAS algorithms to keep the search cost in check without compromising the quality. Effective Zero-cost proxy methods, such as TF-TAS [171], play an important role in quickly evaluating the model performance without expensive training. However, the current training-free methodologies operate only a small set of search elements without considering the hybrid Attention-Convolution search spaces. Therefore, more research is needed in this direction to make NAS accessible to a wider set of researchers and engineers.
On the other hand, Transformer Hardware-aware NAS is not extensively examined on diverse platforms. More specifically, Vision Transformer Hardware-aware NAS is still fresh and unexplored on a large-scale dataset such as ImageNet. Although the current hardware-agnostic NAS algorithms on Transformer architectures are extremely useful for research purposes, hardware-aware search methods on actual or simulated hardware are critical for the real-time deployment of ViTs. Therefore, Hardware-aware search space and enhanced NAS algorithms for several hardware platforms are needed to improve accuracy and performance.

E. AUTOMATED SPARSE AND MIXED PRECISION QUANTIZED TRANSFORMER SEARCH
Neural Network Pruning [110] is an outstanding technique for optimized hardware inference as it removes redundant parameters. In order to complement the sparse nature of networks, numerous hardware platforms, such as Nvidia A100 GPUs [241], started offering explicit support for sparse calculation. The present NAS techniques only consider dense MHSA (FC) and Convolution layers while ignoring sparse matrix computing. Hence, future search methods should include sparsity features in the search space for higher performance gain. On the other hand, Mixed Precision Quantization [113] also plays an important role on top of pruning. Due to the low precision capability of several DNN models, accelerators such as Bitfusion [242] boost the performance of varying bit-width multiplication. As far as we know, AQ-BERT [125] is the only method to utilize NAS methods for bit-width assignment. However, this work is a hardware-agnostic method that does not evaluate their searched model on a mixed precision supporting hardware. There is scope to explore Hardware-aware Mixed Precision Quantization search methods, such as HAQ [123], for Transformer and its family. Therefore, future Transformer NAS algorithms should be aligned in the direction of sparse and mixed-precision to search for models smaller in size and faster on hardware platforms.

F. NETWORK-ACCELERATOR CO-SEARCH
Neural Network and Hardware Accelerator Co-search is an important class of Multi-objective co-design problem where two independent search algorithms are used to simultaneously find the optimal network parameters and accelerator configuration [40]. The co-design involves iteratively sampling a network-accelerator pair and evaluating validation accuracy and hardware performance metrics to guide the search process and converge at an optimal combination of network and hardware specifications. Many CNN-based co-search methods, such as DIAN [243], have reformulated the HW-NAS methods as a Co-search problem. Nonetheless, there are no such similar methods in the Transformer domain. Hence, in the future, Transformer model and hardware accelerator co-search can be utilized to achieve better accuracy-efficiency tradeoff than SOTA CNN-only co-designs.

G. APPLICATIONS AND PURPOSES
NAS and HW-NAS for CNN models have found their way from a simple task on a small dataset to a complex application on a gigantic dataset. However, the Transformer NAS methods have not been specialized for a wide variety of tasks and datasets, even though the manual design of Transformers has been used in diverse tasks such as Super Resolution [244], Semantic Segmentation [245], Medical Image Segmentation [246], etc. Also, the algorithmic breakthrough of search algorithms allowed the use for other purposes, such as Winograd Convolution Search [247], Mixed Precision Quantization Search [123], Fault-Tolerant network search [248], etc. Therefore, the scope should be extended beyond just architecture search for global dominance of Transformers and their kin, and associated current applications.

H. NEURAL ARCHITECTURE SEARCH BENCHMARKS
The computation demand for NAS methods is a hurdle for the development of novel NAS algorithms, as it requires training and evaluation of numerous architectures during the search process. Also, reproducibility is another challenge due to dissimilarity in search space and experiment setting, such as learning rate, initialization, etc., and therefore comparison between search methods and searched models becomes unfair. As a consequence, several Neural Architecture Search Benchmarks (NAS-Bench) have been released, acting as a Look-up-table (LUT) or dictionary of models and their appropriate performance metrics. For example, NAS-Bench-101 [104] is the first work in this context, where the dictionary consists of 423k distinct cell-based architectures and their validation accuracy and training time on the CIFAR-10 dataset. The users of benchmarks can obtain accuracy on the selected search space by querying the benchmark dictionary, thereby avoiding the expensive training process.
The hardware-agnostic benchmarks are equipped only with training and validation-related metrics and do not provide measured latency of networks on the hardware. Hence, these benchmarks are not helpful for non-hardware researchers in developing novel HW-aware NAS algorithms. There are a few attempts in this direction, such as LatBench [249] and HW-NAS-Bench [250], which provide hardware performance metrics for CNN networks on a wide range of CPUs, GPUs, and ASICs. Thus, moving forward in Transformer NAS research, NAS and HW-NAS Benchmarks will play a key role in development of better search methods. In fact, NAS-Bench-NLP [251], the first NAS benchmark dataset for language tasks based on RNNs, pointed out that their benchmarks are not as efficient as Transformers. Following the success of NAS Benchmarks, many surrogate models (ML models) have evolved, which are trained on a limited set of NAS Benchmarks to predict performance of unknown networks in the search space. Therefore, Transformer NAS research requires such NAS Benchmarks and auxiliary ML models to engulf a large space.

I. INTEGRATING IN NAS FRAMEWORKS
The remarkable development of efficient NAS algorithms over the last few years lead the way for multiple frameworks such as Auto-keras [252], Microsoft Archai [253], etc. The next logical step is to integrate these easyto-use NAS methods and Benchmarks in NAS toolkits. The current frameworks do not widely support Transformer based networks as there are a fairly recent success. Therefore, software engineers can work towards incorporating Transformers and Vision Transformers in the NAS frameworks.

XII. CONCLUSION
Transformer network design is a challenging problem with important applications in several tasks and hardware platforms. In the last few years, Neural Architecture Search and Hardware-aware NAS methods have significantly contributed to the automatic design of efficient Transformer, BERT, and Vision Transformer models. The automatically searched Transformers outperformed many manually designed Transformer architectures in terms of model and hardware performance. In this survey paper, we extensively reviewed recent advances related to NAS algorithms specific to Transformer and its family of architectures. We mainly summarized the search space, search strategy, and performance of the searched Transformer of state-of-the-art NAS methods. A diverse set of methods have developed over the last two years with innovations in architecture design and learning methodologies, which are analyzed in different sections. Although the performance of NAS algorithms has been greatly enhanced, the SOTA methods still have limitations, some of which are outlined in this paper. We hope our effort helps the reader understand the latest in Transformer NAS and ignites interest in developing novel and efficient methods.

ACKNOWLEDGMENT
All opinions, findings, and conclusions expressed are those of the authors.
MURALI EMANI received the Ph.D. degree from the School of Informatics, The University of Edinburgh, U.K., in 2015. He is currently an Assistant Computer Scientist with the Argonne National Laboratory, Data Science Group, Argonne Leadership Computing Facility (ALCF). Prior to this, he was a Postdoctoral Research Staff Member at the Lawrence Livermore National Laboratory. His research interests include scalable machine learning, emerging HPC and AI architectures, and AI for science. He serves as the Co-Chair for MLPerf HPC Group at MLCommons to benchmark large scale ML on HPC systems. He also co-leads the AI Testbed at ALCF to evaluate novel AI accelerators for scientific machine learning applications. He has organized workshops and participated in tutorials that include benchmarking deep learning workloads on emerging hardware, MLPerf-Bench at MLSys'20, MLSys'21, ISPASS'20, ISPASS'21, and ASPLOS'21. He has also co-chaired the MLPerf birds-ofa-feather sessions at SC'19, SC'20, and SC'21.
VENKATRAM VISHWANATH received the Ph.D. degree in computer science from the University of Illinois at Chicago, in 2009. He is currently a Computer Scientist at the Argonne National Laboratory. He is also the Data Science Team Lead at the Argonne Leadership Computing Facility (ALCF). His current focus is on algorithms, system software, and workflows to facilitate data-centric applications on supercomputing systems. His research interests include scientific applications, supercomputing architectures, parallel algorithms and runtimes, scalable analytics, and collaborative workspaces. He has received best papers awards at venues, including HPDC and LDAV, and a Gordon Bell Finalist.