Improving Vision Transformers to Learn Small-Size Dataset From Scratch

This paper proposes various techniques that help Vision Transformer (ViT) to learn small-size datasets from scratch successfully. ViT, which applied the transformer structure to the image classification task, has outperformed convolutional neural networks, recently. However, the high performance of ViT results from pre-training using large-size datasets, and its dependence on large datasets comes from low locality inductive bias. And conventional ViT cannot effectively attend the target class due to redundant attention caused by a rather high constant temperature factor. In order to improve the locality inductive bias of ViT, this paper proposes novel tokenization (Shifted Patch Tokenization: SPT) using shifted patches and a position encoding (CoordConv Position Encoding: CPE) using $1 \times 1$ CoordConv. Also, to improve poor attention, we propose a new self-attention mechanism (Locality Self-Attention: LSA) based on learnable temperature and self-relation masking. SPT, CPE, and LSA are intuitive techniques, but they successfully improve the performance of ViT even on small-size datasets. We qualitatively show that each technique attends a more important area and contributes to having a flatter loss landscape. Moreover, the proposed techniques are generic add-on modules applicable to various ViT backbones. Our experiments show, when learning Tiny-ImageNet from scratch, the proposed scheme based on SPT, CPE, and LSA increases the accuracy of ViT backbones by +3.66 on average and up to +5.7. Also, the performance improvement of ViT backbones in ImageNet-1K classification, learning on COCO from scratch, and transfer learning on classification datasets verify that the generalization ability of the proposed method is excellent.


I. INTRODUCTION
Recently, lots researches have been reported that Vision Transformer (ViT) [1] showed better performance than Convolutional Neural Networks (CNNs), such as ResNet [2] and EfficientNet [3]. This made Transformer receive a lot of attention in the computer vision field. ViTs rarely use convolutional filters, i.e., the core of CNNs. Actually, ViTs employ them only for tokenization. Thus, ViTs structurally lack locality inductive bias than CNNs. So they require a huge amount of training data to obtain acceptable visual representation [4]. For example, to learn a small-size dataset, The associate editor coordinating the review of this manuscript and approving it for publication was Taous Meriem Laleg-Kirati .
ViT had to precede pre-training on a large-size dataset such as JFT-300M [5]. However, building a large dataset is very expensive, so it is a difficult problem in the real world.
In order to alleviate the burden of pre-training, several ViTs which can learn a mid-size dataset such as ImageNet from scratch have been proposed. Such data-efficient ViTs tried to increase the locality inductive bias in terms of network architecture. For instance, some adopted a hierarchical structure like CNNs to leverage various receptive fields [6], [7], [8], and the others tried to modify the self-attention mechanism itself [6], [8], [9], [10], [11]. However, learning from scratch on mid-size datasets still requires significant costs. Moreover, if the domain gap between the pre-training dataset and the target dataset is large, transfer learning may not be an effective FIGURE 1. Performance of the proposed method when learning Tiny-ImageNet from scratch. Throughput was measured on V100 GPU. The stars and dots indicate after and before the proposed method is applied, respectively. strategy for learning a small-size dataset [12]. So we need to study ViT that can learn small-size datasets from scratch.
We suffer from two problems that limit the performance of ViT on small-size datasets. The first problem is the structure of the ViT, which is designed to have very little locality inductive bias. ViT divides a given image into non-overlapping patches of same size, and linearly projects each patch to a visual token. So, the tokenization of ViT has the permutation invariant property, which enables a good embedding of relations between patches [13]. However, these simple linear projections over non-overlapping patches do not sufficiently embed spatial relationships between adjacent pixels into each visual token [14]. Also, there is no locality inductive bias in the transformer block that captures the relationship between the visual tokens. Also, conventional ViT never injects position information except when generating initial input tokens. This lack of position information makes it difficult for ViT to analyze high-dimensional positional relations between visual tokens and to learn effectively with a small amount of data [1], [15].
The second problem is the poor attention mechanism. The feature dimension of image data is usually much larger than that of signals in other domains such as NLP. In other words, image data with a large feature dimension has a much larger number of tokens than other domains. So, if the attention mechanism used in NLP is directly applied to such a large number of tokens as it is, the distribution of attention scores will be over-smoothed. Therefore, we face the problem that ViTs cannot attend locally to important visual tokens.
So in this paper, we propose several novel solutions to make ViT effectively learn small-size datasets from scratch. The first solution is Shifted Patch Tokenization (SPT) to further utilize spatial relations between neighboring pixels in the tokenization process. The idea of SPT was derived from Temporal Shift Module (TSM) [16]. TSM is effective temporal modeling which shifts some temporal channels of features.
Inspired by this, we propose effective spatial modeling that tokenizes patches together that are spatially shifted slightly but have the same semantic meaning. This has the effect of increasing the locality inductive bias by embedding more spatial information in each visual token. Second, we propose CoordConv Position Encoding (CPE) that injects position information inside the transformer block. CoordConv, that is an extension of convolutions that allows convolutional filters to consider spatial coordinates of pixels, can integrate a localization prior into the network architecture [17], [18]. Motivated from this property, we propose a method to replace the linear projections of a transformer block with 1 × 1 CoordConv [19]. Thereby, position information about tokens can be continuously encoded even inside the transformer blocks. Finally, we propose Locality Self-Attention (LSA), which allows ViT to attend locally. LSA mitigates the smoothing phenomenon of attention score distribution by excluding self-tokens and applying learnable temperature to the softmax function. LSA induces attention to work locally by forcing each token to focus more on tokens with large relation to itself.
Our proposed method can be very easily applied as add-on modules to various ViT backbones and helps ViT to learn strong representations. In Fig. 1, it is shown that the proposed method dramatically improves the performance of various ViT backbones when learning a small-size dataset from scratch. In particular, when SPT, CPE, and LSA are applied together, the accuracy of CoAtNet-0 [20] reaches 84.82%, i.e., increases by +1.41 in case of CIFAR-100, which is superior to the powerful CNN RegNetY-400MF [21]. Also, the proposed method improves the performance of ViT backbones in ImageNet-1K classification [22], learning on COCO [23] from scratch, and transfer learning on classification datasets because it has good generalization ability (see Sec. V-B and Sec. V-C). For example, In the case of ImageNet-1K, the proposed method increases the accuracy of VOLUME 10, 2022 DeiT-Ti [24] and Swin-T [6] by +3.2 and +1.1, respectively. The main contribution points of this paper are as follows: noitemsep • To solve problems that prevent ViTs from learning small datasets from scratch, that is, lack of locality inductive bias and over-smoothed attention, SPT, CPE, and LSA are presented by flattening loss landscapes. They can be easily applied to various ViT backbones and bring high-performance improvement for cost.
• We experimentally verify the generalization ability of the proposed method in various datasets, learning strategies, and tasks, and confirm that it is superior to CNNs compared to cost.
This paper is organized as follows: Sec. II reviews various approaches to improve ViT. And Sec. III describes the related preliminaries and then describes the formulations and structure of the proposed method in detail. In Sec. IV, the datasets, backbones, and training regime used in the experiments are explained. Sec. V shows quantitatively and qualitatively that the proposed method works well in various datasets and ViT backbones. Finally, Sec. VI concludes this study.

II. RELATED WORKS
Recently, several methods have been proposed to alleviate the dependence of ViT on large-size datasets. The previous ViTs could learn mid-size datasets from scratch. For example, DeiT [24] improved the efficiency of ViTs by employing data augmentations and regularizations, and then realized knowledge distillation by introducing the distillation token concept. T2T [14] proposed a method to tokenize overlapping patches by applying a transformer block. This made it possible to learn local structure information around a token. PiT [7] produced various receptive fields through spatial dimension reduction based on the pooling structure of a convolutional layer. CvT [8] replaced both linear projection and multi-layer perceptron with convolutional layers. Also, like PiT, CvT generated various receptive fields with a convolutional layer. Swin Transformer [6] presented an efficient hierarchical transformer that gradually reduces the number of tokens through patch merging while using attention of local windows. CaiT [9] employed LayerScale, which converges well even in training ViTs with a large depth. In addition, the transformer layer of CaiT is divided into a patch-attention layer and a class-attention layer, which is effective for class embedding. CoAtNet [20] is a hybrid model that properly combines only the strengths of CNN and Transformer. It unified depth-wise convolution and self-attention through relative attention, and effectively improved generalization by stacking convolutional blocks in front of transformer blocks.
Position encoding is important for transformer to capture sequence ordering of input tokens. Wu et al. [15] proposed a new position encoding called image Relative Position Encoding (iRPE). Since iRPE considers interactions between queries and relative position within the self-attention mechanism, it accomplished effective position encoding. [25] demonstrated that strong representations can be learned by smoothing the loss landscape of ViT. For example, it was shown that increasing the locality inductive bias of ViT through local attention flattened the loss landscape and improved generalization ablility.

Park and Kim
As such, various efforts have been made to resolve ViT's data dependence. However, learning small-size datasets from scratch is still challenging. Therefore, this paper presents several solutions which allow various ViT backbones to effectively learn small-size datasets from scratch by flattening loss landscapes.

III. METHOD
This section describes three key ideas helping ViT to learn small-size datasets: SPT, CPE and LSA. Also, the optimal structure of each technique is explored. Fig. 2(a) depicts the concept of SPT. First, SPT spatially shifts an input image in several directions and concatenates them with the input image. This figure is an example of shifting in four diagonal directions. Next, patch partitioning is applied like standard ViT. Then, for embedding into visual tokens, three processes are sequentially performed: patch flattening, layer normalization [26], and linear projection. Through this, patches with the same semantic meaning but slightly shifted spatially can be embedded together in visual tokens. As a result, SPT can utilize more spatial information and increase the locality inductive bias of ViT. Fig. 2(b) explains the second idea, i.e., CPE. CPE first flattens additional feature maps with coordinate information, e.g., x-and y-coordinates, and concatenates the input visual token and reshaped feature maps, and then applies linear projection. Since CPE linearly projects the coordinate information together, it can realize effective 2D position encoding for visual tokens. Fig. 2(c) describes LSA. In general, a softmax function can control the smoothness of attention scores distribution through temperature scaling [27]. LSA primarily sharpens the distribution of attention scores by learning the temperature parameters of the softmax function. Additionally, the self-token relation is removed by applying the so-called diagonal masking, which forcibly suppresses the diagonal components of the similarity matrix computed by Query and Key. This masking relatively increases the attention scores between different tokens, making the distribution of attention scores sharper. As a result, LSA improve ViT by inducing locally-focused attention.

A. PRELIMINARY
Prior to a detailed description of SPT, CPE, and LSA, this section briefly reviews the tokenization, CoordConv, and formulation of the self-attention mechanism of standard ViT.

1) TOKENIZATION
Let x ∈ R H ×W ×C be an input image. Here, H , W , and C indicate the height, width, and channel of the image, respectively. First, ViT divides the input image into non-overlapping patches and flatten the patches to obtain a sequence of vectors. This process is formulated as Eq. (1) where x i p ∈ R P 2 ·C represents the i-th flattened vector. P and N (=HW /P 2 ) stand for the patch size and the number of patches, respectively.
Next, ViT perform patch embedding by linearly projecting each vector into the space of the hidden dimension of the transformer encoder. Each patch embedding corresponds to a visual token that is input to the transformer encoder, so this series of processes is called tokenization (T ). This is defined by where E t ∈ R (P 2 ·C)×d is the learnable linear projection for tokens, and d is the hidden dimension of the transformer encoder. This operation is exactly the same as for a non-overlapping convolutional layer, and this simple method does not have sufficient locality inductive bias. In Sec. III-B, we propose SPT to leverage rich spatial information by increasing the receptive field of tokenization.

2) CoordConv
CoordConv [19] injected additional coordinate information into the convoutional layer. Coordinate information is defined as the x-and y-coordinates of each pixel in the feature map, i.e., c x ∈ R H ×W ×1 and c y ∈ R H ×W ×1 and is scaled to have a range of [-1, 1]. And, the coordinate channels are channel-wise concatenated with the feature map x ∈ R H ×W ×C and then input to the convolutional layer. The position information of each pixel is explicitly encoded in the feature map, providing a localization prior to the neural network. Since the increased computational cost is only an operation for two additional channels, CoordConv is very computationally efficient. Sec. III-C proposes CPE replacing ViT's linear projections with 1 × 1 Coordconv. CPE effectively improves the performance of ViT backbones by increasing the locality inductive bias inside the transformer blocks.

3) ATTENTION MECHANISM
Meanwhile, the self-attention mechanism of standard ViT operates as follows. First, learnable linear projections are applied to each token to obtain Query, Key, and Value. Next, calculate the similarity matrix, that is, R ∈ R (N +1)×(N +1) , indicating the semantic relation between tokens through the dot product operation of Query and Key. The diagonal components of R represent self-token relations, and the off-diagonal components represent inter-token relations: Here, E q ∈ R d×d q , E k ∈ R d×d k indicate learnable linear projections for Query and Key, respectively. And, d q and d k are the dimensions of Query and Key, respectively. Next, R is divided by the square root of the Key dimension, and then the softmax function is applied to obtain the attention score matrix. Finally, calculate the self-attention, defined by the dot VOLUME 10, 2022 product of the attention score matrix and Value, as in Eq. (4): Eq. (4) was designed so that the attentions of tokens with large relations get large. However, due to the following two causes, attentions of standard ViT tend to be similar to each other regardless of relations. The first cause is that since Query (xE q ) and Key (xE k ) is linearly projected from the same input tokens, token vectors belonging to Query and Key tend to have similar values. Eq. (3) shows that R is the dot product of Query and Key. So, self-token relations which are dot products of similar vectors are usually larger than inter-token relations. Therefore, the softmax function of Eq. (4) gives relatively high scores to self-token relations and small scores to inter-token relations. The second cause is that dividing R by √ d k in Eq. 4 acts as a high temperature scaling for softmax. Table 1 shows that the attention scores smoothed by high temperature degrade the performance of ViT. Here, we can achieve the best performance when the softmax temperature is less than √ d k . In Sec. III-D, we propose the LSA for improving the performance of ViT by solving the smoothing problem of the attention scores distribution.

B. SHIFTED PATCH TOKENIZATION 1) FORMULATION
First, an input image is spatially shifted by half the patch size in four diagonal directions, that is, left-up, right-up, leftdown, and right-down. In this paper, this shifting strategy is named S for convenience, and the SPT of all experiments follows S. Of course, various shifting strategies other than S are available, and they are dealt with in the Fig. 3 and Table 2-3. Next, the shifted features are cropped to the same size as the input image and then concatenated with the input. Then, the concatenated features are divided into non-overlapping patches and the patches are flattened as in Eq. (1). Next, visual tokens are obtained through layer normalization (LN) and linear projection. The whole process is formulated as Eq. (5): Here, s i ∈ R H ×W ×C represents the i-th shifted image according to S and E S ∈ R (P 2 ·C·(N s +1)×d S ) indicates a learnable linear projection. Also, d S represents the hidden dimension of the transformer encoder, and N S represents the number of images shifted by S. In Eq. (5), an additional tokenization scheme [14], [20] can be applied after the concatenation process of shifted features.

2) PATCH EMBEDDING LAYER
This section describes how to use SPT as a patch embedding layer. We concatenate a class token to visual tokens and then add position encoding. Here the class token is the token with representation information of the entire image, and the position encoding gives position information to the visual tokens. If a class token is not used, only position encoding is added to the output of SPT. How to apply the SPT to the patch embedding layer is formulated as follows: where x cls ∈ R d S is a class token and E pos ∈ R (N +1)×d S is the learnable position encoding. Also, N is the number of embedded tokens in Eq. (5).

3) POOLING LAYER
Tokenization is the process of embedding 3D-tensor features into 2D-matrix features. For example, it embeds x ∈ R H ×W ×C into y = T (x) ∈ R N ×d . Since N = HW /P 2 , the spatial size of the 3D feature is reduced by P 2 times through the tokenization process. So, if tokenization is used as a pooling layer, the number of visual tokens can be reduced. Therefore, we propose to use SPT as a pooling layer as follows: First, class tokens and visual tokens are separated, and visual tokens in the form of 2D-matrix are reshaped into 3D-tensor with spatial structure. i.e., R : R N ×d → R (H /P)×(W /P)×d . Then, if the SPT of Eq. (5) is applied, new visual tokens with a reduced number of tokens are embedded. Finally, the linearly projected class token is connected with the embedded visual tokens. If there is no class token, only R is applied before the output of SPT. The whole process is formulated as Eq. (7): where E cls ∈ R d×d S is a learnable linear projection. In addition, d S is the hidden dimension of the next stage. As a result, SPT embeds rich spatial information into visual tokens by increasing the receptive field of tokenization as much as spatially shifted.

4) CHOICE OF SHIFT STRATEGY
SPT can adopt a wide variety of spatial shift strategies. However, in this paper, only the spatial shift that can be easily defined by the shift direction and the shift ratio is considered. We examined the following three shift directions. The first is the 4 cardinal directions consisting of up, down, left and right directions (Fig. 3(a)). The second is 4 diagonal directions including up-left, up-right, down-left and downright ( Fig. 3(b)). The last is the 8 cardinal directions including all the preceding directions ( Fig. 3(c)). Table 2 shows top-1 accuracy in small-size datasets such as CIFAR10, CIFAR100, SVHN, and Tiny-ImageNet for each shift direction. This experiment adopted a model applying SPT to standard ViT. 4 cardinal directions showed the best performance in CIFAR10 and SVHN. On the other hand, 4 diagonal directions and 8 cardinal directions provided the best performance in CIFAR100 and Tiny-ImageNet, respectively. This shows that the shift direction is somewhat dependent on the characteristics of datasets. For example, in CIFAR10 or CIFAR100, the target class tends to be in the center of the image, whereas other datasets do not. The location of the target class has some degree of correlation with the shift direction, and the correlation can affect the performance. However, since the performance difference was experimentally marginal, in this paper, the shift direction in the experiment was fixed to 4 diagonal directions.
Next, we look at various shift ratios. The degree of image shifting in SPT is defined as follows: SHIFT = P × r shift , where P represents the patch size, and r shift represents the shift ratio. Table 3 shows the performance according to shift ratio for CIFAR100, Tiny-ImageNet, and ImageNet-1K. In this experiment, a model with SPT applied to standard ViT was used, and 4 diagonal directions were adopted. In CIFAR100 and ImageNet-1K, a ratio of 0.5 was the best, and in Tiny-ImageNet, a ratio of 0.25 was the best. This experimental result shows that the optimal shift ratio also depends on the datasets. Since the relatively most reasonable shift ratio is 0.5 according to our experiment, all the experiments in this paper fixed the shift ratio to 0.5.
Note that more various shifting strategies will be available in addition to the methods considered here. The exploration of optimal shifting strategy according to datasets remains future work.

5) THE POOLING LAYER ALSO NEEDS SHIFTED PATCHES
Patch embedding is the process of converting a large image into a small number of visual tokens, very similar to pooling.  So, SPT for improving tokenization can also be applied to the pooling layer. We experimented to verify that SPT was effective in both patch embedding and pooling layer, and Swin-T of a hierarchical structure was used for this experiment. Table 4 shows that SPT improves performance not only in the patch embedding layer but also in the pooling layer. Especially, it achieved a significant accuracy increase of +4.91 in TINY. Thus, we apply SPT to all pooling layers of ViT backbones with a hierarchical structure.

C. CoordConv POSITION ENCODING 1) FORMULATION
First, we flatten c x and c y so that they have the same shape as visual tokens x ∈ R HW ×d . Then, we concatenate x, c x and c y , and then apply linear projection E c ∈ R d×d c . Here, d c is the dimension of output tokens. As in Eq. (7), the class-token is linearly projected separately and then concatenated with the output visual tokens.

2) WHERE TO APPLY
Linear projection in a transformer block is utilized in Query, Key, and Value projections of the multi-head self-attention layer, as well as in output projections that merge multi-head attention outputs. And the FNN's channel expansion and compression also uses linear projection. CPE can be applied to all of these locations and improve the locality inductive bias of the transformer block through additional coordinate information embedding. In this paper, CPE is applied to both multi-head self-attention and FFN according to the results of the Table 5.

3) CPE CAN IMPROVE BOTH MSA AND FFN
As a layer to replace the linear projection, CPE can be applied to the linear projections inside the Multi-head Self-Attention layer (MSA) and the linear projections of the FFN. Table 5 VOLUME 10, 2022

4) CPE CAN REPLACE APE
CPE explicitly encodes coordinate information to visual tokens. This is similar to the role of Absolute Position Encoding (APE) [28] used in existing ViTs. So, we compared the performance when CPE and APE were applied to ViT and Swin-T. According to Table 6, APE plays a very important role in ViT with little locality inductive bias. When APE is removed from standard ViT, performance degradation occurs by −9.95. However, since CPE injected locality inductive bias more effectively than APE and showed accuracy as high as +2. 19 and +2.53 in CIFAR100 and Tiny-ImageNet, respectively. On the other hand, significantly different from the results of ViT, APE did not play a big role in Swin Transformer which has sufficient locality inductive bias because it has a hierarchical structure while using local attention. But, CPE injected coordinate information more effectively than APE and showed accuracy as high as +0.64 and +1.19 in CIFAR100 and Tiny-ImageNet, respectively. Therefore, we get rid of APE and use only CPE.

D. LOCALITY SELF-ATTENTION MECHANISM 1) DIAGONAL MASKING
Diagonal masking plays a role in giving larger scores to inter-token relations by fundamentally excluding self-token relations from the softmax operation. Specifically, diagonal masking forces −∞ on diagonal components of R. This makes ViT's attentions more focused on other tokens rather than attending to its own tokens. The proposed diagonal masking is defined by where R M i,j indicates each component of the masked similarity matrix.

2) LEARNABLE TEMPERATURE SCALING
The second step for LSA is the learnable temperature scaling, which allows ViT to determine the softmax temperature by itself during the learning process. Based on Eq. (4), the LSA with both diagonal masking and learnable temperature scaling applied is defined by: where τ is the learnable temperature. LSA improves the performance of ViT by allowing it to focus more on tokens that are related to it through diagonal masking and learnable temperature.

3) SHARING TEMPERATURE ACROSS HEADS
In general, the Transformer block uses a MSA that uses multiple self-attentions [28]. Therefore, all heads can share the same temperature or have them individually. Table 7 compares the performance of ViT when the learnable temperature of each attention head is shared and when it is not shared. When the temperature was individually learned in each head, there was a performance improvement of +1.26 and +1.33 for CIFAR-100 and Tiny-ImageNet, respectively. Therefore, we can find that the attention head focusing on different features has meaning only when it learns the appropriate temperature for each.

4) LEARNING LOW TEMPERATURE
As seen in Table 1, the well-learned temperature should be less than √ d k . Fig. 4 shows the average learned temperature according to depth. Note that the average learned temperature is lower than the constant temperature of standard ViT. In general, the low temperature of softmax sharpens the score distribution. Therefore, the learnable temperature scaling sharpens the distribution of attention scores.

5) RESOLVING SMOOTHING PROBLEM
LSA solves the smoothing problem of the attention score distribution. Fig. 5 shows the depth-wise averages of total Kullback-Leibler divergence (D total KL ) for all heads. Here, T and M mean that only learnable temperature scaling and  diagonal masking is applied to ViTs, respectively, and L indicates that the entire LSA is applied to ViTs. The lower the average of D total KL , the flatter the attention score distribution. We can find that when LSA is fully applied, the average of D total KL is larger by about 0.5 than standard ViT, so LSA attenuates the smoothing phenomenon of the attention score distribution.

6) BOTH MASKING AND LEARNABLE TEMPERATURE ARE NEEDED
Let's look at the effect of learnable temperature scaling (T) and diagonal masking (M), two key elements of LSA (L), on overall performance. Table 8 shows that learnable temperature scaling and diagonal masking effectively resolves the smoothing phenomenon of attention score distribution (see Fig. 5). For example, learnable temperature scaling and diagonal masking in TINY improved performance by +0.88 and +1.22, respectively. Considering that the LSA applied with both techniques shows a performance improvement of +1.43, we can claim that the contribution of each is sufficiently large and the two techniques produce a synergy.

E. WHY DOES THE PROPOSED METHOD WORK?
SPT further facilitates the analysis of spatial correlation by tokenizing shifted patches of the same semantic meaning together. This indicates that locality inductive bias is  effectively improved. And CPE embeds spatial coordinates, one of the critical image locality information, into each token. This also improves the locality inductive bias. The improved locality inductive bias through SPT and CPE allows ViT to learn a strong representation despite little data [25]. In addition, attention acts more locally in transformers that have learned strong representation. Through learnable temperature scaling and diagonal masking, LSA's attention focuses more on meaningful regions of the feature map (see fig. 6). In other words, it makes it learn strong representation to improve the performance of ViTs.

IV. EXPERIMENT SETUP
This section describes the setup of experiments to verify the performance of the proposed method. First, the datasets used in this study are described, and then the detailed configuration of ViT backbones is given. Finally, the detailed regime in which each model was trained is described.

A. DATASET
For small-size dataset experiments, CIFAR-10, and CIFAR-100 (C-10 and C-100) [29], SVHN [30], and TinyIm-ageNet (TINY) [31] were employed. For the mid-size dataset experiment, ImageNet-1K [22] was employed. In addition, experiments were conducted on COCO 2017 [23] to verify the extendibility of the proposed method to other tasks. Table 9 represents the train set size, test set size, and the number of classes of the datasets used in this paper.

B. BACKBONE
In order to achieve high performance on a small-size dataset, we have to limit the capacity of ViT by adjusting model hyperparameters. Since small-size datasets generally have small resolution, the number of tokens in the backbones designed for ImageNet should also be adjusted. So, we set the number of visual tokens in the transformer block to 8 × 8. If a pooling layer is used in the middle of the transformer, the number of tokens in the intermediate stage is 8 × 8. Here, C-10, C-100, and SVHN all have a resolution of 32 × 32, and TINY has a resolution of 64 × 64. So, in the case of TINY, we double the patch size compared to other datasets. Alternatively, we control the number of tokens by adding a pooling layer. On the other hand, mid-size datasets such as ImageNet-1K and COCO were trained according to popular configurations. The detailed configuration of each model for the small-size dataset is as follows.

1) ViT
In a standard ViT [1] with a high risk of overfitting, the depth was set to 9, the hidden dimension was set to 192, and the number of heads was set to 12. The patch size in TINY is 8, and the patch size in the other datasets is 4.

2) PiT
Based on the PiT-XS [7] architecture, the patch size was set to 4 in TINY and was set to 2 in the remaining datasets.

3) T2T
We designed based on T2T-14 [14] using Tokens-to-Token modules with a kernel size of 3, the stride of 2, and padding of 1. In TINY, the depth of the Tokens-to-Token module was 3, and it was 2 in the rest of the datasets. And to apply SPT to T2T using transformer-based tokenization, we concatenate shifted images before the Tokens-to-Token module of the first stage.

5) SWIN TRANSFORMER
Swin Transformer uses Swin-T [6] and Swin-S [6] as backbones. For all datasets, the patch size was set to 2 and the window size was set to 4. Note that conventional Swin-T and Swin-S go through 3 pooling layers because they are composed of 4 stages. So, since the image with a resolution of 32 × 32 has 2 × 2 resolution in the last stage, a window attention size of 4 cannot be applied. Therefore, we modified the models to have three stages in C-10, C-100, and SVHN. That is, Swin-T consisted of 3 stages of 2, 6, and 4 depth, and Swin-S consisted of 3 stages of 3, 6, 12 depth.

6) CoAtNet
CoAtNet used CoAtNet-0 [20] as a backbone. In TINY, the pooling layer was used from the second stage, and the pooling layer was used from the third stage in the rest of the datasets.

C. TRAINING REGIME
In the small-size dataset training, Basically, the following learning strategies are followed. We applied data augmentations such as CutMix [32], Mixup [33], Auto Augment [34], Repeated Augment [35] to all models. In addition, regularization techniques such as label smoothing [36], stochastic depth [37], and random erasing [38] were employed. And, AdamW [39] was used as the optimizer. Weight decays were set to 0.05, batch size to 128, and warm-up to 10. All models were trained for 100 epochs, and cosine learning rate decay was used. The initial learning rate was set to 0.003 in ViT and CNN, and 0.001 in the rest of the models. ImageNet was trained at 224 × 224 resolution, and a batch size of 256 was used. The rest of the training regimes followed [6] as it is. To learn COCO from scratch, we referred to Direct(P1x)-1x pipeline of [40] where a detection dataset was trained from scratch.

A. ABLATION STUDY
This section deals with ablation studies for SPT, CPE, and LSA. First, we show that the redundant attention of each technique is improved through the visualization of class token attention. Next, we demonstrate that each technique flattens the loss landscapes through 3D loss landscapes visualization. And we quantitatively shows that the performance of ViT has improved thanks to the proposed techniques. For this experiments, two small-size datasets, i.e., C-100 and TINY are used, and the backbone are ViT [1] and Swin-T [6]. The gray highlight in each table indicates the baseline model. Fig. 6 visualizes the attention scores of the final class token when SPT, CPE, and LSA were applied to ViT. First, when the LSA is applied, the attentions scattered in the background are more concentrated on the target class. Next, when either SPT or CPE is applied, the attention can better capture the shape of target class. Finally, when all of SPT, CPE, and LSA are applied, the shape of the class is maintained and ViT is focused on the more important region. Looking at the first example of the figure (top row), the dog is more accurately discriminated because the facial region that is crucial for distinguishing the dog is attended the most. Therefore, we can find that the proposed method effectively increases the locality inductive bias and induces the attention of ViTs to improve. Fig. 7 visualizes the two-dimensional loss contour according to the application of SPT, CPE, and LSA. Here, the visualization method of Li et al. [45] was adopted. We can observe that SPT and CPE significantly flatten the ViT's loss landscape. This proves that in ViT, which structurally lacks locality information of tokens, spatial shift modeling of visual tokens and position encoding improves locality inductive bias very effectively. Also, Table 10 quantitatively shows that SPT and CPE have a huge effect on improving the performance of ViT. On the other hand, LSA has an insignificant effect of flattening the loss landscape of ViT compared to CPE and SPT. Table 10 supports that LSA has the smallest influence on the performance improvement of ViT in both CIFAR100 and Tiny-ImageNet. Note that the loss landscape is most  Change of loss contour of ViT according to the application of the proposed method. CIFAR-100 was used for this experiment. We can see that SPT and CPE, which have a great influence on performance improvement, flatten the loss landscape of ViT more clearly. Also, flattening is most pronounced when SPT, CPE, and LSA are applied. effectively flattened when SPT, CPE, and LSA are all applied. Actually, Table 10 proves this fact. Thus, Fig. 7 and Table 10 show that the more flattened the loss landscape, the better the performance of ViT. In other words, the proposed method effectively flattens the loss landscape, which in turn helps ViT learn strong representations. Table 10 verifies the synergy effect of SPT, CPE and LSA. The best synergy, i.e., the highest performance improvement, was observed when both were utilized simultaneously. That is, when all techniques were applied, the performance improvement of ViT reached a whopping +4.69 and +5.08 for C-100 and TINY, and that of Swin-T reached +3.62 and +5.7 for C-100 and TINY, respectively. Therefore, we argue that SPT, CPE, and LSA designed to overcome the drawbacks of ViTs are effective add-on modules.

B. IMAGE CLASSIFICATION
This section shows that the proposed techniques effectively improve the performance of ViT backbones even in various datasets and tasks. First, we prove that the performance of ViTs is dramatically improved when data is insufficient. This shows that the proposed method improves ViT comparable to CNNs even if there is not much data. Second, we demonstrate that the proposed methods is also effective in transfer learning and a mid-size datase, i.e., ImageNet-1K. This supports the generalization of the proposed method. Finally, we show that the proposed method improves the performance of ViT backbone in object detection and instance segmentation tasks. This indicates that the proposed method works well not only for image classification but also for other vision tasks. Table 11 shows the performance improvement when the proposed method was applied to various ViT backbones. In all VOLUME 10, 2022 TABLE 10. Ablation study of the application of data augmentations, SPT, CPE, and LSA. Top-1 accuracy is measured in CIFAR100 and Tiny-ImageNet.

TABLE 11.
Top-1 accuracy when learning small-size datasets from scratch. Tiny-ImageNet was used here, and throughput was measured on V100 GPU.

TABLE 12.
Results of learning COCO from scratch. Mask-R CNN [43] was trained with Direct(P1x)-1x pipeline of [40] using only COCO. FPS was measured on V100 GPU. cases, the proposed method effectively improved the performance of ViT backbones, especially in C-100 and TINY. For example, SCL improved PiT-XS by +5.25 in C-100, and improved Swin-T up to +5.7 in TINY. Also, this table shows that the proposed method effectively reduces the gap between ViT and CNN on small-size datasets. For example, the hybrid model SCL-CoAtNet-0 achieves the best performance on C-10 and C-100 beyond CNNs. And, SCL-CaiT-XS24 overwhelmed CNNs for SVHN. In addition, the proposed method improves the performance of ViT more efficiently than methods of increasing model scales such as depth and hidden dimension. For example, SCL-CaiT-XXS24 has much better performance on all datasets, especially, +4.75 on the C-100, with a lower cost than CaiT-XS24.
2) LEARNING ImageNet-1K FROM SCRATCH Table 13 shows the performance when training ImageNet-1K from scratch. The proposed method improved the accuracy  Transfer learning accuracy on classification datasets. The model pre-trained with ImageNet-1K was fine-tuned on small-size datasets. We performed fine-tuning for 100 epochs and warm-up for 5 epochs.
of DeiT-Ti by +3.2 and the accuracy of Swin-T by +1.1. Note that SCL-Swin-T outperforms RegNetY-8GF [21]. This result supports that the proposed method improves ViT backbones even on mid-size datasets. Table 14 shows that the proposed method works well in transfer learning. In this experiment, transfer learning of a pre-trained network with ImageNet was performed to C-100 and TINY. There was performance improvement in both datasets. For example, the accuracy of DeiT-Ti on TINY increased by +3.59. Therefore, we argue that the proposed method has sufficient generalization ability.

C. COCO 2017
The training set of the COCO 2017 consists of 118K images. This capacity is very small compared to 1.2M of ImageNet, which is widely used for pretraining purposes. By learning the object detection and instance segmentation tasks of COCO from scratch, we tested whether the proposed method works in other vision tasks. Table 12 shows that the proposed method improves the Swin-T backbone by +1.2 AP box , +0.9 AP mask in object detection and instance segmentation, respectively. Also, the performance on COCO is improved by +0.6 AP box and +0.5 AP mask , even on a larger scale Swin-S backbone. Therefore, the proposed method is effective even in object detection and instance segmentation where locality information is important.

VI. CONCLUSION
To train ViT from scratch on a small-size dataset, this paper presents novel techniques to increase the locality inductive bias of ViT and to solve the smoothing distribution problem of attention scores. First, SPT embeds rich spatial information into visual tokens through specific transformations. Second, CPE explicitly uses the coordinate information of the visual tokens to position encoding. Finally, LSA induces ViT to attend locally through learnable parameters and diagonal masking within softmax. SPT, CPE, and LSA can achieve significant performance improvement independently, and they are applicable to any ViTs. The proposed method effectively improves the generalization performance of ViT backbones by flattening the loss landscape. Therefore, this study shows that ViTs can learn small-size datasets from scratch and provides an opportunity for ViTs to develop further.