Enhancing Few-Shot Image Classification With Cosine Transformer

This paper addresses the few-shot image classification problem, where the classification task is performed on unlabeled query samples given a small amount of labeled support samples only. One major challenge of the few-shot learning problem is the large variety of object visual appearances that prevents the support samples to represent that object comprehensively. This might result in a significant difference between support and query samples, therefore undermining the performance of few-shot algorithms. In this paper, we tackle the problem by proposing Few-shot Cosine Transformer (FS-CT), where the relational map between supports and queries is effectively obtained for the few-shot tasks. The FS-CT consists of two parts, a learnable prototypical embedding network to obtain categorical representations from support samples with hard cases, and a transformer encoder to effectively achieve the relational map from two different support and query samples. We introduce Cosine Attention, a more robust and stable attention module that enhances the transformer module significantly and therefore improves FS-CT performance from 5% to over 20% in accuracy compared to the default scaled dot-product mechanism. Our method performs competitive results in mini -ImageNet, CUB-200, and CIFAR-FS on 1-shot learning and 5-shot learning tasks across backbones and few-shot configurations. We also developed a custom few-shot dataset for Yoga pose recognition to demonstrate the potential of our algorithm for practical application. Our FS-CT with cosine attention is a lightweight, simple few-shot algorithm that can be applied for a wide range of applications, such as healthcare, medical, and security surveillance. The official implementation code of our Few-shot Cosine Transformer is available at https://github.com/vinuni-vishc/Few-Shot-Cosine-Transformer.


Introduction
Deep learning methods have shown promising performances in numerous computer vision tasks and real-world applications by leveraging large-scale annotated data [10,37].However, collecting and labeling an adequate amount of data for training such methods is extremely costly and time-consuming, particularly in some special contexts such as medical treatment, healthcare monitoring, or security surveillance, where data availability is limited due to scarcity, expensive, or privacy concerns arising respectively.In many practical applications, the balance between quality and quantity for building datasets often be considered for the optimal cost of training and maintenance.For example, developing a smart system that can simultaneously monitor body movement, heart rate, and respiration of yoga learners effectively requires building an extremely costly dataset annotated by various tracking devices under the supervision of specialists in different domains, making by far large-scale dataset development impractical.To address such data limitation challenges in deep learning, few-shot image classification [8,24,43] is proposed as one solution for learning to predict unseen data with a very restricted quantity of labeled instances.
In a few-shot learning scenario, typically, the feature representation network is trained with a constrained amount of labeled support data to classify unlabeled query instances sharing the same categories [15,45,48].This approach enables the model to recognize images with only a small quantity of data.To evaluate the few-shot learning model effectively, the training set and the testing set are disjoint in terms of categories and divided into episodic tasks.From a learning perspective, few-shot learning can be categorized into two approaches: (i) inductive learning [35,36,49] that categorizes query instance independently by exploiting per query instance representation for decision making and (ii) transductive learning [29,30,52] that categorizes query instances concurrently, exploring the entire relationship between support and query instances within a task.From an algorithmic perspective, few-shot learning can be divided into two strategies: (i) learning an embedding space to cluster data features by category under a distance-based metric [1,40,46] and (ii) fine-tuning model's parameters effectively for a novel task with only a few gradient steps [21,32,39].
While few-shot learning methods have shown promise in dealing with limited data problems, they still face two critical challenges that significantly affect their learning capacity.First, the limited quantity of support instances may not fully represent the target categories under various visual conditions such as appearance, point of view, and object shape.As the support instances are often chosen randomly, the selected samples might only focus on the hard cases and do not adequately capture the characteristics of the target object.Thus, effectively handling hard samples is crucial in few-shot learning.Second, the support and query distribution might differ greatly due to random sampling and extremely small sample sizes.Therefore, it is crucial to establish a meaningful relationship between these two sets, particularly in transductive learning approaches, where aligning the support and query distributions becomes essential.
To tackle these challenges, we have developed a novel framework aimed at improving support representations and enhancing the transductive few-shot classification process.Our framework consists of two main modules: enhanced prototypical embedding and transformer encoder with a cross-attention mechanism for effective support-query relational mapping.Firstly, the prototypical embedding is enhanced with learnable and adjustable mean weights for computing centroids.This ensures that the prototypical representations take into account both easy and hard samples, resulting in more comprehensive embeddings.Secondly, we introduce a novel Cosine Attention mechanism based on cosine similarity, replacing the traditionally scaled dot-product one.This cosine attention eliminates the disparities between two feature sets, leading to a more stable and effective relational map.It significantly highlights the correlation between support and query samples that share the same characteristic.The correlation map is then used for query prediction in a transductive learning manner.These two improvements on the baseline framework enhance the few-shot image classification methods, resulting in our proposed method Few-shot Cosine Transformer (FS-CT).
Through a comprehensive and empirical evaluation, we demonstrate the effectiveness of our proposed FS-CT under various configurations and datasets.We further analyze the effectiveness of the proposed cosine attention under various transformer-based few-shot algorithms in ablation studies.These studies validate the impact of our improvements for few-shot image classification.To showcase the real-world applicability of our method, we developed a small-scale custom dataset specifically for yoga poses recognition.This dataset consists of 50 categories and nearly 2,500 images.The performance results of FS-CT on this Yoga dataset showcase the potential of few-shot learning in healthcare practical applications in general, and smart Yoga monitoring systems in particular.The official implementation code for FS-CT along with the proposed Yoga Poses Recognition dataset are available at https://github.com/vinuni-vishc/Few-Shot-Cosine-Transformer.
Our contributions can be summarized as follows: -We propose a novel few-shot image classification method called Few-shot Cosine Transformer (FS-CT), which incorporates a prototypical embedding module and a Transformer encoder architecture.
-We improve the conventional prototypical embedding by introducing a learnable weighted mean operation.This helps enhance the categorical representations for the support set and mitigates the impact of hard samples in few-shot tasks.
-We develop a new attention mechanism called cosine attention, which enhances the transformer layer's ability to map support and query features.These results in a more stable and significant correlation map, leading to improved transformer outputs and more accurate few-shot predictions.-We demonstrate the effectiveness of our proposed FS-CT method, particularly in conjunction with the improved cosine attention mechanism through detailed empirical evaluations and ablation studies across few-shot datasets.-We develop a custom few-shot dataset for the Yoga pose recognition task, where our method's performance showcases the potential of few-shot learning in practical healthcare applications, particularly in Yoga scoring and monitoring system.
The rest of this paper is organized as follows.Section 2 reviews the previous studies.Section 3 provides a mathematical formulation of the few-shot classification problem and Section 4 specifies the proposed FS-CT with Cosine Attention mechanism.Section 5 presents our experimental settings and evaluation.Section 6 discusses limitations and future works.Section 7 concludes the paper.

Related Work
The main objective of this study is to enhance the accuracy of few-shot image classification in a transductive learning setting by computing a correlational map between the support and query sets using transformer attention.This section provides a brief overview of the related research on the image classification method, few-shot learning, and the Transformer attention mechanism employed to compute the correlational map to support our proposed method.

Image Classification
Image classification is one of the fundamental tasks of computer vision, where deep models are developed to recognize images based on their content.Although general Deep learning methods have demonstrated their effectiveness in performing classification tasks on large-scale datasets with deep neural network architectures for most tasks [10,18,37], image classification often faces challenges in specific domain areas or suffering under various conditions, e.g.imbalance data, or data in small scales.Therefore, improvements for deep neural networks often be made to tackle the problem.For example, in dealing with small dataset problems, improvements often be made for network architecture, cost function, data augmentation, latent augmentation (adversarial training), and warm-starting method with pre-trained model [6].Zhou et.al. [51] improved neural network by developing an entanglement coefficient algorithm between pixels based on quantum physics perspective as a general case of various traditional distance functions on geometric sensing images.Zhou et.al. [50] considered both local and global features for blind quality prediction of natural scene images without prior knowledge.Ban et.al. [2] focused on enhancing the quality of microscopic images captured by a monocular camera using depth estimation techniques.Their aim was to improve the images before applying image recognition tasks.While these improvements often bring benefits for specific domain tasks, hyperparameter optimization is currently underestimated and should be considered in future studies to ensure more accurate evaluations and fair comparisons between methods [6].
Instead of exploring deeply the nature of images from various perspectives for domain-specific tasks, we focus in our research on enhancing network architectures to address small dataset challenges in few-shot learning for various domain tasks.Our technique is highly adaptable, as it can be constructed on any feature backbone architecture using pre-trained models.We conduct a thorough evaluation procedure for our proposed method and provide a fair comparison between ours and existing studies under very detailed experiments and ablation studies with an optimized hyper-parameter configuration.

Few-shot Learning
Few-shot learning is a subset of meta-learning [22,32,39], which develops models that are able to adapt to unseen tasks with small training data.Meta-learning algorithms take an advance on prior knowledge from a large-scale dataset (e.g., a pre-trained deep network) to effectively learn on a small novel dataset via a metalearner (or few-shot learner ) [13,14].Based on the learning method, few-shot learning algorithms can be divided into two categories.(i) Metric-based learning [11,20,31,36,38] focuses on learning an embedding space where samples from the same category are mapped closely together under a distance metric: Active Instance Selection [36] fits categorical distribution for support set, and selects new instances for support set based on a clustering algorithm using a metric distance, thereby improving few-shot learning performance.This method relies on the assumption that the distribution of the support set can be well approximated by a specific model, which may not hold true in all scenarios.Prototypical Network [38] computes prototypical embeddings by averaging support features within the same category as prototypical embeddings, and then measures the Euclidean distance between queries and prototypical embeddings.However, it relies solely on averaging support features without considering the relative importance of different samples, leading to suboptimal representations, especially when dealing with hard or challenging samples.(ii) Optimization-based learning [8,15,21,23,32] fine-tune model's parameters to quickly adapt to the new task with only a few effective gradient descent steps: Baseline++ [8] improves the fine-tuning step by replacing the dot product operation with cosine similarity in the linear classifier layer.Meta-Learner LSTM [32] employs LSTM for updating and tracking parameters across few-shot tasks, enabling fast adaptation.P>M>F [21] adopts a three-stage approach: pre-training the feature backbone on unlabeled external data, re-training the model using the prototypical network, and finally fine-tuning it on novel tasks with a few gradient steps while employing data augmentation.
Among the two categories of few-shot learning methods, metric-based learning stands out as a straightforward yet effective approach.It typically comprises two stages: feature extraction to obtain extracted features from both labeled and unlabeled samples under the same embedding space, and metric function to utilize a similarity or distance metric for categorizing unlabeled samples by comparing [40] or clustering [38] the embedded features.Based on the inference settings, there are two learning approaches for the few-shot algorithm, including inductive learning [14,35,43] and transductive learning [3,26,40].Inductive learning classifies each query sample individually, while transductive learning classifies every query sample collectively [5,36].The latter learning method allows additional information in data distribution or visual resemblance can be obtained and leveraged among query samples, thus potentially improving the overall performance.
In this study, we explore the alignment between support and query features for few-shot recognition in a transductive manner with a metric-based learning approach.We focus on the cross-attention mechanism in the transformer as an effective method for support-query correlation maps and discuss its advantages and limitations in the following section.

Transformer Attention mechanism
After being introduced in [42] for natural language processing tasks, transformer soon rose to dominance in computer vision [7,9,12].The core of the transformer is the attention mechanism, which calculates an attention map that indicates the similarity between features for solving tasks.The mechanism comes with two variants, self-attention determines the internal relationships within a feature set, and cross-transformer calculates the external relationships between two feature sets.Several few-shot learning studies are inspired by the transformer and its attention mechanism [1,11,17,25,47], where the methods, in general, involve attention mechanism to align labeled and unlabeled feature for classification.SetFeat [1] tackles few-shot classification by matching support and query features at multiple scales using shallow attention mechanisms, incorporating various distance-based methods.CTX [11] focuses on obtaining a coarse alignment between query and support samples by emphasizing local features through an improved spatial attention mechanism.URT [25] takes a different approach by computing a universal representation between labeled and unlabeled samples through the averaging of multiple scaled dot-product attention on domain-specific representation.While attention is a powerful mechanism, it has critical problems of missing good insight about attention and an expensive quadratic computational cost.Thus, research on the attention mechanism focuses on three main directions: (ii) reducing the computation cost, (ii) obtaining a good insight for attention, and (iii) designing a good attention mechanism for a specific task.Most studies on attention often focus on the self-attention mechanism with one set of features as input rather than two in cross-attention.Therefore, the difference between two feature sets that might happen in crossattention becomes unnoticed.This made the attention output becomes unstable, thus reducing the transformer performance.
In this study, we investigate this limitation inside the scaled dot-product attention mechanism and propose a replacement cosine attention to tackle the problem.Although prior research has utilized the cross-attention mechanism for few-shot learning, our work stands out as the first to investigate deeply the crossattention limitation for the few-shot classification problem.To the best of our knowledge, this is the first time the cosine similarity-based attention mechanism has been explored and proven its effectiveness in the tasks of few-shot learning.

Problem formulation and notations
We first formalize a standard few-shot classification problem while introducing some notations.In the few-shot learning problem, the objective is to develop a few-shot model that is able to perform tasks, in this case, image classification, on any set of random categories given only a very small amount of labeled samples per category as support information.Given a train set D train and few-shot learner A(. | θ).The objective of few-shot classification is to learn the optimal parameter θ * so that it can achieve a good performance of algorithm A(. | θ) on a test set D test .D train and D test must be disjoined in categories.
A few-shot learning problem is usually trained with an episodic learning strategy, where the proposed approach is trained and tested on different tasks with different sets of categories.The episodic learning with m tasks is generally described in Fig.Let (x, y) be defined as the input image sample and its ground truth, respectively.The objective for a few-shot classification task is to predict labels of the query set , where y s ∈ C for a set of n categories (n-way), k is the number of training samples per category (k-shot), and q is the total number of Q samples.The number of k must be small in a few-shot setting.In this paper, we explore two configurations: 5-way 5-shot and 5-way 1-shot.The few-shot classification problem can be formulated with the following optimization formula:

Embedding network
Embedding space

Learnable proto-representation Cosine Transformer
Fig. 2: The overall architecture of the proposed Few-shot Cosine Transformer, which includes two main components: (a) learnable protorepresentation that calculates the categorical proto representation given random support features that might be either in the far margin of the distribution or very close to each other and (b) Cosine Transformer that determines the similarity matrix between proto representations and query samples for the fewshot classification tasks.The heart of the transformer architecture is Cosine Attention, a cross-attention mechanism with cosine similarity and no softmax function to deal with two different sets of features.The Cosine transformer shares a similar architecture with a standard transformer encoder block, with two skip connections to preserve information, a two-layer feed-forward network, and layer normalization between them to reduce noise.The outcome value is through a cosine linear layer, with cosine similarity replacing the dot-product, before feeding to softmax for query prediction.This figure is better viewed in color. where with p y A(x q | θ); S is the probabilistic prediction of sample x q ∈ Q on true label y using few-shot algorithm A(.|θ) given for x q and support set S. λR(θ) is an optional regularization with factor λ. The loss function L S, Q; A(.|θ) is dependent on the few-shot problem and method.In this work, the Categorical Cross-entropy loss Equation 1 is explored for categorical classification.

Few-Shot Cosine Transformer (FS-CT)
In this section, we describe the proposed Few-Shot Cosine Transformer (FS-CT) architecture, which utilizes transformer framework to learn the similarities between labeled support and unlabeled query features to recognize query samples following the transductive learning approach.Fig. 2 presents the overall architecture of FS-CT, with two main components: learnable prototypical embedding, and Cosine Transformer.
Overall, FS-CT shares a similar architecture with the transformer encoder architecture.Given two input support set S and query set Q, their images are fed into a backbone feature extractor to obtain two feature tensors Z S and Z Q , and then features from Z S are averaged along individual categories to obtain the prototypical representation Z P like the prototypical network [38].Unlike other conventional transformer-based architectures, positional encoding is removed, since the arrangement of features is unimportance.After that, Z P and Z Q are brought into three linear layers to split into a multi-head of three features q * , k, v4 , then go through a multi-head cross-attention mechanism to obtain the weight attention features of Z Q on Z P .Instead of using the vanilla softmax attention, we propose a variation of the attention mechanism named "cosine attention", which utilizes cosine similarity to calculate the attention weight.The outcome attention values between heads are then fed to an output linear layer to combine heads together, followed by a two-layer MLP with GELU activation function.Two skip-connected layers are applied to prevent losing information and layer normalization is applied before the linear layer for smoothing values throughout the FS-CT.Finally, the outcome feature is brought through a Cosine linear layer followed by the softmax to yield probabilistic scores on individual categories to predict queries' labels.In the following subsections, we will describe in detail the essential modules within our FS-CT method.

Learnable Prototypical representation
Given support set S = (X S , Y S ) and query set Q = (X Q , Y Q ) follows few-shot setting (n-way, k-shot, q query samples) with that share the same category c are then average equally (arithmetic mean) to obtain the prototypical representation Z P ∈ R n×d , with z c ∈ Z P represented for the centroid of category c.However, with the few supporting samples chosen randomly, the prototypical representations are not guaranteed to be well represented using this arithmetic mean approach.This may happen when the embeddings of chosen samples are in the far margin of the categorical distribution space or close to each other as in Fig. 2.This problem becomes critical when the number of shots is low.To tackle the problem, we propose an improvement to turn the arithmetic mean into mean as weighted sum (or weighted mean), where the weights can be adjusted through the learning process to obtain a better prototypical representation for each category given the same small support samples.The formula for the learnable prototypical embedding is described by: Initially, the mean weights W avg = g(a | θ P ) for calculating categories' centroid are obtained by feeding a fixed scalar input a into the linear layer g(.| θ P ) with learnable weight θ P ∈ R n×k×1 and no bias.We used a = 1 so that the output W avg can have the same value with the weight θ P .After achieving the mean weight, Softmax is applied along the k axis to ensure that components within that axis form the weight distribution with the sum of 1.Then, the prototypical representation Z P ∈ R n×d is obtained by element-wise multiplication between Z s ∈ R n×k×d and W avg ∈ R n×k×1 , then summarize values along the k axis to achieve the weighted mean representation on each category, which is the prototypical representation.
In our implementation, instead of feeding a fixed scalar into a linear layer, we directly initiate a learnable parameter with value a = 1 that shares the same dimensional space R n×k×d and fed it to a softmax function for W avg .We used the initial value of 1 for the learnable parameter as we want to begin with averaging equally among feature vectors within the same category at first, but then weight values w ∈ W avg will vary through the training process to obtain a better prototypical representation that fits best with given data.In the experiment, the improved learnable prototypical embedding helps us achieve a better representation than the standard one and thus comes with higher performances in many scenarios.

Cosine Transformer
Cosine similarity between two matrices Given two vectors a, b ∈ R n , the cosine similarity score between a and b is calculated by the dot-product of two vectors divided by the product of their magnitudes, by: From the formula for vectors above, we expanded the definition of cosine similarity on matrices.Specifically, the cosine similarity S C (A, B) ∈ R n×m between matrix A ∈ R n×k and B ∈ R k×m is the Hadamard division between the matrix multiplication of A and B and the outer product between vectors M A ∈ R n and M B ∈ R m , where M A and M B are the vectorization of the magnitude values of the row vectors of two matrices A and B , respectively.The definition of cosine similarity on two matrices is described by: , With that definition, individual element S C (A, B) i,j is the cosine similarity score between the row vector a i of matrix A and the column vector b j of matrix B, where i ∈ Cosine attention mechanism Initially, as n and q are different in values, we reshaped the proto-feature Z P ∈ R n×d into the 1 × n × d tensor and Z Q ∈ R q×d into a q × 1 × d tensor so we can maintain both n and q dimensions in the attention output h a ∈ R q×n×d , providing the similarity matrix between Z P and Z Q .With the two reshaped tensors, a set of three representations q * , k, v are obtained by linear layers: are the weight matrices, d h is the dimension inside attention.The output attention head h a can be computed using the scaled dot-product or "Softmax Attention" (Sof t Attn) by: Specifically, in the softmax attention, the matrix multiplication performs dot-product operation between every pair of feature vectors between q * and k , then divided by a scaling factor √ d before feeding to a softmax function for an attention map A ∈ R q×n×1 , then multiplies with v ∈ R q×1×d for the attention output h a ∈ R q×n×d .However, as there is an extra dimension for both A and v, the matrix multiplication becomes the Hadamard product under tensor broadcasting.Therefore, in nature, the attention output h a becomes the element-wise multiplication between query features and the attention map of query samples and support proto-representations.While it is quite different from the usual concept, the nature of the attention mechanism is still preserved in h a .
The core of the Softmax attention is the dot product operation, which calculates the similarity involving vector angle and length (magnitude).The involvement of feature-length comes with limitations.First, the feature magnitude is a distinguishing yet unimportant factor in calculating attention, and the similarity output map can become unstable when feature magnitudes vary.This can be critical when q and k come from two distributions: few-shot support and query sets respectively.Second, the enlarging in magnitude makes the softmax function produces an extremely small gradient output, thus leading to gradient vanishing [42].While the division of the fixed scaling factor √ d in softmax attention helps counter this phenomenon, the differential in feature magnitudes still remains.
Therefore, to remove the effect of the differences in vectors' magnitude, we replaced the dot-product with cosine similarity for calculating A. We clarify that the concept of replacing the dot-product operation with cosine similarity is not new and has been studied in [28] and applied in few-shot image classification in [8,16] in terms of a cosine similarity-based classification/recognition model.In this study, the replacement of cosine similarity in the attention mechanism helps us highlight the alignment between two representations by features' content.Specifically, instead of using a fixed number for scaling the entire weight matrix, individual components of the multiplicated matrix will be divided with the product of their corresponding vector's magnitude.We refer to this attention mechanism as "Cosine Attention" (Cos Attn) by Equation 6, based on the definition of cosine similarity for matrices as in Equation 4.
With cosine similarity, the attention map A focuses more on the features' content and determines a better correlation matrix between every pair of features q * i ∈ q * , k j ∈ k .Furthermore, the output distribution of cosine attention can be stable even if its input magnitude varies [28].By removing the magnitude, the output of cosine similarity is bounded into the range of [−1, 1], indicating the similarity between the two features.Thus, the Softmax function is no longer necessary for scaling the values.Without the softmax, attention map A still maintains the probabilistic distributions in the row vectors as well as its components' ratio.Moreover, as cosine similarity does not scale the weight distributions into the sum of 1, a i,j ∈ A possesses a wider range of value.This helps emphasize query features on aligned categories in h a and vice versa, hence boosting the model's performances.In our empirical experiment, removing softmax operation in cosine attention helps increase our proposed FS-CT performance significantly, and normalizing q and k before feeding to the Softmax attention does not procedure an attention map as adequate as using the cosine attention alone.
In our FS-CT method, we apply the multi-head mechanism for cosine attention.The initial three linear layers split Z P and Z Q into sets of q * t , k t , v t where t ∈ [1,8].For each set, a corresponding attention output h t a is computed by either softmax or cosine attention, represented for the projection output in different perspectives.Then, the H out ∈ R q×n×d is obtained with the output weight matrix θ • ∈ R d h ×d by: Cosine linear layer for queries prediction After the attention block, two skip connections are performed on H out = (Z P + H out ) + FFN(Z P + H out ) with layer normalization before each step.The feed-forward network FFN is a simple two linear layers with GELU [19] activation function in between.With the final outcome feature H out ∈ R q×n×d , a linear layer with weight θ out ∈ R d×1 is applied follows by softmax for P out ∈ R q×n , which represent the probabilistic prediction for every query features on n categories.Instead of using a conventional linear layer, we used a cosine linear layer from [8].Furthermore, instead of performing the dot-product between H out and θ out , cosine similarity S C (a, b) is replaced between two L2-normalized tensors with Equation 3. The replacement of cosine similarity instead of the convention dot-product operation helps us achieve a better prediction score for P out .Overall, the probabilistic prediction p(c | h q,c ; θ out ) for representation score h q,c on label c of query sample x q over n categories and the predicted label ŷ are calculated by:

Episodic Training
We train our proposed FS-CT method with an episodic learning strategy, presented in detail in Algorithm 1.For each training step, task T = {S, Q} are selected randomly from D train with n categories.For convenience, all learning parameters are referred to as a general parameter θ.With FS-CT, includes backbone feature extraction f (. | θ f ), is performed, we obtain the probabilistic prediction p(y | x Q , S; θ) of sample x Q ∈ Q on label y given S. Finally, Categorical Cross-entropy loss is applied to update parameters at the end of each training step as in Equation 1.
5 Experiment Results

8
Compute the loss function L as in Equation ( 1). 9 Perform gradient descent step to update the parameter: θ ← θ − α∇ θ L end consists of 100 different categories with 600 image samples per each, each image having the size 84 × 84 pixels.In our implementation, we used the splits by Ravi and Laroche [32] including 64 training categories, 16 validation categories, and 20 testing categories.CIFAR-FS including 100 categories containing 600 images for each label with the size of 32 × 32 pixels.The splits of this dataset are similar to mini-ImageNet.The CUB-200 dataset contains 200 categories of bird species with 11,788 images of 84 × 84 pixels, which are divided into 100 categories for training, 50 categories for validating, and 50 categories for testing.
Besides the three few-shot datasets above, we also created a custom dataset for yoga poses scoring, including 50 categories of main yoga poses with 2,480 images.We developed this small-scale dataset as an initial step toward making a smart monitoring and study scheme for yoga participants.The dataset is partially derived from Kaggle [34] and stored in our implementation code on GitHub 5 .The number of images are ranging from 30 to 81 samples per category with arbitrary size.Furthermore, some categories' samples are different in viewpoint, appearance, or visual condition, which makes the dataset more chal-lenging.We split the dataset into 25 categories for training, 13 categories for validating, and 12 categories for testing.Some example poses from the dataset are presented in Fig. 3, with the statistics of the dataset distribution presented in Table 1.

Implementation Details
We implemented our method and conducted experiments on PyTorch, using a CPU Intel Core i9-10900X 3.7GHz with a GPU NVIDIA GeForce RTX 3090  * For mini-ImageNet, we will have a different experiment using more suitable pre-trained backbones for a fair comparison.
24GB and 16GB RAM memory.Methods were trained and experimented with the learning rate 0.001 without modification scheduler, AdamW [27] optimization function with weight decay 1 × 10 −5 , and no dropout.The model is optimized by Categorical Cross-entropy Loss Equation 1.These hyper-parameters are fixed as we want to make a fair comparison between all experiment scenarios.We performed two configurations: 5-way 5-shot and 5-way 1-shot, with 16 query samples for each category, making a total of 80 queries.All training steps are trained on 50 training epochs with 200 episodic batches (episodes) for each.Each training epoch is followed by a validating step with 200 episodes to select the best-performed model for the testing phase on 600 episodes.All training, validating, and testing sets are disjoined in categories.To increase training data samples for training models, we applied augmentation, including random resizing, cropping, horizontal flipping, color jittering, and image normalizing.All experiments were conducted in the same common ground of code, setting, and environment for a fair evaluation and comparison.
For backbone feature extraction, we mainly utilized four backbone models Conv4, Conv6, ResNet-18, and ResNet-32 [18] for the experiments.Conv4 and Conv6 are lightweight CNN models with 4 and 6 layers respectively and trained from scratch without pre-training.These models have been used in previous studies on few-shot classification [8,40,43].On the other hand, the ResNet backbone networks and pre-trained on mini-ImageNet are available on Torchvision.However, as the mini-ImageNet is a subset of the ImageNet, we deployed a special pre-trained model named FETI [3] for evaluating the dataset, which will be described in detail later.For each type of backbone architecture, we resized sample images before training, depending on the dataset and backbone model.Particularly, with CNN backbones, we resize images into 64 × 64 pixels for CIFAR-FS (due to its small size originally) and 84 pixels for other datasets, and with Res-Net backbones, the resized input image is 112 × 112 and 224 × 224, respectively.For all experiments, we report results with Accuracy Equation 10 in percentage as the sole metric and use this performance metric for comparisons.The official implementation of our Few-shot Cosine Transformer and all experimental configurations are presented on our GitHub6 .

Accuracy = Number of correct predictions
Total number of predictions × 100,

Ablation study
Evaluation on mini-ImageNet, CIFAR-FS, and CUB-200 datasets For the experiment, besides the FS-CT model, we also deploy another attentionbased few-shot learning algorithm CTX [11] as a baseline for the comparison with our proposed FS-CT method.We utilized both two attention mechanisms: the baseline Softmax attention Equation 5 and our proposed cosine attention Equation 6 for two few-shot methods for our ablation evaluation.The two main experiment results on mini-ImageNet, CIFAR-FS, and CUB-200 datasets are presented in Table 2 for the FS-CT method only with two embedding backbones Conv4 and Conv6, and Table 3 for both CTX and FS-CT methods with ResNet-18, and ResNet-32 backbones.Both experiments are conducted in 5-way 1-shot and 5-way 5-shot settings.We use the full ImageNet pre-trained models on ResNet backbones for Table 3 and we will have a separate experiment for a fair evaluation on mini-ImageNet in the latter section.Data augmentation is applied for our proposed FS-CT method only.Generally, FS-CT achieved higher accuracies than CTX, with the only exception in mini-ImageNet, where the performances of CTX are nearly equal (in 5-shot learning) or higher (in 1-shot learning) than FS-CT.However, CTX only achieves these performances with our cosine attention mechanism as the core.Cosine attention improves the performances of two few-shot methods, and data augmentation further enhances them.

C T X S o f t m a x C T X C o s i n e F S -C T S o f t m a x F S -C T S o f t m a x ( A u g ) F S -C T C o s i n e F S -C T C o s i n e ( A u g )
In general, our FS-CT outperformed CTX in different ResNet backbones, datasets, and few-shot settings in most cases with both two softmax attention and cosine attention, with the only exceptions on mini-ImageNet, where CTX outperformed FS-CT on 5-shot learning with ResNet-18 backbones and 1-shot learning with ResNet-34 backbones (Table 3).However, in both two cases, CTX is embedded with our proposed cosine attention mechanism, rather than the original version with softmax attention.On the other hand, cosine attention supports few-shot algorithms in outperforming the standard Softmax mechanisms across all backbones, few-shot settings, and datasets, with the improved performances increasing from nearly 5% to over 20% across cases.Overall, 5-shot learning comes with better performance than 1-shot learning.This happens typically on few-shot algorithms as more label samples come with better centroid representation for individual categories, thus classifying queries better.Furthermore, augmentation helps improve classifier performances on FS-CT, mainly on the 5-shot setting.In particular, in Table 3, the second-best and best results mainly are FC-CT using cosine attention and its corresponding method training with augmentation, respectively.There are some occasions when augmentation does not help improve performance in one-shot learning.This could be explained by augmentation that comes with the growth in noise in categorical representation, therefore affecting the performances on one-shot learning.Augmentation seems to be effective on ResNet backbones, as FS-CT using cosine attention with augmentation mainly achieves the best result within individual scenarios.Across scenarios, deeper backbones comes with better performances, as increasing the number of layers helps both CNN and ResNet backbones achieve higher results.Moreover, the models' performance is heavily affected by the choice of backbone, as in most cases, ResNet-34 backbone as feature extractor comes with the highest performance among the four.These observations are further illustrated in Fig. 4, where the line graphs present the test accuracy correlations between CTX and FS-CT variants on ResNet-34 backbone across few-shot settings and datasets in Table 3.
Performances of two attention mechanisms Fig. 5 illustrates the attention outputs as the correlation heatmap between the baseline Softmax Attention (top) and the improved Cosine Attention (bottom) for 5-way 5-shot learning, from the early training phase (left) to the later training phase (right).Each heatmap point represents the similarity frequency between query samples (x-axis) from one category and the corresponding prototypical representation (y-axis) of the same category.All heatmaps are obtained on the performances from the same few-shot task derived from the testing test of mini-ImageNet.Cosine attention results in a more robust heatmap as it generates a stronger similarity matrix between the query and support samples (through proto-representations) that share the same ground truth, standing by the main diagonal.In the early epoch, the attention heatmap procedure by the cosine attention achieves a similar, if not better, than the Softmax attention heatmap from the latter epoch.The strong connection diagonal between queries and support representations with the same categorical ground truth becomes more apparent with cosine attention as training progresses, resulting in a more stable heatmap.This emphasizes the robustness of our improved cosine attention for the few-shot classification task.compared with the baseline Softmax attention.Additionally, applying cosine attention without normalizing features is more effective than our early attempts to normalize two input feature sets before calculating the Softmax attention.This further highlights the robust improvement of our cosine attention for the cross-attention mechanism compared with the baseline scaled dot-product attention.Furthermore, the training graphs in Fig. 4 and Fig. 6 also show that, under the same configuration (dataset, few-shot setting, backbone, and attention mechanism), our FS-CT learns and performs better than the baseline CTX significantly.
Performances of FS-CT on mini-ImageNet with partially pre-trained model As the mini-ImageNet dataset is a subset of the ImageNet dataset, using a default pre-trained ResNet on the full ImageNet as feature extraction comes with a naturally advanced performance on mini-ImageNet, as shown with very high performance in Table 3.Therefore, it is unfair for us if we want to compare our FS-CT performances with other few-shot classification methods.To tackle the problem, we used a specific pre-trained model that had been trained on a subset of ImageNet that is non-overlapped with the testing set of mini-ImageNet.We adapted this pre-trained model, called "Feature Extractor Trained (partially) on ImageNet" or FETI in abbreviation, from [3].Because the pre-trained model was trained only on ResNet-18, we used two backbone models ResNet-18 and ResNet-12, as ResNet-32 is too different in layer number and layer size, and ResNet-12 is roughly adequate with ResNet-18 in architecture.Table 4 shows our FS-CT performance on mini-ImageNet with FETI pre-trained model.In general, the performances are reduced in the comparison with results from Table 3.This emphasizes the necessity of having a good embedding through a pre-trained model in order to address the few-shot problem, as pointed out in [41].Results show that FS-CT with cosine attention still outperformed the baseline softmax attention, demonstrating the robust learning capability of our proposed attention mechanism.
Evaluation on Yoga poses dataset For the custom Yoga poses dataset, the results are separately presented in Table 5.While cosine attention still comes with more robust performances than the baseline Softmax attention, augmentation overall seems not to help the method in improving the outcome results.The best and second-best performances mainly are FS-CT and CTX, both with cosine attention, while training with augmentation resulted in much lower performances in FS-CT.Our theory is that this phenomenon is affected by the significant difference in hard cases and visual variation of the same pose category in the dataset.Still, these performance results showcase the potential of few-shot learning algorithms in general and our proposed FS-ST in particular for practical applications on healthcare topics, leading to our future studies in developing smart monitoring and scoring system for Yoga learners on downstream devices such as smartphones.

Discussion
We have proposed a transformer-based method for a few-shot classification task with an attention mechanism using cosine similarity.We find the algorithm of our proposed Few-shot Cosine Transformer (FS-CT, along with the improved cosine attention, is straightforward and simple to implement, with the detailed code implementation and experimental configurations presented in the previous section.Our experiments and ablation studies indicate that cosine similarity benefits the attention mechanism to produce a better and more consistent correlational map as attention output and enhance our framework performances across configurations, backbones, and few-shot settings.However, while our current results in various few-shot datasets are promising, there are some limitations that should be considered in future research.First, our method's performances highly depend on the choice of embedding backbone, particularly those with a pre-trained model.While the pre-trained backbones provide good embedding representation that supports few-shot learning algorithms to perform significantly, the dependence of the pre-trained model and its impact should be further investigated in future studies.Second, the complexity of architecture may prevent FS-CT from reaching higher performance levels.Although skip-connection was used to preserve information, it is possible that this was insufficient.Many few-shot approaches, including those from [8,38,40,43] share a straightforward pipeline but perform well across few-shot datasets.We want to continue this line of research.Third, while we consider our improved learnable prototypical embedding is a simple method to address the support variation challenge and hard samples in few-shot learning, further exploration should be conducted in further studies to shed light on our improvement, especially when the training and testing sets come from two disjoined domains.Moreover, we believe that more efficient improvements to the prototype network are yet to be discovered, and the balancing between hard and easy samples has not been fully investigated.Last, due to our limited resources, we are only able to perform the comparison between our proposed cosine attention to the standard scaled dot-product attention, neglecting recent other exemplars of the attention mechanism.We suggest future studies based on our work should consider a more comprehensive and wide comparison across recent variations of the attention mechanism for vision transformer-based algorithms (not just limited to few-shot learning or image classification).We leave these limitations and discussions for future studies.

Conclusion
In this study, we introduce Few-shot Cosine Transformer (FS-CT), a lightweight and straightforward transductive learning method for the few-shot image classification task based on the prototypical network and vision transformer.We made two improvements to our framework: (i) learnable prototypical embedding to balance between easy and hard samples of the provided labeled support instances and (ii) cosine attention based on cosine similarity to compute correlational map between support and query samples for few-shot recognition.Throughout extensive experiments and analysis, we prove that the cosine similarity supports the attention mechanism in providing a better and more consistent attention output as the correlational map, supporting FS-CT to achieve competitive results across few-shot datasets under various settings and configurations.The empirical results further show our proposed cosine attention also enhances the performances of other vision transformer-based few-shot algorithms as well.Finally, we showcase the potential of FS-CT in practical application in healthcare research via a custom yoga pose dataset.However, the potential of the proposed learnable prototypical embedding in dealing with hard samples and the impact of pre-trained models on few-shot learning algorithms should be investigated in further studies.

1 .Fig. 1 :
Fig. 1: Formulation of few-shot learning problem including the training set D train with m tasks and the testing set D test with v tasks.Here, m and v could be equal or different.Each task T comes with different sets of categories and consists of labeled support set S and unlabeled query set Q that share the same categories.Support set S follows the n-way k -shot setting.D train and D test are disjointed in categories.A few-shot learner A is trained on D train to perform test on D test .

Algorithm 1 : 2
Episodic training algorithm of FS-CT over one training epoch with N tasks (episodic batch).Each task T i is chosen randomly from the training set D train with a different set of categories to train and update the general parameter θ of our proposed FS-CT, including the embedding backbone f (. | θ f ).Input : Training set D train .The number of learning tasks N .Learnable parameters θ, learning rate θ Output: Updated parameters θ for i in {1, ..., N } do 1 Randomly chosen task T i = {S, Q} ∼ p(D train ).Obtain feature representations Z S , Z Q from S and Q by the backbone f (. | θ f ) 3 Calculate the prototypical representation Z P using Equation (2).

Fig. 3 :
Fig. 3: Several exemplary samples for categorial poses of the yoga poses dataset, consisting of 50 different poses with a total of 2,480 images.

Fig. 4 :
Fig.4: Testing accuracies between CTX and FS-CT across the three few-shot datasets with softmax and cosine attention mechanisms and augmentation for FS-CT, using ResNet-34 as the backbone.Generally, FS-CT achieved higher accuracies than CTX, with the only exception in mini-ImageNet, where the performances of CTX are nearly equal (in 5-shot learning) or higher (in 1-shot learning) than FS-CT.However, CTX only achieves these performances with our cosine attention mechanism as the core.Cosine attention improves the performances of two few-shot methods, and data augmentation further enhances them.

Fig. 6 :
Fig.6: Accuracy performances of FS-CT and CTX throughout training for 1-shot and 5-shot setting with ResNet-34 backbone, using either the standard softmax attention or the proposed cosine attention mechanism.Cosine attention significantly improves both few-shot methods among settings and datasets with higher starting points and better training plots.FS-CT achieves better training performances than its counterpart CTX with both attention mechanisms.

Table 1 :
Statistical description of the custom image dataset for Yoga poses over three main sets for training, validation, and testing few-shot image classification method.

Table 2 :
[8]formance of our proposed FS-CT for 5-way setting on mini -ImageNet, CUB-200, and CIFAR-FS, using either the baseline softmax attention (Soft Attn) or the proposed cosine attention (Cos Attn) with two shallow backbones Conv4 and Conv6 with 50 training epoch and data augmentation.Similar to[8], we report the mean of 600 randomly generated test episodic tasks.The best and second best results are bolded and underlined, respectively.The evaluation metric is accuracy in percentage.

Table 3 :
[11]ormance of the baseline CTX[11]and our proposed FS-CT for 5-way setting on three datasets mini -ImageNet, CUB-200, and CIFAR-FS, using the baseline softmax attention (Soft Attn) or the proposed cosine attention (Cos attn), with two embedding backbones ResNet-18 and ResNet-34, pre-trained on ImageNet.Data augmentation is applied for our FS-CT method only.Similar to Table2, we validate the methods with 600 random task episode tasks and report the mean value.The best and second best results are bolded and underlined, respectively.The evaluation metric is accuracy in percentage.

Table 4 :
Performance of FS-CT on mini-ImageNet with a pre-trained model FETI (Feature Extractor Trained (partially) on ImageNet) that was trained with non-test-set overlapping ImageNet classes to avoid the natural advantage of ImageNet pre-trained model.We use two supported backbones ResNet-12 and ResNet-18 for the pre-trained model, with the same validating scheme and configurations.

Table 5 :
Evaluation of the baseline CTX and our proposed FS-CT on the custom Yoga dataset using two backbones ResNet-18 and ResNet-34 with 50 training epochs and 600 random testing tasks.The best and second best results are bolded and underlined, respectively.The evaluation metric is accuracy in percentage.FS-CT (our ) + Soft Attn 61.40 69.15 58.89 71.12 FS-CT + Soft Attn + Aug 51.89 66.21 55.08 70.12 FS-CT + Cos Attn 66.38 80.34 64.32 77.66 FS-CT + Cos Attn + Aug 57.98 73.58 59.76 77.92