Abstract:
This paper introduces an innovative and efficient multi-scale Vision Transformer (ViT) for the task of image classification. The proposed model leverages the inherent pow...Show MoreMetadata
Abstract:
This paper introduces an innovative and efficient multi-scale Vision Transformer (ViT) for the task of image classification. The proposed model leverages the inherent power of transformer architecture and combines it with the concept of multi-scale processing generally used in convolutional neural networks (CNNs). The work aims to address the limitations of conventional ViTs which typically operate at a single scale, hence overlooking the hierarchical structure in visual data. The multi-scale ViT enhances classification performance by processing image features at different scales, effectively capturing both low-level and high-level semantic information. Extensive experimental results demonstrate the superior performance of the proposed model over standard ViTs and other state-of-the-art image classification methods, signifying the effectiveness of the multi-scale approach. This research opens new avenues for incorporating scale-variance in transformer-based models for improved performance in vision tasks.
Published in: 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML)
Date of Conference: 03-05 November 2023
Date Added to IEEE Xplore: 13 February 2024
ISBN Information: