An Efficient Convolutional Multi-Scale Vision Transformer for Image Classification | IEEE Conference Publication | IEEE Xplore

An Efficient Convolutional Multi-Scale Vision Transformer for Image Classification


Abstract:

This paper introduces an innovative and efficient multi-scale Vision Transformer (ViT) for the task of image classification. The proposed model leverages the inherent pow...Show More

Abstract:

This paper introduces an innovative and efficient multi-scale Vision Transformer (ViT) for the task of image classification. The proposed model leverages the inherent power of transformer architecture and combines it with the concept of multi-scale processing generally used in convolutional neural networks (CNNs). The work aims to address the limitations of conventional ViTs which typically operate at a single scale, hence overlooking the hierarchical structure in visual data. The multi-scale ViT enhances classification performance by processing image features at different scales, effectively capturing both low-level and high-level semantic information. Extensive experimental results demonstrate the superior performance of the proposed model over standard ViTs and other state-of-the-art image classification methods, signifying the effectiveness of the multi-scale approach. This research opens new avenues for incorporating scale-variance in transformer-based models for improved performance in vision tasks.
Date of Conference: 03-05 November 2023
Date Added to IEEE Xplore: 13 February 2024
ISBN Information:
Conference Location: Chengdu, China

Contact IEEE to Subscribe

References

References is not available for this document.