Loading [MathJax]/extensions/MathMenu.js
Spatial-Enhanced Multi-Level Wavelet Patching in Vision Transformers | IEEE Journals & Magazine | IEEE Xplore

Spatial-Enhanced Multi-Level Wavelet Patching in Vision Transformers


Abstract:

By seamlessly integrating wavelet transforms into the image patching stage of ViT, we leverage the power of multi-level wavelet transforms to decompose images into a dive...Show More

Abstract:

By seamlessly integrating wavelet transforms into the image patching stage of ViT, we leverage the power of multi-level wavelet transforms to decompose images into a diverse array of frequency-domain features. These features, integrated with spatial characteristics at equivalent scales, enrich image details, enhancing ViT's proficiency in delineating intricate textures and distinct edges. Consequently, we registered a notable 2.7% accuracy enhancement on the ImageNet100 dataset in ViT. Our wavelet patching module, designed for versatility, seamlessly fits into various ViT derivatives without necessitating architecture modifications. This advancement has uplifted the performance of several leading vision transformers by 0.46–4.3%, preserving parameter efficiency without notable FLOPs increment.
Published in: IEEE Signal Processing Letters ( Volume: 31)
Page(s): 446 - 450
Date of Publication: 12 January 2024

ISSN Information:

Funding Agency:


I. Introduction

The Vision Transformer (ViT) [1] tokenizes images into fixed-size patches, employing Transformer layers akin to language models to determine inter-token relationships for image classification. However, this method often overlooks the vital local nuances [2], [3] within each patch, notably textures [4], edges [5], and lines, requiring larger training datasets to match CNN benchmarks [6]. In signal processing, techniques like the discrete wavelet transform (DWT) can distinguish such features across varied frequency bands and efficiently spotlight these obscured local features. Nevertheless, many ViT variants sidestep patch-processing enhancements.

Contact IEEE to Subscribe

References

References is not available for this document.