I. Introduction
The Vision Transformer (ViT) [1] tokenizes images into fixed-size patches, employing Transformer layers akin to language models to determine inter-token relationships for image classification. However, this method often overlooks the vital local nuances [2], [3] within each patch, notably textures [4], edges [5], and lines, requiring larger training datasets to match CNN benchmarks [6]. In signal processing, techniques like the discrete wavelet transform (DWT) can distinguish such features across varied frequency bands and efficiently spotlight these obscured local features. Nevertheless, many ViT variants sidestep patch-processing enhancements.