I. Introduction
Fine-grained image classification [1]–[3], which involves distinguishing between visually similar classes, plays a crucial role in various applications such as species identification, product categorization, and medical diagnostics. Despite the remarkable success of deep learning in computer vision [4]–[9], achieving high accuracy in fine-grained classification remains challenging due to the scarcity of labeled data and the subtlety of distinguishing features [10].