Loading [a11y]/accessibility-menu.js
Visual State Space Model With Graph-Based Feature Aggregation for No-Reference Image Quality Assessment | IEEE Journals & Magazine | IEEE Xplore

Visual State Space Model With Graph-Based Feature Aggregation for No-Reference Image Quality Assessment


Abstract:

Inspired by the human visual system (HVS), no-reference image quality assessment (NR-IQA) has made significant progress without relying on perfect reference images. The H...Show More

Abstract:

Inspired by the human visual system (HVS), no-reference image quality assessment (NR-IQA) has made significant progress without relying on perfect reference images. The HVS is primarily influenced by the combined effects of representational information with different receptive fields and attribute categories when capturing subjective perceived quality. However, existing methods only roughly or partially utilize representations of multi-dimensional information. Furthermore, current NR-IQA methods either rely on convolutional neural networks (CNNs) with limited local perception or depend on the computational complexity of vision transformers (ViTs). To make up for the shortcomings of these two architectures, an emerging visual state space model (VMamba) is introduced. Motivated by this, this paper presents a NR-IQA method via VIsual State space model with Graph-based feature Aggregation (VISGA). Specifically, we utilize a plain, pre-training-free, and feature-enhanced VMamba as the backbone. To align with the perceptual mechanisms of the HVS by effectively using features with different dimensional information, a graph convolutional network-based multi-receptive field and multi-level aggregation module is designed to deeply explore the correlations and interactions of multi-dimensional representations. Additionally, we propose a gated local enhancement module with patch-wise perception to enhance the local perception of VMamba. Extensive experiments conducted on seven databases demonstrate that VISGA achieves outstanding performance. Notably, our model remains state-of-the-art when training with very few parameters. The code is released at https://github.com/xirihao/VISGA.
Page(s): 5589 - 5601
Date of Publication: 20 January 2025

ISSN Information:

Funding Agency:


I. Introduction

With the rise of social media and online content-sharing platforms, billions of images are rapidly generated and disseminated. During the processes of acquisition, compression, storage, and transmission, images are often affected by various types of distortions, compromising users’ visual enjoyment [1], [2]. Therefore, it is particularly important to conduct effective and reasonable perceptual quality assessment when processing and sharing images [3], [4]. Subjective image quality assessment (IQA) relies on the human visual system (HVS) to evaluate image quality and is widely considered the most accurate method [5]. However, its time-consuming and labor-intensive characteristics greatly restrict its practical application. In contrast, objective IQA employs algorithms to simulate HVS, allowing assessments to be conducted without human intervention, thereby significantly enhancing evaluation efficiency [6]. This advancement has facilitated the widespread application of objective IQA across various domains, including image generation [7], image restoration [8], and image editing [9]. Objective IQA is categorized into three types based on the availability of reference images: full-reference IQA (FR-IQA) [10], reduced-reference IQA (RR-IQA) [11], and no-reference IQA (NR-IQA) [12]. Given the limited number of reference images available in practical applications, NR-IQA has gained extensive attention in recent research due to its unique advantages. Furthermore, the complexity of research in NR-IQA is also the highest among these categories. Traditional IQA primarily relies on hand-crafted features, such as gradient information [13], wavelet components [14], and texture characteristics [15], to measure image distortion. Recent research has shifted towards deep learning-based methods, fully leveraging their powerful representation capabilities to enhance the performance of quality assessment. Due to the close relationship between the perception of image quality by HVS and the distortions present in both the background and prominent objects, existing IQA methods typically rely on multi-level features extracted from networks to regress quality scores [16], [17], [18]. Among these features, low-level features primarily focus on the fundamental details of the image, including texture, color, and edge information in both the foreground and background, while high-level features emphasize the semantic content of prominent objects, such as their categories and shapes. Therefore, effectively utilizing multi-level features not only enhances the accuracy of IQA algorithms but also improves their applicability in complex scenes. Based on the usage of multi-level features, existing IQA methods can be categorized into four types: score averaging [19], final layer mapping [5], bottom-up [20], and top-down [6]. The score averaging approach illustrated in Fig. 1(a) first regresses each level of features into scores and then averages them. Although the approach utilizes features from different levels, they are used separately, failing to adequately reflect the HVS’s joint interpretation of detail and semantic information. The final layer mapping approach shown in Fig. 1(b) uses features from the final layer to map to quality scores. This approach clearly lacks sufficient utilization of detail information. Figs. 1(c) and 1(d) showcase the bottom-up and top-down approaches, respectively. While both approaches reasonably aggregate multi-level features, the bottom-up approach emphasizes semantic features and focuses on distortions in salient objects, potentially overlooking distortions in the background. In contrast, the top-down approach prioritizes shallow detail features, resulting in weaker feature expression. Notably, IQA methods simulate the HVS’s ability to simultaneously capture both detail and semantic information by combining low-level and high-level features to reflect distortions [21], [22]. Given the non-Euclidean relationships between multi-level features, as shown in Fig. 1(e), this paper proposes a graph convolutional network (GCN)-based [23] method to more accurately measure quality and enhance robustness against various complex distortion. As shown in Fig. 2(b), our evaluation results are closer to the ground truth (Mean Opinion Score, MOS) when facing the different distortions presented in Fig. 2(a).

Intuitive display of five types of IQA feature usage, including score averaging, final layer mapping, bottom-up, top-down, and our method.

Comparison of predicted scores between our method and four existing methods under four distortion scenarios. (a) Examples of the four distortion situations: (1) both background and objects exhibit distortion; (2) only the background exhibits distortion; (3) only the objects exhibit distortion; (4) global distortion is present. (b) Predicted scores from five methods along with the corresponding MOS values.

Contact IEEE to Subscribe

References

References is not available for this document.