Merging Super Resolution and Attribute Learning for Low-Resolution Person Attribute Recognition

In video surveillance, visual person attributes such as gender, backpack, type of clothing are crucial for person search or re-identification. For detecting and retrieving these attributes with high accuracy, the availability of high-quality videos is a necessity in general. However, this cannot be guaranteed in general surveillance videos or images; beside improving hardware technology, improving inference algorithms on low-resolution data is valuable. This paper attempts to propose two solutions in this direction: designing a combined neural network architecture from existing architectures, and a novel combination approach toward re-identification on low-resolution videos. The proposed architecture introduces a combined Neural Network architecture, called SRMAR, that jointly trains Super Resolution and Multi Attribute Recognition models for more effective recognition. Experiments on two benchmark datasets demonstrate the effectiveness and applicability of the proposed neural network architecture for low-resolution multi-attribute recognition. Furthermore, a higher-level linear combination scheme that optimally combines the proposed SRMAR architecture and multi-attribute recognition network is presented, yielding superior results in low-resolution person attribute recognition.


I. INTRODUCTION
Person attribute recognition (PAR) concentrates on recognizing personal characteristics such as gender, age, hair, height, or clothing style, tools at hand, etc., especially from surveillance videos [1]. PAR has a variety of applications in the real world, including security applications, market behavior recognition, traffic management, population management, and more [2], [3]. Like any other tasks in computer vision, the PAR task suffers from limited computational resources and discrepancies in the image quality. Challenges include varying illumination conditions, different viewpoints, complex and varying appearance, and low-resolution (LR) in imagery. Furthermore, due to the nature of the problem, attribute recognition requires an analysis of more delicate details for achieving better results. However, when the imagery is in low resolution, the loss of information makes the recognition quite challenging.
Challenges due to illumination, viewpoint differences and varying appearance can be handled through large proper The associate editor coordinating the review of this manuscript and approving it for publication was Qiang Lai . datasets to train Convolutional Neural Networks (CNNs) in a supervised fashion. Similarly, to address the issues of LR, recently, CNN architectures are being developed and applied with satisfactory results [4]. Here, instead of using neighborhood information of pixels, CNNs are designed to estimate the pixel value directly. These CNN-based models fall under the category of super-resolution (SR) [5], [6]. SR models can upsample LR images by learning from many samples, enabling the model to use local information and understand the scene [7].
Another property of PAR tasks is that there are usually many different attributes, e.g., age, height, hair, depending on the purpose of the application. Recently, this task has also been addressed using CNN architectures, specifically using so-called multi-attribute recognition (MAR) networks. This is currently a very active subject area, and it demonstrates a great potential for understanding the contents of the scene [8].
This work explores the PAR task in LR images as a multiclass recognition problem. To this end, we propose an end-toend deep learning model that merges the power of SR models and MAR models in a joint architecture. We call this endto-end CNN architecture as Super-Resolution Multi-Attribute FIGURE 1. Most real-world surveillance cases demand recognition of personal attributes from distant, which convinces the need to process these attributes in low resolution. Our proposed hybrid network combines the power of a SR network with multi-attribute recognition to provide better recognition for personal attributes in LR images.
Recognition (SRMAR) model. The idea is to combine the powers of SR networks and MAR models into a single one, which will be trained in an end-to-end fashion. The final architecture is achieved via the combination of two state-of-the-art architectures that have been proposed for SR and MAR tasks. We further propose a linear combination of the SRMAR network and the MAR network to boost the recognition performance. Our experiments verify that combining models is better for the low-resolution person attribute recognition.
The overall idea is demonstrated in Fig 1. To the best of our knowledge, this work is the first study that proposes an end-to-end learning model for MAR problems for LR images. We establish the corresponding analysis on two benchmark datasets: Market-1501 [9] and DukeMMTC-reID [10]. Experimental results demonstrate that the proposed SRMAR network increases recognition accuracy on LR images.
To sum up, the main contributions in this article are: i) A joint CNN architecture called SRMAR that utilizes an SR network trained to reconstruct LR images and a MAR network, ii) A novel linear combination scheme to combine the SRMAR model with the MAR model iii) improvements on the state-of-the art PAR methods even on LR images and iv) extensive experiments the proposed models on two benchmark datasets that are widely used for PAR, and results demonstrate that using a joint network for SR and MAR tasks improve the recognition accuracy.
The acronyms used throughout the paper is listed in Table 1 for ease of reference.

II. RELATED WORK
This section briefly reviews the multi-attribute recognition (MAR) and super-resolution (SR) techniques.

A. MULTI-ATTRIBUTE RECOGNITION
One of the primary approaches towards leveraging semantically defined mid-level attributes is carried out by [1]. Low-level colors and texture features are obtained from six equitized images, which horizontally slice each person's image for attribute detection. Each slice consisted of 8 color channels (YCbCr, HSV, and RGB) and 21 filters for textures. To detect attributes, a Support Vector Machine classifier is used. Deng et al. [11] evaluate several techniques such as intersection kernel-based Support Vector Machine, Markov random field with Gaussian kernel, and random forest. More recently, Li et al. [12] propose two models based on deep architectures to address the drawbacks of hand-crafted features and ignoring the relationship between attributes. Their single attribute recognizing model (DeepSAR) is designed for recognizing each attribute individually. The second model [12] is the deep learning framework to exploit the relationship between attributes.
Jianqing et al. [13], propose a multi-label ConvNet model for predicting multiple attributes in conjunction. Li et al. [8] try to get to the best performance of large-scale person VOLUME 10, 2022 re-identification via paying attention to the connection and correlations between attribute labels and ID labels. Their algorithm has two stages. The first is PAR framework, and the other is attribute re-weighting module. With CNN in the PAR framework, the pedestrian image feature representations are extracted. In the first stage, losses of attributes are computed; those scores (loss results) after concatenating are sent to the second stage. Finally, the latest output of this MAR module is conjugated with the global features.
Shi et al. [14], propose a network with two modules: coarse and fine alignment modules. The first module uses a part detector to locate the body parts and form the candidate attributes; then, in the second module, these attributes are aggregated together via a bilinear-pooling layer. Wu et al. [15] propose a parallel model, which consists of intra-attention and inter-attention parts to learn the relationships of images or attributes.

B. SUPER RESOLUTION
One of the pioneering works on SR was [16], then the term SR appeared later around the year 1990 [5]. One of the recent approaches in single-image SR in the pre-deep learning era we can refer to is [7] based on image statistics. First, a sparse representation for each image patch of the LR input image was searched, and then they generated HR output using the coefficients of this representation. Deep learning techniques such as CNN-based SR methods have yielded outstanding results. Wang et al. [4] show that the conventional sparse coding structure's domain expertise could be combined with deep learning's critical ingredients to achieve further improved results. Dong et al. [17] use a three-layer deep fully convolutional network and use bicubic interpolation as an upscaling algorithm. With single-shot SR, they propose an end-to-end CNN-based mapping from the LR images to the High-Resolution (HR) images directly. Their proposed SR CNN. Later, they propose a fast SR convolution neural network [18] based on SR CNN [17]. fast SR convolution neural network has fewer parameters and lower computation costs than its predecessor, and these properties enable it to work on a regular CPU in real-time.
Generative adversarial networks (GANs) have been found to have an essential role in SR research [19]- [23]. Ledig et al. [24], focus on single image SR and present a GAN for super image resolution. In the GAN for SR, a conceptual loss function is used that includes an adversarial and content loss together. Yamanaka et al. [25] use deep CNN models and introduce an efficient and faster single-shot SR model.
For better performance, they used deeper CNN architectures, where they extend a parallelized CNN architecture. This architecture consists of two smaller parts, the feature extraction network, and the re-construction network. Lim et al. [26] develop an enhanced and improved deep SR network (EDSR) that is optimized by dropping out redundant modules in conventional residual networks. Haris et al. [27] propose Deep Back-Projection Networks (DBPN), which works via employing up and down-sampling layers iteratively and implementing an error feedback mechanism for projection errors.

III. PROPOSED METHOD
The overall pipeline of the proposed method end-to-end framework is illustrated in Fig.2. Our framework consists of joint training of two main parts: i) SR framework and ii) attribute recognition framework. In the following of this article, we discuss the details of these two main components.

A. SUPER RESOLUTION NETWORK
The first part of the proposed network is the SR component. Note that we aim to learn a joint network, where the SR part help the recognition of attributes in LR. In this context, any end-to-end trainable SR model can be integrated into our proposed framework. To this end, we utilize two recent state-of-the-art SR models: EDSR [26] and DBPN [27]. The reason for selecting these two is that they are state-of-theart, covering two different areas of the spectrum in terms of network size, and we can reproduce exactly the same results that stated in the corresponding papers.
EDSR SR model [26] extends existing SR networks like [24], [28] that use residual networks by removing additional modules/layers like batch normalization and ReLU. With the elimination of batch normalization layers, the authors show that the performance increases considerably, and the required memory is reduced by 40% during training. This reduction helps in improving the model by expanding the size; as a result, EDSR yields better performance. For upsampling factors ×3 and ×4, the training model is initialized with a pre-trained ×2 network. This expedites the training and increases the final performance.
The second SR model that we have used is the DBPN [27] model. DBPN [27] has two stages, the reciprocally connected up and down-sampling stage and the error feedback stage. Generally, feed-forward architectures that act as a one-way mapping only map rich representations of the input to output space. Such an approach is not successful in mapping LR images because of the limited features available. To solve this problem, DBPN [27] model generates HR features during up-sampling and, during down-sampling, these features are projected back to LR space. The second stage, error feedback stage, has a mechanism from the up to down-scaling steps that positively influences the training process to achieve a better reconstruction.
Both of these SR networks were trained following to the original papers [26], [27] for EDSR and DBPN models respectively, using [29] loss function on the outputs of SR network: We combine the L SR with dice loss function: and get the following loss function: which is then used to train the SRMAR model.

B. THE ATTRIBUTE NETWORK
The second part of the proposed model is a network for learning attributes. LR images that are up-scaled by the SR component are fed into the attribute recognition network for predicting the corresponding attributes. For the MAR task, we adopt the recent network proposed by Lin et al. [8]. This network aims for person re-identification and pedestrian attribute recognition at the same time. We adopt the person attribute recognition part of their model, which is trained just on the attribute data set using ResNet-50 [30] as backbone. This backbone is followed by the attribute recognition, which includes M (number of attributes) Fully Connected layers followed by a Softmax layer [8]. The binary cross-entropy: is used as the loss function in training whereŶ is the predicted output and Y is the ground truth label, γ is regularization factor set to 0.02, and w j represents the weights in the convolution layer j.

C. THE SUPER-RESOLUTION MULTI-ATTRIBUTE RECOGNITION NETWORK
For the combined SRMAR neural network, the training process can be handled in different ways. Designing loss functions and optimizers are essential. We have experimented with different loss functions, data augmentation, and other hyper-parameter optimization techniques. Our idea is to design weighted loss based on SR and MAR network loss functions. This helped to reduce the overfitting effect during the training, however, it requires a subtle selection of the regulation parameter β; Based on results, during the training we decided to set β = 0.00005 which is a trade off between the accuracy and over-fitting of the network. Merging SR and MAR networks leads to a large network that should be trained with care.

D. LINEAR COMBINATION OF MODELS
We also investigate the effect of combining the SRMAR network with the MAR network. The overall architecture for this combination scheme is demonstrated in Fig.3. The idea is to combine the predictive power of the SRMAR model and the MAR model, since our preliminary experiments indicate that some attributes are predicted better by SRMAR network, whereas some others are predicted by MAR network. The simplest approach to achieve this is defining equal weights (i.e. w k = 1/k, k = 2, 3, . . . ) for combining models, which is simply taking the average of the two outputs.
Here we are trying to use networks output statistics on the in-sample data to maximize the accuracy of the linear combination. Obviously one can put other accuracy measures in calculations, but that restricts the problem and makes it difficult to handle with linear optimization techniques. Using the reformulation, the problem reduces to a linear programming problem. This type of modeling falls in the subject of so-called data envelopment analysis which appears in different areas of science addressing similar problems [31].
More formally, let µ i,p and µ i,n for i = 1, 2 be the mean of the outputs of the SRMAR network (i=1) and MAR network (i=2) on positive and negative ground truth data respectively. σ i,p , σ i,n , i = 1, 2 represents the corresponding standard deviations for SRMAR and MA networks. Let ψ 1 and ψ 2 be the outputs of SRMAR and MAR networks respectively with values in [−1, 1], taking weighted average of the two as an output as stated in the following, The goal is to calculate w 1 and w 2 based on statistics of the outputs of the two networks so that y > 0 indicates possessing the attribute and y <= 0 otherwise. w 1 and w 2 need to hold some relations. Lets denote networks output by ψ p i and ψ n i to indicate that the input possesses the attribute (denoting by ''p'') and does not (denoting by ''n'') respectively, then the direct reformulation of the y > 0 (corresponding with the first) and y <= 0 (corresponding with the second) leads to: ψ p 1 w 1 + ψ p 2 w 2 > 0, ψ n 1 w 1 + ψ n 2 w 2 ≤ 0, while maximizing the number of correct predictions. We can calculate the confidence interval of the normalized coefficients as follows: ψ p 1 ∈ (µ 1,p − 1 , µ 1,p + 1 ), ψ p 2 ∈ (µ 2,p − 2 , µ 2,p + 2 ), ψ n 1 ∈ (µ 1,n − 1 , µ 1,n + 1 ) and ψ n 2 ∈ (µ 2,n − 2 , µ 2,n + 2 ).
. Z i 0,1 for i = 1, 2 is the normalization of outputs and Z 0,1 stands for normal distribution with mean 0 and variance 1 and N is number of samples (here number of in-sample data). σ stands for the corresponding distribution standard deviation.
In our case we set W i = { 0.01 × j, j = 1, 2, . . . , 100 } for i = 1, 2 and therefore get finite feasible region W 1 × W 2 by which we calculate the target function by selecting the pair (w 1 , w 2 ) that lead to the highest accuracy.As the size of training set is large, to solve the resulted integer programming we use the python wrapper of the integer programming solver SCIP [32]. The proposed architecture of this linear combination strategy is demonstrated in Fig.3.
Note that the proposed method will perform at least as better as simple average weights as the solution w 1 = 1/2, w 2 = 1/2 is already inside the feasible region of the linear programming model. As demonstrated in the experiments section, this combination strategy have led to better results compared to using a single model.

IV. EXPERIMENTS A. DATASETS
We carry out our experimental evaluation on two widely used benchmark datasets Market-1501 [9], and DukeMTMC-reID [10]. Market-1501 [9] is one of the largest person re-ID datasets, containing 32,668 images and 3,368 query images. 751 identities (19,732) are used for training and 750 identities (13,328 images) are for testing [8]. There are 27 attributes 1 provided in the dataset. Following [8], we work over 11 attributes where some are taken as the average of the related attributes. The attributes were annotated in the identity level. In this dataset, background and junk images are not considered during training or testing since they do not have the corresponding attribute labels. The second dataset is DukeMTMC-reID dataset [10]. It contains 702 identities (16,522 images) for training and 702 identities (19,889 images) for testing. 23 attributes are covered in this dataset 1 . The aforementioned datasets are used in this paper because we can reproduce the results of a state-of-the-art model that already trained on these datasets [8], and furthermore, these datasets are known to be among the most referred datasets in multi-attribute recognition research.

B. TRAINING DETAILS
In case of multi-attribute recognition, feature extraction part is followed by M small sub-nets, each constructed by a convolutional layer, a Pooling layer followed by a fully-connected layer, and finally, there is a Softmax layer to get classification probability. We use the original backbone network (ResNet-50) [30], which was pre-trained on ImageNet. In the training stage of the MAR network the loss function is as stated in the 3. For SR network which has a kind of encoder-decoder architecture, Eq.1 is taken as the loss function. In training SR and MAR, Adam optimizer is used as the optimizer. When training SRMAR model, in the SR part, all layers except the last 15 layers were frozen to not letting them to be trained on the multi-attribute data. Our preliminary experiments indicate that freezing more layers leads to poor results while freezing fewer layers causes overfitting. The last 15 layers is selected based on the outcome of the different trials. During training SRMAR network, DiffGrad optimizer [33] with a cyclic learning schedule is used to optimize combined loss function (Equation 4). The maximum size of images in selected datasets is 64 × 128; therefore, we consider this as the reference size. Paying attention to the input image sizes of SR models EDSR (2x, 3x, and 4x) and DBPN (2x, 4x, and 8x), the images were downsampled with bi-cubic interpolation into four sizes: 32 × 64, 21 × 42, 16 × 32, and 8 × 16 respect to SR model input sizes. The batch size is taken as 32 in all experiments and the networks were trained for 40-60 epochs. For training, the number of epoch to 60 epoch, batch size to 32, with initial learning rate to 0.001, with a learning scheduler multiplies the learning rate by 0.1 every five epochs.

C. EXPERIMENTAL RESULTS
The SRMAR model is evaluated via extensive experiments over the two benchmark datasets. There are two versions, SRMAR-E uses EDSR [26] SR model within the joint network, whereas SRMAR-D uses DBPN [27] SR model. The proposed SRMAR model is compared with the MAR [8] model that is applied to the same resolution images. Several resolutions such as (32 × 64), (21 × 42) and (16 × 32 or 8 × 16) are evaluated in the experiments. The number of total attributes is 30 and 23 for the Market-1501 [9] and the DukeMTMC-reID [10] datasets, respectively. For the sake of representation, we summarized them into 11 (Market-1501) and 10 (DukeMTMC-reID) attributes by averaging similar attributes that belong to one category (such as upper body colors or lower body colors). The rightmost column represents the average accuracy for each method processing over the presented image resolution.
In Table 2, the results of the SRMAR-E method over the DukeMTMC-reID dataset is shown. According to Table 2, for all the resolutions, the SRMAR-E model improves the recognition performance of the MAR model significantly. For 16 × 32 resolution input size, the original model without any SR component achieves an accuracy of 68.78%, whereas the proposed SRMAR-E model achieves 83.22% accuracy. The accuracy is even more improved to 84.37% when the MAR network is combined with the SRMAR network using the proposed linear combination strategy (Combined-E). Similarly, for the input size 21 × 42, the proposed SRMAR-E model improves the accuracy of the MAR [8] model from 74.98% to 82.14%, and the linear combination of the two models (Combined-D) achieve an accuracy of 85.58%. For the 32 × 64 input size, even though the improvements are not that drastic, still the accuracy improve from 77.21 % to 82.89%. Table 3 presents similar experiments using the SRMAR-D (that utilizes the DBPN model as the SR component). We observe that the SRMAR model improves over MAR [8] from 77.21% to 79.71% for large input, from 74.98% to 79.91% for medium input, and from 65.45% to 80,16% for small input. Like the SRMAR-E, we observe that the relative improvement in accuracy for small size input (8×16) is more than the other two input sizes. Also, we observe that the linear combination of two models is again effective in increasing the overall accuracy even further.
For the DukeMTMC-reID dataset, we observe a similar trend in our experiments. Table 4 and Table 5 presents the corresponding results. In Table 4, the results of the SRMAR-E model are presented. The overall accuracy is increased from VOLUME 10, 2022     Table 4, the proposed linear combination of MAR [8] and SRMAR-E models are also effective, offering a performance improvement more than 1% over SRMAR-E model.   Table 5 shows the performance of the proposed SRMAR-D model, and the following improvements are achieved over the reference MAR model [8]: for 32 × 64 size input, the overall accuracy has increased from 79.09% to 79.34%, for 16 × 32 sized input, the accuracy has increased from 77.05% to 79.40%, and for the 8 × 16 input from 68.02% to 78.98%. The same pattern is observedin the value of improvements w.r.t sizes; the smaller size, the better improvement.
From the results in Table 2-5, the performance of individual attributes can further be evaluated. In the case of the Market-1501 dataset, from Table 2, the attribute that has the lowest recognition rate is the gender attribute, whereas the long sleeve (l.slv) attribute seems to be the attribute with the highest recognition accuracy for the MAR model [8] in different resolutions. SRMAR-E model improves the recognition accuracy of gender attribute significantly. The performance improvement is also remarkable for the age, style of clothing (ll.clth), color of up clothing (c.up) and color of down clothing (c.down) attributes. For some of the resolutions, especially for 21×42 and 32×64 input sizes, for ''bag'' and ''hair'' attributes, there is a reduction in accuracy in Table 2. In such cases, using the linear combination of these models as proposed helps. As can be seen, the linear combination strategy resolves accuracy reduction of the SRMAR-E for ''bag'' attribute from 69.51% to 73.76% for 32×64 sized input, from 62.23% to 74.21% for 21×42 sized input, and from 70.93% to 76.32% for 16 × 32 sized input. Table 6 summaries the experimental results for 16 × 32 resolution input size over both datasets. As can be seen from this table, SRMAR-E model that uses the EDSR [26] SR model performs better than the SRMAR-D model that uses DBPN [27] SR model on both of the benchmark datasets. Moreover, the proposed linear combination or SRMAR and MAR models offer a notable increase in the accuracy of both SRMAR-(E, D). Our observations suggest that combining both SRMAR and MAR models in the proposed way offers the best recognition performances for the recognition of person attributes in LR images. Based on these results, linear combination is superior to using individual models. In addition, we have shown that improving the personattribute recognition accuracy on LR images can be achieved by the help of SR models, and further improvement can be obtained by combining the models.

V. CONCLUSION
Identifying personal attributes in LR images is a crucial task for the surveillance applications, however, this task is rarely addressed in the literature. In order to fill this gap, this paper evaluates the effects of the SR CNN architectures on improving MAR performance in LR images. To this end, the paper adapts one of the state-of-the-art models proposed for MAR and two different SR network architectures and then construct a combined architecture entitled SRMAR-(E,D). To the best of our knowledge, this research is the first combined learning model for multi-attribute recognition in low resolution images. We also propose a linear combination scheme to combine the proposed SRMAR network with the base multi-attribute recognition network.
The experiments are carried out in two benchmark datasets that have been proposed for person attribute recognition. The experimental results demonstrate that, for the input images in low resolution, the proposed end-to-end convolutional architecture is successful in improving the recognition performance of the base model for person attribute recognition. The performance improvement is remarkable especially for the small-sized inputs, whereas the joint learning of SR and MAR networks contributes to the recognition performance in both small and larger resolutions. The proposed linear combination scheme, which combines the base model and SRMAR model via statistical properties of the models performance on training set, successfully combines the recognition power of both models.