Supervised Matrix Factorization Hashing With Quantitative Loss for Image-Text Search

Image-text hashing approaches have been widely applied in large-scale similarity search applications due to their efficiency in both search speed and storage efficiency. Most recent supervised hashing approaches learn a hash function by constructing a pairwise similarity matrix or directly learning the hash function and hash code (i.e.,1 or −1) procedure based on class labels. However, the former suffers from high training complexity and storage cost, and the latter ignores the semantic correlation of the original data, both of which prevent discriminative hash codes. To this end, we propose a novel discrete hashing algorithm called supervised matrix factorization hashing with quantitative loss (SMFH-QL). The proposed SMFH-QL first generates hash codes via the class label, avoiding the construction of a pairwise similarity; then, matrix factorization is used to design hash codes from original image-text data, thereby eliminating the impact of class labels and reducing the quantization error. Moreover, we introduce a quantitative loss function term to learn hash codes by incorporating class labels and the original data information, facilitating learning a similarity-preserving hash function in image-text search. Extensive experiments show that SMFH-QL outperforms several existing hashing methods on three representative datasets.


I. INTRODUCTION
Image-text search has attracted much attention due to the explosive growth of data in search engines and social networks in recent years. Image-text search plays an important role in many scenarios in the fields of target monitoring and object tracking [1]- [3], video surveillance [4], [5], audio-text recognition [6], face and saliency detection [7], [8], human computer interaction [9], [10] and multimodal modelling [11], [12], etc. Given an image query, the task of image-text search is to retrieve the most relevant texts in a text dataset, and vice versa. However, performing accurate and efficient image-text similarity searches on large-scale datasets is challenging when faced with limited storage resource and search ability. To address this challenge, many hashing-based methods have been proposed to transform image-text data in original feature space into compact binary codes(e.g., hash codes) in low-dimensional Hamming space. The crucial problem of hashing-based image-text search is how to preserve The associate editor coordinating the review of this manuscript and approving it for publication was Fuhui Zhou . the intermodal and intramodal similarity correlation of the original image-text data for hash codes in Hamming space.
Generally, according to whether label information is utilized, existing hashing-based image-text methods can be classified as unsupervised methods and supervised methods. Unsupervised methods [13]- [22] utilize only the image-text pair, including co-occurrence information, to explore their semantic correlation in the shared image-text feature representation space. However, these methods cannot take advantage of semantic label information, i.e., cannot exploit class labels to preserve the intermodal and intramodal correlations of image-text data from the original feature space, which deteriorates the image-text search performance. By contrast, supervised methods [23]- [32] attempt to preserve correlation by exploiting the semantic labels to learn more consistent hash codes for the original image-text data, resulting in efficient retrieval performance. In this paper, we focus on supervised hashing-based methods for similarity image-text search.
Although many supervised hashing-based methods have achieved promising results, they still confront some common VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ drawbacks in learning common feature representations or hash codes and controlling the quantization error. (1) Most hashing-based image-text methods directly quantize the common potential real number representation into hash code in the process of binary quantization. Due to the inherent discrete characteristics of hash code, the constraints are relaxed to continuous values to obtain a solution, resulting in a large quantization error, which makes the generated hash code suboptimal. Consequently, this process substantially affects the retrieval accuracy of image-text search.
(2) Most existing hashing approaches learn a hash function using the class label to construct a pairwise similarity matrix. However, these approaches entail high computational complexity and storage cost. (3) Another approach is to obtain hash codes directly by class label, which ignores the influence of the original image-text data. Moreover, this approach cannot take advantage of the semantic correlation of intramodal and intermodal (e.g., image and text) information to learn discriminative hash codes.
In light of the aforementioned problems, we propose a novel supervised hashing method called supervised matrix factorization hashing-based quantitative loss (SMFH-QL). The main contributions of the proposed method are summarized as follows: • We propose a SMFH-QL algorithm for image-text searching by combining class labels with a matrix factorization strategy. SMFH-QL introduces class labels to generate hash codes directly and exploits matrix factorization to learn the hash function. Based on this fusion strategy, the proposed approach maintains the well-learned similarity correction of the original image-text data, generating more discriminative hash codes.
• We introduce a quantization loss function term to constrain the objective function of SMFH-QL to achieve a closer correlation between the common representation and hash codes. Thus, we can eliminate redundant feature by label information, which eventually reduces the quantitative loss and improves the quality of hash function when performing quantization.
• We evaluate the proposed SMFH-QL in an extensive experiment on three image-text retrieval datasets. The results demonstrate that SMFH-QL obtains the best retrieval performance over the state-of-the-art hashing methods.
The rest of this paper is organized as follows. Section II introduces related work. Section III presents the details of our method, including matrix factorization, the hash function, and the optimization algorithms. Section IV presents the experimental results, followed by conclusions and directions for future work in Section IV.

II. RELATED WORK
As mentioned previously, a variety of unsupervised and supervised image-text hashing methods have been proposed.
The following review of related work focuses on these two aspects.

A. UNSUPERVISED IMAGE-TEXT HASHING METHODS
Unsupervised image-text hashing methods exploit only image-text original data to find intramodal and intermodal semantic correlations when learning the hash function. Kumar and Udupa [33] proposed cross-view hashing (CVH), which is an extension of the spectral hashing (SH) [34] method, that minimizes the weighted average distance between hash codes to preserve the similarity between data modalities. Zhu et al. [35] proposed linear cross-modal hashing (LCMH), which adopts anchor maps to preserve the similarity between the same modality, avoiding the construction of similarity maps for all training datasets and thus greatly improving the hash search efficiency. Zhou et al. [36] proposed latent semantic sparse hashing (LSSH), which first performs sparse coding and matrix factorization to capture the latent semantic features of images and texts and then projects features into a latent semantic space quantized to obtain an optimized hash code. Ding et al. [37] presented collective matrix factorization hashing (CMFH). This method was the first attempt to project data of different modalities into a common subspace by exploiting collective matrix factorization hashing to learn unified hash codes. Unsupervised methods cannot use class label information to guide the similarity correlation of original image-image, text-text and image-data data, leading to degradation of retrieval performance.

B. SUPERVISED IMAGE-TEXT HASHING METHODS
Supervised image-text hashing methods can improve upon the retrieval performance of unsupervised methods by considering label information. Therefore, supervised methods can be applied in more scenarios than can unsupervised approaches. In general, supervised image-text hashing methods can be further categorized into supervised nonmatrix factorization schemes and supervised matrix factorization schemes.

1) SUPERVISED NONMATRIX FACTORIZATION METHODS
Supervised nonmatrix factorization schemes employ given class labels to guide the hash code or hash function learning procedure. Bronstein et al. [38] proposed a sensitive image-text hashing method called CMSSH, which was the first attempt to apply the binary classification to hash code learning, and then applied a boosting algorithm to the learn hash codes of different modalities. Lin et al. [39] proposed semantics-preserving hashing (SePH), which first adopts the semantic correlation matrix of the training data as the supervised information, then transforms the semantic correlation matrix and the learned hash code into a probability distribution, and finally minimizes the Kullback-Leibler (KL) divergence of the two probability distributions. In the process of hash code learning, the method performs nonlinear projection via kernel logistic regression (KLR) and maps data features into hash codes, which are then utilized to learn hash functions. Zhang and Li [40] proposed semantic correlation maximization (SCM), which measures the semantic similarity between data modalities by multiplying the class matrix of the training data. Xu et al. [41] proposed discriminative cross-modal hashing (DCH). DCH directly generates discriminative binary hash codes via a discrete coordinate descent strategy and then learns modality-specific hash functions based on the learned binary codes. However, these supervised nonmatrix factorization schemes ignore the co-correlation of the original image-text features and cannot mine and employ the semantic similarity correlation of image-text modalities.

2) SUPERVISED MATRIX FACTORIZATION METHODS
Supervised matrix factorization schemes apply matrix factorization to mine the intramodal and intermodal semantic correlations from the original image-text modality, which improves the hash function or hash code. For instance, Liu et al. [42] proposed supervised matrix factorization hashing (SMFH), which can utilize the label information to preserve the semantic similarity between different data modalities. Moreover, the adjacent structure is used to maintain the similarity between data of different modalities based on the CMFH method, which further enhances retrieval accuracy. Tang et al. [28] proposed a similar supervised matrix factorization method based on the class label matrix. However, the above approaches require a pairwise semantic similarity matrix, which leads to a large storage and computational cost. Wang et al. [43] proposed label-consistent matrix factorization hashing (LCMFH), which directly guides the process of learning the hash function via the label information of the training data. Therefore, LCMFH achieves a short training time and high search precision by avoiding the construction of a pairwise semantic similarity matrix. However, this method focuses on the similarity between hash codes of the same type of data, which ignores the feasibility of different data hash codes.
By reviewing the aforementioned methods, supervised approaches can achieve better retrieval performance than unsupervised approaches by considering class labels. Some supervised nonmatrix factorization hashing methods ignore the co-correlation of original image-text features and do not consider the semantic similarity correlation of image-text modalities. Meanwhile, supervised matrix factorization approaches are designed to preserve semantic relationships through class labels to construct pairwise similarity matrices. However, pairwise similarity matrices entail high training complexity and storage costs. In this paper, the SMFH-QL algorithm is proposed to address the two drawbacks and to optimize the hash codes and hash function directly via the class label, taking into account the intermodal and intramodal semantic correlations in the original image-text feature space and avoiding the construction of pairwise similarity matrices. The main differences between the proposed SMFH-QL and several state-of-the-art hashing methods are summarized in Table 1. Fig. 1 shows an overview of our proposed SMFH-QL. It includes collective matrix factorization to produce common representation, a hash function to learn mapping matrices for guiding the query procedure, a classification loss function to directly generate a unified hash code via the class label matrix, and a quantitative loss function to maintain the well-learned similarity correlation between the hash code and common representation when performing quantization. These parts are combined with a training procedure for learning the common representation for each image or text modality to calculate image-text similarities. Table 2 lists the notations and corresponding definitions used in the study. SMFH-QL accepts original paired image-text data as the input and processes them through common representation learning and hash coding. The ultimate goal of the proposed approach is to retrieve the most relevant texts given an image query and vice versa.

1) COLLECTIVE MATRIX FACTORIZATION
Collective matrix factorization is commonly used in low-rank representation learning. Ding et al. [37] demonstrated that VOLUME 8, 2020 FIGURE 1. An overview of our proposed SMFH-QL, which consists of four parts that employ a collective matrix factorization strategy to produce a common representation from the features of image-text modality to learn the hash function and generate hash codes based on the semantic label. We also exploit a quantitative loss function term to strengthen the semantic correlation between the common representation and hash codes when performing quantization.
collective matrix factorization is effective in mining the relationships between multimodal data and shallow semantics in image-text retrieval. Given the image and text modalities X and Y , the proposed method needs to find the basic matrices U 1 ∈ R d 1 ×l ,U 2 ∈ R d 2 ×l and the shared latent representation V ∈ R l×n via matrix factorization.
Although the features extracted from different data of an image-text instance are heterogeneous, the two modalities still have common semantic information because they describe the contents of the identical instance. Hence, we assume that different modalities can share a common latent space generated by collective matrix factorization. Considering this, we define the following formulation:

2) HASH FUNCTION
As mentioned above, we have obtained the unified representation V by Eq. (1), but it cannot be used for similarity query owing to its high dimensionality. Given out-of-sample instances, we need to learn two mapping functions of image and text modalities for hash code. Therefore, the corresponding transformation is defined as follows: where W 1 and W 2 are the mapping matrices.

3) CLASSIFICATION LOSS FUNCTION
To maintain the consistency between the learned hash code and the original semantic label, we utilize a shared label matrix to produce a unified hash code directly. That is, a classification error term is introduced to constrain the hash code, and a classifier is trained via minimizing the classification error in the process of learning the corresponding objective function, which improves the discrimination of learned hash codes; i.e., subject to H ∈ {−1, 1} l×n , where Z ∈ R l×c is a projection matrix.

4) QUANTITATIVE LOSS FUNCTION
In order to avoid a continuous relaxation strategy, we directly generate binary hash codes. Hence in the sequel, we construct a connection between H and V , where the matrix V can be accurately represented via the learned hash codes. In this paper, we minimize the difference between a square loss measure H and V to eliminate the large quantitative loss: subject to H ∈ {−1, 1} l×n .

5) THE KERNEL METHOD
In this work, to address the linear indivisibility of data in low-dimensional space, we adopt a nonlinear embedding method for all training samples, which ensures that the data are linearly separable in the high-dimensional kernel space [44], [45]. The corresponding formulation is listed as follows: where S (t) = {X , Y } denotes the aggregation of image and text modalities. where are m randomly selected anchor points.

6) OVERALL OBJECTIVE FUNCTION
Integrating Eq. (1), Eq. (2), Eq. (3), Eq. (4) with Eq. (5), we establish a unified hashing framework and express the overall objective function of the proposed SMFH-QL. However, the excessive number of parameters in the above five formulations increase the complexity of the proposed model and may lead to overfitting. Thus, we further consider introducing a regularization term to solve the overfitting problem, which keeps the model simple and constrains the characteristics of the model. Concretely, the regularization term is defined as follows: Based on the above illustration, the final objective function of SMFH-QL is as follows: subject to H ∈ {−1, 1} l×n , where λ, β, µ, α, γ are the trade-off coefficients. A detailed definition of Eq. (8) is given by the following: subject to H ∈ {−1, 1} l×n . The objective function for SMFH-QL is formulated to generate discriminative hash codes by preserving the label consistency and distribution of features. Therefore, our final aim is to optimize and minimize Eq. (9).

B. OPTIMIZATION ALGORITHM OF THE PROPOSED SMFH-QL
Eq. (9) is a nonconvex problem on matrix variables Fortunately, the problem is convex with respect to any one of the six matrix variables if the other variables are fixed [43]. Therefore, an iterative optimization strategy is adopted to approach the optimal solution gradually. We adopt this strategy to optimize the objective function of SMFH-QL. Then the optimization problem in Eq. (9) can be solved by updating the following steps iteratively.
Step 1: Updating U 1 . Fixing U 2 , W 1 , W 2 , V , Z , we learn basic matrices U 1 as below: Let ∂J ∂U 1 = 0; we then have Step 2: Updating U 2 . Similarly, we can obtain Step 3: Updating W 1 . Fixing U 1 , U 2 , W 2 , V , Z , we learn mapping matrices W 1 as follows: Let ∂J ∂W 1 = 0; we then have Step 4: Updating W 2 . Similarly, we can obtain Step 5: Updating Z . Fixing U 1 , U 2 , W 1 , W 2 , V , the optimization for projection matrix Z is: Let ∂J ∂Z = 0; we then have Step 6: Updating V . Fixing U 1 , U 2 , W 1 , W 2 , Z and setting ∂J ∂V = 0, we have Step 7: Updating H . The hash code H is optimized by fixing other variables as follows: subject to H ∈ {−1, 1} l×n . The two terms in Eq. (19) are expanded into the following two formulas: and where tr( • ) is the trace of the matrix and Updating H by fixing the other variables, we regard const1 and const2 as constants. By combining Eq. (18) and Eq. (19), we see that Eq. (20) is equivalent to the following problems: subject to H ∈ {−1, 1} l×n . Therefore, we can obtain the final closed-form solution for H , i.e., where sign (·) is the element-wise sign function. According to Eq. (23), we can directly obtain a closed-form solution for H . The optimization strategy can discretely generate all bits of the hash code via the class label and common representation. VOLUME 8, 2020 Here, SMFH-QL avoids the large quantitative loss suffered by relaxation and reduces the time consumption of a bit-by-bit optimization scheme. Meanwhile, the discrete optimization strategy accelerates the training process.

C. COMPUTATIONAL COMPLEXITY ANALYSIS
In this section, we theoretically analyze the computational complexity of SMFH-QL. Suppose that d = {d 1 , d 2 } is the feature dimension of the modalities and c, l, n and w are the length of the class label, the length of the hash codes, the number of training instances, and the number of iterations, respectively. The computational complexity of each step of the optimization process for SMFH-QL is shown in Table 3.
During the training of the SMFH-QL algorithm, because l, d, c n, the overall computational complexity is O (nw). In the query procedure, the computational complexity of SMFH-QL is O (dl), which is also linear with respect to the query complexity of the proposed method. In summary, SMFH-QL is highly scalable for large-scale datasets due to the linear training complexity and query complexity.

D. OUT-OF-SAMPLE EXTENSION
For a new query instance x or y, its corresponding hash codes b can be generated by the trained mapping matrices W 1 , W 2 . We can obtain the hash functions h (x) and h (y) for out-of-sample extension: b = h (x) = sign (W 1 x) and b = h (y) = sign (W 2 y).

IV. EXPERIMENTS
To demonstrate the effectiveness of SMFH-QL, we conduct extensive experiments on three widely used datasets. First, we introduce the three representative datasets and the evaluation and comparison methods. Then, we perform comparative experiments using SMFH-QL and other methods. Finally, to further validate the efficacy of SMFH-QL, we evaluate the experiments via empirical analysis, including convergence analysis, training time results and parameter analysis.

A. DATASETS
We evaluate the SMFH-QL on the three representative image-text search datasets: Wiki, MIRFLICKR-25K and NUS-WIDE. The details of the three datasets are shown in Table 4.
• Wiki [46]: This dataset consists of 2,866 image-text pairs from Wikipedia. Each instance contains ten topics, such as warfare, art, and sky, and each sample is an image-text pair. Each image is represented as a 128-D SIFT feature vector, and each text is represented by a 10-D LDA topics vector. Following the experimental protocol in SePH [39] and DLFH [30], we randomly select 2173 pairs as a training set and the rest as the query set.
• MIRFILCKR-25K [47]: The dataset contains approximately 25K instances, and each image is annotated by several user-assigned tags selected from 24 labels. Each image is represented as a 512-dimension GIST feature vector. Each text is represented by a 1386-D BOW vector. Following the experimental protocol, we randomly sample 5000 instances as the training set and select 2000 instances as the query set.
• NUS-WIDE [48]: This dataset contains approximately 270K images with annotated tags from 81 semantic concepts. Following DLFH, we choose the 10 most frequent concepts consisting of 186,577 images as the experimental data. Each image is a 500-D bag-of-visual words (BOVW) vector, and each text is represented as a 1000-D BOW vector.

B. EVALUATION METRIC
The mean Average Precision (mAP) is a common evaluation metric. The mAP is the mean of the average precision (AP), and the AP of the top R instances is defined as: where q is a query instance, R is the number of instances and N is the number of queries. L is the number of relevant instances in the retrieved set, and P(r) represents the precision of the top r retrieved instances. ξ (r) is an indicator function, and ξ (r) = 1 if the rth instance is relevant to the query and ξ (r) = 0 otherwise. Therefore, mAP can be computed by: where R is the size of the query set in the following experiments. Moreover, we adopt two other criteria, i.e., the Precision-Recall curve and topN-Precision curve [49], which are frequently used in image-text searches.

C. BASELINE METHODS AND EXPERIMENTAL SETUP
Our method is compared against several state-of-the-art hashing methods: LSSH, CMFH, SMFH, SCM-seq, DCH and LCMFH. The first two are unsupervised methods and the last four are supervised methods.
• LSSH [36] extracts an image representation via sparse coding and a text representation via matrix factorization and conducts unified optimization of the objective function by means of hash code learning.
• CMFH [37] learns a unified hash code via matrix factorization with a latent factor model, which can decompose the characteristics of different modes of samples into the same space.
• SMFH [42] constructs a similarity matrix with class label information to strengthen the constraint of data similarity between modalities based on the CMFH method.
• SCM-seq [40] utilizes the label information of samples to maximize the semantic correlation between modalities and proposes two learning models as optimization algorithms. In this paper, we adopt the superior approach SCM-seq, which implements sequential learning.
• DCH [41] directly generates discriminative hash codes via discrete coordinate descent and then learns modality-specific hash functions based on the learned binary codes.
• LCMFH [43] directly guides the procedure of hash function learning based on the semantic labels of the training data. Therefore, LCMFH avoids the construction of a pairwise similarity matrix, which reduces the number of calculations.
Here, SMFH-NQL is used to demonstrate the influence of quantitative loss and the retrieval performance of the proposed SMFH-QL. For fairness, we adopt the same parameters used for SMFH-QL. To validate the superiority of SMFH-QL, several state-ofthe-art hashing methods are used for comparisons, including unsupervised hashing (LSSH, CMFH), supervised nonmatrix factorization hashing (SCM-seq, DCH) and supervised matrix factorization (SMFH, LCMFH). The source codes of the baseline methods were kindly provided by the authors. All parameters in their objective functions are set according to their original papers.
All baseline methods and our SMFH-QL are implemented in MATLAB (64 bit). We perform two image-text search tasks: I→T and T→I. I→T task represents image retrieval from relevant texts, and T→I utilizes text querying of relevant images. The experiments are conducted on a personal computer with an Intel (R) Core (TM) CPU i7-8550U @1.80 GHz and 8 GB RAM and 64-bit Windows 10 operating system.

D. RESULTS AND DISCUSSION 1) ASSSESSMENT OF SMFH-QL'S QUALITY ON WIKI
The first experiment compares the baseline approaches and SMFH-QL on the Wiki dataset. The parameters {λ, β, α, µ, γ } for Eq. (9) are {0.5, 10, 10, 10000, 0.1}. The mAP values of all methods on Wiki are shown in Table 5. The Precision-Recall curves and the topN-precision curves for all compared methods are plotted in Fig. 2 and 3, respectively.  From Table 5, we have the following observations: (1) SMFH-QL achieves the best results, which confirms the VOLUME 8, 2020 efficiency of the proposed algorithm. (2) The mAP values of LCMFH, DCH and SMFH-QL are much better than those of the other baseline methods because generating hash codes directly from the class label matrix improves the performance of hashing-based methods. SMFH-QL again achieves the best results, followed by DCH, possibly because it introduces a quantization loss term, which can learn discriminative common representation via the generated hash code. (3) As the hash code length increases, the performance of SMFH-QL improves, which indicates that longer hash codes embed more semantic information. (4) The mAP values of most algorithms are better in T→I than in I→T because the features of the text modality can better express the semantic information from an original instance. (5) SMFH-NQL achieves much worse mAP scores than the proposed SMFH-QL because the quantitative loss term is not included in SMFH-QL, which leads to large quantization error. These experimental results illustrate the importance of the quantitative loss term and demonstrate the effectiveness of SMFH-QL.
The Precision-Recall curves and the topN-Precision curves on the Wiki dataset are shown in Figures 2 and 3 from 16 bits to 128 bits, respectively. We make the following observations from Fig. 2 and 3. (1) SMFH-QL has the highest precision in image-text searching, which is consistent with the mAP results. (2) SMFH-QL achieves higher precision than SMFH-NQL because of the missing quantitative loss term, which further confirms that our proposed algorithm is superior.

2) ASSESSMENT OF SMFH-QL'S QUALITY ON MIRFLICKR-25K
In this part, we conduct the same experiments as those for the Wiki on MIRFLICKR-25K, where the length of the hash code is 16, 32, 64, and 128 bits. The parameters {λ, β, α, µ, γ } for Eq. (9) are {0.5, 10, 100, 1000, 0.1} on MIRFLICKR-25K. The mAP scores of SMFH-QL and the baseline methods are reported in Table 5. From Table 5, we can observe the following results: (1) SMFH-QL outperforms the other methods in terms of mAP on MIRFLICKR-25K, confirming that our proposed algorithm is effective for large-scale datasets.
(2) As the hash code length increases, the mAP values of all comparison methods increase because the longer the code length is, the more semantic information the hash code contains. (3) The results of I→T and T→I differ greatly on Wiki, whereas they are similar on MIRFLICKR-25K. There are two reasons for this observation. First, the quality of images on Wiki is poor, and the correlation with semantic tags is inferior. By contrast, the text on Wiki is well edited at the beginning of collection and more tag relevant. Second, there are 25k images on MIRFLICKR-25K with a corresponding tag and annotation attached to each image, which greatly reduces the semantic gap of heterogeneous data and results in inferior retrieval performance.
The Precision-Recall curves and the topN-Precision curves are illustrated in Fig. 4 and 5, respectively. Clearly, SMFH-QL achieves the best results, consistent with the mAP values. In addition, all methods have higher precision on MIRFLICKR-25K than on Wiki because the semantic gap is much smaller for MIRFLICKR-25K. Finally, SMFH-QL achieves higher precision than SMFH-NQL, which confirms the influence of the quantitative loss term and the effectiveness of SMFH-QL.

3) ASSESMENT OF SMFH-QL'S QUALITY ON NUS-WIDE
In this part, we compare SMFH-QL and other methods on the NUS-WIDE dataset. The parameters {λ, β, α, µ, γ } for Eq. (9) are {0.5, 10, 100, 1000, 0.1}. Table 5 shows the mAP values of all baseline methods on NUS-WIDE. Fig. 6 and 7 plot the Precision-Recall curves and the topN-Precision curves for all methods, respectively. According to the experimental results from Table 5 and Fig. 6 and 7, the proposed SMFH-QL outperforms the other methods in image-text search tasks because SMFH-QL introduces a quantization loss function term to constrain the objective function,  which yields a closer correlation between the shared semantic representation and hash codes. In addition, SMFH-NQL performs much worse than SMFH-QL, which is consistent with evaluation of Wiki and MRIFLICKR-25K. These experimental results demonstrate the superior retrieval performance of SMFH-QL.
The six subfigures in Fig. 8 illustrate the mAP values on the two retrieval tasks with the different settings of α and µ on the three image-text datasets. The results shown in Fig. 8 yield the following observations. Observation 1: Performance on Wiki.
(1) SMFH-QL can achieve stable performance when α varies in [10,1e4] and µ varies in [1e3,1e5]. (2) When α is excessively small, e.g., α is in [0,1], the mAP scores of SMFH-QL deteriorate because the learned hash codes do not consider the common latent features, which makes the hash function poor. When α is excessively large, e.g., α is in [1e3,1e4], the mAP scores of SMFH-QL start to decline because the learned hash codes consider more common latent features, which makes the hash function imprecise. Therefore, a suitable score for α can decide the range of µ. (3) In addition, when α is in [1,100], SMFH-QL achieves the best performance on the three datasets for µ in [1e3,1e4]. Thus, as for Wiki, if the value of α is excessively large or small, it can affect the value of µ, which prevents the generated hash codes from learning a discriminative common feature representation. In addition, the imprecise common feature representation worsens the to-be-learned hash function and ultimately degrades the image-text search performance.
Observation 2: Performance on MIRFLICKR-25K. (1) For α = 0, the mAP values are the worst result on both retrieval tasks, regardless of the value of µ. The main reason is that SMFH-QL can only learn the hash function from the common feature representation obtained by matrix factorization of the original image-text data. Thus, in this case, the label information does not effect common feature representation, which makes the to-be-learned hash function weak and consequently affects the efficiency of the image-text search. (2) When the values of α vary in the range of [1e3, 1e4], the mAP values corresponding to µ are unstable. (3) When the value of α is in [10,100], SMFH-QL achieves the best performance on the three datasets for µ in [1e3,1e4].
Observation 3: Performance on NUS-WIDE. Similarly, the same observations can be found on NUS-WIDE. Thus, the quantitative loss term has a substantial effect on the classification loss term, which demonstrates the importance of the quantitative loss function term.
The six subfigures in Fig. 9 illustrate the mAP scores in the two retrieval tasks under different settings of α and β on the three benchmark datasets. From Fig. 9, we have the following observations. (1) When α is in the range of [0,1], the results of mAP corresponding to parameter β are sensitive and unstable for the three datasets. For α = 0, the mAP corresponding to β achieves the worst scores because the generated hash code cannot be associated with common feature representation. That is, the label information does not guide the process of learning the hash function, which impacts the retrieval performance. (2) From Figure 8, except for Wiki, if α is in the range of [1e3,1e4], the mAP corresponding to β obtains worse scores when α is in [0,1]. One potential reason is that the quantitative loss term generates a large value of matrix V , which leads to redundant features in label information during learning of the hash function. (3) Another phenomenon is that the mAP of β in the range of [10,100] can obtain the best scores when the values of α is in the range of [10,100], i.e., SMFH-QL achieves optimal performance.
In summary, the quantitative loss function term establishes similarity correlation of hash codes H and common feature representation V , resulting in learning of a better discriminative hash function during the training procedure. In light of the aforementioned analysis, we confirm that the quantitative loss term yields a close connection between the classification loss term and the hash function term; i.e., the label information indirectly guides the procedure of learning the hash function. Finally, Fig. 8 and 9 directly verify the importance of the quantitative loss term and indirectly demonstrate the effectiveness of SMFH-QL in image-text searching.

E. EMPIRICAL ANALYSIS
In the previous section, we conducted extensive experiments on the three datasets and employed three most common evaluation metrics, i.e., mAP, Precision-Recall curve and topN-Precision curve. In this section, we perform an empirical analysis to validate the retrieval performance of the proposed SMFH-QL algorithm.

1) CONVERGENCE ANALYSIS
The objective function of SMFH-QL is optimized via an iterative strategy in Algorithm 1. The convergence rate of this strategy is significant for the retrieval performance of SMFH-QL. Hence, we perform additional experiments on the same three datasets with the length of hash code fixed to 64 bits. The convergence curves are shown in Fig. 10. According to Figure 10, the following observations can be obtained.

2) TRAINING TIME ANALYSIS
To demonstrate the efficiency of SMFH-QL on large-scale datasets, we analyze and compare the training time (in seconds) of all comparison methods with different code lengths on MIRFLICKR-25K and NUS-WIDE. The experimental results are shown in Tables 6 and 7. For all comparison approaches, the training time includes the time for learning the hash function and hash code. (1) As shown in Table 6, LSSH takes the most time in different bits during the training procedure. SCM-seq consumes more and more time as the code length increases. CMFH and SMFH require less time than LSSH and SCM-seq but much more time than DCH, LCMFH,SMFH-NQL and SMFH-QL because these four methods directly guide the hash code or hash function procedure based on the semantic label, thereby avoiding the construction of pairwise similarity matrices that requires a large quantity of calculation. VOLUME 8, 2020 (2) From Table 7, we see that LSSH, CMFH, SMFH, and SCM-seq require much more training time than DCH, LCMFH,SMFH-NQL and SMFH-QL. Additionally, the latter four methods have similar training times, which is the same as the results for MIRFLICKR-25K. Besides, LCMFH has achieved the best reesults than other methods. However, in summary, SMFH-QL possesses the best retrieval performance when keeping similar computational efficiency than several state-of-the-art hashing-based image-text methods.

3) PARAMETER ANALYSIS
In the previous experiments, we empirically set all values of parameters based on three datasets. In this section, we conduct confirmatory experiments to demonstrate the influence of changing parameters λ, β, α, µ, γ on the proposed SMFH-QL. All parameter experiments are completed on the same three datasets with 64 bit hash codes, where ''I→T'' and ''I→T'' denote querying text by image and querying image by text, respectively. As shown in Fig. 11, we have the following observations.
• Parameter λ controls the influence of image-text modalities during the procedure of matrix factorization. We consider that the original image-text data play an identical role in matrix decomposition and keep common importance in SMFH-QL. Therefore, we set the value λ = 0.5 without concrete analysis.
• Parameter β influences the performance of learning the hash function term. From Fig. 11, it can be observed that SMFH-QL achieves the best results when β = 10 on the three datasets.
• Parameter µ controls the importance of label consistency and the generated hash codes in the final objective function. If µ is too small, it cannot make full use of label information to generate a discriminative hash code, which reduces the performance of SMFH-QL. If the value of µ is too large, the redundant features in the label matrix will be brought into the procedure of learning the hash code, reducing the quality of the hash code. Fig. 11 illustrates that SMFH-QL achieves the best results when µ = 10000 on Wiki and when µ is 1000 on MIRFLICKR-25K and NUS-WIDE.
• Parameter α influences the quantitative loss function term. If µ is too small or large, it degrades the performance of SMFH-QL. The main reason is that the common semantic representation V cannot be accurately represented via the learned binary code H . Therefore, we obtain α = 10 on Wiki and α = 100 on MIRFLICKR-25K and NUS-WIDE based on Fig. 11.
• Parameter γ controls the weight of the regularization term. Thus, when the value of γ is too small, the effect of the regularization term will be reduced, which makes the training process of SMFH-QL overfitted. By contrast, when γ is too small, underfitting may occur. In addition, Fig. 11 shows that the mAP of SMFH-QL remains stable on MIRFLICKR-25K and NUS-WIDE but tends to decline on Wiki when γ is more than 0.1. One possible reason is that the training sets in Wiki are small, which results in overfitting.

V. CONCLUSION
In this paper, we propose a novel discrete supervised matrix factorization hashing-based method for image-text searching. The proposed SMFH-QL can learn discriminative hash codes because of two contributions: (1) it directly generates hash codes and a common feature representation by employing semantic class and matrix factorization, respectively; (2) it constructs a strong similarity correlation between the common feature representation and hash codes when performing quantization. Extensive experiments on three widely used benchmark datasets demonstrate that SMFH-QL substantially outperforms several state-of-the-art hashing-based image-text search methods. In future work, we plan to integrate the proposed SMFH-QL with manifold embedding learning to capture the real local structure in the original image-text data, which can be used to generate more compact hash codes.