Non-Co-Occurrence Enhanced Multi-Label Cross-Modal Hashing Retrieval Based on Graph Convolutional Network

Supervised cross-modal retrieval has significant advantages in retrieval efficiency and storage cost. In the field of hashing retrieval, existing supervised methods are divided into single-label and multi-label methods. For the single-label method, simply using a single label to measure the semantic relevance between instances will cause an error in supervision information. However, the existing multi-label hashing methods also have some problems. For example, only considering the co-occurrence of multiple labels among instances may not accurately reflect their similarity. At the same time, in the previous methods, the text modality processing did not reach the fine level of image modality, resulting in insufficient use of text information. To address these issues, we proposed Non-co-occurrence enhanced Multi-label cross-modal hashing retrieval based on Graph Convolutional Network (MHGCN). Firstly, we introduced a multi-label non-co-occurrence similarity measurement method, which adds multi-label non-co-occurrence information among instances in the multi-label similarity measurement to measure the differences between instances; Secondly, we used Graph Convolutional Networks (GCNS) to process the information on text modality; Thirdly, we introduced the memory mechanism to restrict the difference of hash code learning. Many experiments show that the proposed method has excellent performance. In three widely used datasets (NUS-WIDE, MIRFlickr-25k, IAPR TC-12), MAP performance in image-text and text-image tasks was significanlty improved by about 8%, 9%, and 7%, respectlively.


I. INTRODUCTION
Since entering the network era, especially the era of big data, various fields have intersected with the Internet. So multimodal data (e.g., videos, texts, images, audios, etc.) has shown explosive growth. Cross-modal retrieval [13], [14], [15], [16] aims to start from one modality of data to find information about other relevant modalities (e.g., retrieving videos by querying texts). Because the multi-modal data of an instance describes the instance from different dimensions, there is a semantic gap. Therefore, filling in the semantic gap and getting the same semantic description is a great chal-The associate editor coordinating the review of this manuscript and approving it for publication was Yassine Maleh .
lenge. To this end, scholars have developed hashing retrieval technology [8], [9], [14], [47], hoping to obtain a close hash representation by mapping different modalities of instances to Hamming space, one of the most effective and popular methods.
Hash codes are widely used in various fields of computers. Mapping original data to Hamming space form it through the hash function, which is not only fast but also has low computational cost and storage consumption. Early cross-modal hashing methods [7], [10], [11], [12], [17], [19], [20], [21], [22], [23], [24] are based on hand-crafted features, with simple architecture that cannot extract deep semantic features well. Therefore, the accuracy of retrieval results cannot be further improved. The outstanding performance of neural networks stems from their ability to extract high-level features from original sensory data and easily capture the effective representation of instances. So far, various methods of applying deep neural networks to cross-modal hashing retrieval have been proposed. In the field of cross-modal hashing retrieval, much research focuses on supervised and unsupervised retrieval. The difference between the two research directions is whether to use pre-annotated labels. In the unsupervised methods, the instance features extracted by the network are used to build the affinity matrix as the guidance for network training [4], [5], [6], [7]. In the supervised methods, we directly use the label information as the strong supervision in the training process [29], [30], [31], [32], [33], [34], [35], [36], [37], [38].
Due to the strong representation ability of graphs, graphbased hashing has been widely studied by scholars. Traditionally, affinity graphs are used as a guide in the learning process. However, in the process of model training, we need to use the global similarity measurement, so the time cost is very large. Because of this, much research has been done on graphs recently, and researchers hope to add it to the feature learning process to extract more semantic information. such as Graph Convolutional Hashing (GCH) [47] and Aggregation-based Graph Convolutional Hashing for Unsupervised Cross-modal Retrieval (AGCH) [1]. Specifically, GCH adds a Graph Convolutional Network (GCN) to the learning framework and uses it to explore the inherent similarity structure between data points, which will help to generate differentiated hash codes. In AGCH, the intrinsic information embedded in each modal is effectively combined through graph convolution to aggregate the complementary semantic information in different modalities.
In real life, everything is multifaceted. It is only possible to effectively distinguish similarities using more than one label to describe instances, which may lead to suboptimal retrieval results. In fact, most instances share multiple labels, and we can use multiple shared labels between paired instances as supervised information, which can more accurately describe the semantic similarity between instances. According to the number of co-occurrence labels between paired instances, we can measure the similarity between instance pairs: the greater the number of co-occurrence labels between instances, the more similar; otherwise, the less similar ( Figure 1).

FIGURE 1.
From the perspective of a single label, the similarity between instances a, b and a, c is the same, which is unreasonable.
However, even if the number of co-occurrence labels between two pairs of instances is equal, their similarity should be different. Inspired by MDMCH [2], we add multilabel non-co-occurrence information between instances to multi-label similarity measurement. If instances a and b have FIGURE 2. From the perspective of multiple labels, even if a, b, and a, c share the same number of labels, the similarity should also be different. the same number of shared labels as instances a and c, but the number of non-co-occurrence labels between instances a and c is less than the instances a and b, then we have reason to think that the similarity of the latter is greater than the former (Figure 2).
To this end, we propose Non-co-occurrence enhanced Multi-label cross-modal hashing retrieval based on Graph Convolutional Network(MHGCN) for cross-modal multilabel hashing retrieval. We use non-co-occurrence information between instances to enhance our similarity matrix. In order to make text modality processing reach the fine level of image modality, we use Graph Convolutional Network [78] to mine semantic features and retain the semantic information between instances in the original space as much as possible.
The contributions of our MHGCN are the following: 1: We introduced a multi-label non-co-occurrence similarity measurement method, which adds multi-label non-cooccurrence information among instances in the multi-label similarity measurement to enhance the similarity matrix. Therefore, we can judge more accurately the similarity between instances.
2: Because graph networks have strong representation ability, we introduce Graph Convolutional Networks [78] into our proposed model (MHGCN). Therefore, our model can fully mine the semantic information in the text, which helps our model learn the hash codes.
3: In addition, we introduced a memory bank [70] to retain the hash code generated in our learning process effectively. Therefore, we can constrain the hash representation in the whole training process, not only on the mini-batch. 4: Our model performs better on the three benchmark datasets in most cases than the most recent excellent work. This indicates that our model can better extract the semantic features in the instance and generate hash codes with richer semantic information, which will be conducive to downstream tasks.
The remaining chapters are summarized as follows. We review the related works in section II. In Section III, we introduce our method (MHGCN) and give the symbol definition. Section IV gives a detailed description of the optimization algorithm of our framework. We describe the experimental analysis and results in Section V. We give our conclusions in Section VI.

II. RELATED WORK
The rapid development of the Internet connects the whole world, and many multimodal data are released daily. As a hot research field, cross-modal hashing retrieval has been widely studied by scholars, and a large number of efficient methods have been proposed. According to whether the pre-annotated VOLUME 11, 2023 labels are used, we can divide the cross-modal hashing method into supervised method and unsupervised method. Unsupervised methods usually use affinity matrices to constrain the generation of consistent hash codes. Some excellent unsupervised cross-modal hashing methods include Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval (DJSRH) [4], Semantic Topic Multimodal Hashing for Cross-Media Retrieval (STMH) [7], Unsupervised Contrastive Cross-modal Hashing (UCCH) [5], Unsupervised Deep Cross-modal Hashing with Virtual Label Regression (UDCH-VLR) [6] and so on.
Unlike unsupervised methods, supervised methods usually use pre-annotated labels to construct the similarity matrix, which serves as the guidance of the training process. A number of excellent methods include but are not limited to Crossmodality Metric Learning using Similarity-Sensitive Hashing (CMSSH) [19], which by means of embedding incommensurable data into a common metric space. Semantics-Preserving Hashing for Cross-View Retrieval (SePH) [20], which standardizes all Hamming distances by transforming each into a probability that depends on all others. Thus, combining the correlation between hamming distances. Seamless integration of semantic labels into the hash learning process for large-scale data modeling (SCM) [21]. Generalized semantic preserving hashing for cross-modal retrieval (GSPH) [22] using kernel logistic regression. Although the above methods are very effective, they are all based on hand-crafted features. They cannot extract deeper semantic features in the instance, which will cause inaccuracy in the training process and lead to suboptimal experimental results.

A. SINGLE-LABEL METHOD
With the improvement of hardware performance, deep learning has spread to many other fields. Deep Cross-Modal Hashing (DCMH) [29] Introduces Deep Neural Network into Cross-modal hashing. In DCMH [29], image network and text network are used to extract features for cross-modal data, respectively, and then used negative log likelihood function to optimize loss. Adversary Guided Asymmetric Hashing for Cross-Modal Retrieval (AGAH) [30] introduces the thought of Adversary Guided into end-to-end hashing learning and obtains consistent hash codes through the adversarial between text and image output. Pair-wise relationship guided deep hashing for cross-modal retrieval (PRDH) [31] integrates different types of pairwise constraints to encourage the similarities of the hash codes from an intra-modal view and an inter-modal view, respectively. Cross-modal Hamming hashing (CMHH) [32] achieves efficient retrieval by punishing the instance pairs whose hamming distance is greater than the threshold. Correlation hashing network for efficient cross-modal retrieval (CHN) [33] optimize the maximum margin loss on similar pairs.

B. MULTI-LABEL METHOD
Due to the powerful learning ability of the deep neural network, the above methods have excellent performance.
However, almost all of the above methods are based on a single label to calculate the similarity between instances, which will cause much delicate semantic information to be ignored. Using multiple labels method can enrich the semantic features extracted from network. Improved Deep Hashing with Soft Pairwise Similarity for Multi-label Image Retrieval (IDHN) [34] uses soft and hard similarity to distinguish semantic similarity between instances. Deep Multi-Level Semantic Hashing for Cross-Modal Retrieval (DMSH) [35] uses multi-level semantic similarity to construct similarity matrix. Self-supervised adversarial hashing networks for cross-modal retrieval (SSAH) [36] uses multiple labels as supervisory information. Multi-label semantics preserving based deep cross-modal hashing (MLSPH) [3] define a new similarity calculation method to utilize multiple labels information. However, even between instance pairs with the same similarity, they should be different, not identical. Multiple deep neural networks with multiple labels for cross-modal hashing retrieval (MDMCH) [2] to measure the difference between instances by calculating semantic factors.

C. GRAPH CONVOLUTIONAL NETWORK
Graph neural network regards data as a node and uses an adjacency matrix to measure the relationship between data. Since GNN [13] was first proposed, it has attracted extensive research interest in classification, association prediction and other fields. In GNN, each iteration uses the features of neighbours to update itself. Finally, the information of neighbours can be aggregated. Thus, the relationship between data can be captured. Nevertheless, it is easy to cause the features of the node itself in the iteration process to be ignored. Therefore, GCN [78] is proposed to solve the problem. GCN strengthens its own features while weakening the information of neighbours in the aggregation process so that the features of data can be extracted well. Some other representative work includes GraphSAGE [14], Graph Generative Networks (DGMG) [16] and MolGAN [79], Graph Attention Networks (GATs) [15]. Specifically, GATs is a space-based graph convolution networks, which use attention in the aggregation process and can amplify the impact of the most important part of data. GraphSAGE leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. DGMG will generate a node in the graph during each iteration and will make decision and judgment after each node is added. If the judgment is true, it will extract nodes from existing nodes and add edges. When finished, DGMG will update the representation of the graph. MolGAN [79] improves the authenticity of the generated object through the competition between the discriminators and the generators. In this paper, in order to make the text modality processing reach the fine level of image modality, we select GCN to improve the ability of processing text information.
Inspired by MDMCH [2], we use multi-label non-cooccurrence information to enhance the similarity matrix, which can make our similarity matrix more delicate. Moreover, we introduce Graph Convolutional Network to improve our representation of text features. To ensure the generation of hash codes with correct semantic information, we reduce the error between semantic feature similarity and label similarity by using mean square error loss.

III. METHOD A. PROBLEM DEFINITION AND NOTATION
We give the definition of symbols used throughout the paper. We use the uppercase letter, e.g., V to represent the matrix, and the lowercase letter, e.g., v to represent the vector. Row i and column j of matrix V are respectively expressed as where c represents the number of label categories of the instance. l ik = 1 represents the instance o i belongs to semantic category k, otherwise l ik = 0. sign(·) is a sign function defined as: Traditionally, s ij = 1 represents that instances of different modalities share at least one identical semantic class, otherwise s ij = 0. Obviously, this method is not helpful to distinguish the degree of similarity between instances.
In IDHN [34] and ISDH [37], the pairwise similarity is divided into soft similarity and hard similarity. The similarity between instances can be obtained by calculating the distance between their labels with the cosine function. Similarly, the hamming distance between instances can be calculated with the cosine function.
DMSH [35] use 'multi-level semantic similarity' to construct the similarity matrix to preserve the semantic information.
In MLSPH [3], The similarity matrix is constructed by using method similar to Intersection of Union.
Through the above method, we can see that the similarity between instances is no longer a simple 0 and 1, but a number VOLUME 11, 2023 between 0 and 1, which helps us to distinguish the degree of similarity between different instances. For us, inspired by MDMCH [2], we use semantic factors as multi-label non-co-occurrence information to enhance the similarity matrix, which can make our similarity matrix more delicate.
where l i is the label vector of instance o i , |l i | represents the length of label vector l i . l i −l j 1 F is the number of non-cooccurrence class labels between instance o i and o j .
Through the above way, we can use non-co-occurrence information between instances. e.g., we have two label vectors of instances o 1 and o 2 , which is l 1 = {0, 1, 1, 0, 1} and l 2 = {1, 0, 0, 0, 1}. Both instance o 1 and o 2 do not belong to the fourth semantic class. If defined as before, this non-cooccurrence information will not be regarded as the similarity of instance one and two. However, we believe that if both instances do not belong to the same semantic class, this can also be regarded as a kind of similarity information.
Therefore, we can use non-co-occurrence information and traditional similarity to construct our similarity matrix S new as follows: To reduce complexity, in the following, we use S instead of S new .
C. FRAMEWORK Fig. 1 shows the framework of our model (MHGCN), which mainly includes two networks for feature learning and multilabel non-co-occurrence information enhancement methods for building a similarity matrix. The image network is responsible for extracting image information to generate the hash representation of images. Similarly, a text network is responsible for extracting text information for generating the hash representation of texts. We consider the non-co-occurrence label information between instances to optimize the similarity matrix and fully use the label information to guide network learning. At the same time, the similarity matrix is input into the text network as the adjacency matrix.
For the text network, inspired by the AGCH [1], we use Graph Convolution Network (GCN) to mine text semantic features, and input the similarity matrix formed by labels into GCN as the adjacency matrix. Because of the powerful representation ability of Graph Convolutional Network, we can get more robust text hash code. For the image network, we use ResNet34 [41], because its internal residual blocks use shortcut connections, which alleviates the problem of gradient disappearance caused by increasing depth in the deep neural network, and it has excellent performance in image classification.
We use cur_f to represent the image features of instance F ∈ R k×N denotes the hash representation of the image stored in the memory bank. Similarly, cur_g represents the text features extracted from the GCN network g(o t ; θ t ). G ∈ R k×N denotes the hash representation of the text stored in the memory bank. N and k represent the number of instances and the length of the final hash code, respectively. The Graph Convolutional Network (GCN) propagate process is written as follows: whereD ii = jÃ ij and W (l) is a layer-specific weight matrix. σ (l) denotes an activation function. H l ∈ R d×m is the matrix of activations in the Lth layer. So, the binary code B_I and B_T generation are expressed as follows: where θ v and θ t represent parameters in the networks. In order to avoid the back-propagate gradient problem caused by sign(·) function in the training process, we use tanh(·) to replace it. From Eq. (7) and (8), we can see that in the graph convolutional network, an instance is regarded as a node, and we can update the features of the node through its neighbours. Through the weighted summation cascade of the neighbours of the node, the features of its neighbours are allocated to the node, which indicates that the features of adjacent nodes in the feature space will be closer. Therefore, the generated hash code can well reflect the relationship of instances in the feature space.
The construction of adjacency matrix is crucial for node learning in the graph convolutional, which are elaborated before.

D. HASH LEARNING
The quality of hash codes determines the accuracy of retrieval to a certain extent. Therefore, it is a challenge to obtain hash codes that not only contain rich semantic information but also can distinguish semantically similar instances. We constrain the network learning process through three aspects of losses. In order to bridge the semantic gap, we use an inter-modal loss to reduce the semantic difference between different modalities of information of the same instance. For similar instances, we use intra-modal loss and quantitative loss to generate hash codes with discrimination. We calculate the cosine similarity between hash representations to represent the semantic similarity between instances learned by the model. By continuously reducing the loss of mean square error, we can obtain hash codes that retain more and more semantic information. Cosine similarity is defined as follows: The range of cosine similarity cos(f i , g j ) is [−1, 1], ∥·∥ L2 is the L 2 norm. ⟨.⟩ represents the inner product. Obviously, if the similarity of two instances is lower, the cosine similarity of their hash representation will be lower. The intra-modal pair-wise loss consists of two parts, image-to-image pairwise loss and text-to-text pairwise loss, which are defined as follows: (cos(g i , g j ) − S ij ) 2 (11) where N represents the training instances. The S in Eq. (10) and (11), which represents the similarity between instances. For example, S ij represents the similarity between instance i and instance j. Through the intra-modal loss, we can preserve the similarity of the same modality of different instances.
Meanwhile, the inter-modal loss define as follows: We introduce quantization loss to smooth the difference between hash codes and hash representations, while reducing the distance between them.
where f i represents the hash representation extracted from the image network, and f ij represents the jth value. Similarly, g i represents the hash representation extracted from the text network, and g ij represents the jth value. We use b ij to represent the jth bit of the unified hash code b i . From Eq.(10), (11), (12), (13) and (14), we can obtain the final objective function: α, β and γ are the hyper-parameters used for loss calculation.
In addition, we introduce a memory bank [70] to memorize the hash representations in each training batch. In each training batch, we use the up-to-date hash representations and label constraints between instances for loss calculation to retain the semantic information in instances and the semantic relevance between instances.

IV. OPTIMIZATION
For all parameters in the network (θ v , θ t , B), we adopt the alternating strategy to optimize. Specifically, we adopt the strategy of alternating updating parameters, updating one parameter and fixing other parameters in each iteration.
The optimization algorithm of our model(MHGCN) is summarized in Algorithm 1. for iter = 1 to num x do 5: Randomly sample n v instances from O to construct a mini-batch; 6: For each sampled instcnce o i in the mini-batch, calculate f i by forward propagation; 7: Calculate the derivative using Eq. (16); 8: Update parameters θ v by using back propagation; 9: end for 10: for iter = 1 to num y do 11: Randomly sample n t instances from O to construct a mini-batch; 12: For each sampled instance o i in the mini-batch, calculate g i by forward propagation; 13: Calculate the derivative using Formula. (17); 14: Update parameters θ t by using back propagation; 15: end for 16: Update B using Eq. (18); 17: until convergence.

Algorithm 1 MHGCN
A. UPDATING θ v By fixing θ t and B, we can learn the parameters θ v of the image network through the stochastic gradient descent (SGD) with back-propagation (BP). Each time we randomly select a mini-batch from the training set for loss calculation. We calculate the gradient as follows and update the parameters in the network through backpropagation.
Then, it can use the chain rule to calculate ∂L ∂θ v with ∂L ∂f * . Finally, parameter θ v can be updated based on BP (Back Propagation Algorithm). VOLUME 11, 2023

B. UPDATING θ t
Similar to update θ v , by fixing θ v and B, we can learn the parameters θ t of the text network through the stochastic gradient descent (SGD) with back-propagation (BP).
Then, it can use the chain rule to calculate ∂L ∂θ t with ∂L ∂g * . Finally, parameter θ t can be updated based on BP (Back Propagation Algorithm).

C. UPDATING B
When θ v and θ t are fixed, we can update B as follows:

D. OUT-OF-SAMPLE EXTENSION
For instances that are not in the training set, we can obtain the hash code of the instance through the well-trained model easily. Specifically, for the imaging modality of the query instance, we can obtain the hash code through forward propagation as follows: Similarly, we can also obtain the hash code for the text modality of query instance as follows: Then, we can obtain the corresponding retrieval instance by calculating the distance of the hash codes.

V. EXPERIMENTS
In this section, we will detail the model evaluation indicators and the datasets used in the experiments. We compare our method (MHGCN) with the excellent methods and discuss its performance of it.

A. DATASETS
The IAPRTC-12 dataset is a commonly used dataset in the field of cross-modal retrieval, which contains 20,000 instances. Since one of the instances is not labeled, we use the remaining 19,999 instances as our experimental data. For each instance, the image is resized to 224 * 224 * 3 and the text is converted to a 1,251-dimensional bag-of-words vector. For all instances in the dataset, we use 10,000 of them to build training set, 2,000 of them to build query set, and the rest as database. All selected datasets are mutually exclusive.
The NUS-WIDE dataset is a commonly used cross-modal retrieval dataset, which contains 269,468 instances and 81 semantic class labels. After removing some instances with incorrect labels, we selected 190,421 of them. Their labels are frequent 21 semantic categories. The text of all instances is converted to 1,000-dimensional bag-of-words vector. For all instances in the dataset, we use 10,500 of them to build  training set, 2,100 of them to build query set, and the rest as database. All selected datasets are mutually exclusive.
We selected 20,015 instances in MIRFLICKR-25K dataset containing 25,000 instances as our experimental dataset. The text of all instances is converted to 1,386-dimensional bag-ofwords vector. For all instances in the dataset, we use 10,000 of them to build training set, 2,000 of them to build query set, and the rest as database. All selected datasets are mutually exclusive.

B. EVALUATION PROTOCOL
In cross-modal retrieval, we aim to return the information of another modality through the information of one modality of an instance. For example, given a sentence describing an instance, the described picture is returned. In order to measure the retrieval accuracy of the model, we use the mean average precision (MAP) as the reference parameters. The retrieval accuracy of the model is proportional to the MAP value. Similarly, the top-N curve reflects the retrieval accuracy of the model in different recall quantities. The area enclosed by precision-recall curve (PR curve) determines the performance of the model. The average precision (AP) is defined as follows: where i represents the instance used for query. n r indicates the number of instances waiting to be retrieved in the database. N represents the returned query results.  p i represents the probability that the query instance is similar to the top i retrieval results. t(i) = 1 or 0 to indicate similarity or dissimilarity. From the average precision (AP), we can get the definition of MAP as follows: The MAP reflects the precision of model retrieval. The retrieval accuracy of the model in different recall quantities constitutes the top-N precision curve. The precision recall curve is used to measure the accuracy of the hash lookup protocol.

C. MODEL EVALUATION
We have compared our methods with the state-of-theart existing methods, including CMSSH [19], SePH [20], SCM [21], GSPH [22], DCMH [29], PRDH [31], CMHH [32], CHN [33], SSAH [36], SCAHN [38], MLSPH [3], MDMCH [2] (Because we cannot obtain the complete experimental code of MDMCH, we cannot conduct a complete experiment on IAPRTC-12 dataset). From left to right, there are four hand-crafted methods and eight methods based on deep features. The results of all the above methods are from the recurrence experiments or the original paper settings.  The performance of our model with different code lengths on MIRFLICKR-25K, NUS-WIDE, and IAPRTC-12 datasets is shown in Table 1.
Because our model adopts a more accurate similarity matrix construction method, we can get more distinctive semantic features in the training process. We can effectively distinguish instances with an equal number of labels, and our model can better learn deeper semantic features in the learning process to generate more discriminating hash codes. Graph neural network regards data as a node and uses an adjacency matrix to measure the relationship between data. Because of its powerful presentation capability, our method achieves a higher MAP value than the baseline methods in most cases.
MLSPH, SePH and MDMCH consider the semantic relevance of multiple labels when generating hash codes, so MLSPH, SePH and MDMCH are outstanding in the baseline method. This shows that compared with the singlelabel method, the multi-label method is easier to guide the network to mine the semantic information hidden in the instance.
We change the Hamming radius from 0 to k to draw the precision-recall curves. We can see from Figs 4, 5 and 6 that our MHGCN method is superior to other methods in different code lengths (16,32,64) of the three datasets.
From 1 to 5000, we take a value every 200 as our recall number and draw the results of different hash lengths (16,32,64) on three datasets in Figures 7, 8 and 9. From the figures, we can see that our topN accuracy curves are better than the other baseline methods.

D. IMPLEMENTATION DETAILS AND ABLATION STUDY
We set the hyper-parameters α, β and γ in the experiment to 0.9, 1.2 and 0.1 respectively, and the mini-batch to 128. We set the learning rate from initial 10 −1.5 to 10 −6 in 200 iterations. All experiments are conducted under the above settings. Our experimental platform is the open-source environment Pytorch and a NVIDIA 3080Ti GPU.
To compare the effects of different similarity matrix construction methods on the experiment, we made statistics on the experimental results of 64-bit code length on MIRFlickr-25K dataset. We can see the impact of the three methods on the model performance in Table 2. To verify the improvement of the Graph Convolutional Network on our model performance, We use two commonly used model architectures to replace it. We make statistics on the experiments of 64-bit code length on the MIRFlickr-25K dataset. The results are given in Table 3. We can see that the Graph Convolutional Network can fully exploit the features in the text information due to its strong representation ability, which brings huge performance improvement to the model.

E. CONVERGENCE ANALYSIS
To verify the convergence of our model optimization algorithm. We conducted experiments on three datasets and plotted the convergence curves of the three datasets in the case of 64-bit code length in Figure 10. We can see in the figure that our objective loss decreases rapidly and tends to be stable.

F. FUTURE RESEARCH
When building the similarity matrix, we refine the similarity between instances by adding non-co-occurrence label information between instances. However, we assign the same weight to instances for co-occurrence label information (1-1) and non-co-occurrence label information (0-0). Obviously, the information implied in the former is more important, and we can reduce the contribution of the latter to the construction of similar matrix, which is worth studying.

VI. CONCLUSION
In this paper, we propose an effective cross-modal hashing retrieval method called Non-co-occurrence enhanced Multilabel cross-modal hashing retrieval based on Graph convolutional Network (MHGCN). We use a novel multi-label nonco-occurrence similarity measure to construct our similarity matrix, which makes our similarity matrix more refined and makes it easier to distinguish similar instances. Compared with the single-label method, our similarity matrix is more delicate; Compared with the multi-label method, our similarity matrix can better distinguish instances with consistent co-occurrence information. In addition, we use a Graph Convolutional Network with a strong representation ability to extract features, which enables the hash codes generated by the model to retain more semantic information. By analyzing the experimental performance of our MHGCN method on three benchmark datasets. Our model has an excellent performance in cross-modal hashing retrieval tasks.