Embedding Imputation With Self-Supervised Graph Neural Networks

Embedding learning is essential in various research areas, especially in natural language processing (NLP). However, given the nature of unstructured data and word frequency distribution, general pre-trained embeddings, such as word2vec and GloVe, are often inferior in language tasks for specific domains because of missing or unreliable embedding. In many domain-specific language tasks, pre-existing side information can often be converted to a graph to depict the pair-wise relationship between words. Previous methods use kernel tricks to pre-compute a fixed graph for propagating information across different words and imputing missing representations. These methods require predefining the optimal graph construction strategy before any model training, resulting in an inflexible two-step process. In this paper, we leverage the recent advances in graph neural networks and self-supervision strategy to simultaneously learn a similarity graph and impute missing embeddings in an end-to-end fashion with the overall time complexity well controlled. We undertake extensive experiments to show that the integrated approach performs better than several baseline methods.


I. INTRODUCTION
Embedding techniques [1], [2], [3] have attracted numerous attention, especially in the domain of natural language processing, because high-quality word representations are indispensable for most language learning tasks. Specifically, many NLP tasks employ transfer learning by creating an embedding lookup layer using a set of pre-trained word embedding vectors trained on a large corpus. Transfer learning is preferred over training a language model from scratch to learn completely new embeddings because most real-world datasets contain a large volume of rare words, making it difficult to find the right representation for them. Additionally, when embeddings are trained from scratch, the number of trainable parameters increases, and the training process slows down significantly.
Despite all these advantages of pre-trained embedding vectors, some words in the dataset of a specific task may not have a pretrained embedding vector. For example, thousands or even millions of terminologies and abbreviations in The associate editor coordinating the review of this manuscript and approving it for publication was Chuan Li. the medical field do not have pretrained embedding vectors because they are often not contained in a general corpus. When employing a set of pretrained embeddings for an NLP task, such words without any pretrained embedding are usually assigned a randomly sampled embedding vector. This phenomenon hinders the downstream NLP tasks from benefiting from these words' embeddings. Fortunately, it is possible to leverage some side information to mitigate the problem of missing word embedding. For example, a knowledge graph in the form of a medical taxonomy and ontology, or even a collection of chemical or physiological attributes of the terminologies, can serve as side information and be fused into the semantic space to impute missing word embeddings.
In this work, we aim to design a semi-supervised learning model that leverages side information to define pair-wise similarities between entities (words) and propagates information between entities with known targets (embeddings) and the ones with unknown targets based on their similarities. Graphs can naturally be used to model such relationships. Most of the existing graph-based methods require a given/preconstructed graph to start with; however, such a graph may not always be readily available. In this work, we aim to learn and optimize a graph and use it for propagating node attributes.
In [4] and [5], the authors first build a fixed graph between words based on side information and then use it to propagate information between nodes. This approach is depicted in Fig. 1. First, the side information matrix X is used to learn pair-wise similarities between words and construct the fixed similarity graph G. Then, the parameterized mapping function f (G, X ) learns to map the side information vector for a word w i to its corresponding pre-trained embedding by updating its parameters.
Hence, for these methods: 1) the graph has to be pre-constructed and fixed, 2) the optimal parameters for the mapping function have to be computed for the fixed graph. If the pre-constructed graph is not suitable for the imputation task, i.e., it does not accurately capture the pairwise relationships between words, it may significantly limit the capacity and hurt the performance of the model. Optimizing the graph structure is costly because it requires repeating the two-step process every time. To address these challenges caused by the two-step approach, we propose to infer a graph from the data, similar to latent graph learning approaches, such as [6], [7], and [8] instead of calculating a fixed graph. These methods are more flexible compared to fixed graphs due to their ability to adjust graph structure to minimize the embedding learning loss function. For embedding imputation using side information, latent graph learning is a promising approach, because it does not impose any constraint on graph structure and allows joint learning of the graph and the neural network parameters in an end-to-end manner.
One challenge for latent graph learning approaches is that they heavily rely on limited labels and suffer from poor generalization because of over-fitting. This happens particularly when labeled training data are scarce [9]. In [8], the authors identify a supervision starvation problem in latent graph learning approaches in which the edges between pairs of unlabeled nodes that are far from labeled nodes receive insufficient supervision, leading to unreliable graphs during test time. Because our embedding imputation is concerned with imputing rare words, such as terminologies or proper nouns, it is likely that the domain for side information also consists of only some related rare words, resulting in a limited amount of labeled training data. Any embedding imputation on these rare words suffers from over-fitting and supervision starvation issues. In this paper, we tackle the embedding imputation problem by leveraging the recent advances in self-supervised learning with latent graph structure.
Additionally, graph learning often incurs quadratic costs concerning the number of graph nodes, and this may cause some challenges for large-scale data. To reduce the complexity of the method, we leverage the anchor-graph idea [10] to build an approximate graph where the prior knowledge is translated into the word-word pairwise relationship, with guaranteed linear time complexity and provable algebraic properties. Overall, our method generates superior FIGURE 1. Side information matrix X and target embedding matrix Y for words w 1 , w 2 , w 3 and w 4 . The i -th rows in X and Y correspond to the side information vector and pre-trained embedding of word w i , respectively. For all words, the side information vectors are available, but target embeddings for w 3 and w 4 are missing.
experimental results compared to previous works and is scalable for building large graphs. The contributions of this work are as follows: • We design a powerful embedding method built on top of the recent advances in latent graph learning to address the critical problem of word embedding imputation in natural language processing.
• We learn a dynamic graph from prior knowledge that best fits the embedding learning problem and use it to propagate and transform the prior knowledge into effective embeddings.
• We customize the graph construction using an anchor sampling process to reduce the complexity from quadratic to linear. Consequently, the overall approach is scalable to large datasets.
• We demonstrate the effectiveness of the approach with thorough experiments.

II. RELATED WORKS
The embedding imputation problem was formulated and studied in Latent Semantic Imputation [4], where the authors first use a piece of side information to construct a weighted graph based on the non-negative least squares approach defined in Locally Linear Embedding [11] and convert it into a minimum-spanning-tree-k-nearest-neighbor-graph. The missing embedding vectors are then imputed via a matrix power iteration process that theoretically guarantees deterministic convergence. KG2Vec [5] also uses prior knowledge to build a similarity graph where each node corresponds to a word. Similar to [4], the authors formulate the imputation problem as graph-based semi-supervised learning and apply graph convolutional networks (GCN) [12] to learn missing embedding vectors. The advantages of using GCN in the solution include the incorporation of graph topology into embedding and enlarged model capacity provided by the neural network parameters. Another line of work addressing the problem of missing word embedding vectors includes federated learning based on character-level information [13] and the robust backed-off approach [14] based on sub-word information. These two approaches do not incorporate side information or prior knowledge. We follow the same problem setting as the first line of work that integrates prior knowledge into the solution. Differently from these works, we learn graph topology and embeddings within the same network architecture in an end-to-end fashion as shown in Fig. 2, rather than using a non-flexible, fixed graph to propagate information. The method in this paper is closely related to the graph-based semi-supervised learning, including the early endeavors [15], [16], where the modeling process takes into consideration the feature vectors of all samples and uses the pair-wise relationships in a graph to enforce locality and smoothness. Graph neural networks [12], [17], [18] further enhance the solution with the rich capacity of neural network. Another line of work considers solving classification problems with GNNs when a graph structure is unavailable. LDS-GNN [6], jointly learns the graph structure and the parameters of a GCN by approximately solving a bilevel program that learns a discrete probability distribution on the edges of the graph. The method allows applying GCNs in scenarios where the given graph is not available, incomplete, or corrupted. IDGL [19] uses an iterative approach and alternates over projecting the nodes to a latent space and constructing an adjacency matrix from the latent representations multiple times. In [8], the authors propose to simultaneously learn the adjacency and GNN parameters with self-supervision for inferring a robust graph structure. Our method is closely related to this line of work.
Most GNN application scenarios assume that the graph topology is given. However, graph topology is often unknown in our problem setting and may have to be constructed from prior knowledge multiple times during training. Graph construction is based on some distance metric among node vectors and incurs quadratic time complexity in the number of nodes. A brute-force approach does not scale to large datasets. To address this issue, we look into near-linear-time geometric graph construction. The recent theoretical work [20] systematically investigates the problem and presents a solution based on well-separated-pair-decomposition, coupled with Johnson-Lindenstrauss lemma if the node feature vector is high-dimensional. Fast approximate k-NN graph construction is also tackled in [21] using locality-sensitive hashing. In our work, we employ a relatively simple yet effective approach that samples anchors and is similar to the idea in [10]. Our experiment results show that the straightforward anchor sampling achieves efficacy and efficiency simultaneously. Moreover, when combined with self-supervised GNNs, the final model performance is robust against graph variation and randomness due to the anchor sampling.
the objective is to infer {y u }. To apply graphbased semi-supervised learning methods, we need to build a graph based on prior knowledge {x i }, where each word w i is a node. Note that, |{w i }| = |{x i }| = n, |{w l }| = |{x l }| = |{y l }| = p and |{w u }| = |{x u }| = |{y u }| = q, where n = p + q. n, p and q represent the total number of words, the number of words which have pretrained embedding vectors and the number of words without pretrained a embedding vector, respectively.

IV. PRELIMINARIES
We define a weighted, attributed graph as G = {V,Ã, X}, where V = {v 1 , v 2 , . . . , v n } is the node set,Ã ∈ R n×n is the adjacency matrix withÃ ij (the element at i-th row, j-th column) indicating the edge weight from node i to node j, (Ã ij = 0 implies there is no edge) and X ∈ R n×f is the feature matrix with f representing the dimensionality of the feature vector of each node.
The graph convolutional neural network (GCN) [12] aims to improve the quality of the representations by aggregating information from neighbor nodes and applying transformations to the representations in each layer. For a graph G = {V,Ã, X}, the output of the l-th layer of a GCN is defined as: H (l) ∈ R n×d l and H (l−1) ∈ R n×d l−1 are the node representations of the current and the previous layers, respectively and the representations in the first layer are initialized as 2 + I is the normalized adjacency matrix with the added self-loop, where I represents the n × n identity matrix. W (l) ∈ R d l−1 ×d l is a trainable weight matrix and σ is an activation function, such as tanh.

V. METHODOLOGY
In this section, we introduce the framework of our self-supervised imputer (SSI). Our goal is to define a mapping from the side information data space to the embedding space, i.e., for each word, the feature vector (side information) will be mapped to its embedding vector. We propose to use a GCN to incorporate the affinity information between words and thereby improve the quality of the learned representations. However, most GCN based architectures assume a pre-defined graph structure. Our problem setting does not have a pre-defined graph structure and requires graph construction from available side information. Existing approaches often adopt kernel-tricks and the nearest neighbors to construct a fixed similarity graph based on side information. However, these fixed graphs may not be appropriate for our imputation objective due to the difference between the two spaces and the lack of flexibility in domain adaptation. Hence, we propose to learn the graph structure FIGURE 2. Network architecture. Side information matrix X is input into the generator to obtain the processed adjacency matrix A. A is then used for both tasks. GNN S tries to reconstruct the original feature matrix X and the self-supervision loss L S is computed based on its output. GNN R predicts the embedding for each word and the regression loss L R is computed based on known embeddings. During test time, the unknown embeddings are replaced with the predictions made by GNN R . and optimize the model weights simultaneously in an endto-end fashion similar to the graph construction algorithm in [8]. Fig. 2 shows the overall architecture of the proposed embedding imputation framework, consisting of the following components: 1) Graph generator module: transforms the original node features using an MLP and computes a kNN -graph based on the similarity of the transformed features. The generated graph is then used in both the Regression module and the Self-supervision modules to increase the robustness of the learned graphs, 2) Regression module (GNN R ): this module consists of a GCN that learns to map the original node features to the target embeddings based on the kNN -graph generated by the previous module, and 3) Self-supervision module (GNN S ): it randomly masks (converts to zero) a subset of the entries in the original feature matrix and then uses the Graph AutoEncoder and the kNN -graph generated from the prior steps as input graph to reconstruct the original features from the masked (corrupted) version of the original node features.

A. GRAPH GENERATOR
The graph generator uses side information (node features) in order to construct the affinity graph. Formally, the graph generator G : R n×f → R n×n is a function that takes the side information matrix X ∈ R n×f as input and produces the affinity matrixÃ ∈ R n×n as output. First, X is passed to a multi-layer perceptron, MLP : R n×f → R n×f ′ , which outputs the transformed node representations X ′ ∈ R n×f ′ . Based on X ′ , the k-nearest-neighbors function kNN : R n×f ′ → R n×n , selects the top k neighbors for each node and generates the sparse k-nearest-neighbor graph.
Let x ′ i denote the transformed representation of node v i . To select the nearest neighbors for v i , we compute the dot product between x ′ i and x ′ j for all j = {1, 2, . . . , n} and select the top k nodes with the largest dot product. Finally, for all selected nodes v j , we setÃ ij = x ′ i · x ′ j , where · represents the dot product operation.
The output matrixÃ may contain negative and positive values, may be asymmetrical and needs to be normalized, so we further processÃ to obtain the affinity matrix of the learned graph: whereD is the matrix of node degrees and P is an element-wise function with non-negative range, such as ReLU .

B. REGRESSION MODULE
The GNN-based regression module GCN R : R n×f ×R n×n → R n×d , takes the original node features X and the generated adjacency matrix A as input and outputs the predicted embedding, where d is the dimensionality of the embedding vector to be imputed. The GNN module parameterized by two matrices W (1) R and W (2) R of trainable weights generates output as follows: The regression loss L R for training is defined as the mean squared difference between the original embedding and the predicted embedding, which is computed for all nodes with available target embedding. Here, target embeddings serve the same role as labels in supervised learning. We generalize the definition of label to be the learning target regardless of whether the model is for classification or regression.

C. SELF-SUPERVISION MODULE
The graph generator learns to generate graphs mainly based on the regression loss L R computed on only labeled nodes. In [8], the authors identify the problem of starved edges and argue that only relying on the supervision from labels may not be sufficient while learning a dynamic graph. A starved edge is defined as an edge generated between two nodes that receives no supervision from the labels because it is more than k-hops away from any labeled node when using a k-layer GCN. They show that a graph suitable for predicting node features can also be useful in predicting node labels and such a graph can be used to regularize the starved edges. We use VOLUME 11, 2023 this idea and introduce the self-supervision module which is useful for learning a more robust graph.
First, we augment the node features by masking out some node attributes with a binary mask matrix M ∈ {0, 1} n×f and obtain a randomly sampled versionX of X. A different M is created randomly at each epoch and contains a fixed number of zero entries.X = X ⊙ M, where ⊙ represents the Hadamard product.
We define an Autoencoder GCN S : R n×f × R n×n → R n×f with parameters W (1) S and W (2) S that takes the node featuresX and the generated adjacency matrix A and tries to reconstructX. The self-supervision loss is defined as: where idx = {(i, j)|M ij = 0} is a set of indices selected uniformly at random in each epoch. MSE is the meansquared-error loss. The final model is trained to minimize L = L R + λL S . λ is a hyperparameter that controls the relative importance between the embedding regression loss L R and the self-supervision loss L S .

D. ANCHOR-kNN GRAPH CONSTRUCTION
The graph generator computes a new kNN graph at each epoch. This may result in scalability issues for large datasets because conventional kNN graph construction requires quadratic time with respect to the number of nodes. To mitigate the scalability issue, we adopt the idea of anchorgraph [10]. The Anchor-kNN process is outlined in Algorithm 1. choice() is the random sampling process on a uniform distribution. choice(X, m) samples m unique rows from X and constructs X m . We take the inner product of the feature matrix X and X m to construct the similarity matrix C, where C ij denotes the similarity between nodes v i and v j . The graph adjacencyÃ is initialized as a matrix of zeros. NN _index takes in C i (the i-th row of C, i.e., the list of similarities for v i ) and an integer k (the node degree) as inputs, and returns a list of indices, γ , corresponding to the largest k elements in C i . Then, for all indices j ∈ γ , the corresponding entry iñ A ij is set to be the similarity score C ij . To use anchor graph, we replace the kNN function in the Graph Generator module with the Anchor-kNN function.

1) COMPLEXITY ANALYSES
In addition to the model effectiveness, we need to evaluate whether the end-to-end approach for embedding imputation is scalable to large datasets. During graph construction, the anchor sampling process takes O(1) time, the similarity computation between n nodes and m anchors takes O(fmn) time where f denotes the dimension of the original feature vector. Overall time complexity for k nearest neighbor search is O(mn). The time complexity of the GNN model evaluation is also linear in n given that the graph is sparse. Therefore, the end-to-end approach with a learnable graph has a time complexity O(n) with m ≪ n.

VI. EXPERIMENTS
We carry out comprehensive experiments on various realworld datasets, including the finance industry and online application markets, to demonstrate the effectiveness, scalability and robustness of the proposed approach. Specifically, we try to answer the following questions: • Is SSI more effective in mapping the side information to the semantic space and facilitate downstream tasks than the baseline methods?
• Can the anchor sampling process reduce graph construction complexity while retaining model performance?
• What is the impact of self-supervision on model performance? We use LSI [4], GCN [12], IDGL [19] and LDS-GNN [6] as the baseline methods for comparison. These methods cover a variety of machine learning models with different architectures. LSI is based on the standard (linear) matrix power method while GCN takes advantage of the rich capacity of the multi-layer-perceptron (MLP) and the graph neural networks. IDGL and LDS-GNN jointly learn the graph structure and the parameters of a GCN and SSI further improves upon all the aforementioned methods by deriving robust graphs from self supervision. Implementation and hyperparameter tuning details are provided in Appendix A.

A. DOWNSTREAM TASKS
When we want to avoid building a language model from scratch, we typically utilize a pretrained embedding set such as GloVe [3]. However, some word embedding in the dataset corpus may not be available in this pretrained embedding set. To impute those missing words, we can utilize a side information source and apply an imputation model. Note that, for data imputation, we do not need to have side information for all the words in the corpus. It is enough to have side information only for a subset of the words which includes the missing words and some selected words from the corpus having the pretrained embedding available. For example, in a finance sentiment analysis task, the corpus may consist of a 70614 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
collection of financial articles, which include some company names along with some other more commonly used words. After picking a pretrained embedding set for the task, we may observe that some of the company names in the corpus do not have a pretrained embedding. Suppose that we identify a side information source (e.g., daily stock returns) that contains information about n companies in the corpus: q of those n companies do not have a pretrained embedding, while the remaining p = n − q companies have a pretrained embedding. Then, we apply the designed imputation algorithm to predict the embedding vectors for those q companies using the guidance from the other p companies. Below, we conduct experiments on two real-world datasets and demonstrate the effectiveness of our imputation model.

1) IMPUTING FINANCE COMPANY EMBEDDINGS
For this task, we adopt the datasets from [4]. There are two datasets of different sizes. The small dataset has a word set of 488 company names retrieved from S&P500 index. The large dataset has a word set of 4092 company names covering almost all publicly listed stocks in US market retrieved from NYSE and NASDAQ. The goal is to successfully impute the missing embeddings that are not available in the pretrained embeddings using the available side information (historical return data). This is done separately for all three pretrained embedding sets. A more detailed description of the dataset is provided in the Appendix B. As LSI and GCN require a fixed graph, we follow the steps in [4] to construct an MST-kNN graph using the side information matrix X as the domain matrix, solve for the optimal edge weights using Non-Negative-Least-Squares and normalize the weights to obtain the graph. LDS-GNN, IDGL and SSI do not require this as they learn the graph during training. For GCN, LDS-GNN, IDGL and SSI, side information matrix X is used as the feature matrix.

2) IMPUTING MOBILE APPLICATION EMBEDDINGS
Mobile App Statistics dataset contains more than 7000 Apple iOS mobile application details extracted from the iTunes Search API at the Apple Inc website. Each app has a name and a primary genre such as Games, Sports, or Business. There are 23 possible genres. Each app also has categorical (e.g., content maturity rating) and numerical features (e.g., price) and a textual description. For each app, we process and merge the features and obtain a feature vector of size f . Details of the dataset and the processing steps are provided in the Appendix B. Side information matrix X ∈ R n×f is obtained by stacking the individual feature vectors of all apps. Again, X is used to construct the fixed graph for LSI and GCN and also as the feature matrix for GCN, LDS-GNN, IDGL and SSI.

3) EVALUATION
After the imputation, we obtain a set of predicted embedding vectors {ŷ 1 ,ŷ 2 , . . . ,ŷ n } for all words. Any word w i which previously had a pretrained embedding retains its original embedding y i , while we set y j =ŷ j for all the remanining words w j (the words without any pretrained embedding). After this step, we refer to {y 1 , y 2 , . . . , y n } as the completed set of embeddings. To evaluate and compare the quality of the learned embeddings using different methods, we perform the k-Nearest-Neighbors (kNN ) evaluation described in Algorithm 2 in the Appendix C and provide the reasoning for this evaluation. Essentially, we iteratively leave one company/app out of the set and try to predict its label (industry for finance, app category for app statistics dataset) using its k-Nearest-Neighbors.
The results are shown in Table 1, Table 2, Table 3 for the small finance, large finance and Mobile App Statistics datasets, respectively. We run every experiment five times and compute the mean and standard deviation. We perform the kNN classification and gradually change the k to examine the robustness of different methods under different choices of k. From Table 1, Table 2 and Table 3 (see Appendix D for more detailed tables) we make the following observations: (1) SSI, LDS-GNN and IDGL often outperform GCN, demonstrating the effectiveness of using latent graph learning compared to using fixed graphs. This result highlights the potential of these approaches in improving the mapping function from the side information space to the embedding space and consequently, their ability to improve overall embedding quality; (2) LSI outperforms GCN, LDS-GNN and IDGL. This result shows that neither the capacity improvement brought by the MLP parameters nor the latent graph learning approach can bring enough improvement to outperform LSI. A possible explanation for this is LSI's deterministic convergence, which guarantees the model to reach the optimal solution once the hyperparameters are set, which is not the case for the other models. (3) SSI outperforms LDS-GNN and IDGL, indicating that latent graph learning guided by self-supervision is able to learn more robust graphs for domain adaptation compared to the alternative approaches offered by LDS-GNN and IDGL. (4) SSI outperforms all baselines, proving that GCN based methods combined with a robust latent graph learning strategy is the most effective way to perform embedding imputation. VOLUME 11, 2023

B. SENSITIVITY ANALYSES 1) NUMBER OF ANCHORS (m)
In this section, we evaluate the effectiveness of the anchor sampling process in reducing computation complexity while retaining the model's accuracy. For the large finance dataset and fastText, fixing the found optimal hyperparameters and only varying the number of anchors, we train SSI multiple times and observe the kNN accuracies (for k = 30) and runtimes for building a single graph. The reported results are the averages of five different runs for each configuration. Figure 4 confirms that the Anchor-kNN achieves a comparable accuracy based on an approximate graph that only considers the node affinity concerning a small subset of nodes (anchors) in comparison with the exact solution that has quadratic time complexity and requires comprehensive pair-wise calculations among all nodes. With 500 and 1000 anchors, Anchor-kNN loses around 1%-2% accuracy, but manages to decrease construction time significantly, by 77.6% and 63.3%, respectively. Using 2000 anchors (approximately half of all the nodes) roughly preserves original performance while reducing the construction time by 43.4%. Recall that the complexity of anchor graph construction is linear to the number of anchors and the number of all graph nodes. Our algorithm uses a constant number of anchors, achieves comparable performance to those full-scale graph-based algorithms, and has the advantage of the overall linear complexity in terms of the number of nodes.  2) SELF-SUPERVISION STRENGTH (λ) Figure 3 shows the results for the Mobile App Statistics dataset for different values of λ. We train multiple models with varying λ values using the optimal hyperparameters. Note that λ = 0 corresponds to no self-supervision. For both word2vec and glove embeddings, we observe that when λ increases up to a particular value, the model  accuracy improves for both glove and google, confirming that self-supervision indeed enhances the model robustness and performance.
Once λ reaches a specific value, the model performance peaks. After this point, increasing λ deteriorates the performance. This behavior is similar to cases where applying too much regularization starts to introduce a lot of bias. In fact, the self-supervision loss term indeed serves as a regularization mechanism to ensure that the model trades off between the supervised learning objective and graph robustness in preserving node features.

VII. DISCUSSION AND CONCLUSION
In this paper, we tackle the problem of embedding imputation with the recent advances in graph neural networks. Instead of using a pre-computed graph, we use the idea of self-supervision to learn and evolve graph structure during training to address the challenge of converting the side information into a suitable network structure for imputation. Combining the reconstruction loss of the original node features and the actual prediction task has a close connection to regularization and is highly effective. We also integrate the idea of anchor sampling into our framework to reduce the complexity of graph construction for scalability. The approach yields performance superiority and robustness on multiple tasks and numerous datasets compared to previous works. We anticipate our embedding imputation technology will be especially useful in domain NLP tasks.

APPENDIX A IMPLEMENTATION DETAILS
The hyperparameters for all models are tuned using the losses on the validation set (mean squared error between predicted and original embedding). After finding the optimal hyperparameters, we train and test each model using 5 random seeds and report the average results and standard deviations on the test sets. We use the Adam [22] for all models except LSI unless stated otherwise, and tune the weight decay parameter from (0.0, 1e-7, 1e-6, 1e-5).
For LSI and GCN, we tune the number of neighbors δ for constructing the MST-kNN graph from (10,20,30). We tune the learning rate from (1e-5, 1e-4, 1e-3) and apply dropout on the adjacency matrix and the weight matrices with the keep probabilities selected from (1.0, 0.75, 0.5), applied at each layer. We train for 400 epochs. For LSI, stopping criterion η is set to 1e-4.
For IDGL, input graph knn size is tuned from (10,20,30). Weight dropout and adjacency dropout rate is tuned from (0.0, 0.25, 0.5). A two layer GCN is used for the GNN module. Weighted cosine is used for the graph metric type. λ is to 0.9, η is set to 0.2, δ is set to 8.5e-5. Number of perspectives is set to 4. Learning rate is tuned from (1e-5, 1e-4, 1e-3, 1e-2), early stopping is applied with a patience of 20 epochs.
For SSI, we apply dropout on the weight matrices for GNC R with keep probability selected from (1.0, 0.75, 0.5), applied at each layer. We tune the learning rates for GCN R and GCN S from (1e-5, 1e-4, 1e-3), the mask out ratio, i.e., what portion of entries to mask out (convert to zero) on the feature matrix from (0.1, 0.2, 0.5, 0.9), k for the kNN function in the Graph Generator from (10,20,30), self-supervision strength parameter λ from (0.1, 1, 2, 3, 4, 5, 10), all MLP activations from (tanh, ReLU). Graph Generator uses two 300 × 300 diagonal weight matrices. We train for 400 epochs and the first 20% of the epochs are used to train only the self-supervision module.
Finally, for a fair comparison, we keep the capacity of the regression modules the same, i.e., GCN, inner objective GCN of LDS-GNN, GNN module of IDGL and GCN R of SSI all have two layers of 600 hidden units.
Each company has an industry category label, e.g., Google belongs to the IT industry, while Blackrock belongs to the financial industry. There are eleven different category labels representing eleven industry sectors. Every company also has a historical daily trading return vector ⃗ r = [r t 1 , r t 2 , . . . , r t f ] available as side information. For the small dataset, this vector contains the daily stock returns from 2016-08-24 to 2018-08-27. For the large dataset, this vector contains the daily returns for 400 trading days ending on 2018-11-01. For each dataset, we obtain the matrix of returns X ∈ R n×f by stacking the individual return vectors of all companies.
For all three pretrained embedding sets, the companies that are not contained in the embedding set form the test set. We perform a stratified split on the remaining companies by forcing the distribution of the industry category labels to be same across training and validation sets. We designate 80% of the data as training set and the remaining 20% as the validation set. Training and hyperparameter tuning details are provided in A.

B. MOBILE APP STATISTICS DATASET
The Mobile App Statistics dataset from Kaggle 1 contains more than 7000 Apple iOS mobile application details extracted from the iTunes Search API at the Apple Inc website. Each app has a name and a primary genre such as Games, Sports, or Business. There are 23 possible genres in total.
Each app also has numerical features including: "price": Price amount, "size_bytes": Size (in Bytes), "ratingcounttot": User Rating counts (for all version), "ratingcountver": User Rating counts (for current version), ''user_rating'': Average User Rating value (for all version), "userratingver": Average User Rating value (for current version), "ver": Latest version code, "sup_devices_num": Number of supporting devices, "ipadSc_urls.num": Number of screenshots showed for display, "lang.num": Number of supported languages. There are also categorical features including: "cont_rating": Content (Maturity) Rating (e.g., 7+, 13+), "vpp_lic": Vpp Device Based Licensing Enabled (True or False). "currency": Currency Type. We drop the ver and currency (only one currency) columns because they to do not provide any useful information. We also drop the ratingcountver and userratingver columns because they are only valid for the current version of the app. Instead, we keep ratingcounttot and user_rating.
We convert the categorical features into one-hot vectors and normalize the numerical features using a min-max normalization. We use word embeddings to transform textual descriptions of each app into an average representation vec- where n is the number of words in the description and ⃗ w i is the pre-trained embedding from Word2vec or Glove (the same one as the task) for the i-th word. If there are any words in the textual description which is not contained in the pretrained embedding, it is excluded from the computation. We merge the textual representation ⃗ v, the numerical features and the one hot vectors for the categorical features to obtain a final feature vector, ⃗ x ∈ R f , for each app. We obtain the side information matrix X ∈ R n×f by stacking the individual feature vectors of all app names. Same with the finance task, X is used as the domain matrix to construct the fixed graph for LSI and GCN and also as the feature matrix for GCN, LDS-GNN and SSI.
Word2vec, GloVe, and fastText embeddings already contain 190, 260, and 10 app names, respectively while the remaining app names are missing from these embedding sets. We only work with Word2vec and GloVe (fastText contains only a few apps) and try to impute the missing embeddings using the app features as side information.
For both Word2vec and GloVe, the app names which are not contained in the embedding set form the test set. We perform a stratified split on the remaining companies by forcing the distribution of the primary genre labels to be same across training and validation sets. We designate 80% of the data as training set and the remaining 20% as the validation set. Training and hyperparameter tuning details are provided in A.

APPENDIX C kNN EVALUATION
The algorithm takes in the completed set of embeddings {y 1 , y 2 , . . . , y n }, the list of industry category labels l with the i-th element l i representing the industry category (or primary genre for the mobile apps dataset) label for w i and an integer k. The classification is conducted by leaving out one word at a time and predicting its label based on the labels of its k nearest neighbors in terms of Euclidean distance in the embedding space (lines 8-10). We repeat this for all the words in the dataset and compute the overall accuracy, i.e., the ratio of the number of correct predictions to the number of all predictions (line 12).
A company is expected to be semantically more similar to a company from the same industry compared to another from a different industry. For example, Google is more similar to another tech company, Apple, than it is to Walmart. Any effective word imputation method should preserve the semantic locality in the embedding space, in other words, similar words should be embedded closely in the embedding space. Hence, if the imputation method works well, we should be able to accurately infer the industry of a company based the industry labels of the nearby companies in the embedding space, resulting in a high k-Nearest-Neighbors accuracy. A similar reasoning also applies to mobile apps and their genres.