Inferring Multilingual Domain-Specific Word Embeddings From Large Document Corpora

The use of distributed vector representations of words in Natural Language Processing has become established. To tailor general-purpose vector spaces to the context under analysis, several domain adaptation techniques have been proposed. They all require sufficiently large document corpora tailored to the target domains. However, in several cross-lingual NLP domains both large enough domain-specific document corpora and pre-trained domain-specific word vectors are hard to find for languages other than English. This paper aims at tackling the aforesaid issue. It proposes a new methodology to automatically infer aligned domain-specific word embeddings for a target language on the basis of the general-purpose and domain-specific models available for a source language (typically, English). The proposed inference method relies on a two-step process, which first automatically identifies domain-specific words and then opportunistically reuses the non-linear space transformations applied to the word vectors of the source language in order to learn how to tailor the vector space of the target language to the domain of interest. The performance of the proposed method was validated via extrinsic evaluation by addressing the established word retrieval task. To this aim, a new benchmark multilingual dataset, derived from Wikipedia, has been released. The results confirmed the effectiveness and usability of the proposed approach.


I. INTRODUCTION
In recent years, distributed vector representations of text have been widely applied to solve complex tasks in Natural Language Processing (NLP) such as sentiment analysis [1], machine translation [2], text categorization [3], and synonym prediction [4].
A pioneering word embedding model, namely Word2Vec, was proposed in [5]. The quality of its word-level text representations are impressive: it has shown to effectively capture most of the semantic word-level relationships in large document corpora. Later on, several new word-level encodings (e.g., FastText [6], GloVe [7]) and contextualized models (e.g., XLNet [8], ELMo [9], BERT [10]) have been proposed. The present study focuses on the Word2Vec model because, as discussed later on, it allows both word-level domain adaptation and multilingual alignment and still retains a high popularity level in several NLP applications [11].
Domain adaptation entails transforming high-dimensional vector spaces to specific domains [12]- [15]. The goal is The associate editor coordinating the review of this manuscript and approving it for publication was Gianluigi Ciocca .
to tailor the designed NLP solutions to specific application domains, such as energy [12], biology [15], and industry [14]. Within this scope, unsupervised domain adaptation techniques are particularly appealing, as they allow end-users to fine-tune a general-purpose model even in the absence of labeled data [13], [16].
Since the learning phase of the distributed representations of words relies on Deep Learning architectures, their computation requires (i) a sufficient large document corpora to learn robust data representations and (ii) an adequate computational power (e.g., ad hoc Graphical Processing Units) to accomplish the task in reasonable time. To overcome the above-mentioned issues, in the last decade the NLP community has released several pre-trained general-purpose multilingual models (see, for example, [7], [17]- [19]).
Multilingual document corpora are not only used to separately train language-specific embedding models, but also to align them in a unified latent space [19]. To this purpose, a bilingual lexicon is used to map the words of a source language (e.g., English) to the corresponding translations. Aligned word embedding models have been exploited VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ to effectively address cross-lingual NLP tasks, such as cross-lingual text classification [20], emotion lexicon induction [21], cross-lingual summarization [22]. As a drawback, in many cross-lingual NLP scenarios the use of aligned multilingual word embeddings is still limited by the lack of pre-trained domain-specific models for languages other than English. Currently, the greatest majority of pre-trained vectors were trained on general-purpose document corpora (e.g., Wikipedia). Just few domain-specific models are currently available and mostly for the English language (see Section V-A1). Moreover, for less spoken languages it can be very hard to retrieve a sufficiently large corpus of domain-specific documents. This calls for new approaches to automatically inferring aligned domain-specific multilingual word embeddings. This paper presents a new inference method aimed at adapting the general-purpose Word2Vec vectors of a target language to its domain-specific version. The idea it is to rely on the underlying mapping between general-purpose and domain-specific word embeddings that is known for the English language. These aligned pre-trained models are either easy to retrieve or can be inferred thanks to abundance of English-written document corpora. In other words, the goal is to overcome the lack of domain-specific data and word vectors of the target language by exploiting data richness for the English language. Notice that the proposed approach can be easily extended to any application domain where the availability of data and word vectors of a specific language (not necessarily English) is prevailing.
The proposed method consists of a two-step inference process: first, it automatically identifies the sub-space of domainspecific words of the target language using a binary classifier. According to the domain under consideration, a word in the original space can either change its coordinates in the hyperspace if its relative position does not reflect the semantic similarity with its neighbor words in the domain-adapted space, or be invariant under domain adaptation if its general meaning is not influenced by the domain. The classification step discriminates between the two cases mentioned above. Hence, it allows us to tailor the next adaptation phase to a reduced word set (typically, one order of magnitude smaller than the original one) and, thus, to avoid introducing bias in the original model. Next, the proposed method infers the new position of each selected word in the domain-specific latent space. The latter step relies on a multivariate regression model trained on word vectors of the source language. The key idea is to learn and opportunistically reuse the (potentially nonlinear) transformations that were previously applied by the multilingual embedding aligner to the words of the original language. Notably, the inference step is aligner-agnostic, i.e., it can be successfully applied whatever word embedding aligner was previously used on the aligned word vectors of the source language. As discussed later on, the proposed methodology is instrumental in addressing various cross-lingual NLP tasks (e.g., domain-specific text classification, text summarization).
Since, to the best of our knowledge, this is the first attempt to solve this particular issue, we crawled, prepared, and released a benchmark multilingual dataset tailored to our purposes. Benchmark data consist of (i) a set of document corpora retrieved from Wikipedia and written in seven different languages (i.e., Italian, English, French, Spanish, German, Arabic, Russian), (ii) the per-language word embeddings trained on general-purpose Wikipedia pages, (iii) a selection of terms related to specific domains (i.e., finance, technology, and medicine), (iv) the domain-specific, multilingual document corpora consisting of the term definitions on the basis of the Wikipedia interlanguage glossary, (iv) the per-language domain-specific embeddings.
To test the effectiveness and usability of the proposed method, we conducted an extrinsic evaluation of the model performance achieved on the word retrieval NLP task [23]. To this aim, we used the models trained on English documents as source vectors and separately tested the inferred domain-specific embeddings for the other languages (one by one) against the retrieved ground truth. We tested both linear and non-linear neural network-based regressors, relying on shallow and deep architectures. The results show that the models inferred using a deep fully-connected neural network model outperformed both general-purpose and linear models for most of the tested languages.

A. SUMMARY OF THE CONTRIBUTION
• To overcome the lack of domain-specific document corpora and pre-trained specialized models for less spoken languages, we study of the problem of domain adaptation in multilingual Word2Vec embeddings. This work is, to our best knowledge, the first attempt to address the aforesaid research issue.
• We propose a two-step inference process based on (i) automatic identification of domain-specific words and (ii) supervised inference of the new word vectors in the domain-specific hyperspace of the target language.
• We release a new benchmark multilingual dataset tailored to the task under consideration. To the best of our knowledge, this is first benchmark including general-purpose, multi-domain, and multilingual data and aligned word vectors at the same time.
The rest of the paper is organized as follows. Section II presents the preliminary results achieved in two practical NLP use cases. Section III overviews the related works and discusses the position of the present paper in the related literature. Section IV thoroughly describes the proposed methodology. Section V summarizes the results of the empirical evaluation, whereas Section VI draws conclusions and discusses the future research agenda.

II. MOTIVATING EXAMPLES
We report and qualitatively describe here the preliminary outcomes achieved by adopting the proposed method to address two well-known NLP tasks, i.e., word analogy [24] and retrieval [23]. The respective results are summarized in Tables 1 and 2, where Base indicates the outcomes produced by exploiting the general-purpose models, whereas Domain denotes the outcomes produced by the inferred models tailored to the technology domain (assuming that a sufficient amount of domain-specific data is not available to directly train the domain-specific model).

A. WORD ANALOGY TASK
The word analogy task entails answering analogical questions like man is to king as woman is to? by specifying the most appropriate word (e.g., queen). Word embeddings have relevantly simplified and improved the performance of the NLP approaches used to tackle the above-mentioned task. Specifically, in [5] the authors showed that Word2Vec embedding exhibits seemingly linear behaviour. The embeddings of the analogy woman is to queen as man to king approximately describe a parallelogram [25], even if the model is not specifically trained to address such a task. Hence, given the vector representation of words man, king, and woman in the hyperspace, the analogical questions man is to king as woman is to? can be solved by simply computing a linear combination of vectors in the hyperspace (v king − v man + v woman ).
For each analogical question, Table 1 reports the top-5 nearest neighbor words in the vector space corresponding to the resulting vector. The aim is twofold: (i) test the ability of the models to retrieve appropriate words at the top of the rank and (ii) compare the rank produced by the general-purpose model with those achieved by the domain-specific ones. The latter are expected to produce more pertinent answers questions related to the technological domain. The results confirmed the expectation for all the tested languages.

B. MOST SIMILAR WORD RETRIEVAL TASK
The word analogy task entails answering a query by retrieving the most similar words. The goal is to evaluate the ability of domain-specific models to better capture the semantic relationships among words belonging to the technological domain. Table 2 summarizes the achieved results, which highlight the specialization of the inferred model. For example, given the query memoria (i.e., the Italian word for memory) it retrieves words like usb rather than ricordo or commerazione, which are the Italian translation of recollection and remembrance, respectively.
A quantitative evaluation of the performance of the proposed method in solving this particular task is given in Section V.

III. RELATED WORK
The main goal of word embedding methods is to organize words into a Poincaré hyperspace such that their distance reflects their semantic similarity [26]. To achieve this goal, the learning process relies on the distributional hypothesis. The rationale behind such an hypothesis is that linguistic items occurring within the same domain likely have similar meanings [27]. Hereafter we will separately present (i) the most relevant word embedding models, (ii) the studies aimed at tailoring general-purposes models to specific domains, (iii) the strategies used to align embeddings in multilingual contexts, and (iv) the efforts made in contextualized embeddings. Finally, we will clarify the position of the present work in the state-of-the-art literature.

A. WORD EMBEDDING MODELS
Training vector representations of text using neural networks was first proposed by Bengio et al. [28], whose main goal was to learn a probabilistic language model. A pioneering work in this field was presented in [5]. Given a large training corpus, the authors proposed an effective and efficient neural network-based approach (namely Word2Vec) to learning word embedding based on a sliding window strategy. The indisputable success of the Word2Vec model in supporting several NLP tasks has fostered a huge body of work on learning vector space models. For example, FastText [6] extended the Word2Vec model by encoding also sub-words. This alleviates the Out-Of-Vocabulary problem since the network can infer the embedding of a new word by combining the vector representations of the n-grams that compose it. GloVe [7] and MWE [29] inferred word vector representations based not only on the local context of a word, but also on global information reported in a word co-occurrence matrix. The present study focuses on Word2Vec. Notice that, unlike Fast-Text, Glove, and MWE, Word2Vec supports both word-level domain adaptation and multilingual word vector alignment.

B. DOMAIN ADAPTATION
Word embeddings may differ from one domain to another due to lexical and semantic text variations. Hence, their performance have shown to be strongly dependent on the training corpus [30]. To capture domain specificity a relevant research effort has been devoted to fine-tuning general-purpose vector spaces to capture the peculiarities of specific domains. For example, the method presented in [31] focuses on capturing the word polysemy in different contexts based on topic modeling, whereas in [32] a meta-learner is used to expand the in-domain corpus by exploiting the corpora from a set of past related domains.
Unsupervised domain adaptation approaches (e.g., [33]) often rely on ad hoc heuristics to identify pivot words, i.e., words that are frequently used in a specific domain. Domain adaptation is crucial to successfully employ the embedding model in specific application areas such as finance and healthcare [13]. For example, in [12] and [15], [34] the authors empirically demonstrated how document corpora respectively ranging over oil/gas and biomedical domains can be exploited to improve the quality of word embeddings. In [14] the authors proposed an architecture aimed at adapting general-purpose word embeddings using industry-specific data in order to improve document classifier performance. The benefits of using specialized word VOLUME 9, 2021 embedding models have been demonstrated in languages other than English as well [35].

C. BILINGUAL EMBEDDING ALIGNMENT
Several studies have investigated the alignment between pairs of embedding models (namely, the source and target models). The goal is to map words of a source language to the corresponding ones of the target language. This is particularly useful for addressing automated machine translation [36]. Unsupervised approaches (e.g., [37], [38]) focused on learning a transformation from the source to the target by assuming an empirical distribution in the embedding models, whereas supervised strategies (e.g., [19], [39], [40]) relied on bilingual lexicons.

D. CONTEXTUALIZED EMBEDDINGS
Contextualized embeddings are vector representations of text where a target word's embedding can change depending on the context in which it appears [8], [9], [41], [42]. Unlike Word2Vec, FastText, and Glove they rely on a dynamic representation for each word. Therefore, by construction, they are unsuitable for generating multilingual word-level vector alignments.

E. POSITION OF THE PRESENT WORK IN THE STATE OF THE ART
• This work focuses on Word2Vec embeddings [5]. Fast-Text [6] is not applicable because it relies on sub-words compositionality thus it can be aligned only for static embedding models. GloVe [7] cannot be used since it is based on the corpus' overall word co-occurrence statistics from a single corpus known only at initial training time.
• The aim is to adapt multilingual general-purpose word embeddings to a specific domain to overcome the lack of domain-specific data. Hence, it is a combination of the domain adaptation and bilingual alignment tasks.
• The aim is not to propose new ad hoc solutions separately for the domain adaptation and supervised embedding alignment tasks.
• The use of contextualized embeddings is out of scope of the present work and will be addressed as future work (see Section VI).

IV. PROPOSED METHODOLOGY
Let L be a set of languages and let V l be the vocabulary of words of a language l ∈ L. We assume that we have sets of word embeddings E l (l ∈ L) trained independently on monolingual data. We differentiate between generalpurpose embeddings E G l , i.e., word embeddings trained on multi-domain, general-interest document corpora such as the whole Wikipedia corpus, and domain-specific embeddings E δ l , which are specialized using document corpora tailored to a specific domain δ.
Algorithm 1 reports the main steps of the proposed methodology. A graphical sketch of key phases is depicted in Figure 1. The procedure takes as input the general-purpose and domain-specific document corpora for the source language as well as the general-purpose corpus for one or more target languages. The expected outcome is to infer domain-specific word embedding models separately for each target language. Once all general-purpose embedding models are trained, the model corresponding to the source language is fine-tuned by exploiting a domain-specific corpus (see Figure 1a). In the current implementation of the proposed method, both model training and domain adaptation rely on Word2Vec [5]. However, the embedding method can be straightforwardly substituted with any other word-level embedding that allows domain-adaptive fine-tuning. Then, the general-purpose models for the target languages are all aligned to the corresponding model for the source language by adopting the supervised approach proposed by [19] (see Figure 1b). As in the previous step, different bilingual alignment strategies can be easily integrated as well. Next, to infer domain-specific embeddings for the target languages it is first necessary to discriminate between words specific to the target domain and not. To this aim, a binary classifier is trained on the source language models to predict which words in the general-purpose model of each target language are likely to be specific to the target domain (see Figure 1c). For the subset of words of the target language that are labelled as domain-specific (V true l ), new vectors are inferred by using a regression model (see Figure 1d). The regression step learns from the embeddings available in the source language the mapping between word vectors of the general-purpose and domain-specific models. The mapping is opportunistically reused to infer new word vectors for the target languages. Finally, the newly inferred vectors are joined with the word vectors labeled as not domain-specific (V true l ) to compose the complete domain-specific embeddings for the target languages E δ t . A more thorough description of each step is given in Algorithm 1.

A. DOMAIN ADAPTATION
For each language l ∈ L the domain adaptation phase takes as input the general-purpose embedding E G l and the domainspecific corpora D δ . It generates the corresponding domainspecific embedding E δ l (see Figure 1a).
This phase entails fine-tuning the general-purpose model by shifting the vectors of domain-specific words in order to better capture their context-specific semantic meaning. The key idea is to specialize the general-purpose model for the source language (typically, English) for which a sufficiently large amount of domain-specific data are available. Such a specialized model will be opportunistically re-used to infer the mapping between general-purpose and domain-specific models for the target languages.
Notice that, at this stage, pretrained general-purpose models (e.g., [17]) can be exploited to avoid retraining the vector representations of the source text from scratch. Despite a number of open-source projects having released generalpurpose models (more details are given in Section V-A), only few of them include domain-specific data and models and mostly for a limited number of languages. The latter evidence inspired our research.

B. BILINGUAL EMBEDDING ALIGNMENT
Let l s , l t be a pair of source and target languages. Each word w i in the vocabulary of the source language (respectively target language) is associated with a vector x i ∈ R. To align the two corresponding embeddings E l s and E l t we exploit an initial bilingual lexicon, of size d, that maps each word w i s of the source language to the corresponding translation w i t of the target language. The bilingual alignment step entails extending the lexicon to all source words in V l s that are not present in the initial lexicon so that all word vectors E l s have an explicit mapping to E l t (see Figure 1b). State-of-theart alignment methodologies leverage bilingual lexicons to optimizes a retrieval criterion able to generalize on the full vocabulary learning a source-to-target alignment function.
In our context, we consider as source language the one for which both general-purpose and domain-specific corpora are given (typically, English). The target language is the language for which only general-purpose document corpora end return E δ t : Domain-specific word embedding in the target languages (∀ l ∈ L).
are currently available but there is a need to learn domainspecific word embeddings.
Let M s and M t be the matrices of real numbers respectively containing the words embeddings in E l s and E l t of the words in the initial lexicon. Bilingual embedding alignment entails learning a linear mapping W between the source and target hyperspaces so that the discrepancy between the corresponding word vectors is minimized.
where x t i and y s j are mapped word vectors in the source and target spaces and x t i − y s j is the square loss function to be minimized.
To align bilingual word embeddings we exploited the supervised approach proposed in [19] by considering, as initial bilingual lexicons, the ones released by [39]. 1

C. DOMAIN-SPECIFIC VECTOR IDENTIFICATION
The classification step (depicted in Figure 1c) aims at classifying each word vector x i belonging to the general-purpose embedding E G l t for the target language l t as follows.
l(x i ) = true, if w i is likely to be domain-specific false, otherwise To accomplish this task we study the correlation between general-purpose and domain-specific word vectors E δ s and E G s in the source language. The idea behind it is to rely on the empirical evidence from the domain adaptation process previously applied to the source language. Specifically, the word vector shifts that would be produced by domain adaptation for the target language are expected to reflect, to a good approximation, those observed for the source language. Hence, similar word vectors are likely to show similar shifts in the adaptation phase. The word-level prediction model can be formulated as the following boolean function f

D. DOMAIN-SPECIFIC VECTOR INFERENCE
This step builds the domain-specific embeddings E δ t . They consist of (i) the vectors of domain-specific words (i.e., the words labeled as true at the previous step), which are likely to change with respect to the corresponding vector in E G t , and (ii) vectors of not domain-specific words (i.e., the words labeled as false), which are invariant under domain adaptation as their semantic meaning is unlikely to be influenced by the domain under consideration. To estimate the domain-specific vectors we infer the position of the type-(i) vectors using a regression model, whereas we approximate the type-(ii) vectors as those already available in the generalpurpose model (i.e., we assume that domain adaptation does not yield any type-(ii) vector shift in the hyperspace).
Analogously to what previously done for domain-specific vector identification, we learn how to shift word vectors for the target language by studying the correlations between general-purpose and domain-specific word vectors E δ s and E G s in the source language. At this stage, we predict the exact values of each element of the new vector by learning the following regressor r: is the vector associated with word w i t in the generalpurpose model, whereas x j is the vector associated with the same word in the domain-specific model (after the eventual shift due to domain adaptation).

V. EXPERIMENTAL RESULTS
We summarize here the outcomes of the empirical analysis carried out on the document corpora retrieved from Wikipedia. Specifically, Section V-A describes the newly released benchmark dataset, Sections V-B and V-C formalize the addressed NLP task and the tested models, respectively. Section V-D reports the outcomes of the performance comparison. Section V-E analyzes the effect of the system parameters.
The experiments were run on machine equipped with 32GB of RAM, Intel Xeon E5-2680 CPU and Nvidia Tesla K40 GPU. 2 The computational time required by the overall process of domain-specific model inference (including both classification and regression) was quite variable across languages. It ranged from 51 seconds (Arabic language) to 175 seconds (German language).

A. BENCHMARK DATASET
The lack of open multingual datasets that fit for our purposes prompted us to crawl, prepare, and release a new benchmark dataset, namely AMED (Adapting Multilingual word Embeddings to specific Domains). 3 The AMED benchmark dataset consists of a set of multilingual document corpora retrieved from Wikipedia and ranging over different topics. The Wikipedia online encyclopedia is a common source of data to learn word representations, as it is available in many languages [17]. More specifically, it includes 1) The full Wikipedia dump crawled in November 2020 separately for each of the following languages: Italian, English, French, Spanish, German, Arabic, Russian. 2) The general-purpose word embedding models trained on the per-language Wikipedia dumps at Point (1). 3) For a subset of domains (i.e., medicine, technology, finance), the lists of most representative terms in the Wikipedia glossary 4 translated in all the languages considered at Point (1). 4) The multilingual document corpora consisting of the definitions of the selected Wikipedia terms retrieved at Point (3). Definitions are given in all the languages considered at Point (1). 5) The domain-specific word embedding models adapted to the domains specified at Point (3) by using the multilingual document corpora selected at Point (4). The multilingual document corpora used to train the general-purpose models were retrieved from the latest dump of the language-specific wikipedia encyclopedia. 5 Domains at Point (3) were selected among the most common categories in the English Wikipedia dump (e.g., https://en.wikipedia.org/wiki/Category:Finance). Glossary terms at Point (4) were extracted by considering the corresponding glossary sub-categories. The domain-specific documents at Point (4) were retrieved by first querying the Wikipedia glossary in English through the PetScan tool 6 and then by following the corresponding Wikipedia interlanguage links 7 in order to retrieve consistent documents across different languages. Table 3 summarizes the main data characteristics. As one can clearly deduce by the reported statistic, the English corpus is six times larger than those of available in the other languages (6 Millions vs. 1 Million). Furthermore, the number of domain-specific documents tailored to a single domain is significantly smaller (three order of magnitude lower). This reinforces the motivations behind our research: in contexts where retrieving a sufficiently large corpora written in languages other than English is challenging (e.g., summarization of patents or technical reports, conversation agents for technical support, multilingual search engines) training domainspecific models would be challenging. Finally, the characteristics of the textual definitions are rather diversified across languages (e.g., definitions in Russian contain approximately half of the words than those in all the other languages).

1) COMPARISON WITH EXISTING BENCHMARKS
Other researchers have previously released large textual corpora and word embedding models along with the open source implementations of their research projects. For example, in [5] the authors released English word embedding trained on Google News; in [7] released English models trained on Wikipedia, Gigaword and Common Crawl. In [18] the authors released general-purpose word embeddings trained for 100 languages based on Wikipedia data. [6] and [17] respectively released FastText and Word2Vec word embeddings for 44 and 157 languages using Wikipedia and data from the common crawl project. However, to the best of our knowledge, a benchmark dataset consisting of both generalpurpose and domain-specific embeddings in various domains and languages has not been presented in literature yet.

B. WORD RETRIEVAL TASK
The retrieval task is known since long ago [23] and has been largely addressed by the NLP community (e.g., [43]- [45]).
To extrinsically evaluate the quality of the inferred models we formulated the retrieval task on the benchmark dataset as follows: given a Wikipedia term retrieve the keyphrases in the corresponding glossary definition. Since this work focuses on word embeddings, we applied the following data preparation steps: 1) For each term in the multilingual Wikipedia glossaries 8 retrieve the title of the corresponding Wikipedia page. 2) Extract the set of words occurring in the title (excluding the stopwords). 3) Term ← w T 1 , w T 2 , . . . , w T n 4) Summarize the Wikipedia page using the top-2 sentences in the document. 5) Extract the set of words occurring in the keyphrases (except for the stopwords).
To our purposes, we reformulate the word retrieval task as follows: given a term retrieve the words in the definition.

1) EXTRINSIC EVALUATORS
We extrinsically evaluate model effectiveness in addressing the word retrieval task in terms of Precision, Recall and F1-Measure [46]. The aforesaid measures are established in Information Retrieval [23].
For each term T we first retrieve a ranked list of words Ret. Then, we evaluate the pertinence of the retrieved words placed at the top of the ranking to the description as follows.
where K is the target number of top ranked words to retrieve, Ret K is the number of words in the top-K of Ret that were actually retrieved from the description D, and |D| is the total number of words in the description.
Precision is the percentage of correctly retrieved words over the total number of retrieved words, recall is the percentage of correctly retrieved words over the total number of description words to be retrieved, whereas F1-measure is the harmonic average of precision and recall.
The aforesaid measures will be averaged over all the analyzed terms in order to get a unique quality score per model. Notice that the number of words in the definition approximately doubles the number K of words to retrieve (see Table 3). The only exception is the Russian language, where the two aforesaid counts are approximately equal.

C. EMBEDDING MODELS
We tested the following multilingual embedding models:   • Non-Linear Inference Model (NLIM): the word embedding model inferred from the general-purpose one for the target language using the proposed method. The inference process relies on non-linear classifiers and regressors. Both word embedding training and fine-tuning phases were performed using the Gensim library [47]. The GP model will be used as a reference to get a lower-bound estimate of the performance, as domain-specific models are expected to perform better the general-purpose ones. Conversely, the performance of the GT model will be considered as an upper bound estimate since the proposed inference method is assumed not to take advantage of domain-specific data in the target language. The closer the extrinsic evaluation score to the GT's ones, the better the result.
LIM is the proposed inference method, where both classification and regression step rely on linear predictive models. As linear models we considered Linear Regressor and Support Vector Classifier available in the the SciKit-Learn library [48]. NLIM is the variant of the proposed inference   model, where both steps potentially rely on non-linear predictions. The comparison between LIM and NLIM is aimed at understanding to extent to which non-linear predictors could enhance model performance compared to simpler (linear) ones. In NLIM we explored the use of deep learning neural network-based models as well. Specifically, as nonlinear models we relied on a fully connected neural networks (MultiLayer Perceptron with ReLU activation function) and explored both shallow and deep versions of the network architecture (more details are given in Section V-E).  and Ground Truth separately for each domain. To deepen the analyses, Table 4 reports the Precision, Recall, and F1-measure scores for three representative K values (i.e., 3, 7, and 10) separately for each domain. Columns labeled as F1vs.G.T . in Table 4 indicate, for each method, the percentage value ratio of the achieved F1-measure to the G.T. score. In most cases, both linear and non-linear methods outperformed the general-purpose model. The gap is particularly significant for specific European languages (e.g., French, Spanish), where the syntactic and semantic language similarities with the source language (English) provide clear benefits. Surprisingly, convincing results were achieved for non-European languages as well for all the analyzed domains (e.g., in Russian NLIN achieved 86% of the G.T. score for Medicine). This supports the hypothesis that word shifts due to domain adaptation are, to a large extent, predictable independently of language grammar and syntax. As expected, the non-linear model has shown to achieve better performance than the linear one in almost all languages and domains due to the inherent complexity of the inference task. In the Arabic language the Ground Truth performed slightly worse than the inference model according to the extrinsic evaluation scores. This is probably due to the higher morphological richness and to the increasing lexical ambiguity of the Arabic language compared to English, which have already been highlighted by previous studies related to Arabic Wikipedia content (e.g., [49]). The latter findings reinforce the need for alternative, algorithmic solutions to automatically infer domain-specific models, such as the newly proposed approach described by the present study.

E. PARAMETER ANALYSIS
We investigated the use of fully connected neural networks with different characteristics to tackle both the vector identification and inference problems. Figure 5 plots the F1-measure scores achieved by the 2-layer deep neural network architectures characterized with different width (W ) for the technology domain (chosen as representative). The results show that, independently of the considered language, the performance is weakly influenced by the number of nodes per layer provided that it is above the number of inputs (300). Therefore, to limit the computational complexity of model training, hereafter we will set W to 900 (3 times the number of inputs) for all the considered languages. Figure 6 shows the impact of the network depth, where we varied the number of hidden layers from 1 to 5. The best average performance was achieved by the 2-and 3-layer networks on most of the tested languages and domains. Typically, the level of complexity of the inference process seems to not require the use of more than 2 or 3 layers. For example, for the Arabic language the 4-layer Deep Learning architecture performed worst (see Figure 6c). Hence, to avoid data overfitting and to limit the computational time we recommend to use, as default setting, a 2-layer fully connected network.

VI. CONCLUSION AND FUTURE WORK
The paper proposed to infer aligned domain-specific Word2Vec embeddings in a multilingual scenario where, for some of considered languages, there is a lack a domainspecific data and/or pre-trained word vectors. Since, typically, this is not an issue for all languages but only for a subset of them, we proposed to opportunistically reuse the information provided by a source, data-rich language (e.g., English) to infer how word vectors should change in order to tailor general-purpose models to specific domains. An extrinsic evaluation carried out on a newly proposed benchmark dataset show that the proposed approach is able to effectively support word retrieval in a multilingual context.
The main takeaways from the experiments are enumerated below: • Both Linear and Non-Linear models outperformed the General-Purpose one. While coping with document corpora relative to domains and languages for which the standard domain adaptation pipeline is not applicable, they bring clear benefits to the NLP process.
• For specific combinations of language and domain (e.g., French-Medicine, Russian-Technology), the best performing version of the proposed approach achieved results comparable to the Ground Truth. In few exceptional cases relative to the Arabic language, the inference-based model even beat the Ground Truth, probably due to the inherent complexity of the domain adaptation step.
• Non-Linear 2-layer fully connected deep models have shown to averagely perform best. They were able to capture non-linear word vector relationships without incurring in data overfitting. The achieved results leave room for further improvements. Firstly, since multilingual data are often changing, we aim at studying how multilingual domain-specific word embeddings evolve over time account [34]. Secondly, we plan to apply the proposed methodology to address various cross-lingual NLP task among which cross-lingual text summarization and sentiment analysis, search engines, cross-lingual media retrieval, and conversational agents. Finally, we aim at leveraging the proposed inference-based approach to map the vector representations of multimodal content (e.g., videos, images).

ACKNOWLEDGMENT
Computational resources were provided by HPC@POLITO, a project of Academic Computing within the Department of Control and Computer Engineering at the Politecnico di Torino (http://www.hpc.polito.it).