SpaceTransformers: Language Modeling for Space Systems

The transformers architecture and transfer learning have radically modified the Natural Language Processing (NLP) landscape, enabling new applications in fields where open source labelled datasets are scarce. Space systems engineering is a field with limited access to large labelled corpora and a need for enhanced knowledge reuse of accumulated design data. Transformers models such as the Bidirectional Encoder Representations from Transformers (BERT) and the Robustly Optimised BERT Pretraining Approach (RoBERTa) are however trained on general corpora. To answer the need for domain-specific contextualised word embedding in the space field, we propose SpaceTransformers, a novel family of three models, SpaceBERT, SpaceRoBERTa and SpaceSciBERT, respectively further pre-trained from BERT, RoBERTa and SciBERT on our domain-specific corpus. We collect and label a new dataset of space systems concepts based on space standards. We fine-tune and compare our domain-specific models to their general counterparts on a domain-specific Concept Recognition (CR) task. Our study rightly demonstrates that the models further pre-trained on a space corpus outperform their respective baseline models in the Concept Recognition task, with SpaceRoBERTa achieving significant higher ranking overall.


I. INTRODUCTION
In the past three years, the transformers architecture [1] and transfer learning [2] have profoundly impacted the Natural Language Processing (NLP) landscape. Transfer learning consists of two stages: (i) a pre-training phase in which contextualised word embeddings are learned through self-supervised training tasks on a large unlabelled corpus (for instance, Masked Language Model (MLM) and Next Sentence Prediction (NSP) [2]), and (ii) a second phase in which the pre-trained model is fine-tuned for a specific task [3]. The performance of the downstream NLP tasks are thus greatly improved with the knowledge transferred from the pre-trained models. Numerous studies presented the theoretical background and empirical proof of the positive impact of the pre-training and fine-tuning setting for downstream tasks [4], [5]. The BERT model, standing for Bidirectional Encoder Representations from Transformers, from Google AI Language [2] advanced the state-of-the-art (SOTA) performance on 11 NLP tasks. Transfer learning The associate editor coordinating the review of this manuscript and approving it for publication was Francisco J. Garcia-Penalvo . brings a decisive advantage for NLP applications, especially for domains where annotated corpora are scarce.
Space systems engineering is a field where access to large-scaled annotated data is limited. Yet, experts involved in the early stages of space mission design can spend up to 50% of their work time searching for heritage and design information [6]. The accumulated data explored by experts mostly consist of unstructured data: past design reports, books and journal publications. This information bottleneck can be reduced by implementing NLP and text mining solutions. Concept Recognition (CR) is a first essential step for the identification and extraction of domain-specific fundamental concepts, enabling the structuring of accumulated data via the construction of ontologies [7].
While pre-trained transformer models such as BERT [2] or RoBERTa, a Robustly Optimised BERT Pretraining Approach [8], are trained on general corpora, domain-specific models such as SciBERT [9] have proven to be more adapted to domain-specific downstream tasks. Pre-training language models from scratch is resource intensive, requiring large corpora (160 GB for RoBERTa [8]) and costly computational resources (7 days of training on VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ a Tensor Processing Unit (TPU) for SciBERT [9]). Instead, we propose to further pre-train the baseline models on our domain-specific corpus. We choose the BERT-Base, RoBERTa-Base and SciBERT-SciVocab models to build SpaceTransformers, a family of three models for space systems language modeling: SpaceBERT, SpaceRoBERTa and SpaceSciBERT. While models pre-trained on a general corpus learned contextualised word embeddings for a general or scientific English vocabulary, our further pre-training specialises these models in space systems engineering. The models performance are evaluated through a fine-tuning Concept Recognition (CR) task with a set of space systems terms annotated by hand by three human annotators. The contributions of this paper are summarised as follow: 1) We further pre-train and release SpaceTransformers, a novel open-source family of three models: SpaceBERT, SpaceRoBERTa and SpaceSciBERT further pre-trained from BERT, RoBERTa, and SciBERT on our space systems corpus. 2) We release a novel labelling scheme based on space standards and its corresponding hand-annotated dataset for Concept Recognition (CR) of space systems terms. 3) We provide, for the first time a thorough comparison of the performance of domain-specific models with respect to several baseline models on a classification task. 4) We demonstrate that further pre-training from RoBERTa-Base considerably improves the results on the downstream CR task for domain-specific language models. The source code and domain-specific models are available at github.com/strath-ace/smart-nlp. All data underpinning this publication are openly available from the University of Strathclyde KnowledgeBase at https://doi.org/10.15129/8e1c 3353-ccbe-4835-b4f9-bffd6b5e058b (further pre-training corpus) and https://doi.org/10.15129/3c19e737-9054-4892-8ee5-4c4c7f406410 (fine-tuning corpus and labelled concepts).

II. BACKGROUND AND RELATED WORK A. TRANSFER LEARNING
The purpose of transfer learning is to first learn from an initial training objective, then apply it to a different target objective. Let s be an input sequence consisting of m words such that where t i is the i th word of the sequence. These tokens have a fixed initial embedding of dimension n, noted as x i . The pre-training phase yields a contextualised embedding y i of dimension d for each embedding x i of a term t i where θ f ∈ f represents a particular set of model parameters. In the pre-training phase, the model f is trained in a self-supervised fashion. In a second phase, the pre-trained model is fine-tuned for a specific task. The contextualised representations previously obtained are used as inputs to the model The output is a probability distribution through an identity or softmax activation function, configured by the parameters θ g ∈ g and of dimension q. The parametrisation of the fine-tuned model is thus configured by This framework has proven to be more efficient than training a task-specific model from scratch, requiring at least 10 times less task-specific data samples [2], [4]. The number of pre-training parameters, θ f , is usually much higher than the number of fine-tuning parameters θ g . For instance, the configuration of BERT-Base involves a θ f ,BERT of 110M parameters [2]. Thus, the training set required for fine-tuning is significantly smaller than for the pre-training, while avoiding over-fitting.
Finally, let C(·, ·) be the loss function for training a neural net (e.g. cross-entropy), then the cumulative empirical risk for minimising the loss in the fine-tuning setting is defined as: where f θ f is the pre-trained model configured by θ f parameters, g θ g (f θ f ) is the fine-tuning model configured by θ g parameters, and X f , Y g are respectively the pre-training and fine-tuning training sets.

B. DOMAIN-SPECIFIC LANGUAGE MODELS
There are three approaches found in the Literature to generate domain-specific language models: (i) a generic model is fine-tuned on a domain-specific task, (ii) a model is further pre-trained from a generic pre-trained model with a domain-specific corpus, or (iii) a model is trained from scratch on a domain-specific corpus. Fine-tuning a pre-trained model for a domain-specific task is the quickest and easiest approach. In [10], the authors fine-tuned BERT-Base on a patent database for a classification task. Their model, patentBERT achieved better results than the previous SOTA method based on Convolutional Neural Network (CNN) and word vector embedding. Reference [11] presents a downstream application similar to our work. The authors fine-tuned BERT-Base on a CR task to identify concepts related to space systems engineering. To the best of our knowledge, their study is so far the only application of transfer learning in the space field. Their labelled dataset was however based on a single document, the NASA System Engineering Handbook [12] and they chose high-levels labels such as event or location whereas our labels cover all management, product assurance and engineering disciplines found in 126 space standards.
Pre-training from scratch or further pre-training on a domain-specific corpus enables the introduction of domain-specific words embeddings in the language model, improving the performances on downstream domain-specific tasks. BioBERT [13] and VNLawBERT [14] were both further pre-trained from BERT-Base respectively with biomedical publications and a Vietnamese legal corpus. A clinical language model presented in [15] was further pre-trained from BERT-Base and from BioBERT. Both ClinicalBERT [16] and FinBERT [17] were trained from scratch on an architecture similar to BERT's with, respectively, a corpus of clinical notes and a large financial corpora. The benefits of either further pre-training or training from scratch on a domain-specific corpus have been largely proven by these studies as they all outperformed the original general language models on domain-specific tasks.
Further pre-training or training from scratch appears as a trade-off between (i) the available domain-specific corpus size, (ii) the available computational resources, and (iii) the fine-tuning performances sought-after. Training from scratch is resource intensive, it requires a large domain-specific corpus and heavy computational resources. Both BERT and SciBERT use a corpus of around 3B tokens. The training of BERT-Base was performed in 4 days on 4 cloud TPUs [2]. RoBERTa was trained in one day over 1024 V100 GPUs [8]. SciBERT took 7 days to train from scratch with a single TPU v3 with 3 cores [9]. In [18], a legal language model, LEGAL-BERT, is trained on a 12 GB corpus of legal texts, either from scratch or further pre-trained from BERT-Base. The authors found that both were valid approaches with similar results. Our training corpus has a similar size as [18] and we use a single NVIDIA V100 GPU with 16 cores to train our models. Based on these limitations, the decision was taken to further pre-train our domain-specific models rather than train them from scratch. The methods mentioned in this Literature Review are summarised in Table 1.

C. CONCEPT RECOGNITION FOR SPACE SYSTEMS
CR is a NLP task used to identify and classify terms of interest from text. It is a word-level annotation exercise. For instance CR in the clinical domain annotates labels associated with general terms, including, in the analysis of patient data, terms such as ''treatments'', ''findings'', and ''problems'' [19]- [21]. Similarly to the clinical domain, CR for space systems engineering includes generic terms, describing, for instance, the interface between engineering and management [11]. Therefore, concepts can be loosely defined as sequences that represent a specific cognitive construct in their domain [22]. In the context of systems engineering, these concepts can be ''engineering unit'', ''system architecture'' or ''system analysis'', labelled as examples for the label ''system concepts'' in [11]. One can assume that in systems engineering the concept ''system'' almost exclusively stands for the technical assembly of interconnected items or devices of a satellite or spacecraft, in comparison to generic text where ''system'' could have different meanings based on context. In general, ambiguity depends on the target domain as well as on the level of granularity in the annotation scheme defining the level of abstraction. For instance, labels such as ''tasks'', ''processes'', and ''materials'' were used for constructing a scientific knowledge graph in [23]. These labels can be applied to multiple scientific domains such as computer science, biology, and mathematics, and thus have a low level of granularity with a high chance of ambiguity as the meaning of a concept varies in function of the scientific field. Nevertheless for the purpose of comparing scientific publications based on their intrinsic concepts, this level of granularity is considered as sufficient [23]. Thus, the necessary level of granularity in the annotation scheme for CR depends on the later application, target domain, and their tolerated level of ambiguity.
Different approaches for CR applications exist. Rule-based and pattern matching systems leverage hand-crafted rules on the text and its linguistic features to extract concepts as shown in [24]. Alternatively, other methods are based on supervised Machine Learning (ML) methods, trained from example inputs and their expected outcomes. Linguistic feature-based ML systems such as support-vectormachines, decision trees, and conditional random fields used to be the preferred methods for CR [20]. However, in the last years, these were increasingly replaced by deep learning approaches using word embedding as input features [19], [23]. Language models and transfer learning have recently significantly contributed to this field. Transfer learning increases the performances of CR applications, as seen in [25], [26], requiring a smaller labelled dataset than training from scratch. Furthermore, the contextualised representation contributes to recognising and differentiating concepts based on their context, thus increasing the accuracy of the model predictions.

III. CORPORA
The study involves two corpora: 1) A further pre-training corpus: a 14.3 GB collection of unstructured documents related to space systems, acquired from heterogeneous sources. 2) A fine-tuning corpus: 28, 763 textual requirements extracted from European Cooperation for Space Standardisation (ECSS) standards.

A. FURTHER PRE-TRAINING CORPUS
The training corpus is a collection of 5, 266 unstructured documents including books, publication abstracts, and Wikipedia pages. These documents were manually gathered. They were chosen as they represent the typical information sources used by space systems engineers. The books cover most of the fields of space mission design, and are publicly available. The abstracts were extracted from papers published in three peer-reviewed journals: the Acta Astronautica, Advances in Space Research, and the Aerospace Science and Technology journals. All papers were published between 2017 and 2019 included, and therefore describe recent work. Using the abstracts of the publications was found to yield better results than using the full journal publications documents. The reason is most likely that papers include mathematical notations, figures and tables which introduce noise. The Wikipedia webpages were scraped and manually filtered using the hyperlinks connecting pages to the spacecraft design webpage. Table 2 provides statistics on the training corpus. The sentences are mainly extracted from books (70%), then from publication abstracts (17,6%) and Wikipedia (12,4%). This distribution reflects the language complexity of these different sources.

B. FINE-TUNING CORPUS
The fine-tuning corpus consists of annotated requirements extracted from ECSS standards. The latter is an initiative launched by the European Space Agency (ESA) in 1993 to define a coherent and single set of standards for all European space activities [27]. 28,763 requirements are collected from 126 single standards as shown in Table 3. The ECSS standards are split into three main branches under an overhead branch called System: Management, Product assurance and Engineering, covering the design and implementation of the standards and requirements. Each requirement briefly describes a regulatory provision to be complied with in the form of ''what to do'' in a customer -supplier context [28]. Because of the intent of using them in an obligating contract, the requirements are written in a clear and unambiguous language. Additionally, the average number of tokens per requirement is similar for all branches.
For the fine-tuning, we used requirements from the three branches. Focusing on just the majority branch Engineering would not be feasible as the standards are to be used in conjunction with each other and not as single documents. For instance, the topic ''Software'' is covered by two standards belonging to the Engineering and the Product assurance branches. Nevertheless, there is an effort to avoid duplication of content in requirements with the ideal situation that each requirement is unique [29].

IV. METHODOLOGY
The SpaceBERT, SpaceRoBERTa and SpaceSciBERT models are respectively further pre-trained from BERT-Base, RoBERTa-Base, and SciBERT-SciVocab. The pre-trained and further pre-trained models are fine-tuned on a domain-specific CR task. The methodology is summarised in Figure 1.

A. FURTHER PRE-TRAINING
Further pre-training a model means that in the pre-training phase, instead of randomly initialising the weights θ f , the weights values of a baseline model such as BERT, RoBERTa or SciBERT are reused.
Hence the weights θ f for the three further pre-training tasks are initialised with the following set of weights where θ f ,BERT , θ f ,RoBERTa and θ f ,SciBERT are respectively the set of weights of the pre-trained models BERT, RoBERTa and SciBERT. Weights initialisation from a pre-trained model also implies the reuse of the original model vocabulary.
The authors of the SciBERT model [9] observed an average improvement of only +0.76 F1 score on biomedical tasks when using their domain-specific vocabulary. They concluded that training with a domain-specific corpus had more impact than using a domain-specific vocabulary. A study similar to ours, BioBERT [13], chose to rely on the BERT-Base vocabulary. The authors assessed that since the Word Piece tokenization used to build the BERT vocabulary reduces out-of-vocabulary issues it was fit to represent and fine-tune their domain-specific terms. An alternative to training from scratch with a domain-specific corpus is to replace ''Unused'' tokens in the vocabulary with domain-specific words. To assess if a modification of the original vocabulary was necessary, we extracted the top thousand most frequent words from our domain-specific corpus and compared our frequency-based lexicon to the vocabulary of BERT-Base-uncased, RoBERTa-Base, and SciVocab-uncased. The top 10 most frequent words in our frequency-based lexicon are: ''satellite'', ''system'', ''orbit'', ''space'', ''spacecraft'', ''data'', ''time'', ''mission'', ''model'', and ''control''. Out of our frequency-based lexicon, 87, 8% of the words were already included in the BERT-Base-uncased vocabulary, 88, 8% in the RoBERTa-Base vocabulary, 89, 9% in the SciVocab. Within these 1000 words, the 10% most frequent words were already included in all three vocabularies. Table 4 gives a sample of the words not found in the generic models vocabularies.
As the amount of domain-specific terms not covered by the original vocabularies was negligible, we decided to re-use the vocabularies and tokenizers of the models we were further training on. The configuration and pre-trained weights of the BERT-Base, RoBERTa-Base and SciBERT models are accessed through the HuggingFace library and their Python Transformers library [30]. For each model the pre-training weights and hyperparameters are thus initialised from one of the three baseline models with the exception of the batch size and maximum sequence length. The batch size is set to 256, as for RoBERTa [8], with a gradient accumulation step of 16. The maximum sequence length of the input is set to 512 as defined in BERT [2]. The models are further pre-trained for 70 epochs on one NVIDIA V100 GPU hosted on the ARCHIE-WeST High Performance computer. The further pre-training corpus is split between a training and a testing set, based on the classic 80%/20% partition.

B. REQUIREMENTS LABELLING
For the fine-tuning of the pre-trained models, the corpus presented in section III-B was used as a basis for the annotated dataset. The requirements are written in a precise and brief manner, with a high density of concepts relevant to space systems, making them useful for generating a CR dataset in this domain. An annotation scheme was carefully designed to cover the whole spectrum of the ECSS standards, creating labels for each of the three main branches: Management, Product assurance and Engineering. The labels were constructed from domain-experience of three human annotators as well as with the help of online available taxonomies in the space domain such as the ESA Technology tree [31], the ESA Product tree [32] and the NASA taxonomy viewer. 1 18 labels were eventually defined for the annotation scheme. The complete description for each label is found at github.com/ strath-ace/smart-nlp. Table 5 summarises the annotation The single requirements were annotated with the commercial software tool Prodigy from the software company explosion.ai. 2 To facilitate the annotation process, requirements addressing similar topics were annotated simultaneously. The process was repeated for all topics, ensuring that similar numbers of requirements were selected so that the resulting dataset would be balanced and cover the full scope of the ECSS standards. The annotation process was considered done once the performance of the CR classifier were within an acceptable accuracy. Eventually, 882 requirements were annotated. Each annotator labelled the whole fine-tuning corpus independently. These results were then compared, showing a high level of inter-annotator agreement of 96.5%. Discrepancies between the three annotators were discussed and removed from the final set. The resulting numbers of annotated concepts present in the final dataset are shown in Table 6. The number of unique concepts found per label, as well as the ratio of unique concepts to the 2 https://prodi.gy/ total number of concepts, called non-overlapping, are also displayed.

C. FINE-TUNING FOR CONCEPT RECOGNITION
The Python Transformers library from HuggingFace [30] was used to load the pre-trained and further pre-trained models. For CR, a linear layer is added as output layer with a softmax activation function. The models were trained three times with a 10-fold, 80% to 20% split, cross validation. The split size was established from the mean ratio of non-overlapping samples, which is slightly below with 78%, as shown on Table 6. Another assumption for the training was to reinitialise the weights of the final layer if the fine-tuning resulted in a failed run for the fold. This is in accordance with previous studies, which stated that the random initialisation of the fine-tuning layers can have a significance influence on the fine-tuning results in computer vision [33] as well as NLP [34]. A failed run was defined as when the validation accuracy stayed below classifying all examples with the majority class, classifying every word as a non-concept [35].
Further hyperparameters for the fine-tuning were a linear decreasing learning rate and a batch size of 16. The models were trained for up to 10 epochs. To compare the models' predictions, the results of the epoch with the lowest validation loss for each respective fold were taken. One benefit of the further pre-training was already observed during fine-tuning. In comparison to RoBERTa with three failed runs overall, SpaceRoBERTa did not fail any.

A. MODELS SELECTION
During the trial and error phase, we experimented with uncased and cased vocabularies and various batch sizes. Further pre-training on uncased vocabulary yielded better results than cased vocabulary. This was to be expected as our labelled concepts are not named entities and thus casing is not relevant to our application. We also found that a higher batch size of 256 yielded better results than lower batch sizes of 16 or 32.
The models are further pre-trained for 70 epochs which is enough to achieve the convergence of the evaluation perplexity as shown in Figure 2. Perplexity is a common metrics for evaluating language models. It quantifies how well a model reduces the uncertainty in the prediction of the language in a tokenized sequence of text s. Perplexity PPL is derived from the cross-entropy H and is defined in [36] as: with H p (s) = 1 m log 2 1 P(s) (11) where m is the number of words in the sequence s, P(s) is the probability of the sequence of words provided by the model, H p (s) the cross-entropy of the text in relation to the model, and finally PPL the perplexity of the model. We chose to retain the SpaceBERT model trained for 60 epochs, the SpaceRoBERTa model trained for 57 epochs, and the SpaceSciBERT trained for 54 epochs. These models either correspond to the start of the perplexity convergence or to a local minimum close to convergence. Although of disparate initial configuration and pre-training corpus, these models interestingly take a similar number of further pre-training epochs to converge. Figure 3 displays the evolution of the validation loss for all models with respect to the number of fine-tuning epochs. The validation loss curves have a parabola-like shape reaching a minimum after a certain number of epochs. When comparing the minimums of each model, the validation loss appears to be the lowest for SpaceRoBERTa and the highest for BERT. While SpaceSciBERT and SciBERT have similar validation losses, SpaceRoBERTa, and SpaceBERT demonstrate significant improvements with respect to their respective baseline models. Although the results were averaged over 30 folds, the standard deviation for the validation loss is still high. Former studies [34], [35] reported similar issues for comparable dataset sizes.

B. CONCEPT RECOGNITION RESULTS
The CR F1 scores for all 6 models and 18 labels are reported in Table 7. The results were computed from the epochs with the lowest average validation loss, averaged over all 30 folds. The standard deviation is provided along with the F1 score. The weighted label represents the averaged F1 score over all the labels weighted by the number of examples in the validation set, and is defined as: weighted = 1 l∈L nŷ l l∈L nŷ l F 1 (y l ,ŷ l ) (12) where l is one label from the set L of all labels,ŷ l is the set of true samples for label l, y l is the set of predicted samples for label l, F 1 (y l ,ŷ l ) is the F1 score calculated for label l, and nŷ l is the number of true samples for label l.
Considering only this weighted F1 score, SpaceRoBERTa clearly outperforms the other models, followed by SpaceSci-BERT. BERT and RoBERTa obtain the lowest scores. SpaceRoBERTa ranks the highest on several labels. As shown on Table 7, the labels, displaying the most significant improvements compared to the baseline of BERT, are GN&C with a 7.8% improvement, then Space environment with 4.5%, followed by Thermal with around 4% improvement, and Structure & mechanism 3.8%. SpaceSciBERT also substantially improves the score of the Communication and OBDH labels, respectively by 12% and 4%, compared to BERT.
Altogether, the reported F1 scores are consistent with the observed validation loss trends, with SpaceRoBERTa leading the F1 score table and the further pre-trained models outperforming their baselines. The standard deviations of the single scores are still generally high, usually exceeding the achieved improvement between the baseline and the further pre-trained models. Therefore, statistical tests are conducted and summarised in section V-C to evaluate the statistical significance of the results.
To fully assess the impact of the further pre-training with a domain-specific corpus, the scores of the baseline models are compared to their respective space variant in Figure 4. SpaceRoBERTa again displays the most significant improvements compared to its baseline model RoBERTa. All three domain-specific models show substantial improvements for the Propulsion, Space environment, Structure & mechanisms, Communication, GN&C, and OBDH labels. These labels corresponds to the main engineering disciplines of a spacecraft subsystems. However the score of more general labels such as Safety & risk control, Nonconformance, and Quality control were either unaffected or slightly deteriorated by the further pre-training. These labels all belong to the ECSS branch of Product assurance. For the remaining labels, no clear trend can be inferred as the further pre-training resulted either in an improvement or a deterioration of performances depending on the model used.
A more thorough investigation is conducted for the SpaceRoBERTa model as it achieved the highest performance. Figure 5 displays the confusion matrix for the SpaceRoBERTa model. The majority of samples are concentrated on the diagonal, thus predictions are predominantly accurate. A few incorrect classifications occur between the OBDH and Communication labels, indicating a lack of clear boundaries between the two topics. The annotated requirement shown in Figure 6 illustrates this overlap. SpaceRoBERTa wrongly associates the concepts found in this requirement to the Communication label instead of the OBDH label as they were manually assigned to. These concepts, including communication frame and command word, actually fall under the domain of signal processing and can be used both in a communication or data handling context. The requirement was here extracted from a standard related to data handling. This information is however hidden from the model and therefore cannot be used to guide it. The ambiguity of these terms were already highlighted by the human annotators. Figure 7 quantifies the number of new concepts not seen by the model during training but found in the validation set, demonstrating the ability of the model to generalise over the training set and discover new concepts in unevaluated samples. The prediction of the model was compared for one fold to a simple look-up approach. The latter method identifies concepts present in both training and validation sets. As seen in Figure 7, the prediction with the fine-tuned model achieves substantially better results than the look-up approach. Out of 844 unique concepts, 690 were recognised exactly by the model and 78 concepts were partly recognised. For partial recognition, the span was either too long or too short. For instance, the concept 50W resistors, corresponding to two labelled concepts 50W and resistor were merged by the model. The concept flight production was extracted by the model while the full labeled concept was proto-flight production. Alternatively, the look-up approach resulted in only 170 complete and 187 partial matches.

C. STATISTICAL TESTS
The results obtained have been statistically analysed with the Friedman pre-hoc and the Bonferroni-Dunn and Nemenyi post-hoc tests. To determine the statistical significance of the F1 score of each method with respect to the labels set, a non-parametric Friedman test was completed with the ranking of the F1 score of the best model as the test variable.  The Friedman test shows that the proposed method is statistically significant at a level of 5% as the confidence interval is C 0 = (0, F 5 = 2.322) and the F-distribution statistical values is F * = 6.330 / ∈ C 0 . Consequently the Friedman test rejects the null-hypothesis that all models perform equally well in mean ranking. Based on this rejection the Nemenyi post-hoc is completed to compare the performances of the different models. The difference in ranking, as resulting from the Nememyi tests can be observed in Figure 8, for α = 0.05. The results of the Bonferroni-Dunn test for α = 0.05 are reported in Table 7. From the results of both tests it can be concluded that SpaceRoBERTa has a significant higher ranking than all the other methods and RoBERTa, its baseline, the lowest one. The remaining methods, BERT, SciBERT and their space counterpart instead, have not a significant difference in mean ranking.

VI. DISCUSSION AND FUTURE WORK
The weighted F1 scores demonstrate that the domain-specific models outperformed their respective baseline models. SpaceRoBERTa benefited the most from the further pre-training with an increase of 8% F1 score with respect to RoBERTa. SpaceBERT and SpaceSciBERT have less significant improvements, respectively displaying increases of 0.3% and 0.85%. Both SpaceSciBERT and SciBERT outperformed SpaceBERT and BERT proving that the scientific pre-training gave an additional advantage to training from a general model. The decisive advantage came from combining our domain-specific training corpus with the alternative pre-training architecture and tokenizer of RoBERTa. Indeed, the latter model is pre-trained on a single Masked Language Model (MLM) task [8] where the model must predict randomly hidden tokens whereas the BERT-based models are also trained on a Next Sentence Prediction (NSP) task [2], [9]. The statistical analysis and Bonferroni-Dunn test, ignoring the number of labels in the evaluation set unlike the weighted F1 score, demonstrated that there is no significant difference between SpaceBERT, SpaceSciBERT and their baseline counterpart. The Bonferroni-Dunn test however confirmed the significant higher ranking of SpaceRoBERTA.
Labels covering more common concepts such as Nonconformance, Project Scope, and Quality Control benefited less from the domain-specific training. Domain-specific labels such as Propulsion, Structure & Mechanisms, and Communication however saw their F1 score significantly increased for all space models. These results were obtained for one fine-tuning task. When fine-tuning for another task it is recommended to not discard SpaceSciBERT nor SpaceBERT as different models might be more adapted to different applications.
In future work, other pre-training tasks, beyond MLM and NSP, could be explored as in [37] where a domain-specific model was trained on four different tasks. This is a resource intensive approach requiring additional computational power and a larger training set. To improve the performances over ambiguous concepts that could be belong to several engineering disciplines, information should be integrated about the original document the requirements were extracted from. Related to the fine-tuning, the comparison could be extended to additional downstream tasks to further compare the performances of SpaceRoBERTa, SpaceSciBERT and SpaceBERT.  CR can as well support additional text mining operations on the ECSS standards. Standards contain key information on space systems, and they are highly correlated. Thus, a follow-up task could be to associate similar requirements based on common concepts. This application could facilitate the identification of requirements relevant to a new project. Finally, we recommend the development of a standard taxonomy for transformers, as in the Literature the concepts of pretraining and further pre-trained often overlap or are misused.

VII. CONCLUSION
In this paper, we proposed SpaceTransformers a new family of three models: SpaceBERT, SpaceRoBERTa and SpaceSciBERT, providing contextualised word embedding for space systems. Our domain specific models were further pre-trained from BERT-Base, RoBERTa-Base and SciBERT-SciVocab on our domain-specific corpus. The pre-trained and further pre-trained models were evaluated on a CR task with our new labelled dataset of space systems concepts. All further pre-trained models outperformed their respective baseline models. The model further pre-trained from RoBERTa-Base, SpaceRoBERTa, considerably improved the F1 score of several labels with a weighted average of 8% with respect to its baseline. The SpaceSciBERT model, further pre-trained from SciBERT-SciVocab, achieved the highest improvement, on the single label, with respect to BERT-Base with an F1 score increase of 12% for the Communication label. Finally, SpaceRoBERTa achieved the highest ranking in the Nemenyi CD diagram. The statistical analysis however showed a lack of significant difference in mean ranking for the remaining models.
AUDREY BERQUAND received the Dipl.-Ing. degree from EPF, France, and the M.Sc. degree in aerospace engineering from KTH, Sweden. She is currently pursuing the Ph.D. degree with the Intelligent Computational Engineering (ICE) Laboratory, University of Strathclyde. Her Ph.D. is half-funded by the European Space Agency (ESA) and in cooperation with AIRBUS, RHEA, and satsearch in the frame of a Networking Partnering Initiative (NPI). She is an Alumnus of the International Space University Space Studies Program. Her research interests include knowledge management and reuse, text mining, natural language processing, and autonomous reasoning for space systems.
PAUL DARM received the Dipl.-Ing. degree in aerospace engineering from Dresden University of Technology, Germany. He recently done his master's thesis about a Knowledge Graph (KG) for space system requirements. He is currently a Research Assistant with the Intelligent Computational Engineering (ICE) Laboratory, University of Strathclyde, working on various applications of natural language processing for space systems.
ANNALISA RICCARDI is currently a Lecturer in computational intelligence with the Department of Mechanical and Aerospace, University of Strathclyde. She has more than ten years of experience in optimization techniques, and machine learning and applications. She is involved in projects on text mining and data-driven decision making for engineering design.