Aligning Semantic in Brain and Language: A Curriculum Contrastive Method for Electroencephalography-to-Text Generation

Electroencephalography-to-Text generation (EEG-to-Text), which aims to directly generate natural text from EEG signals has drawn increasing attention in recent years due to the enormous potential for Brain-computer interfaces. However, the remarkable discrepancy between the subject-dependent EEG representation and the semantic-dependent text representation poses a great challenge to this task. To mitigate this, we devise a Curriculum Semantic-aware Contrastive Learning strategy (C- SCL), which effectively recalibrates the subject-dependent EEG representation to the semantic-dependent EEG representation, thereby reducing the discrepancy. Specifically, our C- SCL pulls semantically similar EEG representations together while pushing apart dissimilar ones. Besides, in order to introduce more meaningful contrastive pairs, we carefully employ curriculum learning to not only craft meaningful contrastive pairs but also make the learning progressively. We conduct extensive experiments on the ZuCo benchmark and our method combined with diverse models and architectures shows stable improvements across three types of metrics while achieving the new state-of-the-art. Further investigation proves not only its superiority in both the single-subject and low-resource settings but also its robust generalizability in the zero-shot setting. Our codes are available at: https://github.com/xcfcode/contrastive_eeg2text.


Aligning Semantic in Brain and Language:
A Curriculum Contrastive Method for Electroencephalography-to-Text Generation Xiachong Feng , Xiaocheng Feng , Bing Qin , and Ting Liu Abstract-Electroencephalography-to-Text generation (EEG-to-Text), which aims to directly generate natural text from EEG signals has drawn increasing attention in recent years due to the enormous potential for Braincomputer interfaces.However, the remarkable discrepancy between the subject-dependent EEG representation and the semantic-dependent text representation poses a great challenge to this task.To mitigate this, we devise a Curriculum Semantic-aware Contrastive Learning strategy (C-SCL), which effectively recalibrates the subjectdependent EEG representation to the semantic-dependent EEG representation, thereby reducing the discrepancy.Specifically, our C-SCL pulls semantically similar EEG representations together while pushing apart dissimilar ones.Besides, in order to introduce more meaningful contrastive pairs, we carefully employ curriculum learning to not only craft meaningful contrastive pairs but also make the learning progressively.We conduct extensive experiments on the ZuCo benchmark and our method combined with diverse models and architectures shows stable improvements across three types of metrics while achieving the new state-of-the-art.Further investigation proves not only its superiority in both the single-subject and low-resource settings but also its robust generalizability in the zero-shot setting.Our codes are available at: https://github.com/xcfcode/contrastive_eeg2text. Index Terms-Brain-computer interface, computational neurolinguistics, contrastive learning, curriculum learning.

I. INTRODUCTION
D EVASTATING neurological conditions such as spinal cord injuries or neuromuscular disorders can suddenly lead to people losing their ability to communicate [9], [33].Such patients may still have intact language and cognitive skills, but injuries might hinder them from expressing themselves [11].Fortunately, Brain-computer interfaces (BCIs) can The authors are with the Research Center for Social Computing and Information Retrieval, Faculty of Computing, Harbin Institute of Technology, Harbin 150001, China (e-mail: xiachongfeng@ir.hit.edu.cn;xcfeng@ir.hit.edu.cn;qinb@ir.hit.edu.cn;tliu@ir.hit.edu.cn).
Digital Object Identifier 10.1109/TNSRE.2023.3314642restore language abilities to such patients by decoding neural activities into the natural language (Brain-to-Text), which can drastically improve their quality of life [4].To pursue this goal, various Brain-to-Text works are proposed, building upon either invasive brain recordings, such as electrocorticography (ECoG) [2], [24], [25], or non-invasive brain recordings, such as functional magnetic resonance imaging (fMRI) [37] and electroencephalography (EEG) [34].Amongst, EEG shows its superior benefits in portability and cost-effectiveness in realworld applications, thus EEG-to-Text generation gains a lot of research interest recently [10], [34].Fig. 1 depicts the EEGto-Text generation task flow.However, we claim that existing studies neglect the discrepancy between the subject-dependent EEG representation and the semantic-dependent text representation, which inevitably degrades EEG-to-Text model performance.To explain why it becomes a crucial challenge for this task, we present brain topological graphs to intuitively visualize the discrepancy under two situations.Firstly, as shown in Fig. 2(a), EEG representations elicited by the same subject skewed towards being similar, no matter what the sentence stimulus is, demonstrating the same subject is prone to favour similar cognitive patterns in the face of different sentence stimuli.Secondly, on the contrary, Fig. 2(b) reveals that different subjects act variably even disparately in terms of the same sentence stimulus.These observations are in line with findings in previous studies, including neuroscience [1] as well as some machine learning research areas, such as emotion classification [32] and visual recognition [21].On this account, such subject-dependent EEG representation negatively impacts the performance of the EEG-to-Text model from two perspectives.On the one hand, it introduces a "many-to-one" generation problem (multiple EEG signals correspond to the same sentence), which is challenging for training current sequence-to-sequence generation models.On the other hand, it largely hinders good cross-subject generalizability since transferring original subject-dependent EEG representation to unseen subjects is intractable.
To address this issue, we propose a novel Curriculum Semantic-aware Contrastive Learning strategy (C-SCL), which can effectively recalibrate the original subjectdependent EEG representation into our desirable semanticdependent EEG representation so that it can be better adapted to the EEG-to-Text generation task.In detail, the core part of our C-SCL is the Semantic-aware Contrastive Learning strategy (SCL), which aims to maximize the similarities of EEG representations across subjects w.r.t. the identical sentence stimulus (positive pairs) while minimizing the similarities of EEG representations w.r.t. the different sentence stimuli (negative pairs).Note that the critical ingredient for successful contrastive learning is to construct hard positive and negative pairs.However, based on the random selection, we witness that nearly 45.93% of total constructed contrastive pairs already satisfy the final objective, in which positive pairs are similar and negative pairs are dissimilar.Therefore, we manufacture contrastive pairs in different difficulties by pre-computing similarities between numerical EEG signals (e.g., hard positive pairs initially have low similarity while hard negative pairs have high similarity) and drawing support from curriculum learning to not only introduce hard contrastive pairs but also enable a progressive learning process by learning from easy pairs to hard pairs.With the integration of curriculum learning, we finalize our Curriculum Semantic-aware Contrastive Learning strategy (C-SCL).
We conduct experiments on the ZuCo benchmark [18], [19] and assess the generation performance via three types of metrics.The experimental results achieving state-of-theart performance demonstrate the efficacy of our proposed method across various models and architectures and indicate the necessity of curriculum learning.Further investigation empirically shows its benefits in both the single-subject setting and low-resource settings as well as its robust generalizability in the zero-shot setting.In summary: (a) We take the first step to mitigate the challenge of the discrepancy between the subject-dependent EEG representation and the semantic-dependent text representation for the EEG-to-Text generation task; (b) We devise a curriculum semantic-aware contrastive learning strategy that succeeds in yielding the semantic-dependent EEG representation; (c) We conduct extensive experiments on the ZuCo benchmark that demonstrate the effectiveness of our method and its robustness and superior generalizability.

II. PRELIMINARIES
In this section, we first describe the task formulation and then introduce the ZuCo benchmark.

B. ZuCo Benchmark
We use the ZuCo dataset, which is a corpus of EEG signals and eye-tracking data during natural reading.The reading materials are collected from movie reviews and Wikipedia articles.Specifically, following Wang and Ji [34], we utilize the combination of both ZuCo [18] and ZuCo 2.0 [19] to form our final ZuCo benchmark.For each EEG-text pair in the dataset, EEG signals are composed of a sequence of word-level EEG features E. For each word-level feature e, 8 frequency bands are recorded and denoted as the following: theta1 (4-6Hz), theta2 (6.5-8Hz), alpha1 (8.5-10Hz), alpha2 (10.5-13Hz), beta1 (13.5-18Hz) beta2 (18.5-30Hz) and gamma1 (30.5-40Hz) and gamma2 (40-49.5Hz).Each band of the feature has a fixed dimension of 105 1 .We concatenate all 8 bands of features to construct the final word-level feature vector with a dimension of 840 (e ∈ R 840 ).Additionally, all features are Z-scored as done by Willett et al. [35].We further split the dataset into the train, valid and test (80%,10%,10%) parts following Wang and Ji [34].Note that each part of the dataset maintains the same subject set with no overlapping sentences.Table I shows the statistics of the ZuCo benchmark2 .

III. METHOD
In this section, we thoroughly introduce our curriculum semantic-aware contrastive learning strategy (C-SCL) step by step, including (1) semantic-aware contrastive learning, (2) curriculum learning, (3) the backbone model BRAIN-TRANSLATOR and (4) the overall learning procedure.

TABLE I STATISTICS FOR THE ZUCO BENCHMARK. "# PAIRS" MEANS THE NUMBER OF EEG-TEXT PAIRS, "# UNIQUE_SENT" REPRESENTS THE NUMBER OF UNIQUE SENTENCES, "# SUBJECT" DENOTES THE NUMBER OF SUBJECTS AND "AVG.WORDS" MEANS THE AVERAGE NUMBER OF WORDS OF SENTENCES
A. Semantic-Aware Contrastive Learning 1) Motivation: The critical ingredient of training a superior model for EEG-to-Text generation is reducing the discrepancy between the subject-dependent EEG representation and the semantic-dependent text representation.To this end, we draw support from contrastive learning [16], which is skilled at recalibrating the representation space, and propose our semantic-aware contrastive learning strategy (SCL) by pulling semantically similar EEG representations together (positive pairs) and pushing apart dissimilar ones (negative pairs).Note that through employing the semantic embedded within EEG signals as a supervisory signal to direct the optimization of EEG representations, we implicitly achieve a joint model of EEG signals and textual semantics, thereby deriving semanticdependent EEG representation.
2) Positive Pairs: One important question in contrastive learning is how to construct positive pairs (E i , E + i ).Towards achieving our goal of learning semantic-dependent EEG representations, given an anchor EEG representation E i with its corresponding sentence S i , we randomly choose one EEG E + i from the positive set E + i , in which all EEG signals correspond to the same sentence stimulus S i across different subjects, as shown in Fig. 3(a).Such positive pairs will promote clustering of semantically similar EEG signals.
3) Negative Pairs: Practically speaking, original in-batch negative samples insufficiently provide weak supervision for contrastive learning.To alleviate this problem, Gao et al. [13] verify that introducing specially designed negative pairs can further promote the learning process.Inspired by this conclusion, given the anchor EEG representation E i elicited by p i with its corresponding sentence S i , we construct the negative pair , where E − i satisfies two conditions3 : (1) E − i corresponds to sentences except for S i and (2) E − i is elicited by subjects except for p i .All E − i that satisfy both two conditions form the negative set E − i , as shown in Fig. 3(b).

B. Curriculum Learning
1) Motivation: Recall that our final learning objective is to make the EEG representations corresponding to the same sentence similar while making the EEG representations corresponding to semantically different sentences also dissimilar.To examine the learning efficiency, we conduct one preliminary experiment by running SCL for 10 epochs on the ZuCo train set, resulting in 145670 (14567 × 10) contrastive triples.However, we find that 66906 triples already satisfy the final objective.In other words, 45.93% ( 66906 145670 = 45.93%) of the positive and negative pairs satisfy the condition that EEG representations with respect to the same sentence are already similar and semantically different EEG representations are already dissimilar without needing contrastive learning, which severely reduces the effectiveness of the learning process.To overcome this problem, we employ curriculum learning to not only introduce hard contrastive pairs but also ensure the model learning efficiency, thus finalizing our Curriculum Semantic-aware Contrastive Learning strategy (C-SCL).Compared with SCL that randomly selects a positive sample and negative sample from E + i and E − i respectively, C-SCL selects samples in an easy-to-hard order.
2) Curriculum Criterion: How to determine the ordering?Recall that our goal is to introduce hard contrastive pairs, where positive pairs are initially far away from each other while negative pairs are oppositely similar.Therefore, we pre-calculate the cosine similarity between two EEG representations and craft contrastive pairs of varying difficulties by taking the similarity into consideration.Specifically, given an anchor EEG representation E i , for positive pair construction, we calculate similarities between the E i and all E + i ∈ E + i and then sort the E + i in the descending order, resulting in È+ i .On the contrary, for the negative set E − i , we sort it in the ascending order and attain É− i .Both hard positive and negative samples w.r.t. the anchor E i are located at the end of the È+ i and É− i , respectively.In other words, samples in the È+ i and É− i are now in an easy-to-hard order.
3) Curriculum Level: What are the curriculum levels?We conduct preliminary experiments by setting up the number of curriculum levels from 2 to 5 and finally decide to split the È+ i and É− i into 3 levels due to their better performance.In detail, we split the sorted È+ i into three equal-length parts, including In other words, we obtain curriculums of different difficulty according to the length of sorted È+ i and É− i .Fig. 4 shows two examples of contrastive pairs of different difficulties.We can clearly find the easy pair Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.already satisfies the condition: positive pairs are similar while negative pairs are dissimilar.In contrast, the hard contrastive pair instead follows the condition: positive pairs are dissimilar while negative pairs are similar.
4) Curriculum Scheduler: When to update the curriculum?We adopt a One-Pass scheduler with a linear pace [3] to progressively train the model in an easy-to-hard order.One-Pass scheduler means that training the model only once per curriculum, while linear pace ensures that each curriculum takes the same amount of training time.In detail, when reaching the hard level, given an anchor EEG E i , we select the positive sample and negative sample from E hard+ i and E hard− i , respectively.

C. Backbone Model
Our backbone model BRAINTRANSLATOR inherits a typical Encoder-Decoder framework, which first encodes a sequence of word-level EEG features E to distributed representations and then generates the target sentence S with the decoder.The overall architecture is shown in Fig. 5. BRAINTRANSLATOR takes word-level EEG features as input and produces the corresponding sentence.It mainly consists of three parts: (a) Word-Level EEG Feature Construction that concatenates features of different bands of one word to form the final word-level EEG feature.(b) Pre-encoder that transforms original EEG features into the pre-trained Seq2Seq embedding space, and (c) Pre-trained Seq2Seq that takes a sequence of transformed embeddings and produces the final output sentence.Formally speaking, the overall model is formulated as:

D. Learning Procedure
The overall training process follows a two-step manner.The first is the C-SCL that aims to pre-train the pre-encoder.The second is the language modelling that aims to jointly optimize the whole EEG-to-text generation model.
Firstly, we adopt our C-SCL to train the pre-encoder.Formally, given anchor E i and one specific curriculum level c_level, we have the contrastive triple shows the construction process).After the transformation of the pre-encoder, we can get , where h i is the averaged vector of the outputs of the pre-encoder.Following the contrastive framework in Gao et al. [13], we minimize the cross-entropy loss ℓ i defined by (N is the mini-batch size): where τ is a temperature hyperparameter 4 .sim(h i , h j ) is the cosine similarity.Note that our C-SCL works in an online manner, which means both positive and negative pairs are constructed dynamically along with the training process (pairs are decided during training) rather than constructing them offline (pairs are decided before training).This increases the distribution of contrastive pairs, thus improving training efficiency.Accordingly, the overall learning objective of the first step is: Secondly, based on the contrastive-trained pre-encoder, we jointly fine-tune all the parameters of the BRAINTRANS-LATOR to minimize the cross-entropy loss in a parallel training corpus (E, S): IV. EXPERIMENTS

A. Baseline Models
We adopt the previous state-of-the-art BRAINBART [34] as our baseline model, which is composed of the Transformer pre-encoder 5 and the BART pre-trained seq2seq model [22].Besides, we further employ other two types of widely used pre-trained seq2seq models, including PEGASUS [36] and T5 [30], building upon the Transformer pre-encoder to form BRAINPEGASUS and BRAINT5 respectively.All the above three models come in two model-size variants, including LARGE and BASE, leading to six models in total.

B. Evaluation Protocol
Following Wang and Ji [34], we adopt ROUGE [23] and BLEU [29] for evaluating the EEG-to-Text generation task.Besides, following Metzger et al. [25], we also adopt Word Error Rate (WER) as our metric to examine more fine-grained generation performance.

C. Implementation Details
Our pre-encoder consists of 6 layers, each with 8 heads and a hidden dimension of 2048.The dimension of the input EEG representation is 840.For the contrastive training process, we use Adam with a learning rate of 0.001 with a batch size of 32.τ is set to 0.00001.For the curriculum training process, we train one epoch for each curriculum from easy to hard (easy, medium and hard).For the overall training process, we first load the checkpoint of the contrastive-trained pre-encoder and then fine-tune the whole model using Adam with a learning rate of 2e-5 and batch size of 32.For the generation process, following Wang and Ji [34], we equip our model with greedy decoding to produce final sentences.For all three metrics, we use standard implementations provided by HuggingFace. 6.RESULTS

A. Automatic Evaluation
Table II shows the performance of our SCL and C-SCL on the ZuCo benchmark.In detail, we evaluate our model following two settings: (1) the 10-fold cross-validation setting and (2) the same data split setting with respect to Wang and Ji [34] 7 .Overall, we find that SCL can consistently attain strong performance across various baseline models and architectures.With the enhancement of curriculum learning, C-SCL can further boost performance.In detail, our approach achieves state-of-the-art performance across six different architectures.Specifically, when comparing our method to the previous SOTA model (BRAINBART-LARGE), we observe a 1.58-point increase in ROUGE-L and a 2.41-point increase in BLEU-4, and a 2.25-point enhancement in WER, which serves as substantial evidence of the effectiveness of our method.When comparing SCL to C-SCL, our state-of-theart C-SCL demonstrates comprehensive supremacy across all metrics.In addition to the main observations, our empirical results also demonstrate the following two findings.Firstly, BART performs well.Although this finding is exclusively derived from results based on three pre-trained seq2seq models, it still provides the guideline for choosing future backbone seq2seq models for the EEG-to-Text generation task: choosing task-agnostic language models (e.g., BART) rather than task-oriented models (e.g., PEGASUS for summarization and T5 requiring task prompts).Secondly, EEG-to-Text generation also follows the scaling law, which means the generation performance scales up with the increasing number of model parameters 8 .

B. Human Evaluation
To further assess the quality of the generated texts, we conduct a human evaluation study.We choose two metrics: consistency (EEG representations with respect to the same sentence can be consistently decoded into the same sentence) and correctness (the decoded sentence is factually consistent with the reference sentence).Specifically, we employ three evaluators to undertake the human evaluation.Each evaluator is remunerated $40 for this evaluation task.We randomly select 50 unique sentences from the test set and take 5 EEG representations elicited by different subjects for each sentence to conduct the evaluation.For consistency, given 5 EEG representations corresponding to one sentence, we evaluate whether the generated 5 sentences are consistent.For correctness, we evaluate whether 250 generated sentences are factually consistent with the ground truth.For each metric, the score ranges from 1 (worst) to 5 (best).The results are shown in Table III.Firstly, we find that our proposed SCL and C-SCL can achieve better scores in terms of two metrics, with the C-SCL performing the best.Secondly, even our best method still  cannot achieve very good results on correctness, which shows that factual inconsistency remains an important challenge for the brain-to-text generation task.

C. Analysis
1) Parameter Search for τ : Fig. 6 shows the contrastive training loss under different τ 9 We can find that setting τ to a small number is critical for successful EEG contrastive training.In contrast, a larger τ value of 0.05 results in an almost 0 loss, indicating that contrastive training is ineffective for settings where τ = 0.05.We attribute this to the fact that the original EEG signals are similar to each other, a small τ can produce more distinguishable EEG representations, thus enabling effective contrastive learning.We conduct  preliminary experiments and find that setting τ to 0.00001 yields better EEG-to-Text generation performance.

2) Parameter Search for the Number of Curriculum Levels:
The search results for varying numbers of curriculum levels are presented in Table IV.In particular, the È+ i and É− i are partitioned equally based on the number of curriculum levels.Subsequently, the model is trained progressively in an easyto-hard order.After evaluating the performance, the number of curriculum levels is set to 3.
3) Ablation Study for Explicit Negative Pairs: To examine whether explicit negative pairs are necessary.We conduct the ablation study by only considering the in-batch negative pairs without incorporating explicitly crafted negative pairs.Table V shows the results.We can find that explicit negative pairs indeed do good to the contrastive learning, thus are effective and necessary.

4) Comparison with Domain-Adversarial Learning
Method: Recall that the key challenge of the brainto-text generation task is to mitigate the discrepancy between the subject-dependent EEG representation and the semantic-dependent text representation.Accordingly, for a more comprehensive evaluation of our proposed method, we additionally explore one critical method, domainadversarial learning (DAL) [12], which is also skilled in learning domain-invariant representations, to address this challenge.Specifically, the objective of domain-adversarial learning is to learn EEG representations that are indiscriminate with respect to the same sentence by treating any two EEG representations corresponding to the same sentence as the source domain and the target domain, respectively.The experimental results are presented in Table VI.We can find that both contrastive learning and domain-adversarial learning can mitigate the discrepancy and improve brain-to-text performance.However, DAL underperforms compared to C-SCL, suggesting the method requires careful adaptation to this task.Overall, we believe domain-adversarial learning holds promise as an important research direction for brain-totext generation.Further efforts are warranted to fully realize its potential.

5) Embedding Visualization:
To verify whether our C-SCL can achieve learning of semantic-dependent EEG representations.We give a straightforward comparison via t-SNE  between the original representations (Fig. 7(a)) and representations obtained after the transformation of the contrastive-trained pre-encoder (Fig. 7(b)).We can easily observe that our learned EEG representations of the same sentence tend to be closer compared with original desultorily distributed ones.This result coincides with our initial goal.Besides, Fig. 7  Fig. 9. Results different methods testing on 4 subjects respectively, including both male and female, youth and middle-aged, e.g., ZPHmale-26 describes the subject identified as ZPH, is male and 26 years old.BRAINBART-LARGE (Single) means that training and testing on the data of a single subject.Others mean that training on the whole data while testing on the data of a single subject.
other three mixed-subjects training methods achieve remarkable improvements, which precisely indicates that it is worth exploring mixed-subjects training methods.Besides, the results also show the effectiveness of our proposed SCL and C-SCL at a more fine-grained level.
7) Low-Resource Setting: To verify the robustness of our methods on varying data sizes, we provide datasets of different sizes to train the pre-encoder using SCL and C-SCL, then fine-tune the whole model.Note that the size of the test set is the same across all experiments.The results are shown in Fig. 8.We can find that the model performance clearly improves with the growing of dataset size in terms of all metrics.Prominently, our methods show great advantages in the low-resource setting.Especially when only using 25% of the dataset, our C-SCL can directly reduce the WER from 92.83% to 78.89%, achieving comparable results compared with using 50% of the dataset.
8) Zero-Shot Setting: To verify the generalizability of our methods, we conduct zero-shot experiments by training on the partial ZuCo dataset, which excludes the data of one selected test subject.The results are shown in Fig. 10.We can see

TABLE VII RESULTS OF DIFFERENT CURRICULUM LEVELS
that our yield strong performance for unseen ZPH and ZKP respectively.We attribute this good generalizability to the fact that contrastive learning not only learns better representations for currently available subjects but also optimizes a distinguishable representation space that can be easily transferred and adapted to unseen subjects.9) Single-Curriculum Setting: To verify the necessity of curriculum learning for our C-SCL.We individually perform SCL based on contrastive pairs from each curriculum level, including SCL(E easy ), SCL(E medium ) and SCL(E hard ).Then, we select one-third of the data from each curriculum level and conduct C-SCL based on the fixed . Note that all the above contrastive learning datasets keep the same size and the fine-tuning is based on the whole ZuCo train part.The results are shown in Table VII.Firstly, we can find that curriculum learning indeed does good to the model performance.Besides, both SCL(E easy ) and SCL(E hard ) achieve relatively lower results.We attribute this fact to that easy pairs are insignificant but directly leveraging hard pairs is quite challenging for model learning.In addition to the extrinsic evaluation based on the downstream Brain-to-Text generation task, we further conduct the intrinsic analysis to give an in-depth understanding of the efficiency of our proposed C-SCL compared with the SCL.For each method, during the training process, we calculate the average cosine similarity of EEG representations (obtained after the transformation of the contrastive-trained pre-encoder) corresponding to the same sentence in the valid set.Specifically, we set four calculation points, which are at the start, one-third, two-thirds, and the end of the full training process, respectively.The results are shown in Fig. 11.Firstly, we find that our C-SCL can learn more similar EEG representations corresponding to the same sentence, which is in line with our learning objective.Secondly, the results coincide with the finding in the Table VII, where C-SCL performs the best, SCL(E easy ) and SCL(E hard ) perform worse.By means of this intrinsic analysis, we can attribute the success of our C-SCL to the effective learning of semantic-dependent EEG representations.

D. Case Study
Fig. 12 shows the case study.We can find our method generates the same sentence for EEG signals elicited by different subjects based on learned semantic-dependent EEG representations, whereas the baseline produces different ones.Besides, our result is more semantic-related compared with baseline results, which indicates that semantic-dependent EEG representation can enhance the generation performance.However, there still exists a large gap between our generation and the golden reference.We believe future works should pay attention to the following research directions: (1) Strategies by jointly modelling continuous word-level EEG signals and the syntactic structure of sentences, since the current generation still failed to capture the linguistic structure; (2) Strategies to close the gap between the word-level EEG feature and tokenlevel generation, since the current generation still has several spelling errors.

VI. RELATED WORK A. Brain-to-Text Generation
Brain-to-Text generation is an active area of research at the intersection of artificial intelligence and neuroscience [31] and is closely related to research on simulating human perceptual experiences and reasoning processes [7], [8], [20].According to the classification criterion of vocabulary size, there are two series of related works: closed vocabulary and open vocabulary brain-to-text generation.The first line of works generates words in small closed vocabularies [24], [28].For example, Moses et al. [28] focus on a 50-word vocabulary.While exhibiting promising generation accuracy and speed, expanding access to a larger vocabulary enables effective day-to-day communication.Accordingly, Wang and Ji [34] study the problem of open vocabulary EEG-to-Text decoding task by utilizing pre-trained language models (PLMs) [22].It brings two benefits: on the one hand, PLMs offer a large vocabulary, on the other hand, PLMs can serve as a bridge between brain signals and linguistic information [26].In our work, we focus on the open vocabulary paradigm due to the non-invasive nature and widespread application prospects of EEG-based BCIs.Specifically, we pay particular attention to the challenge of the discrepancy between the subject-dependent EEG representation and the semantic-dependent text representation for the EEG-to-Text generation task.

B. Contrastive Learning
Contrastive learning is a technique that aims to make the representation of a given anchor data to be similar to its positive pairs while being dissimilar to its negative pairs.It shows promising results in computer vision [5], [16], [17] and has gained popularity in natural language processing [13], [14].After witnessing its superiority in the above areas, contrastive learning is attracting the attention of neuroscientists and been applied to several EEG-based classification tasks [6], [10], [21], [27], [32].More recently, et al. [32] propose a contrastive learning to tackle the cross-subject emotion recognition problem.Défossez et al. [10] devise a contrastive learning objective to align representations of brain signals and natural speech.our work, devise a curriculum semantic-aware contrastive learning strategy (C-SCL), aimto learn semantic-dependent representations, which effectively reduce the discrepancy between the EEG and text representations.

VII. CONCLUSION AND FUTURE WORK
In this paper, we propose a curriculum semantic-aware contrastive learning strategy (C-SCL) to reduce the discrepancy between the subject-dependent EEG representation and the semantic-dependent text representation.The experimental results based on the ZuCo benchmark demonstrate its effectiveness for the EEG-to-Text generation task.Besides, our Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
analyses also verify the robustness and superior generalizability of our C-SCL in the low-resource setting and the zero-shot setting, respectively.Moreover, single-subject setting experiments also point to the necessity of exploring mixedsubjects training methods for the EEG-to-Text generation task.
We believe that forthcoming research endeavours will seek to implement the proposed method in real scenarios.First, building upon the existing C-SCL framework, future work could consider semantic similarity when constructing contrastive pairs and integrate multiple solutions like contrastive learning and domain-adversarial learning to further improve performance.Second, findings from neuroscience research could inform the text decoding stage, associating brain-inspired related words during decoding to mitigate the hallucination problem.Third, collaborating with hospitals would enable deploying the method with actual patients, gauging its effectiveness and robustness.Overall, opportunities remain to refine the technique through semantic-similarityaware contrastive learning, brain-inspired text decoding, and validation in real-world clinical settings.

Manuscript received 1
July 2023; revised 22 August 2023; accepted 7 September 2023.Date of publication 12 September 2023; date of current version 9 October 2023.This work was supported in part by the National Key Research and Development Program of China under Grant 2020AAA0106502, in part by the National Natural Science Foundation of China (NSFC) under Grant 62276078, in part by the Key Research and Development Program of Heilongjiang under Grant 2022ZX01A32,and in part by the International Cooperation Project of PCL under Grant PCL2022D01.(Corresponding author: Xiaocheng Feng.)

Fig. 1 .
Fig. 1.Illustration of the EEG-to-Text generation.The left part shows the EEG recording process, in which one subject reads a sentence on the screen while recording their EEG signals.Concurrently, the eye-tracking device permits defining exact word boundaries via fixations.Given recorded EEG signals, the task aims to generate the sentence that stimulated those EEG signals.

Fig. 2 .
Fig. 2. Brain topological graph of the sentence-level EEG representation (averaged word-level EEG representations).(a) Four topological graphs denote EEG representations elicited by the same subject in response to four different sentences.(b) Four topological graphs describe EEG representations elicited by four different subjects corresponding to the same sentence.

A
. Task Formulation Given a sequence of word-level EEG features E, EEG-to-Text generation task aims at producing a sentence S via a model θ, where E consists of |E| features [e 1 , e 2 , . . ., e |E| ] and S consists of |S| tokens [s 1 , s 2 , . . ., s |S| ]. e ∈ R n symbolizes a word-level EEG feature vector and θ denotes the parameters of a sequence-to-sequence model.Each sequence of EEG features E is associated with a subject p i ∈ P, P being a set of subjects.During the training phase, EEG-Text pairs come from various subjects and the learning objective.At the test phase, sentences are totally unseen.Besides, the train, valid and test sets maintain the same set of subjects P.

Fig. 3 .
Fig. 3. Illustration of our semantic-aware contrastive learning strategy.(a) Positive pairs derive from EEG signals corresponding to the same sentence elicited by different subjects.In contrast, (b) Negative pairs come from EEG signals elicited by different subjects corresponding to different sentences.
denotes M decoding layers.Y 0 describes the shifted right version of S, FFN(•) represents a position-wise feed-forward network, and ATT(•) represents a multi-head attention.

Fig. 7 .
Fig. 7. T-SNE visualization of sentence-level EEG representations of sentences in the training set, which are (a) original EEG representations and (b) generated by the pre-encoder after C-SCL.Different colours mean different subjects.Each dot represents a sentence.The red box dots represent the EEG representations corresponding to the same sentence "He and his wife had seven children".
(a) also shows distinct subject clusters (different colours) while Figure 7(b) reveals subjects distributed more equally.Nevertheless, Fig. 7(b) also shows the EEG representations of the same sentence are not fully clustered.Instead, multiple sub-clusters are formed, which indicates achieving a desirable semantic-dependent EEG representation space is a challenging task.To alleviate this challenge, we envision three potential paths.First, optimize EEG signal preprocessing.We could introduce a new normalization method by introducing the semantic-dependent EEG representation idea during the preprocessing to bias the initial EEG representation.Second, employ pre-training techniques.Pre-trained EEG models could also enhance EEG modelling.Through meticulous pre-training objective design, we could guide the model to learn semantic-dependent EEG representations.Third, leverage joint learning approaches like contrastive learning and domain-adversarial learning to augment the model's learning objective and accomplish enhanced performance.6) Single-Subject Setting: Given that the subject-dependent EEG representation poses a great challenge to the EEG-to-Text generation task, in this analysis, we aim to answer one question: Whether single-subject training is a more suitable way for the EEG-to-Text generation task?To verify this, we test both mixed-subjects training and single-subject training methods on data from 4 distinct subjects.The results are shown in Fig. 9. Compared with single-subject training, allAuthorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 10 .
Fig. 10.Zero-shot results by training on data that excluded the final test subject.

Fig. 11 .
Fig. 11.Average cosine similarity of EEG representations corresponding to the same sentence in the valid set during the training process.

)
Algorithm 1 Contrastive Pairs Construction for Specific Curriculum Level Input: EEG E i with its corresponding subject p i and sentence S i ; a dict f s : S i → E S i maps S i to a set of EEG signals E S i ; a dict f p : p i → E p i maps p i to a set of EEG signals E p i ; a set of sentences S; curriculum level c_level; Output: a contrastive triple

TABLE II TEST
[34]RESULTS ON THE ZUCO BENCHMARK UNDER THE 10-FOLD CROSS-VALIDATION SETTING.THE RESULTS ENCLOSED IN PARENTHESES ARE OBTAINED UTILIZING THE IDENTICAL DATASET SPLITS AS THOSE EMPLOYEDBY WANG AND JI[34].↑ MEANS HIGHER IS BETTER.↓ MEANS LOWER IS BETTER

TABLE IV RESULTS
OF THE DIFFERENT NUMBER OF CURRICULUM LEVELS BASED ON THE BRAINBART-LARGE (W/ C-SCL)

TABLE V ABLATION
STUDY FOR EXPLICIT NEGATIVE PAIRS

TABLE VI RESULTS
OF DIFFERENT PRE-TRAINING METHODS