Zero-Shot Learners for Natural Language Understanding via a Unified Multiple-Choice Perspective

Zero-shot learning is an approach where models generalize to unseen tasks without direct training on them. We introduce the Unified Multiple-Choice (UniMC) framework, which is format-independent, compatible with various formats, and applicable to tasks like text classification and sentiment analysis. Furthermore, we design a two-stage tuning method, initially training on multiple-choice formats to develop format-agnostic capabilities, and subsequently enabling direct predictions on unseen tasks for zero-shot learning. Our methodology avoids issues in large-scale models like FLAN, enhancing generalization and reducing parameters. In experiments, UniMC shows State-of-the-Art (SOTA) performance across out-of-domain and in-domain benchmarks, with only 235M parameters, far fewer than previous methods. Moreover, the UniMC-Chinese model excels beyond human performance on benchmarks like EPRSTMT and CHID-FC, underscoring its generalization capacity across languages. Additionally, ablation experiments demonstrate the effectiveness of our design. The code and model weights are available at https://github.com/IDEA-CCNL/Fengshenbang-LM/tree/main/fengshen/examples/unimc.


I. INTRODUCTION
In the field of Natural Language Understanding (NLU), the evolution of language models has served as a foundation for advancements across a variety of tasks, including sentiment analysis, coreference resolution, and text classification [2], [3].The progress in Zero-Shot Learning (ZSL) presents opportunities for NLU tasks, particularly in predicting labels on unseen tasks [4], [5].Most solutions are designed within a framework known as prompt tuning [6], [7].This framework activates specific parameters in Pre-trained Language Models (PLMs) to adapt them for zero-shot tasks.A variation of prompt tuning, known as instruction tuning, has been introduced in recent literature [8].This method allows for the sharing of knowledge across different domains.
The associate editor coordinating the review of this manuscript and approving it for publication was Jolanta Mizera-Pietraszko .
We provide an overview of the mainstream large-scale PLMs, as illustrated in Fig. 1.
While these frameworks have marked a notable success in various applications, their inherent limitations present challenges that hinder the full realization of their potential, particularly in the field of ZSL.Firstly, the trend towards employing prompt-related models is characterized by an overwhelming number of parameters.For instance, FLAN [8] is equipped with over 130 billion parameters, while PaLM [5] takes it a step further with more than 540B.A pressing challenge is the complexity of training large-scale PLMs, impacting both efficiency and the ease of deployment and application.Secondly, hand-crafted designs are often indispensable.For example, the method in T0 [4] manually constructs thousands of prompts for managing over a hundred new tasks.Lastly, the prevailing models in the field predominantly adopt a unidirectional strategy.They rely on either auto-regressive or sequence-to-sequence methodologies.This constrained FIGURE 1.The mainstream zero-shot learning approaches and our proposed UniMC for NLU tasks.''PLM'' refers to the pre-trained language model, typically a unidirectional model for understanding language context.Conversely, ''PMLM'' represents the pre-trained masked language model, designed for tasks necessitating comprehension of context from both directions of a given input.
approach hampers the optimal utilization of information, neglecting the simultaneous consideration of both forward and reverse directional data.Note that recent work [9] highlights the superiority of Pre-trained Masked Language Models (PMLMs) over PLMs in the context of NLU tasks.Fig. 1 (c) illustrates PMLMs' attempt to implement a zero-shot learner.However, PMLMs require fine-tuning on task-specific samples, avoiding random initialization of the classifier.Such a necessity limits PMLMs' capability in addressing zero-shot scenarios, emphasizing a constraint in its broader application.
In response to the previous challenges, we present a light-weight architecture named the Unified Multiple-Choice model (UniMC), which embodies a new approach to Multiple-Choice (MC) tuning.Firstly, we standardize the input, enabling our model to handle various types of tasks from a unified perspective.Specifically, we use the original sample text as the passage.For example, since NLU tasks are built based on labels, a natural idea is to use labels as input options.Therefore, labels are treated as selectable alternatives, rather than constructing verbalizer maps and supplying the corresponding text information to the models as in previous approaches.From the labeled data, we are able to derive valuable label information directly.In order to accomplish this, we modify the classifiers that have been identified as problematic, turning them into mechanisms for selecting options.
After unifying the input data, a critical question arises: How can PMLMs select an option?The selection must be clear and precise, without the need for additional classification modules.As described in Sec.III-B, we design option-mask tokens, denoted as [O-MASK], to facilitate the prediction of either ''yes'' or ''no'' preceding each selectable option.Initially, in a manner akin to the Masked Language Modeling (MLM) [3], we employ an approach called Option MLM (O-MLM).This method is designed to ascertain a ''yes'' or ''no'' response for each given option, thereby enhancing the model's ability to interpret and respond to specific choices.Following the O-MLM method, we introduce an Option Prediction (OP) technique.This technique is designed to calculate the appropriate choices by evaluating the statistical likelihood of selecting ''yes''.To equip the model with the ability to select the correct option from the candidates, we introduce MC tuning in Sec.III-B.The MC tuning method proposed offers several benefits: i) the updating of parameters is confined to the MC training phase, ii) the models can effortlessly handle unseen tasks in the zero-shot inference phase, and iii) it facilitates the deployment of the model.
To thoroughly investigate the capabilities of the UniMC framework, we conduct both out-of-domain and in-domain experiments.In out-of-domain tasks, our UniMC exhibits outstanding zero-shot performance, achieving State-of-the-Art (SOTA) results in datasets such as ANLI R1 and CB.For in-domain (MC tasks) comparisons, our UniMC enhances the zero-shot performance of existing SOTA models by 42.3%.Furthermore, to validate the language generalizability of the UniMC framework, we develop a Chinese version of the DeBERTa-v2 [10] model, enhancing the toolkit available to Chinese language processing community.We then integrate this model with the UniMC framework, dubbing it UniMC-Chinese.Our experiments demonstrate the effectiveness of our framework, even exceeding human performance in datasets such as EPRSTMT and CHID-FC.Additionally, we conduct extensive ablation studies, affirming that our framework adapts well to different PMLMs and validates the efficacy of our design.
The achievements of UniMC within zero-shot scenarios underscores the promising capabilities and adaptability that this method offers for diverse applications.The contributions of the present work 1 are: • We propose a new approach to zero-shot paradigm by converting label-based NLU tasks into MC formats, minimizing manual intervention.To accommodate MC tasks seamlessly, we introduce O-MLM and OP tasks and a MC tuning method with PMLMs.These elements are collectively assembled under the Unified Multiple-Choice model (UniMC) framework.
• Experiments show that UniMC achieves SOTA performance in both in-domain and out-of-domain tasks.For example, it achieves up to a 48% improvement on Dbpedia over SOTA baselines, which are a few hundred times larger than our model.This demonstrates the effectiveness of our proposed model. 1 An early version of this paper was presented at EMNLP 2022 [11].
142830 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• To further explore the generalizability of the UniMC framework, we develop UniMC-Chinese model based on DeBERTa-v2.This model shows competitive zero-shot performance and even surpass human benchmarks across various tasks.Furthermore, comprehensive ablation studies demonstrate the robustness of our proposed method across various scenarios.

II. RELATED WORK A. UNIFIED FORMATS FOR NLP TASKS
In the field of NLP, tasks are frequently presented in various formats.This diversity stems from the rapid development and introduction of different types of datasets.Recent studies underscore the imperative of standardizing formats to bridge the disparities across diverse tasks [4], [8], [12].Through the creation of a form designed for natural language input, T0 has developed an application that facilitates the transformation of original NLP datasets into target templates, guided by custom prompts [4].In addition, FLAN organizes multiple datasets into 12 distinct task clusters and subsequently designs 10 specialized instruction templates to standardize formats [8].Although effective, this approach is primarily geared towards generative styles, limiting its adaptability to a wide array of label-based models that necessitate selection.This has led us to focus on unifying label-based NLU tasks by developing standardized Multiple-Choice (MC) formats.

B. LABEL INFORMATION
Label semantics play a pivotal role in enhancing the performance of few-shot and zero-shot tasks, a conclusion strongly supported by extensive research [13], [14], [15].Among the innovative applications of this concept, a framework [13] stands out, skillfully merging label information with handcrafted prompts to address the challenges of few-shot slot tagging.Moreover, another approach [14] takes the integration of label semantics a step further by weaving it into both the pre-training and fine-tuning stages of PLMs.The success of these methodologies in low-resource environments has not only demonstrated the potential of label semantics but also spurred our exploration into its application within our unified MC inputs, specifically targeting the zero-shot scenario.

C. ZERO-SHOT LEARNING IN NLU TASKS
PLMs with large-scale parameters, such as GPT-3, excel in few-shot tasks but face limitations in zero-shot tasks, which hold broader practical implications.Recent endeavors have sought to overcome this challenge from various angles.For instance, FLAN [8] employs specific instruction templates and over 60 labeled datasets to fine-tune a 137B language model.On the other hand, T0 [4] adopts a unified approach, transforming all tasks into a source-target structure, and utilizes over 2, 000 manually constructed prompts for multitask learning.Similarly, ZeroPrompt [6] leverages more than 1, 000 supervised datasets and introduces a novel genetic prompt search method for new tasks.These methods, while innovative, demand considerable effort in prompt engineering and template creation.The pre-training and tuning of large-scale PLMs consume significant computational resources, posing challenges for new task deployment.
In contrast, our UniMC model is streamlined, with only 235M parameters and a few manual text transformations, making it adaptable to a wider range of scenarios.

III. APPROACHES
In this section, we introduce the proposed framework, referred to as UniMC.Moreover, we provide details of the MC tuning method and the multiple model variants.

A. THE UNIMC FRAMEWORK 1) UNIFIED INPUTS
A standardized input format promotes the process of model generalization, thereby enhancing the efficiency of knowledge sharing across various tasks.In order to accomplish this goal, we formulate all individual task objectives into a unified Multiple-Choice (MC) framework, as illustrated in Fig. 2.
An MC problem typically includes three components: options, a question, and a passage.We now explain how to obtain these components.The passage component is often readily available in the original data.In handling the question component, we have the option to utilize the original question as it is.Alternatively, we can formulate a new corresponding question should it be absent.The conversion of options relies on our ability to obtain a clear and unambiguous representation of labels.On the one hand, we can transform all classification tasks into selectable options if the necessary details for the choices are provided.On the other hand, we must formulate an option prompt to generate specific choices when they do not exist.The details about this transformation are in Sec.IV-A.This approach allows us to move away from label indices in classification tasks, which contain less information than the options we use.

2) NETWORK
The overall structure of the UniMC is shown in Fig. 3.
In our framework, we utilize BERT-like PMLMs, including ALBERT [17] and RoBERTa [18], as the backbone architecture to incorporate the bidirectional modeled input x inp .Rather than applying the conventional embedding methods, we engineer a new approach for the segment id, position id, and attention mask matrix to accommodate MC tasks, simultaneously.Additionally, we explore various language environments with distinct backbones, which will be detailed in Sec.III-C.

a: TOKENIZATION
In this framework, the essential factor in achieving the capability to handle MC tasks is the configuration of an appropriate option.To enhance the representation ability, we introduce option-mask tokens ([O-MASK]), specifically designed to replace ''yes'' or ''no'' in the input text.Building it is important to note that the tokens of options cannot attend to each other.

on this, [O-MASK] inherits the functionality of [MASK],
and thus continues to utilize token predictions to ascertain the correct option.For example, consider an input set denoted as (o, q, x), including the following: i) one passage (x = x 1 . . .x |x| ), ii) N Q questions (each represented as q = q 1 . . .q |q| ), and iii) N O candidate options (each represented as o i = o 1 . . .o |o| , where o i are tokens that define an option), whose input token x inp is formulated as follows: Here, 142832 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

b: THE SEGMENT ID EMBEDDING, THE POSITION ID EMBEDDING AND THE ATTENTION MASK MATRIX
Note that a unified input text encompasses multiple options, which may cause unintended interactions between options, leading to misunderstandings in the answers.We are addressing this issue from three perspectives: the segment id embedding (seg), the position id embedding (pos), and the attention mask matrix (M mask ).First, we assign segment id embeddings to differentiate between options and context (questions, passages) information, as shown in Fig. 3. Second, we update the position id embeddings to distinguish the intra-information within the options.This is because the PMLMs cannot obtain position information from tokens.
We aim to allow PMLMs to treat tokens' position information based on their position embeddings.Lastly, we manage the information flow between options by utilizing the M mask in self-attention, as shown in Fig. 4. Specifically, black squares are utilized to mask the part of the input attention matrix, ensuring the separation between different options.We place a −inf number on the masked slots, following the same approach as BERT to mask tokens.Then, we can obtain the encoded hidden representations, denoted as T = [T 1 . . .T n ], using multiple Transformer-based layers as follows, T = encoder(x inp , pos, seg, M mask ). (2)

B. MC TUNING
Recall the backbones of our system are often pre-trained models, which excel in capturing commonsense knowledge from the pre-training corpus.Intuitively, these models can be employed as foundational modules, leveraging their extensive knowledge base.In particular, we use the outputs of pre-trained models as the initial states for the subsequent MC tasks.Following this, we establish a two-stage tuning paradigm, specifically designed to enhance the system's ability to accurately select the correct answer.In the MC training phase, we train the models with MC tasks, achieving an optimal initialization for selecting the correct options.Finally, in the zero-shot inference phase, we deploy the UniMC models to handle unseen zero-shot tasks.

1) MC TRAINING PHASE
We now introduce the proposed Option Masked Language Modeling (O-MLM) and Option Prediction (OP) methods.
Both of these techniques are specifically designed to assist PMLMs in handling MC tasks.Masked Language Modeling (MLM) is a pre-training task in BERT [3] for self-supervised learning, where T is a set of tokens from T that have been randomly perturbed; m(T ) and T \m(T ) represent the masked tokens from T and the unmasked tokens, respectively.In practice, we randomly replace tokens in the passage sequence x with special tokens  −inf signifies negative infinity, which is used to effectively mask certain logits during the computation.
Once the prediction probabilities for ''yes'' or ''no'' are obtained, we introduce a specific operation (OP) to guide the model in learning MC tasks.This process is illustrated in Fig. 5.To learn the mutually exclusive characteristics between options, OP takes the logits T yes in ''yes'' for each option sequence to generate label distributions.OP aims to compute the cross-entropy loss, which measures the difference between the predicted distribution and the actual ground truth label distribution denoted by Y : Recent studies demonstrate that incorporating mixed tasks within a batch enhances the generalization ability of deep neural networks [19].For example, when dealing with mixed tasks, we specifically mask the output logits, leaving only the [O-MASK] tokens unmasked during the Softmax operation.This technique is used to compute the OP loss within a minibatch, as illustrated in Fig. 6.The logit masking approach enables our model, UniMC, to handle MC tasks efficiently.These tasks can vary in types, including different numbers of options, all within a single batch.Furthermore, such flexibility enhances the robustness of the model in managing diverse task configurations.
In summary, the overall training objective in MC training is defined as follows: After obtaining a tuned UniMC model, we utilize O-MLM and OP to predict the answer in unseen zero-shot datasets.However, it is important to note that the ground truth labels are missing, making it impossible to compute the loss.As an alternative, we can determine the most confident option with the OP, since the model still recovers the [O-MASK] tokens to ''yes'' or ''no'' using O-MLM.

C. MODEL VARIANTS 1) ENGLISH MODELS
As referenced in Sec.III-A2, we employ PMLMs as the backbone models to understand the context through bidirectional learning.Specifically, we utilize ALBERT-xxlarge-V2 [17] as the primary backbone, leveraging its efficient parameters, and we name this configuration UniMC.Further details on other settings are provided in the corresponding sections.

2) CHINESE MODELS
In the fast-growing field of Chinese NLP, the lack of a PMLM on par with ALBERT has been a clear shortcoming.While English language models have advanced significantly, the Chinese language field has faced challenges in data diversity model design, and deployment.Identifying this gap, we have embarked on pre-training a Chinese PMLM, choosing DeBERTa-v2 [10] for its strategic advantages.DeBERTa-v2 is recognized for its unique design, utilizing disentangled attention to grasp subtle contextual relationships in Chinese.Its flexible architecture can handle diverse text challenges, fitting various NLP tasks.Our work goes beyond copying models like ALBERT; it is a focused effort to create a solution tailored to Chinese language needs.We have specifically conducted the pre-training of our DeBERTa-v2-Chinese model (1.4B version) on the WuDao Corpora [20] for one epoch.The details of this process are provided in this work [1].Using DeBERTa-v2-Chinese, we strive to improve the clarity, precision, and adaptability of Chinese NLP applications.Furthermore, we incorporate our pre-trained DeBERTa-v2-Chinese as the backbone and refer to it as UniMC-Chinese.

IV. EXPERIMENTAL SETTINGS
In this section, we detail the configurations of the experiments.

A. DATA PREPARATION
We follow the preparation outlined in T0 [4] to categorize label-based out-of-domain NLU datasets into six groups.Specifically, we gather publicly available NLP datasets from HuggingFace, 2 assigning each label-based dataset to a corresponding task group.We then divide the entire collection of datasets into two segments for the two phases in our framework: one segment for MC task training, and the other for zero-shot scenarios.It is noteworthy that confining the MC tasks to the MC training phase minimizes the need for intensive resource computing.

1) MC TRAINING DATASETS
The Multiple-Choice (MC) task involves selecting the correct answer from multiple candidate options based on related questions and passages.As shown in Table 1, we summarize the datasets used in the MC training phase.Additionally, we provide a brief introduction to them: 2) CommonsenseQA [22] is made for testing question answering that needs commonsense knowledge, not just specific context or documents.3) Cos-E [23] compiles human-generated explanations for commonsense reasoning, encapsulated in both natural language sequences and specific annotations.4) CosmosQA [24] is a comprehensive dataset designed for evaluating commonsense reading comprehension, presented in the form of multiple-choice questions.Unlike many existing datasets that concentrate on the factual interpretation of the text, CosmosQA extends the focus to understanding implicit meanings within a wide array of everyday stories, demanding reasoning that transcends the literal content.5) DREAM [25] is the first dataset of its kind for multiple-choice reading comprehension, gathered from English tests made for Chinese learners.Unlike other datasets, DREAM focuses on understanding dialogues that have many turns and involve several parties.6) MCTest [26] is a set of stories and questions that anyone can use for free.It is made for studying how machines understand text.Unlike earlier work that only looked at specific areas or goals, MCTest asks machines to answer questions about made-up stories.This helps test how well machines can understand text in general, not just in specific areas.7) MultiRC [27] is a collection of short texts and questions.
The questions need answers that come from putting together information from more than one sentence in the text.The dataset is special for three reasons: there is no set number of right answers for each question; the right answers do not have to be a specific part of the text; and the texts come from 7 different areas like news, stories, and history, so there is a lot of variety.8) OpenBookQA [28] is a unique dataset created for question-answering challenges in the field of machine learning.Drawing inspiration from open book exams, it serves to assess human understanding of specific scientific concepts.9) PIQA [29] serves as a standard in evaluating reasoning based on physical commonsense.Concentrating on realworld scenarios, like the application of eyeshadow using a cotton swab or toothpick.10) QASC [30] represents a complex challenge in multihop reasoning, requiring the extraction of facts from extensive text and the assembly of these facts to respond to multiple-choice queries.What sets QASC apart is the marked facts within a broad corpus, coupled with the challenge that the breakdown into these facts is not readily apparent from the questions themselves.11) RACE [31] is designed for the evaluation of reading comprehension.It is assembled from English examination materials targeting Chinese middle and high school students in the age range of 12 to 18. 12) Social IQa [32] represents a comprehensive benchmark, specifically crafted for the analysis of commonsense reasoning within social contexts.This benchmark encompasses a series of multiple-choice questions, aimed at evaluating emotional and social intelligence across various commonplace scenarios.13) WikiHop [33] is a multi-hop question-answering dataset, constructed with entities and relations from WikiData, and supported by documents from WikiReading.The task is to predict the correct answer given a query and multiple supporting documents.14) WIQA [34] is the first large-scale dataset for ''What if. . .'' questions in procedural text.The dataset includes paragraphs, each with multiple graphs showing how one change affects another.It also has 40, 000 ''What if. . .?''multiple-choice questions.There are three types of questions in the dataset: perturbations related to the paragraph, external perturbations needing commonsense knowledge, and irrelevant perturbations.

2) EVALUATION DATASETS
To evaluate the ability of models to perform tasks they have not been specifically trained for, we collect 19 datasets from areas not originally related to the models' MC training domain.These datasets, focused on NLU, are then organized by the specific tasks they represent.The details of the datasets for each task are as follows: Natural language inference (NLI) determines whether a ''hypothesis'' (x 1 ) with a ''premise'' (x 2 ) is true (entailment), or indeterminate (neutral) or false (contradiction).
1) ANLI (R1/R2/R3) [35] represents a large-scale benchmark dataset for NLI.This dataset is crafted through an iterative, adversarial procedure that engages both human evaluators and AI models, ensuring robustness and depth in the evaluation process.2) CB [36] is a dataaset comprising short texts, with at least one sentence in each text containing an embedded clause.These embedded clauses are annotated with the degree to which the author appears committed to the truth of that clause.The data contains examples excerpted from the Wall Street Journal, fiction from the British National Corpus, and Switchboard.3) SNLI [37] is a collection of 570, 000 human-authored English sentence pairs, grouped into three distinct categories: entailment, contradiction, and neutral.4) MNLI-m/mm [38] is designed to facilitate the development and evaluation of machine learning models for sentence comprehension, and it provides a specific framework for assessing cross-genre domain adaptation.5) QNLI [39] is a modified version of a question-answering dataset, formed by pairing each question with sentences from the corresponding context and filtering pairs with low lexical overlap.6) RTE [16], [40], [40], [41] are compiled from annual textual entailment challenges.Based on news and Wikipedia text, the datasets are converted into a two- class split, collapsing neutral and contradiction into not entailment for consistency.7) WNLI [42] consists of pairs of sentences, where the second sentence is constructed by replacing an ambiguous pronoun in the first sentence with a possible referent.

Commonsense reasoning (Commonsense) involves the model's ability to infer conclusions using general knowledge
or what is often referred to as ''common sense''.
1) COPA [43] is a dataset that aids in assessing progress in open-domain commonsense causal reasoning.2) Hellaswag [44] s a dataset that tests the ability of a machine to complete sentences logically.Sentiment analysis (Sentiment) involves classifying the polarity of input text.
1) SST-2 [45] is a corpus containing 11, 855 sentences extracted from movie reviews, fully labeled with parse trees for comprehensive sentiment analysis.2) IMDB [46] serves as a robust resource for binary sentiment classification.This dataset not only significantly surpasses the size of prior benchmark datasets but also incorporates additional unlabeled material.Coreference resolution (Coreference) involves the systematic grouping of textual expressions that correspond to the same entities in the real world.
1) WSC [42] (Winograd Schema Challenge) tests the ability of a system to reason with commonsense.It comprises pairs of sentences that contain ambiguous pronouns.2) Winogrande [47] is a collection of 44, 000 problems inspired by the Winograd Schema Challenge, adjusted to enhance scale and robustness against dataset-specific bias.3) DPR [48] is made up of sentence pairs created by undergraduate students, following certain rules.The sentences cover topics from real events to made-up situations.Text classification (Classification) is the task of assigning a class label to a specific text from a set of possible classes.[49] is a collection of more than 1 million news articles, gathered from over 2, 000 news sources.2) Dbpedia [50] ontology classification dataset is composed of 14 distinct, non-overlapping classes from DBpedia 2014.In practical applications, the requirements for AI systems extend beyond handling out-of-domain tasks.An effective AI must have the ability to manage previously unseen in-domain datasets as well.In evaluating the zero-shot capability of UniMC in dealing with in-domain tasks, we consider the following benchmarks.

1) MCScript [51] is the official dataset for SemEval2018
Task11, consisting of text passages related to daily life activities.Each passage is accompanied by questions, each with two answer choices.2) ReClor [52] is a specialized Reading Comprehension dataset that focuses on logical reasoning, and is drawn from standardized graduate admission examinations.3) OneStopQA [53] is a multiple-choice reading comprehension dataset annotated following the STARC scheme, utilizing Guardian articles from the OneStopEnglish corpus.
142836 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

3) DETAILS OF UNIFIED FORMAT
We have developed a specific transformation rule.This rule converts each type of task into a standardized multiple-choice (MC) format, As outlined in Sec.III-A1, we have developed a specific transformation rule.This rule converts each type of task into a unified MC format, as shown in Table 2.In particular, since the datasets in commonsense reasoning tasks and coreference resolution tasks already follow the MC format, we simply adopt the original format in our experiments.Furthermore, we provide two examples that illustrate the procedure of converting raw text into input tokens.
An example of Social IQa (MC): 1) Raw text: {x 1 : ''Riley treated the girls to dinner after they won the girl's soccer championship.'',''question'': ''How would the girls feel as a result?'',''options'': How would the girls feel as a result?Riley treated the girls to dinner after they won the girl's soccer championship.
An example of SNLI (NLI): 1) Raw text: {x 1 : ''A surfer showing off his talent in public using a man-made water machine.'',x 2 : ''A man is cutting the grass.'',''options'': [''we can infer that A man is cutting the grass.'',''we can not infer that A man is cutting the grass.'',''it is difficult for us to infer that A man is cutting the grass.''],''answer'': ''we can not infer that A man is cutting the grass.''}2) Transformed text: ''no we can infer that A man is cutting the grass.yes we can not infer that A man is cutting the grass.no it is difficult for us to infer that A man is cutting the grass.Based on the paragraph.A surfer showing off his talent in public using a man-made water machine.''paragraph.A surfer showing off his talent in public using a man-made water machine.''
In our assessment, we follow the popular settings [8], [56], measuring model accuracy across all datasets to ensure a fair and consistent comparison of their effectiveness.When the baseline experiments are performed across multiple iterations, only the mean results are presented to maintain uniformity in evaluation.Furthermore, we include a basic random guessing strategy as a baseline.

C. IMPLEMENTATION DETAILS
To ensure a fair comparison, we configure the maximum token length to 512 across all experiments, following the common setting [17].During the MC training phase, we adhere to a single epoch, in line with the settings used in FLAN [8].We cap the number of samples for each dataset at 20, 000 for both the English and Chinese model variants, with the goal of preventing any specific task from dominating the model.We also conduct the experiment five times, employing different seeds to ensure variability.All our experiments are executed on 8 NVIDIA A100 GPUs.

1) NATURAL LANGUAGE INFERENCE
We showcase the primary findings from the Natural Language Inference (NLI) task, as outlined in Table 3.
Our UniMC achieves the best performance across various datasets, demonstrating its zero-shot capability of NLI.In particular, UniMC achieves this level of competition with a mere 235M parameters, contrasting with the hundreds of billions found in alternative baselines.These outcomes validate the efficacy of implementing a multiple-choice style as a means of format unification.Moreover, a bidirectional structure in UniMC strengthens its ability in capturing information as opposed to the previous onedirectional structures.

2) TEXT CLASSIFICATION
The task of text classification is centered on assigning an appropriate label or class to a given set of texts, a goal that bears similarity to the nature of MC tasks.Given this correlation, we carried out a zero-shot text classification experiment to substantiate the capabilities of our model.As shown in Table 4, UniMC outperforms previous SOTA models by a remarkable margin.Specifically, the inclusion of 14 categories within Dbpedia introduces an additional layer of complexity to the classification challenge.However, UniMC possesses an intrinsic advantage in managing multiple classes, a trait attributable to the congruence between the options in a multiple-choice task and the class labels in classification, culminating in a 48.9% improvement.

3) A COMPREHENSIVE COMPARISON TO FLAN
We recognize that FLAN is renowned for handling zero-shot options or label-related tasks, and its zero-shot generalization ability is particularly noteworthy.A distinctive advantage of this model is its ability to generalize in a zero-shot learning scenario.To more effectively illustrate the capabilities of UniMC, we present a detailed comparison between our model and FLAN, as depicted in Table 5.In the NLI task, UniMC outperforms FLAN overall, aligning with the findings in Table 3.Furthermore, we choose more tasks such as sentiment analysis, commonsense reasoning and coreference resolution to delve into the generalization capabilities of our model.UniMC demonstrates a clear advantage over competing models in various datasets, including COPA, Hellaswag, Winogrande, WSC, and DPR, particularly in the evaluation of coreference resolution and commonsense reasoning tasks.Beyond these two tasks, we find that the construction of datasets emerges as a vital factor in the performance.Generally, these datasets fall into two primary classes, which are generation style and understanding style.UniMC demonstrates improved performance on datasets aligned to the understanding paradigm, as its backbones are PMLMs.For the sentiment analysis task, the constraint on the number of categories renders the methodology of dataset construction less critical compared to commonsense reasoning and coreference resolution tasks.Consequently, UniMC and FLAN both exhibit commendable performance.Except for the UniMC, the results are from this work [60].The best results are in bold.

4) THE ANALYSIS OF THE IN-DOMAIN TASK
Given that we continually encounter novel data in the open world that is not present during training, we investigate the performance of our model on such unseen data.In Table 6, we compare our model with competitive baselines.These baselines are designed for training on question-answering tasks, including massive multiple-choice problems, and are subsequently evaluated on unseen data.Notably, the previous SOTA model, OLTQA, utilizes a frozen large-scale language model with 10 billion parameters, a retrieval model, a re-ranking model, and a QA model.These four models collectively learn knowledge and then are test on unseen datasets.Therefore, in Table 6, we report their performance on popular datasets that are not included in our or their training phase, which are MCScript [51], ReClor [52] and OneStopQA advanced [53].As these datasets are formatted as the MC format, no further modifications are required to apply our UniMC model.We adhere to the settings, such as using the same test set and data processing methods, as described in a previous study [60].
In summary, our method consistently delivers SOTA performance across all datasets.For instance, it enhances the performance of OLTQA by the 42.3% improvement on the OneStopQA dataset.Moreover, OLTQA, which represents a collaborative effort of multiple models, has been fine-tuned across 21 datasets.In contrast, UniMC has only been exposed to 14 datasets.This underscores the effectiveness of our method in domain-specific knowledge transfer.

B. ABLATION STUDIES
In this section, we aim to validate the essential role of core components within our UniMC framework, encompassing aspects such as the MC training, the prompt effect, and flow control strategies.Furthermore, we present an analysis of how variations in the model's complexity and the choice of backbones architectures impact its performance.

1) HOW IMPORTANT IS MC TRAINING?
Recall that our proposed UniMC takes advantage of O-MLM OP to evaluate zero-shot tasks without MC training.To provide a clearer understanding of this design, we craft a variant named UniMC * , which specifically omits the MC training component.The results of this variant are presented in the table referenced Table 7, where the performance of UniMC * is found to be closely aligned with what one might expect from a ''Random Guess''.This observation emphasizes the essential function of MC training.

2) HOW DOES THE PROMPT AFFECT THE PERFORMANCE?
Our framework aims to minimize the effort required in designing prompts.Therefore, we analyze the impact exerted by specific prompts, encompassing both question prompts and option prompts.
For the question prompts as presented in Table 8, we conduct experiments on four challenging tasks, comparing the performance with and without using question prompts.Although the performance across all tasks varies, we hypothesize that this variation stems from the method of data construction.These datasets are mainly designed for two purposes, which are the language modeling task and the relationship choice task [2].The requirement for question prompts escalates as the data aligns more closely with the language modeling task; conversely, the need diminishes when the data deviates from this task.Furthermore, we classify these datasets into two categories, spoken-based and writtenbased, according to the widely-used definition [61].MNLIm/mm, CB, SNLI, SST-2 and IMDB belong to the spokenbased corpus, while the rest of the datasets belong to writtenbased corpus.Considering that PMLM is usually pre-trained on written-based corpus, e.g., the pre-training datasets of BERT are Wikipedia and BookCorpus [3], ours may have no need of questions for written-based data.This, again, confirms that data construction affects the requirements of question prompts.
For the option prompts, we present the experimental results in Table 9.We would like to emphasize that option prompts are necessary for our UniMC; therefore, we cannot remove this component as in the above experiment.Instead, we design different option prompts to demonstrate their effects.We observe that different prompts show limited performance variations, indicating the robustness of our UniMC to option prompts.Since FLAN and PaLM are not open-sourced, we choose one of the most powerful models, e.g., UnifiedQA-T5, as the baseline to ensure the fairness in comparison.In the experiment, we find that UnifiedQA-T5 is sensitive to option prompts, exhibiting a standard variation (Std) of up to 8.3.

3) HOW DOES THE FLOW CONTROLLING AFFECT THE PERFORMANCE?
We design the prompt to frame the input sequences, allowing all datasets to fit directly into UniMC.However, some recent methods need extra processes, such as adopting an option with a context (question and passage) into a sequence and aggregating multiple different sequences to get an answer [12].To fix this gap, we design two strategies to control the flow of the information as in Sec.III-A2.We summarize the performance of these two in Table 10.We observe that UAMM adds the greatest improvement to results, which is much better than UE.On the one hand, UniMC can learn the position relationship between options.On the other hand, UniMC can distinguish between options and context.However, UE is unable to prevent the inter-influence in options.Thanks to the self-attention mechanism, UAMM makes the options unreachable to each other, eliminating the intra-information of options.

4) HOW DOES THE MODEL SIZE AFFECT THE PERFORMANCE?
A common intuition from this domain is that a large model size will result in better performance [5], [8], particularly large-scale PLMs.By following the setting of ALBERT [17], UniMC employs various ALBERT models as the backbones, as illustrated in Table 11.Naturally, we believe that our backbone PMLM follows this rule as well.To validate this, we implement an experiment by varying the model size, as shown in Fig. 7.All 4 different tasks show the same trend, demonstrating the correctness of the mentioned intuition.

5) HOW DOES THE DIFFERENT BACKBONES AFFECT THE PERFORMANCE?
In our investigation into the efficacy of various backbone models within the UniMC framework, we replace the ALBERT-xxlarge-v2 with RoBERTa-large.The comparative analysis, detailed in Table 12, reveals that ALBERT consistently surpasses RoBERTa across several domains, including but not limited to NLI, commonsense reasoning, sentiment analysis, and coreference resolution tasks.A simple explanation is that ALBERT-xxlarge-v2 [17] demonstrates superior performance compared to RoBERTa-large [18] in their respective papers.Additionally, our experimental findings suggest that tokenization may contribute to this difference.Since O-MLM aims to predict ''yes'' or ''no'', UniMC needs a stable tokenizer to accurately reconstruct these terms.Unlike ALBERT, RoBERTa uses a byte-level BPE tokenizer instead of a WordPiece tokenizer.Under the settings of the byte-level BPE tokenizer, the word id does not only depend on the word itself, but also is influenced by its position.Consequently, RoBERTa encounters challenges in handling O-MLM and OP tasks during the MC training phase, resulting in an underperformance relative to ALBERT.Based on these insights and the superior outcomes associated with ALBERT, we have selected it as the default backbone model for all our experimental endeavors.

VI. GENERALIZING UNIMC FRAMEWORK TO CHINESE LANGUAGE TASKS
In this section, we extend the UniMC framework to Chinese tasks, demonstrating the cross-lingual capabilities of our approach.Specifically, Sec.VI-A details the setup for our Chinese experiments.Then, we present the results for our DeBERTa-v2 model in Sec.VI-B, and discuss the robust performance of our UniMC-Chinese across various Chinese language tasks in Sec.VI-C.

A. EXPERIMENTAL PREPARATION 1) MC TRAINING DATASETS FOR CHINESE
Compared to English MC training datasets, Chinese datasets are less available.This is especially true for open-source datasets.Therefore, in the MC training phase of tuning a Chinese variant (UniMC-Chinese), we include the widely-used data in the Chinese MC task as summarized in Table 13.We prepare them as a unified format to fit our models as described in Sec.III-A1.
1) C3 [62] is a free-form multiple-choice Chinese machine reading comprehension dataset, which is collected from Chinese-as-a-second-language examinations.2) ClozeT [63] is a manually annotated Chinese story cloze test dataset.The corresponding task is to select one sentence from two options that can logically complete a given story missing a sentence.The story dataset originates from children's stories crawled from the internet.3) CMRC2019 [64] is a Chinese Machine Reading Comprehension (MRC) dataset, specifically designed for a new task known as Sentence Cloze-style Machine Reading Comprehension (SC-MRC).The primary objective of this task is to correctly insert the appropriate candidate sentence into a passage that contains several blanks.4) GCRC [65] is a challenging dataset featuring highquality multi-choice questions, sourced from Gaokao Chinese (the Chinese subject from the National College Entrance Examination of China).This dataset is designed to address two major limitations observed in existing MRC datasets: the simplicity of defined tasks and the lack of explainable evaluation.

2) EVALUATION DATASETS FOR CHINESE
To assess the zero-shot capabilities of the model in a comprehensive manner, we select datasets that are frequently utilized in Chinese language evaluations.These datasets, originating from the zeroCLUE benchmark [66], encompass various constructs and types of tasks. 1) EPRSTMT (E-commerce Product Review Dataset for Sentiment Analysis), also referred to as EPR-sentiment, is a specialized sentiment analysis dataset derived from consumer reviews on an e-commerce platform.Each entry within the dataset is categorized as either positive or negative, reflecting the sentiment of the review.2) CHID-FC (Multiple-Choice) constitutes a comprehensive large-scale Chinese IDiom cloze test dataset, encompassing diverse domains such as news articles, literary novels, and academic essays.

3) BASELINE METHODS
For Chinese benchmarks, we present the following large language models as the current SOTA methods in the Chinese language domain: GLM [67], ERNIE 3.0 Titan [68], and Prefix LM (Yuan 1.0) [69].
In our Chinese-related experiments, we employ the accuracy metric to evaluate performance.Our approach aligns with the methodologies utilized in our English language experiments to ensure methodological coherence across languages.

B. EXPLORING THE PERFORMANCE OF DEBERTA-V2-CHINESE
As detailed in Sec.III-C2, we introduce a robust pre-trained masked language model, DeBERTa-v2-Chinese, to the Chinese NLP community.To validate its efficacy, we conduct tests on the CLUE benchmark [70], specifically fine-tuning the candidate models across various datasets.By selecting RoBERTa-wwm-ext [71], a well-recognized model within the community, we establish a basis for comparison.As presented in Table 14, the result shows that our DeBERTa-v2-Chinese outperforms the baselines across all datasets.Moreover, it demonstrates the adaptability to diverse scenarios, including handling scenarios like the single sentence and the sentence pair.With 119 categories, evaluating the model's performance on IFLYTEK presents a challenge.Nevertheless, the DeBERTa-v2-Chinese model overcomes this complexity, improving the baseline score by 2.6%, thereby demonstrating the robustness of our UniMC.

C. EVALUATING UNIMC-CHINESE ON CHINESE TASKS
To investigate the generalization capabilities of the UniMC framework, we implement it within the Chinese linguistic context.English uses an alphabetic system, while Chinese operates with a logographic writing system that includes tens of thousands of distinct characters.Many of these characters, particularly those frequently used, carry specific and detailed meanings.This complexity is compounded by the lack of extensive datasets in the Chinese community, a situation that has led to a strong desire for universally applicable language methods.As shown in Table 15, UniMC-Chinese demonstrates performance that not only meets this need but even exceeds human abilities in zero-shot scenarios.It matches the performance of large language models that are a hundred times its size, earning a close second place in the rankings.The minimal difference from the top rank highlights the potential of the model.These findings show the language-agnostic nature of the UniMC

TABLE 15.
The zero-shot results on Chinese tasks.In our comparison, large-scale language models that exhibit robust zero-shot capabilities in Chinese are selected.These models are systematically arranged in descending order based on their size.The scores surpassing human performance are in bold.
framework, underscoring its applicability across diverse linguistic landscapes.

VII. DISCUSSION
Our model demonstrates superior effectiveness and generalizability, as evidenced by our experiments.Additionally, we identified two key advantages of our model.Firstly, the unified input approach expands the application scenarios of UniMC, enabling flexible problem settings and providing responses from multiple perspectives.This eliminates the need for separate models for different questions, unlike traditional PMLMs trained with predetermined queries.Secondly, our task-driven design eliminates the need for further tuning, ensuring consistency in performance.In contrast, previous models often have divergent learning objectives, resulting in performance oscillation.Our MC tuning approach provides a new perspective to align training and inference phases in PMLMs, drawing parallels with the established alignment in PLMs.

VIII. CONCLUSION
This paper introduces the Unified Multiple-Choice (UniMC) framework, a new paradigm for zero-shot learning that enhances flexibility and generalization in NLU tasks.Through the use of O-MLM and OP tasks in the MC tuning method, UniMC achieves top performance over existing SOTA models.With only 235M parameters, the framework's success across both in-domain and out-ofdomain benchmarks highlights its potential as a robust tool for zero-shot tasks.Additionally, we have successfully adapted UniMC to the Chinese language environment, where it has exceeded human performance in multiple zero-shot tasks.The creation of UniMC-Chinese and its success in specific Chinese datasets such as EPRSTMT and CHID-FC further demonstrate the effectiveness and adaptability of UniMC.Future work will focus on further optimizing the framework's efficiency and exploring integration with other deep learning paradigms to create versatile and robust AI systems.
JUNJIE WANG is currently pursuing the Ph.D. degree with the Department of Computer Science and Engineering, Waseda University.His research interests include natural language processing and multimodal.
PING YANG is currently a Senior Researcher with the International Digital Economy Academy (IDEA), where he brings a wealth of expertise and practical experience in the fields of artificial intelligence and machine learning.

FIGURE 2 .
FIGURE 2. The examples of unified input text.These examples are drawn from various datasets during the zero-shot learning phase.The prompt text is underlined and the right options are in bold.

FIGURE 3 .
FIGURE 3.An overview of the UniMC structure.The tokens [C], [S] and [M] are the abbreviation of the tokens [CLS], [SEP] and [MASK].The framework employs Option Masked Language Modeling (O-MLM) and Option Prediction (OP) methods for handling multiple-choice tasks, while Masked Language Modeling (MLM) serves as an auxiliary tool for model learning.The samples are sourced from the RTE dataset [16].

FIGURE 4 .
FIGURE 4.An example of the Self-Attention Mask Matrix within the UniMC framework.Given the input sequence[C], [O-MASK], o 1 1 , o 1 2 , . . ., x 3 , [S],it is important to note that the tokens of options cannot attend to each other.

FIGURE 5 .
FIGURE 5.The details of the O-MLM and OP tasks during the MC training phase.The tokens ''[O-MASK] 1 '' and ''[O-MASK] 2 '' are the same as the tokens in Fig. 3.Moreover, the tokens ''[O-MASK] 1 '' and ''[O-MASK] 2 '' correspond to options 1 and 2 respectively, serving as key elements that enable the model to perform multiple-choice tasks.
[MASK], instead of masking the whole sequences as done in standard BERT.The main difference between O-MLM and the standard MLM lies in the masking technique.In O-MLM, we consistently mask the tokens denoted by [O-MASK] to predict either ''yes'' or ''no'', as illustrated in Figure Fig. 3 (b).Therefore, the loss L O−MLM and L MLM are constructed in a similar manner, and share the same style.

FIGURE 6 .
FIGURE 6.Incorporating the logit masking method within the OP task.

FIGURE 7 .
FIGURE 7. Zero-shot performances on several tasks with model variants.

TABLE 2 .
[8]mpt designs for out-of-domain English datasets.Inspired by templates in FLAN[8], we design a simple rule to convert original text to a unified MC format.

TABLE 3 .
Zero-shot results in natural language inference task.The best scores are in bold.

TABLE 4 .
Zero-shot results in text classification task.The best results are in bold.

TABLE 5 .
A summary on natural language inference, commonsense reasoning, coreference resolution and sentiment analysis task.The best scores are in bold.

TABLE 6 .
Comparison with competitive baselines in unseen MC datasets.

TABLE 8 .
We report results of UniMC with and without question prompts.We present 3 tasks (NLI, Sentiment, Classification) because question prompts are not designed in other tasks.

TABLE 9 .
Zero-shot results in sentiment analysis task.''Std'' indicates Standard Deviation.The best average results are in bold.The more stable performance is underlined.

TABLE 10 .
Zero-shot performance with different strategies to control the flow between options.''UE'' indicates Updating Embeddings, including segment id embeddings and position id embeddings.''UAMM'' means Updating Attention Mask Matrix.''Improvement'' shows the accuracy improvement from Random Guessing.

TABLE 11 .
The configurations if the UniMC variants.

TABLE 13 .
Dataset statistics for the Chinese variant during its MC training phase.To ensure a balanced representation across tasks, the sample count for each task caps at 20k.

TABLE 14 .
The results in TNEWS, IFLYTEK, OCNLI and AFQMC.Bold text denotes the best result in each column.