MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation

Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark (Chen et al., 2021) and MBPP benchmark (Austin et al., 2021) to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex (Chen et al., 2021), CodeGen (Nijkamp et al., 2022) and InCoder (Fried et al., 2022). We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.


I. INTRODUCTION
C ODE generation models, also known as large language models (LLMs) of code, are deep neural networks trained on massive corpora of source code. Over the past few years, code generation models have demonstrated their utility on a wide variety of software engineering tasks, including test generation, documentation generation, and even synthesizing working programs from natural language descriptions [1], [3], [4], [5], [6]. New products such as GitHub Copilot 1 , Amazon CodeWhisperer 2 , and Tabnine 3 built on code generation models are growing in popularity with developers [7]. Although several code generation models are trained on multiple programming languages, they are typically only evaluated on a single programming language: Python. Machine learning researchers are familiar with Python: they have painstakingly constructed several Python code generation benchmarks [1], [2], [8], [9] and it is the best represented language in training datasets [1], [2], [9], [10]. However, we should also evaluate code generation models with other languages to support a wider variety of programmers. There is prior work on multi-language evaluation [6], but it uses perplexity as a proxy for performance, instead of benchmarks that check correctness.
In this article we present MultiPL-E, a system for translating code generation benchmarks from Python into new languages, and use it to propose the first massively parallel, multi-language benchmark for code generation. By "multi-language" we mean multiple programming languages: MultiPL-E supports 18 languages and is straightforward to extend with more. By "parallel", we mean that MultiPL-E produces parallel problems for each language, thus we can measure performance of a code generation model on a consistent set of problems across multiple programming languages. What makes MultiPL-E possible is that code generation benchmarks have unit tests to determine if the generated function behaves correctly.
MultiPL-E uses a suite of 18 compilers from Python benchmarks to each target language. 4 However, what makes this scale is that these are not full-fledged compilers. Each compiler must be able to translate four components from Python: (1) a function signature (name and arguments), (2) simple unit tests, (3) a comment describing the expected function behavior, and (4) type annotations if the target language is statically typed. Notably, the compiler does not have to translate the body of a function, since it is the job of the code generation model to synthesize it. Thus each MultiPL-E compiler is approximately 200 LOC and easy to build. MultiPL-E also includes a simple, rule-based tool to translate technical terms in comments to be more language appropriate, e.g. a Python list is approximately a C++ vector.
MultiPL-E also includes a containerized sandbox that (1) compiles programs if necessary, (2) runs them with appropriate timeouts, (3) validates their results on unit tests, and (4) classifies each output as successful, syntax error, etc. Thus each language requires an evaluation script, which is typically about 20 LOC.
We use MultiPL-E to translate two widely-used code generation benchmarks, HumanEval [1] and MBPP [2], into 18 languages. The 18 languages capture a broad spectrum of language features, application areas, and popularity, allowing us to explore the impact of these factors on model performance.
We use the multi-language parallel MultiPL-HumanEvaland MultiPL-MBPPbenchmarks to evaluate three state-of-the-art code generation models: Codex [1], CodeGen [3], and In-Coder [4]. Our evaluation presents new insights into the effectiveness of code generation models, including: 1) Across models and benchmarks, code generation models perform extremely well on JavaScript, sometimes outperforming Python, even on benchmarks originally designed to evaluate Python performance. Codex also performs well on C++, Scala, and TypeScript. 2) There is no strong correlation between model perplexity and correctness of generated code, which suggests that perplexity may not be a good estimate of performance. 3) Code generation performance is correlated with language popularity, but some niche languages perform as well as more popular languages. 4) Code generation performance is sensitive to prompt design for both niche and popular languages. 5) Static type-checking neither helps nor hinders code generation model performance. To summarize, our key contributions are: r MultiPL-E: a suite of compilers and an evaluation framework for translating code generation benchmarks from Python into other programming languages. MultiPL-E translates unit tests, doctests, Python-specific terminology, and type annotations.
r Two parallel benchmarks for code generation in 19 languages encompassing a variety of programming paradigms, language features, and popularity levels.
r Explorations of language frequency effects, the impact of type annotations, and prompt translation sensitivity on code generation performance, along with a fine-grained error analysis for four languages.
We hope this evaluation work will help the many software engineers that do not use Python to assess the feasibility of code generation models for their work and to understand the language factors that may affect model performance.
MultiPL-E is presently being employed to evaluate code generation models for the BigCode project, which is an open collaborative effort to build code generation models on permissively licensed code. 5 The MultiPL-E system, dataset, and tutorial are available at github.com/nuprl/MultiPL-E.

II. CODE GENERATION
Code generation has long been a task of interest: there is extensive work on program synthesis [11], [12], [13] using both symbolic and neuro-symbolic approaches. More recently, large neural language models (LLMs) trained for text generation have demonstrated the ability to perform program completion [5], [14], [15]. Since 2020, several code generation models have been trained or fine-tuned on programming language text [1], [3], [4], [6], [16], [17]. Unlike traditional program synthesis techniques, neural language models are able to condition on and generate both natural language (i.e., code comments) and programming language text. However, existing code generation models are tested using monolingual benchmarks that largely target Python. Thus there is little quantitative data about how well they perform on other languages. We make progress towards answering this question by proposing two large-scale parallel benchmarks for code generation in 19 languages, which we use to evaluate three state-of-the-art models: Codex, CodeGen, and InCoder.

A. The Natural Language to Code Task
Code generation models have been applied to a variety of tasks, including test generation [18], docstring generation [19], code search [16], [20], type inference [21], [22], [23], and more [24]. We focus on the natural-language-to-code task (NL2Code): given the description of a function in natural language, complete the function body.
The input to a code generation model is called a prompt. Fig. 1(a) shows an example prompt from the HumanEval benchmark for NL2Code [1]. The prompt has several sources of information for the model: the function signature (its name and parameters); a brief comment describing the function; and, optionally, examples in the form of Python doctests. Given the prompt as input, the code generation model generates a completion that is likely to follow the given prompt. Note that the model does not receive an explicit cue about the target language, but each of the three prompt regions provide implicit cues: the syntax of the function signature, the terminology used in the natural language description, and the syntax of the doctests all suggest that the target is Python. Consequently, to translate this prompt to a new programming language, we must target all three regions of the prompt.  Code generation models produce tokens, not ASTs, and may produce output beyond that requested. This is truncated output from InCoder given just the first highlighted line as the prompt.

B. Sampling Program Completions
There are several ways to configure how a code generation model produces completions, each of which can have a significant effect on the quality of generated code. Fundamentally, a completion is a sequence of tokens and is not an abstract syntax tree. Therefore, a completion can readily produce tokens that go beyond a single function. For example, given just the the signature of "mean", InCoder produces the mean, variance, standard deviation, and several other functions (Fig. 2). In fact, it can continue producing code up to the maximum sequence length, which, for InCoder, is 2048 tokens.
We control this output by specifying stop sequences that typically demarcate the end of a function. For Python, we use the stop sequences that have been employed in prior work [1]. For example, when completing a top-level function, \ndef marks the start of the next top-level function, but allows nested helper functions. For other languages, we design different sets of stop sequences.
Under the hood, given a prompt, a code generation model produces a completion one token at a time. At each step, the neural network receives an encoded prompt as input and produces a distribution for the following token. To generate several tokens, a sampling algorithm iteratively samples next tokens, extending the prompt at each step with the previously sampled token.
There are a variety of sampling approaches that one can use. A naive approach is to greedily sample the next most likely token, but this performs poorly in practice [25]. One approach employed in prior work [1], and in this article, is to rescale the probability distribution to favor high probability tokens more strongly using a temperature hyperparameter (0 ≤ t < 1): low temperature makes the completion more "predictable" and high temperature makes it more "creative". This is commonly combined with top-p sampling, which cuts off the least likely tokens that contribute in aggregate 1 − p to the probability mass, and redistributes their mass to the remaining tokens. Since these procedure is nondeterministic, we sample 200 completions for every prompt and choice of hyperparameters.

C. Evaluating Code Generation
Early work on code generation relied on textual similarity metrics for evaluation [16], [26]. However, previous work shows that textual similarity is not reliably correlated with code correctness [1], [2]. The best way to evaluate code generation is to test code correctness using a suite of hidden unit tests.
We translate two code generation benchmarks that include unit tests for every problem. Fig. 1(c) shows 3 of the 12 unit tests that accompany the problem from Fig. 1(a). Note that these unit tests are simple assertions: each test asserts that the output value produced by the function matches an expected value.
We judge a generated function correct if it passes all tests. Fig. 1(b) shows just one of the solutions generated by Codex for the example prompt. This solution is incorrect because it fails some of the unit tests ( Fig. 1(c)). Because the output of the code generation model is stochastic, it is common to sample multiple completions per problem and report an estimated pass rate (Section IV-B).

III. THE MULTIPL-E APPROACH
This section describes how we select and prepare languages and benchmarks for MultiPL-E.

A. Benchmark Selection
There are a number of existing single-language NL2Code benchmarks [8], [9], [10]. We choose to translate HumanEval [1] and MBPP [2] as two of the most widely-used benchmarks. HumanEval is a good choice of benchmark for several reasons. It is a diverse collection of 164 problems, where all problems have tests to check correctness, and most have examples or doctests as part of the prompt. All of the problems are functions that receive and return first-order values, which facilitates unit testing and test translation. Many also use Python's optional type annotations. Moreover, it is a challenging benchmark: the best model evaluated by Fried et al. [4] achieves only a 36% pass rate on Python.
MBPP is another large, commonly used benchmark of Python problems. As originally formulated, it is a little unusual. Each problem has a description and a list of assertions. The prompt for code generation includes both the description and the assertions, and the generated code is then tested with the same set of assertions. We argue that the HumanEval approach, where test cases are hidden, is a significantly better way to evaluate code generation. We therefore remove the assertions from the MBPP prompts so that we can use them as hidden unit tests. However, with only a problem description, a code generation model is free to make up the name of a function (or not even produce a function). Therefore, we mechanically augment every prompt with a function signature, based on the name and arity implied by the assertions. Fig. 3 shows an example of an original MBPP prompt and our modification.

B. Programming Language Selection
MultiPL-E supports 19 programming languages, which we categorize into four frequency classes (NICHE, LOW, MEDIUM, or HIGH) based on a weighting of TIOBE rank and GitHub frequency ( Table I). Eight of the languages in MultiPL-E had never been used before to measure NL2Code performance; this set includes newer languages (Julia and Swift), older scripting languages (Bash and Perl), and languages for specific applications (Lua and R). Half of the languages are statically typechecked. The broad range of languages in MultiPL-E shows the generality of our compilation approach and allows us to explore how language frequency and language features affect performance (Section VI). Finally, we ensured that we only selected languages for which the authors had enough expertise to confidently build a compiler and validate its results.
A key feature of MultiPL-E is that it is easy to extend with new models, benchmarks, and languages. To support new languages and benchmarks without manual (and error-prone) effort, we build 18 compilers to translate NL2Code benchmarks written in Python. Writing one of these compilers is straightforward when the target language is similar to Python, but requires care for statically typed languages and even some dynamically typed languages, notably Perl, Bash, and R. Section III-D discusses unsupported languages.

C. Compiling Python Benchmarks
A MultiPL-E compiler is significantly easier to build than a complete compiler. To translate a benchmark problem, we only need to compile function signatures and unit tests (not arbitrary statements and expressions). Our compilers preserve comments, since they contain the natural language description for the NL2Code task; however, we automatically rephrase them to replace Python-specific terminology.

1) Compiling Unit Tests:
MultiPL-E supports any unit test where the input and output to the test are first-order values. In Python, these include constants and data structures such as lists, tuples, and dictionaries, but exclude values such as lambda expressions. 6 HumanEval and MBPP unit tests apply the modelgenerated function to a first-order value, and compare the result with an expected first-order value. Each MultiPL-E compiler has a recursive function that compiles Python values to the target language's values. Even for a dynamically typed target, this value-to-value compilation requires care, because not all Python value types have perfect analogues in every target. For example, we compile both tuples and lists to JavaScript arrays, since JavaScript lacks a canonical tuple type. We also support dynamically typed targets where the compilation strategy is less obvious. For example, when the target is R, it may appear natural to compile Python lists to R lists: both kinds of lists can be nested and allow heterogenous values. However, R's vector type is much more commonly used (data frames are made of vectors). Unfortunately, vectors must be homogeneous and cannot be nested, so not all Python lists can be translated to vectors. For example, an argument typed List[Int] can be translated to a vector, but a nested list cannot. In order to more closely match the token distribution of idiomatic R code seen during training, our R compiler uses types (described below) to identify homogenous list values and maps them to vectors using c()-even though R is dynamically typed.
The final step of compiling tests is to choose an appropriate test for equality. The meaning of equality operators varies across programming languages. Python's == operator checks deep equality, i.e., item-by-item equality within data structures. Deep equality is the appropriate choice for unit tests. In some languages, we need to import equality-testing functions from testing libraries, as in the JavaScript example shown in Fig. 4.
2) Translating Types and Type Inference: Compiling a function signature to a dynamically typed language is straightforward, but requires care when the target is typed. Most typed languages require argument and return type annotations. Python has optional type annotations. Thus if a benchmark has type annotations, we can translate them to types in a target language. Fortunately, a large subset of the HumanEval benchmarks employ Python's optional type annotations. We introduce type annotations to the few that do not. None of the MBPP benchmarks have type annotations. Instead of manually adding annotations to 400+ benchmarks, we infer the types of the values that appear in the MBPP assertions.
Translating types and typed values is subtly different for every language. For example, five HumanEval problems use types such as Any which cannot be translated to most traditional statically typed languages (e.g., C++ and Rust). We fail to compile these few problems to these languages. Another problem arises when compiling to languages with algebraic datatypes or discriminated unions. For example, consider translating the Python type Optional[Int] to Rust, Swift, or Scala. The analogous type in the target language is an algebraic datatype. This means that when the Python number n has type Optional[Int] it must translate to the value Some(n). Optional values are very common in Python benchmarks, and we use this approach extensively.
Finally, many typed languages require type annotations in data structures, which appear in unit tests. For example, C++ vectors require an annotation specifying their element type, and numbers in Rust (sometimes) require a type suffix. We perform limited local type inference to calculate these types from the type of the function signature to ensure that the unit tests always compile successfully.
3) Translating Doctests: Python doctests are a standard format for examples in documentation. While many of the Hu-manEval prompts include examples, not all of them are validly formatted doctests. We standardize examples to the Python doctest format (">>>" prepended). We apply value-to-value compilation to the doctests as we do for unit tests. However, since not all languages have an equivalent doctest format, we keep the Python format for all target languages.

4) Translating Python Terminology in Prompts:
Different programming languages use different terminology to refer to the same concept. For example, a Python list is closest to a JavaScript array or a Rust vector. To mitigate the impact of these differences, we identify Python-specific terminology in the natural language portion of the prompt, and translate it to the most natural equivalent for the target language. Fig. 5 shows an example of a prompt translated from Python to Perl. Notably, Perl not only lacks Booleans, but uses 1 for true and the empty string for false.
We conservatively avoid translating number types. Although some target languages use different terms for floats and integers, the term integer is commonly used in a mathematical sense rather than in reference to the Python type.

D. Limitations of Our Approach
A handful of benchmarks cannot be easily translated using the MultiPL-E approach. Of the 164 original HumanEval benchmarks: (1) we exclude 3 benchmarks that have Python helper functions in their prompt; (2) we modify 2 benchmarks to use unit tests instead of randomized testing; and (3) for certain typed languages, we fail to compile up to 5 benchmarks with untranslatable types. These changes do not lead to significantly different results for Python (Section V-B1).
Our approach can be generalized to additional programming languages, so long as the target language has natural analogues for the Python data types used in the benchmarks. We do not include two previously studied languages, C [6] and SQL [27], [28] because they do not meet this criterion. SQL queries and tables define relations and not functions, which our benchmarks define. The problem with pure C is that there are no canonical types for dictionaries and lists. This is unlike C++, which has STL data structures, and our C++ prompts include the STL headers. A prompt for C would have to include one of several non-canonical libraries and would be very sensitive to the choice made.

E. Validating Prompts and Test Suites
Since we automatically generate thousands of prompts and test suites, we validated the quality of generated prompts and test suites in several ways.
First, we identified a subset of 23 problems on which we manually validated the generated prompt and test cases for every language. We selected these problems because they exercised a variety of syntactic and semantic features. For example, we selected problems that used different types and compositions of types, including lists, dictionaries, tuples, unions, the any type, the optional type, and nested types. We also selected problems with test cases that used particular kinds of values, including empty lists, the None value, and literal newline characters in strings. In addition, we performed language specific validation when necessary, e.g., with Bash.
Second, on the aforementioned 23 problems, we manually checked several sample completions produced by Codex. This helped us ensure that the solutions were in the right programming language, and develop the set of stop tokens for each language.
Finally, we validated our evaluators writing test cases for every language, including tests that we expected to fail. Moreover, since we classify failures as syntax errors or runtime errors, we constructed tests that failed in both ways.

IV. CODE GENERATION MODELS
We evaluate three state-of-the-art code generation models, each of which use a Transformer architecture [29] and are trained with a language modeling objective on a mixture of natural language and code. We evaluate the largest, best-performing versions of each of these three models.

A. Models
InCoder. InCoder [4] is a 6.7B parameter language model trained using a causal masking objective [30]. It supports both code infilling and code completion; we test only the latter. InCoder was trained on 159 GB of deduplicated, filtered code from Github (around a third in Python) and 57 GB from Stack-Overflow.
CodeGen. CodeGen is a 16.1B parameter language model trained with a next-token prediction objective. We evaluate the multilingual CodeGen model, which was trained first on The Pile [31], a 825 GB dataset of mostly natural language text with around 8% Github-scraped code. The model was further trained (fine-tuned) on a 6 programming language subset (C, C++, Go, Java, JavaScript, and Python) of the BigQuery code dataset. 7 Codex. Codex is a GPT-3 language model fine-tuned on code.
[1] describe a 12B parameter version of Codex fine-tuned on 159 GB of deduplicated, filtered Python code from Github. We use the more recent 175B parameter codex-davinci-002 model, which is trained on multiple languages. Details of its training set are not publicly known. We use the OpenAI API to query Codex.

B. Metrics
For each language, we calculate pass@k using the methodology employed by [1] and subsequent work. Intuitively, pass@1 is the likelihood of the model producing a completion that passes all unit tests, pass@10 is the likelihood of any one of 10 completions passing all unit tests, and so on. We calculate pass@1 with temperature 0.2, and use temperature 0.8 for pass@10 and pass@100. For statistical reliability, we take 200 completions at each temperature and calculate average pass rate using the unbiased sampling estimator presented in [1]. 8

V. EVALUATION
In this section, we present the results of evaluating Codex, InCoder, and CodeGen on MultiPL-HumanEvaland MultiPL-MBPP. We fit mixed-effects models to evaluate the statistical significance of the differences between groups that we report below [32]. Appendix C, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety. org/10.1109/TSE.2023.3267446 has a full description of each statistical model with its estimate table.

A. Effect of Model Size and Training Data
We expect to find that Codex will outperform CodeGen, and CodeGen will outperform Incoder. First, the most important factor is model size: Codex has 175B parameters, CodeGen 16.B parameters, and InCoder 6.7B. A larger model has higher capacity and can be trained effectively on more data. Second, the training time and data also matters. The CodeGen corpus is nearly 2 TB, the InCoder corpus 159 GB, and the Codex corpus is unknown. Apart from size, the training data for InCoder and CodeGen have different compositions of programming languages. CodeGen is trained on six languages, InCoder is trained on more, but most of these extra languages have negligible quantities of data. Finally, we expect InCoder's performance on left-to-right generation to suffer slightly because a portion of its training is on fill-in-the-middle problems. Although InCoder did not perform an ablation, subsequent work by some of the authors of InCoder and this article show that training to fill-in-the-middle has a small impact (1%) on pass@100 rates [33].

B. MultiPL-HumanEval Results
We explore the code generation abilities of the three models on our translated version of HumanEval, MultiPL-HumanEval. Fig. 6 shows the by-language performance of each model on both benchmarks. Note that the Codex model is over 10× and 20× larger than CodeGen and InCoder respectively, so it performs much better as expected.
We find reliable differences between Codex pass@1 rates on MultiPL-HumanEvalfor Python and all but 4 languages: C++, JavaScript, Scala, and TypeScript. For these languages, Codex performance is similar to Python.
CodeGen performs best on the languages included in its fine-tuning dataset (Python, JavaScript, Java, C++, and Go). It also performs well on TypeScript, likely due to its similarity to JavaScript. A mixed-effects model finds reliable differences in pass@1 rates on MultiPL-HumanEvalbetween all languages and Python, except Ruby, where performance is so low that the model fails to find a reliable estimate.
InCoder performs significantly better on the Python version of MultiPL-HumanEvalthan all of the other language versions (p < 0.001 for all languages).

1) Python Results and Replication:
Our InCoder results on Python exactly replicate its previously reported performance on HumanEval [4]. We measure a slightly higher pass@1 rate for CodeGen than what is reported in [3] (19.2% compared to 18.3%). 9 These findings show that the small standardization changes we made to the HumanEval benchmarks do not significantly affect model performance.
We evaluate a more recent Codex model (codedavinci-002) than the original article and observe a large improvement on Python: a pass@1 rate of 45.9%, compared to 28.8% reported earlier [1]. Our pass@1 rate on the Python HumanEval subset replicates what is reported for codedavinci-002 in [34]. 10 2) Codex Performs Best on JavaScript: Codex's performance on JavaScript is better than its performance on Python, though the difference is not significant (+2.3%; p = 0.43). Codex also achieves a pass@1 rate higher than 40% on C++, Java, TypeScript, PHP, Ruby, Rust, Scala, and Lua. Since its training set is not public, we cannot independently verify that the CodeGen also performs well on JavaScript and TypeScript, though the latter is not included in its fine-tuning dataset. In-Coder's performance is the weakest. Like the other models, it performs better on more frequently-used languages (Python) than less popular ones. However, unlike Codex and Code-Gen, it does not match its Python performance on any other language.
3) Performance by Language Frequency: Fig. 7 shows MultiPL-HumanEvalpass@1 rates for each model, grouping languages by frequency. All three models perform best on high frequency languages.
Although we find reliable differences in Codex pass@1 rates between the MEDIUM, LOW, and NICHE languages when compared to the HIGH category (p = 0.006; p < 0.001; p = 0.002), we observe that Codex performs very well on some LOW and NICHE languages. For instance, Lua is the 9th-best language in our Codex evaluation, although it only appears in 0.2% of GitHub activity and is not in the TIOBE Top-20. CodeGen also performs well on Scala, Rust, and Julia.
Our evaluation therefore shows that contemporary code generation models may be useful even for developers working with less commonly used programming languages.

4) Perplexity and Code Correctness Do Not Correlate:
Xu et al. [6] report Codex perplexity scores for 11 of our 18 languages. We do not observe a strong correlation between Codex pass@1 scores and their perplexity scores (Fig. 8). Notably, perplexity is highest for JavaScript and TypeScript, while we find that Codex performs best on these languages. Therefore, perplexity may not be a reliable evaluation metric for NL2Code. One caveat is that [6] likely evaluate an older Codex model, since they report substantially lower pass rates for Python.  As with MultiPL-HumanEval, CodeGen performs best on the MultiPL-MBPPlanguages included in its fine-tuning set: Python, JavaScript, C++, Java, and Go. It performs almost as well on TypeScript as on JavaScript. A mixed-effects model finds significant differences in CodeGen pass@1 rates between Python and all languages except Ruby, where performance is so low that the model fails to find a reliable estimate.

C. MultiPL-MBPP Results
Unlike with MultiPL-HumanEval, on MultiPL-MBPP, In-Coder's performance on TypeScript, JavaScript, and PHP actually exceeds its performance on Python. InCoder's Python pass@1 rate is similar on MultiPL-HumanEvaland MultiPL-MBPP, one of the few instances where MBPP performance is not considerably better than HumanEval. A mixed-effects model finds significant differences in InCoder pass@1 rates for all languages, with positive coefficients for TypeScript, JavaScript, and PHP.
Note that we do not have any doctests in MBPP (Section II-I-A). The performance differences between MultiPL-MBPPand MultiPL-HumanEvalon certain languages may relate to this, as we discuss in more detail in Section VI-A.

1) MBPP is Less Challenging Than
HumanEval: MBPP appears to be a less challenging benchmark than HumanEval. The MultiPL-MBPPpass@1 rate is higher than the MultiPL-HumanEvalpass@1 rate for all but 6 of our 57 model/language pairs. This is despite the fact that MBPP does not provide doctests in any prompts, which, as we show in Section VI-A, affects performance for some languages.
This suggests that HumanEval may be a more useful benchmark suite than MBPP for the community, as it provides an equally good or better indication of model performance with a more computationally efficient sample size.

2) Python Results and Replication:
Our Python MBPP pass@1 rates for Codex are slightly higher than what is reported in [34] (60.3% compared to 58.1%). [34] prompts with a function signature and docstring, even though the original MBPP problems do not include function signatures; we also include function signatures, which we infer from the provided test cases.
Our Python MBPP results for InCoder are lower than what is reported in the original article (15.5% compared to 19.4%) [4]. We calculated pass@1 rates for MBPP differently than Fried et al. [4] in two key ways. First, since the original MBPP benchmarks do not include function signatures, Fried et al. [4] prompts InCoder with the MBPP docstring only. We infer function signatures for MBPP problems from the provided test cases, as described in Section III. Second, Fried et al. [4] reports computing pass@1 rates for MBPP using a single completion, rather than computing the unbiased sampling estimator with 200 samples as described in Chen et al. [1], as we do. We suspect this leads to inflated pass@1 rates. 11 3) Performance by Language Frequency: Fig. 9 shows model performance on MultiPL-MBPPby language frequency. As with the MultiPL-HumanEvalbenchmark, models generally perform better on more common languages. However, Codex performance on MultiPL-MBPPis robust on many MEDIUM, LOW, and NICHE languages, such as Lua and Scala. CodeGen performs surprisingly well on the D version of MBPP, a niche language not included in its fine-tuning dataset.
We find reliable differences in Codex pass@1 rates between languages in the MEDIUM, LOW, and NICHE categories when compared to the HIGH category (p = 0.007; p < 0.001; p < 0.001).

4) Comparing Model Performance by Problem:
Although Codex outperforms CodeGen and InCoder, one may wonder if the latter models can solve problems that Codex cannot. This type of comparison is subtle since all of these models are non-deterministic, and the pass@k metric estimates the likelihood of a prompt producing a working program. We can consider the subset of prompts where Codex never produces a working program. There are exactly 3,000 out of over 10,000 such prompts, across all languages and benchmark suites. For these prompts, we calculate the 95% confidence interval for pass@1 with CodeGen and InCoder and find that 3% of them have pass@1 > 0. Moreover, the mean value of pass@1 for the aforementioned 3% of prompts is 0.16 and 0.14 for CodeGen and InCoder respectively. Thus the likelihood of CodeGen or InCoder solving a problem that Codex cannot is very low.

D. Summary
On the whole, our results replicate previously reported model performance on code generation for Python. We benchmark three state-of-the-art models on 18 additional languages, most of which have never been evaluated before. Surprisingly, we find remarkably good model performance on some lower-frequency languages, such as Lua. We also find that performance on JavaScript and TypeScript is consistently high and occasionally exceeds Python, even though the benchmarks we explore originated in Python.

E. Cost
We use CodeGen and InCoder "off-the-shelf" from Hugging Face Transformers and estimate that we spent about 2 years of time on a variety of Volta/Ampere GPUs. We used the Codex model when it was in limited beta, and it is now available from Azure at 10 cents / 1,000 tokens. The total length of all Codex completions in our dataset is nearly 1.5 billion characters. With the OpenAI estimate of 0.25 characters per token, our experiment would have cost approximately $37,000 if we had had to pay for it.

VI. FACTORS IN CODE GENERATION SUCCESS
In this section, we explore factors that impact code generation success. Focusing specifically on the MultiPL-HumanEvalbenchmark suite, we report results from a number of follow-up experiments, including an ablation study of MultiPL-E's translation components and finer-grained examinations of language features and prompt translation choices. We also provide a fine-grained analysis of the kinds of errors that arise in NL2Code across several languages.

A. Ablation Study
Our compilers target multiple distinct regions of the prompt for each problem. We explore the impact of each component  (Fig. 10). However, translating these components seems more important for certain languages: Bash, PHP, Perl, R, Rust, Swift, and TypeScript.
We note that six of these languages are ones where Codex does not perform substantially better on MultiPL-MBPPthan MultiPL-HumanEval (Fig. 6). The performance degradation observed for these languages when doctests are removed from the MultiPL-HumanEvalproblems suggests that the worse than expected performance on MultiPL-MBPPcould be due to the lack of doctests in that benchmark suite.
Overall, we find significant differences between the Full Translation and Test-Only Translation experiments (p = 0.03), and between No Doctests and Test-Only Translation (p < 0.001), but not between No Translation and Test-Only Translation (p = 0.2). This suggests that the Python terminology translation has a small but reliable effect, and that the presence of the doctests is important, though their translation is not.

B. Explicitly Prompting With Language Name
Our prompts do not explicit specify the name of a programming language and instead relies on the models to infer the desired language from other cues in the prompt (Section II-A). We run a small ablation study with Codex using three programming languages on which Codex performs poorly: D, R, and Racket. 12 For each of these languages, we take the original prompt (specifically, the full translation that translates doctests and terminology) and add "Generate in language" as a comment before the original prompt. Table II shows pass@1 rates with and without this explicit language cue. The results are inconclusive across languages: Codex's performance on Racket improves slightly, but is slightly worse on D and R.

C. Type Annotations
One may conjecture that type annotations improve model performance by constraining the code generation search space. Or, perhaps, they might hurt performance by complicating the task, since the model must generate correct type annotations.
In Figs. 7 and 9, the dashed lines in each category separate languages with type annotations (left) from languages without (right). We observe no overall effect of type annotations on Codex pass@1 rates on MultiPL-HumanEval(p = 0.33) or MultiPL-MBPP(p = 0.23).
To explore the impact of type annotations at a more finegrained level, we run a series of follow-up experiments using the MultiPL-HumanEvalbenchmark suite. We focus on two languages: Python, which allows optional type annotations, and TypeScript, a gradually typed cousin of JavaScript. Gradual typing allows us to weaken type annotations or even configure the TypeScript compiler to ignore all type errors.
1) Precise Types Improve TypeScript Performance: Type-Script has an "Any" type, which is compatible with all types. We run Codex on a variation of the MultiPL-HumanEvalTypeScript prompts where all types in the function signature are translated to "Any". We find that the loss of precise types hurts performance on TypeScript (-2.5%; p < 0.001).
2) TypeScript Type Errors Correlate With Runtime Errors: Even gradual type-checking can reject programs that would in fact run without error. We run the Codex-generated TypeScript programs without first checking types. We observe no significant difference in pass@1 rates (p = 0.14), suggesting that typed programs are rejected for genuine errors.
3) Type Annotations do Not Affect Python Performance: We run a similar experiment with Codex and Python, where we remove all the type annotations from the MultiPL-HumanEvalprompts. We find that this has no significant effect on Codex's pass@1 rate for Python (p = 0.23).
We interpret these results as evidence that type annotations do not guide search in general, since they do not improve Python performance, but that informative types are necessary for languages where type annotations are standard, perhaps in order to fit the token distribution of high-quality typed code seen in training.

D. Sensitivity to Compilation Choices
Each MultiPL-E compiler makes small choices about how to translate prompts that could have an impact on performance. We explore some of these choices below.
1) Comment Style Affects Performance: Most programming languages have several comment styles (e.g., single-line vs. multi-line). To investigate their impact, we ran follow-up experiments with Codex on the MultiPL-HumanEvalbenchmark suite for two languages: PHP (MEDIUM) and Racket (NICHE). Our original prompts use single-line comments for both PHP and Racket, following conventional style. We re-ran Codex on versions of the MultiPL-HumanEvalproblems for both languages using multi-line comments instead. This improves the pass@1 rate for Racket (+1.9%, p < 0.001), but decreases it for PHP (-3.1%, p = 0.001).
2) Naming Arguments Improves Performance for Perl: Functions in Perl do not have formal named arguments. Instead, all arguments are passed in a special array. Our compiler to Perl produces a prompt that pops elements off the special array and names them, with the expectation that this would improve model performance.
We ran a follow-up experiment on a version of the MultiPL-HumanEvalproblems for Perl where we omit this argumentnaming prompt, so the model has to infer everything about arguments from the natural language description and examples. This significantly lowers Codex's pass@1 rate (-8%; p < 0.001).
In summary, our results show that code generation performance can be sensitive to prompt engineering choices for both high and low frequency languages.

E. Impact of Language Features
One challenge of extending existing benchmarks to new programming languages is that not all programming languages have the same features. Although the MBPP and HumanEval benchmarks consist of relatively simple functions, they exercise a variety of datatypes, not all of which have a straightforward equivalent in all programming languages in our dataset. To explore whether model success is impacted by the Python language features used in the program, we categorized all problems from the HumanEval benchmark suite into groups based on the Python language features used in their type annotations: Booleans, dictionaries, lists, tuples, or none of the above. Fig. 11 shows the performance by language on each type of problem. A mixed-effects model finds no significant effect of problem type, when programming language is treated as a random effect.
Many models appear to struggle with questions involving tuples. Some of these are on languages that lack a native tuple type, such as Java. However, JavaScript performs well despite lacking tuples.
Although many languages show poor performance on dictionary problems, there are only 3 problems in this category, so these results should be interpreted with caution.

F. Fine-Grained Error Analysis
Code generation models generate many more failing programs-programs that produce errors or fail to pass unit tests-than programs which run successfully. This section presents a detailed evaluation of errors present in the Codexgenerated completions for MultiPL-HumanEvalproblems in 4 languages: Python, C#, Swift, and Racket. See Appendix D, available in the online supplemental material for a full categorization.
We first identified specific error labels for each language and then grouped them into themes (e.g. "NullReference"). We produced five general error categories: RUN-TIME, STATIC, TYPE, LANGUAGE, and MODEL. We group similar error sources together across languages, even if they occur in different contexts: for example, calling a function with a value of the wrong type may fail at compile-time or run-time, depending on the language's type system.
The most common STATIC theme across all languages is "Un-definedIdentifier", which contains errors related to referencing non-existent terms (e.g., Fig. 12(a)). These errors can be caused in many ways -calls to functions not in the local context, use of Python-like keywords, or calls to methods from external libraries that were not imported.
Some errors in the RUNTIME category mimic those we expect from software engineers (e.g., index-out-of-range errors). However, others are unlike human mistakes. Notable themes in the latter group (MODEL) include generating code that throws exceptions on purpose (e.g., Fig. 12(b)) and generating code in an entirely different language. For instance, Codex frequently generates Markdown code for Racket problems. Although we don't have access to the Codex dataset, we suspect that Racket is not well-represented in the dataset. We posit that these errors occur because Racket files begin with a language declaration (#lang racket) that is easily mistaken for a Markdown heading (e.g., Fig. 12(c)).
Finally, the category LANGUAGE includes multiple themes related to the specifics of the target language itself. The "LanguageSpecific" theme contains idiosyncratic errors such as the requirement of labeled arguments in Swift (e.g., Fig.  12). An additional theme ("DoesNotKnowSyntax") includes errors in Racket caused by incorrectly generated core language constructs.

VII. THREATS TO VALIDITY
Our work translates two Python code generation benchmarks into 18 other languages and evaluates the performance of three code generation models on the translated benchmarks.
The principal threat to validity is that the (translated) benchmarks may not be representative of the kinds of problems that programmers typically solve in each languages. For example, we evaluate both scripting languages (e.g., Python and JavaScript) and systems languages (e.g., C++ and Rust) on the same task, but programmers frequently use these languages for very different tasks. We characterize the HumanEval and MBPP benchmarks as a mix of basic programming problems and straightforward interview questions. Thus, performance on benchmarks may not accurately represent real-world performance.
Code generation models are sensitive to small changes in how prompts are designed, as we show in our exploration of prompt design choices for three of our languages (Section VI-D). It is likely that the pass rate on individual languages could be improved with even more language-specific effort. We do provide an ablation study on prompts for all languages in our dataset (Section VI-A) to investigate the impact of our different translation components. All our prompts use >>> to mark examples, which is Python doctest notation. It is possible that a different marker may improve performance on some other languages.
The quality of generated code is also sensitive to decisions about how to sample completions (Section II-B). We use the same sampling configuration that is used in most prior work on code generation. Empirical results show these settings are optimal for Python [1], but it is possible that different settings would be better for other programming languages. 13 However, in a practical deployment of a multi-language code generation model, it may not be feasible to adjust the sampler for every input language.

VIII. RELATED WORK
In this section we focus on related work on evaluating neural code generation models.
Early approaches. Early work on neural network code generation relied on textual similarity metrics for evaluation. For instance, Feng et al. [16] evaluate their CodeBERT model on six programming languages using BLEU [35]. Ren et al. [26] proposes a code generation-specific formulation of this metric. However, previous work has found that textual similarity metrics correlate only weakly with code correctness [1], [2], [26], highlighting the importance of benchmark suites with unit tests.
Other benchmark formats. The benchmarks that we translate pair code with comments; some other benchmarks pair natural language descriptions of other kinds. For instance, the CoNaLa [10] benchmark consists of matching natural language questions and code snippets mined from StackOverflow. We note that MCoNaLA [36], which extends CoNaLa to Spanish, Japanese, and Russian, is the only currently available benchmark for evaluating code generation from multiple natural languages.
CodeGen [3] introduces a multi-turn benchmark that involves interleaving prompts and results. We believe it could be translated to other languages with some effort, using a variation of the MultiPL-E approach.
InCoder [4] supports "in-filling" or "fill-in-the-middle" and presents a number of infilling tasks, of which the most general is to fill in an arbitrary line of code in the middle of a function. This article does not benchmark in-filling, since the other two models that we evaluate do not support it. However, MultiPL-Egenerated solutions are used as multi-language infilling benchmarks for the SantaCoder [33] code generation models, which do support in-filling.
MathQA [2] is benchmark of math word problems, where answers are number-valued expressions in Python and a DSL. Since the canonical solutions are number-valued, it is easy to check that a solution in another language produces the same number, within some epsilon. However, MathQA requires some formulas to be translated for either fine-tuning or few-shot prompting.
Other Monolingual Benchmarks. There are monolingual code generation benchmarks in languages beyond Python. Kulal et al. [8] presents a C++ dataset consisting of crowdsourced descriptions of lines of code. Iyer et al. [37] present a Java benchmark taken from online code repositories. Zhong et al. [28] and Yu et al. [38] propose benchmarks for SQL. However, we do not believe SQL is amenable to translation, since conventional types in programming languages do not naturally translate to the types of relational tables. Moreover, of these datasets, only Kulal et al.'s includes unit tests to enable evaluation of code correctness [8].
Our approach could be applied to other Python code generation benchmark suites like MathQA-Python [2], a set of mathematical word problems with multiple choice answers, or APPS [9], a set of problems taken from open-access code competition websites like Codeforces.
Other tasks. Although we focus specifically on benchmarks for the code generation task, there are many other tasks that have been used to evaluate code generation models, including generating unit tests from code [18], code search [16], [20], and type inference [21], [22], [23]. Lu et al. [19] propose a suite of evaluation datasets for ten tasks, including code translation, docstring generation, and code summarization.
Other code generation models. We evaluate three stateof-the-art code generation models, but many other models that have been proposed. Two influential early models were CodeBERT [16] and PyMT [17]. More recent models include PolyCoder [6], CodeParrot [39], AlphaCode [40], and PaLM-Coder [41]. PolyCoder was not trained on natural language text, and its authors explicitly state that it may not be suitable for NL2Code. AlphaCode and PaLM-Coder are not available for academic researchers to investigate.
Other multi-language evaluation. Xu et al. [6] measure the performance of several code generation models on 12 languages. However, they evaluate model performance using perplexity, rather than building a benchmark with unit tests, as we do; they test code correctness only for Python.
HumanEval-X 14 is an unpublished benchmark that appeared after our work that manually translates the HumanEval problems into four languages (C++, Java, JavaScript, and Go).
MBXP [42] is a concurrent effort to evaluate code generation models. We support more languages (13 vs. 19), though MBXP translates an additional benchmark (MathQA). Both MBXP and our work could be extended to support more languages and benchmarks. However, there are deeper differences in the nature of our translation and evaluation: r We believe our approach to testing is more reliable. Rather than keeping the unit tests hidden from the model, MBXP prompts the model with the same unit tests it uses to test the generated code. Thus the code generation model can "see" the test cases that it will be evaluated on. In contrast, we use a hidden set of unit tests to evaluate code correctness.
r Our work is more faithful in translating types from Python into typed languages. For example, our type inference infers types like Either[X,Y] and Optional[X] and translates them to algebraic datatypes in typed languages (Section VI-C). MBXP produces types such as Object and Any in languages like Java and Scala, which are less idiomatic. For languages that do not support Any, such as C++, MBXP fail to translate these benchmarks altogether.
r MBXP uses greedy decoding in their evaluation of public models. Greedy decoding produces a single candidate program which may not be the most likely program. Prior work has shown that sampling the output of a code generation model significantly improves the correctness of generated code [1]. We follow standard practices for sampling (Section II-B).
r Finally, MBXP has publicly released a subset of their benchmarks, but not their system used to build them. All code and data for MultiPL-E is open source.

IX. CONCLUSION
We propose MultiPL-E, the first massively parallel, multilanguage benchmark for natural-language-to-code generation. We write compilers to translate code generation benchmarks from Python to 18 additional programming languages that span a spectrum of language features and popularity.
We translate two widely used unit test-driven benchmarks for code generation: HumanEval and MBPP. Using our multilanguage parallel versions, we present the first multi-language code correctness evaluation of three state-of-the-art models: Codex, CodeGen, and InCoder. We demonstrate that Codex displays high performance across a variety of programming languages, performing similarly to Python on several languages, most notably, JavaScript.
In our detailed by-language analysis, we find a predictable effect of language frequency, but draw mixed conclusions about the impact of type annotations. Our detailed error analysis highlights common patterns in four languages, finding model errors that are both like and unlike those of human programmers. We hope that our in-depth, parallel evaluation of a large set of languages will be a useful guide for developers weighing whether the utility of code generation tools in their project context.
Our publicly available benchmark is also easy to extend to new problems and languages. We hope it will help evaluate and develop future work on multi-language code generation models.
Federico Cassano is an undergraduate student and researcher with Northeastern University. He is currently working towards the bachelor's of Science degree in Cybersecurity and Economics. His research interests include systems security and programming languages.
John Gouwar received the BA degree in computer science and mathematics from Grinnell College. He is currently working toward the PhD degree in computer science from Northeastern University. His research focuses on implementation and evaluation of example-based program synthesis tools and the implementation of novel abstractions for circuit-based probabilistic programming languages.
Daniel Nguyen will be an undergraduate with Carnegie Mellon University. He is interested in algorithms and software engineering.
Sydney Nguyen is working towards the BA degree in computer science with a minor in Music, Wellesley College. Her research interests include AI code generation and translation.
Luna Phipps-Costin received the BS degree in computer science from the University of Massachusetts Amherst. She is currently working toward the PhD degree with Northeastern University. Her research interests are programming languages and compilers. Donald Pinckney is currently working toward the PhD degree with Northeastern University, leading research on understanding and improving the software supply chain, security and malware analysis, and other programming languages and software engineering topics. His recent work includes building MaxNPM, a replacement for NPM which utilizes MaxSMT solvers to optimize various software properties when installing dependencies.
Ming-Ho Yee received the bachelor's of software engineering and the master's of mathematics degrees from the University of Waterloo. He is currently working toward the PhD degree with Northeastern University. His research interests include programming language design and implementation and software engineering.
Yangtian Zi received the bachelor's of computer science and master's of mathematics from the University of Waterloo. He is currently working toward the PhD degree with Northeastern University. His research interests are Large Language Models for Code and Just-In-Time compiler implementation.
Carolyn Jane Anderson is an assistant professor of Computer Science with Wellesley College. Her research explores the intersection of computation and meaning using a variety of methods, including formal semantics, Bayesian modeling, psycholinguistic experiments, and deep learning.
Molly Q Feldman received the PhD degree in computer science from Cornell University. She is an assistant professor of Computer Science with Oberlin College. Her research considers the application of newly practical computational methods to human-centered domains.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
Arjun Guha is an associate professor of Computer Science with Northeastern University and a visiting professor with Roblox Research. His research interests span programming languages, software engineering, and systems.
Michael Greenberg received the BA degree in computer science and egyptology from Brown University, and the PhD degree in computer and information science from the University of Pennsylvania. He is an assistant professor with the Stevens Institute of Technology. He works on programming languages and systems, on a broad range of topics: from functionalreactive JavaScript to runtime verification of higherorder programs; from software-defined networking to present activities focused on the POSIX shell and executable formalism.
Abhinav Jangda received the PhD degree from the University of Massachusetts Amherst. He is a senior researcher with Microsoft Research. His research interests lie at the intersection of programming languages, systems, and high-performance computing.