Code Generation Using Machine Learning: A Systematic Review

Recently, machine learning (ML) methods have been used to create powerful language models for a broad range of natural language processing tasks. An important subset of this field is that of generating code of programming languages for automatic software development. This review provides a broad and detailed overview of studies for code generation using ML. We selected 37 publications indexed in arXiv and IEEE Xplore databases that train ML models on programming language data to generate code. The three paradigms of code generation we identified in these studies are description-to-code, code-to-description, and code-to-code. The most popular applications that work in these paradigms were found to be code generation from natural language descriptions, documentation generation, and automatic program repair, respectively. The most frequently used ML models in these studies include recurrent neural networks, transformers, and convolutional neural networks. Other neural network architectures, as well as non-neural techniques, were also observed. In this review, we have summarized the applications, models, datasets, results, limitations, and future work of 37 publications. Additionally, we include discussions on topics general to the literature reviewed. This includes comparing different model types, comparing tokenizers, the volume and quality of data used, and methods for evaluating synthesized code. Furthermore, we provide three suggestions for future work for code generation using ML.


I. INTRODUCTION
Software development is a complex and time-consuming process. It consists of two main phases: analysis and coding [1]. In the analysis phase, the requirements and architecture of the software system are formalized. In the coding phase, source code is written and tested to meet the requirements set in the first phase. Usually, maintenance of the system is included as an additional phase in the software development cycle where previous steps can be adapted to reflect changes in the needs The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Olague .
of the system user. Figure 1 shows a flowchart for a simple software development model. In this review, we focus on the coding phase which works directly with source code.
Modern society relies on complex software applications. These applications can consist of millions of lines written in many programming languages (PLs) by many teams of developers. Even small software projects will often leverage large libraries that are expected to be easy to use and trusted to be efficient and safe. PLs are difficult to read and understand quickly so the developers must also document their programs to make them more maintainable. Mistakes made during the coding phase lead to software bugs that can cost time and money for the software creators and users. In the worst-case scenario, software bugs can jeopardize the safety of human beings.
As a result, many software development tools and technologies have been created to help developers write better software. A popular technology used by software developers is a ''linter'' which flags syntactic errors in code. Auto-formatters will add or remove whitespace and ''newline'' characters to code to improve readability. Statement auto-complete tools can suggest tokens that programmers might write next to improve their productivity. While these traditional tools can be useful for programmers, most of them can't help a developer with complex tasks such as writing understandable code documentation or implementing algorithms.
More recently, machine learning (ML) has opened up the possibility to automate difficult programming-related tasks. In particular, advances in neural network (NN) architectures, such as recurrent neural networks (RNNs) and transformers [2], have been used to advance the state-of-the-art (SOTA) for many difficult automated software engineering tasks. These tasks include code generation from code documentation [3], [4], documentation generation from code [5], [6], and cross-PL translation [7]. These technologies, among others, have even led to commercial products such as Tabnine [8] and Github [9]'s Copilot [10].
To provide a broad and detailed introduction to this field, this systematic review summarizes and discusses publications that use ML to generate code. More specifically, publications retrieved from searches on arXiv [11] and IEEE Xplore [12] databases that propose models that synthesize code from non-code inputs (description-to-code), generate code documentation given code inputs (code-to-description), or modify existing code from one form to another (codeto-code) were reviewed. We categorize each publication by its ML model and the relevant sub-domain. The intention of this review is to provide a broad but detailed overview of ML techniques applied to the domain of automatic software generation. A summary of each publication and general discussions are provided. Topics discussed in this review include the application categories, popular ML models, tokenization strategies, the quantity and quality of data, and metrics used to evaluate synthesized code. Additionally, three directions for future work are suggested before concluding this systematic review.

II. METHODOLOGY
This review followed the Preferred Reporting Elements for Systematic Reviews and Meta-analyses (PRISMA) guidelines [13] where appropriate for the scope of this review (see Section III for more details). The arXiv [11] and IEEE Xplore [12] databases were searched to identify potential publications for review. Inclusion criteria refer to filters applied to the search functions of each database. The search applied two general inclusion criteria and one inclusion criterion specific for each of the databases searched. Four exclusion criteria were then applied to the studies retrieved from this search to remove studies that are not appropriate for this review. Figure 2 provides a graphical overview of the search and selection methodology explained in more detail in the remainder of this section.
The first inclusion criterion applied to both databases was the search terms used. The search phrase ''code generation using machine learning'' was applied to all fields (title, abstract, full-text, etc.) using each database's search engine to identify possible publications for review. The second general inclusion criterion was the publication date. To ensure a variety in the different ML models studied by the retrieved publications, the first publication date range used was chosen to be 2016, one year before the introduction of the transformer architecture [2]. This was done because transformers have become very popular recently and we wanted to make sure that studies using other ML models were also retrieved by our search. Therefore the first searches were limited to studies published between 2016 and 2021. Additional searches were applied on earlier years iteratively by decrementing the last year searched, keeping all other inclusion criteria the same, until no new studies are selected for the review. An additional filter was added to the search of each database using filtering options provided by the search function of each database, respectively. The search on the IEEE Xplore database was limited to conference and journal publications. For the search on the arXiv database, the search was limited to publications on the subject of computer science (including cross-listings).
Four exclusion criteria were defined to filter publications returned by search queries using the inclusion criteria that are not applicable to the review. These exclusion criteria are as follows: (i) publications that do not propose new ML techniques or models for code-to-description, description-tocode, or code-to-code applications; (ii) survey, benchmark, and vision papers; (iii) publications where an ML model predicts numerical parameters for non-ML code generation engines; and (iv) publications not written in English. These criteria were applied to the title and abstract of publications first. Titles and abstracts that do not provide sufficient information to justify removal from consideration of the review are not filtered at this stage. Next, duplicates between the remaining publications from both databases were removed. Finally, the full texts of the remaining publications were screened based on the exclusion criteria. The publications not filtered in this final stage are added to the set of selected publications. All searches were performed in April 2022.

III. RESULTS
For the initial query using the publication date range of 2016-2021, the query on the arXiv [11] and IEEE Xplore [12] databases returned 613 and 274 publications, respectively. After applying the inclusion and exclusion criteria to the titles and abstracts of the 887 identified publications, 811 publications were excluded from consideration for the review. One duplicate study was found among the remaining 76 studies between the two databases and was removed. The final selection of publications was obtained by applying the inclusion and exclusion criteria to the full text of the 75 remaining publications. The final selection consisted of 37 publications, 28 indexed in the arXiv database [11] and 9 indexed in the IEEE database [12].
An overview of these search results for every step in the selection methodology is shown in Figure 3. Figure 3 also shows the results of the second search which was limited to studies published in 2015, one year earlier than the publication date range of the first search. Since no new studies were selected from this second search, no additional searches were conducted. All searches were performed in April 2022.
A direct comparison of the results of each publication is not possible as they use different model types, are trained and evaluated on different datasets, and use different methods for evaluation. Instead, a summary of each publication is provided in Table 1. The table consists of the following columns: (i) the application studied, (ii) the ML model used, (iii) the datasets used by the study, (iv) the results of the study and discussion, and (v) limitations and/or future work. General aspects of the applications, ML model types, tokenizers, datasets, and evaluation methods of these publications are discussed in more detail in the next section.
Common abbreviations used in Table 1

IV. DISCUSSION
This review examines 37 studies that propose ML models to generate code from non-code descriptions, generate code documentation, or modify code. This section discusses common challenges these studies faced as well as key findings from the selected studies as a whole. First, the different application categories are introduced and explained with the help of examples. Next, the popular model types are introduced and compared. Subsequently, different tokenization strategies for code generation are compared. The quantity and quality of readily-available data for the different applications are discussed in their respective sub-sections. A section where the different metrics for measuring the quality of synthesized source code are compared is also included. Finally, three suggestions for future work are listed.

A. APPLICATION CATEGORIES
The studies reviewed in Table 1 can be categorized into three paradigms: description-to-code, code-to-description, and code-to-code. The studies can further be categorized into a number of application categories as shown in Table 2.        Application categorization of the selected publications. The applications categories are grouped together by three paradigms: description-to-code, code-to-description, and code-to-code. These paradigms describe the nature of the inputs and outputs of its applications. All studies within a row are ordered chronologically by publication date. Figure 4 shows the percentage of selected studies belonging to each paradigm. This section explains the three paradigms and introduces the most popular application categories of the selected publications. Percentage of selected publications that study each of the three application paradigms: description-to-code, code-to-description, and code-to-code.
Description-to-code applications involve generating code conditioned on model inputs that are not code. This was the most popular paradigm, applicable to 46% of all selected studies. The descriptions can come in various forms. The most popular description type is natural language (NL) documentation. These descriptions are often obtained from code comments written before a code snippet. An example NL description with an associated code implementation is shown in Figure 5.
Programming-by-example (PBE) is the second most popular application category for the description-to-code paradigm. For PBE, the functionality of the desired program to be VOLUME 10, 2022 FIGURE 5. An input-output pair example for code generation from NL. The input consists of an NL description, depicted here as a code comment, which describes the functionality of the desired code output.

FIGURE 6.
An example of a data-program pair for programming-by-example (PBE) with a verification code snippet. PBE generates a program that satisfies the functionality described by input-output data. This data is depicted here as a list of tuples (top rectangle) consisting of an input and then an output. Under this data is a for loop which verifies that the function to be generated should return the corresponding output for each input. The example program (bottom rectangle) returns the desired output for each input in the list ''inputs. '' generated is described by pairs of program input and output examples. Figure 6 shows an example of a list of possible program inputs and a list of corresponding outputs as well as a program that satisfies the given input-output pairs.
Another important description type is images. For [24] and [29] the images are screenshots of graphical user interfaces (GUIs) and for [114] the images are sketches of data visualizations. The desired output is a program that can synthesize the given image. Figure 7 shows a possible image-program pair for a simple GUI implemented in the HTML language.
Code-to-description studies in this review all belong to a single application category, documentation generation. With 25% of all selected studies being code-to-description studies, the paradigm is the least popular of the three paradigms while documentation generation is the single most popular application category. Sometimes called source code summarization, the objective of this task is to generate an NL description of the code, usually in the form of a comment. An example of documentation generation is shown in Figure 8 as well as Figure 7 if the input and output data were swapped.
Code-to-code applications generate code conditioned on other code. The most popular application category of this paradigm is automatic program repair (APR). The input for APR is faulty, or buggy, code for which the model should generate similar code that does not have the bug. Figure 9 shows a buggy line of Python [39] code that has a syntactic error and the fixed line of code as the output. An example of an image-code pair for code generation from images. The image depicted here is a simple GUI with a black background and two buttons. The desired code to be generated, shown in the bottom rectangle, is HTML code that synthesizes this GUI.

FIGURE 8.
An input-output pair example for documentation generation from code. The input consists of code, in this case a Python [39] function. The desired output is NL which describes the semantics of the given code. The problem statement for documentation generation is the same as the problem statement of code generation from NL with the inputs and outputs reversed. . An example of an input-output pair for automatic program repair (APR). The input example here contains a syntactic error that has been recognized by the Pylance [116] linting tool. The output example is the same code as the input with an additional closing bracket which fixes the input's error.
Cross-PL translation involves translating code written in one programming language (PL) to code written in another PL while preserving as many features of the original code as possible. An example where C++ [40] code is translated to Python that preserves the same functionality is shown in Figure 10. Refactoring is similar to cross-PL translation FIGURE 10. A cross-PL translation pair example. The first code snippet is written in C++ [40]. A functionally equivalent Python [39] translation of the code snippet is shown in the second code snippet. in that features of the input code should be preserved but different because the input and output code are written in the same PL. Refactoring aims to transform the input code to a form that is better understandable for humans. Reference [14] does this by adding or removing whitespaces or new-line characters to input code while [62] attempts to paraphrase code statements so that the transformed code is more concise than the input. The final application category, code completion, involves predicting subsequent code statements only from prior code.

B. MACHINE LEARNING MODEL TYPES
The review shows that a wide variety of ML methods can be used for different code generation tasks. Table 3 shows which of the selected studies use certain model types. Figure 11 shows the number of times a ML method is used for the different application categories introduced in Section IV-A. The popular ML methods used by the selected studies are introduced and compared in this section.
Recurrent neural networks (RNN) are a class of NN often used with sequential data such as NL. RNNs use previous outputs and states of the network as supplementary information to the current input. Diagrams and equations of three types of RNNs, a basic RNN, LSTM, and GRU, are shown in figures 12, 13, and 14, respectively. The hidden state, h t , allows the model to use previous data in a sequence alongside the current input. Basic RNNs do not capture long-term dependencies well. LSTMs address this issue by passing along its cell state, c t [117]. GRUs combine the hidden state and cell state into one state, simplifying the network. Attention mechanisms, first proposed in [118], are often used by RNNs to encode information in variable-length vectors which reduces information loss for large inputs. RNNs are the most popular NN type in the review, used in 43% of the selected studies, 80% of which use LSTMs specifically. RNNs are used for every application category from VOLUME 10, 2022 FIGURE 13. A long short-term memory (LSTM) network. network. Compared to simple RNNs such as the one shown in figure 12, LSTMs add a cell state (c t ) which allows for better handling of long-term dependencies in sequential data.  RNN decoders are used in combination with CNN encoders in four studies, [24], [66], [69], [114].References [24], [114] use CNNs as encoders for image inputs while [66], [69] use CNNs for encoding text. Transformers [2] rely solely on attention mechanisms to capture dependencies between tokens in sequential data. While complex RNNs incorporate attention mechanisms to enhance dependency information, only using self-attention allows for greater parallelization. Figure 15 shows multi-head attention, the main building block of the original transformer architecture [2].
The transformer architecture is studied in 12 out of the 37 selected publications, as shown in Figure 16, making it the second most popular model type overall. Figure 16 also shows that transformers are the most popular model type if only the last year is considered. Similar to RNNs, transformers are used for a wide variety of tasks. Transformers are used for five out of seven code generation from NL description studies, more than any other model type. The only tasks that do not have a selected study that uses a transformer are code generation from other structured data and refactoring.
As RNNs and transformers work well with sequential data, they are often compared in experiments. Of the 12 studies that use transformers in their proposed models, 9 of them [47], [76], [79], [83], [85], [89], [91], [95], [102] compare their results with RNN baselines. The transformer model used in [47] underperformed a comparative RNN model studied. The other eight publications propose transformer-based models that outperformed RNN-based baselines of each respective study. Reference [111] proposed a method with an LSTM component that outperformed transformer baselines but it was suggested that replacing the LSTM component with a transformer could improve the proposed method.
Convolutional neural networks (CNN) use convolution layers that sweep an input using a feature filter to aggregate information about the input. Convolutional operators work well on grid-like data such as images. Pooling is frequently used in CNNs to reduce the number of parameters of the model. Figure 17 shows an example of the convolution operation as well as the pooling operation. Note that no zero-padding is used in this example and simple parameters are used (e.g., strides of size 1). For more detailed information on these, we refer to [119].
All three publications on code generation from images used CNNs. Reference [29] used a CNN to classify objects in GUI screenshots. References [24], [114] used CNNs to extract features from images in combination with RNN or transformer decoders. CNNs are also used as encoders in other encoder-decoder models to create embeddings of code inputs [38], [66], [89]. References [38], [66] use CNNs in RNN encoder-decoder architectures while [89] use CNNs to augment a transformer architecture. Reference [69] uses a convolutional graph NN on AST inputs due to their ability to encode spatial information well.
ML augmented search for PBE is a method of generating programs by building an AST node-by-node where the next node is chosen by an ML model until the program satisfies all input-output examples. Explicit functional specifications such as input-output examples are important so that the model knows when to stop searching. The advantage of this method, compared to other code-generation techniques, is that a generated program is guaranteed to compile and behave as specified by the input-output examples given. Three of the five PBE studies [59], [63], [73] use ML augmented search. These studies use DSLs specifically designed to reduce the search space of all possible programs. General PLs like Java [16] or Python [39] have large program spaces that make these search techniques infeasible. For general PLs, RNNs or transformers are commonly utilized as they can more efficiently build sequences one token at a time by using decoding strategies such as beam search [120]. These decoding strategies usually do not provide any syntactic or functional guarantees for the generated program in contrast to PBE with AST search.

C. TOKENIZERS AND THE OUT-OF-VOCABULARY PROBLEM
Tokenization is a preprocessing step where an input string is partitioned into chunks. These chunks or ''tokens'' are mapped to numbers that ML models can process. The outputs of the models can be mapped back to tokens which form a part of the model output. Models with tokenizers recognize a finite set of tokens which are called the vocabulary of the model. Whenever a chunk of the input string does not have a matching token in the vocabulary, a special <unknown> token must be used. This results in a loss of information which is referred to as the out-of-vocabulary (OOV) problem. Table 4 categorizes the selected studies of the review into three main tokenizer types: word-based, characterbased, and subword-based. Figure 19 shows examples of the three main types of tokenizers: word-based, character-based, and subword-based tokenizers. Word-based tokenizers split words on whitespace characters with special rules for punctuation. In the context of code, parsers are frequently used to handle ''code punctuation'' such as brackets. Word tokens capture a complete unit of meaning from the input string but require a large vocabulary. Character-based tokenizers split the input string on every character. This simplifies the vocabulary, but each token usually holds little meaningful information. Subword-based tokenization provides a compromise between character and word-based tokenization. The vocabulary consists of all base characters as well as frequently occurring sequences of characters. Sub-word tokenizers are the most popular tokenizer type as shown in Figure 18. Figure 19 shows examples of each tokenizer type.
A problem that tokenizing source code faces is the fact that the number of unique ''words'' in code is generally much larger than in NL. This is mostly due to identifiers for functions or variables which are multiple words concatenated together using some naming or casing convention (e.g., a function that prints ''hello world'' to the console can be VOLUME 10, 2022 TABLE 4. Overview of tokenizer types used by the selected studies. For non-string data, tokenizers are not used.

FIGURE 18.
Percentages of selected studies that used different tokenizer types. Non-string data refers to studies that did not use tokenizers to preprocess data for ML models.
Custom tokenization processes can be used to keep vocabulary sizes small while encapsulating useful information in each token. Token copying is one such process observed in the selected publications. Reference [76] used positioned <unknown> tokens to copy tokens not in the model vocabulary from the input to the output string. Similarly, [75] keeps out of vocabulary tokens and a position encoding in a lookup table to replace <unknown> tokens during the decoding of the output. Reference [50] used copying mechanisms for AST tokens based on probabilities from the training data.
The second type of tokenization enhancement process observed in the selected literature is token abstraction by using multiple <unknown> tokens that distinguish different types of tokens. Reference [21] noticed that <unknown> tokens are terminal nodes in the AST and uses the node type as a replacement token rather than a generic <unknown> token. Similarly, [34], [67] abstracted OOV identifiers for methods and variables to the separate <unknown> tokens. Reference [31] used an identifier abstraction mechanism where all instances of an identifier were tokenized to a numbered identifier token.

D. VOLUME OF AVAILABLE DATA
The volume of data needed to train and evaluate code generation models is critical to the performance of the model. The studies in this review use open-source repository data, manually created data, and/or automatically generated data. Table 5 shows which data source types were used by each selected study. Percentages of the number of studies that use data from different combinations of these sources are shown in Figure 20.
Data from open-source repositories are used by 62% of selected publications. Open-source repositories provide large volumes and varied data in many PLs. Reference [4] shows an example of this in their proposed model which was trained on millions of lines of code to be able to generate a wide variety of multi-line Python [39] functions. These lines of code were collected from Github [9]. Eighteen of the selected publications [4], [14], [21], [31], [34], [38], [43], [46], [47], [66], [67], [71], [79], [85], [91], [95], [97], [102] used data sourced from git repositories. This data often includes source code, documentation, as well as repository change information. This last type of data is especially useful for APR studies as these changes are occasionally pre-and post-bugfix pairs. All APR studies used open-source repository data. Reference [76] used data sourced from StackOverflow [123], a forum where users ask and answer programming-related questions, which is especially useful for obtaining NL-code pairs.
A lack of available data led five studies [26], [44], [53], [59], [114] to manually create data. For example, [75] manually wrote translations in a target programming language given code from a source programming language. Manually created datasets made publicly available from other studies were used by six [29], [50], [62], [63], [76] of the selected studies. Data generation performed by humans is time-intensive which is why it is avoided where possible.  FIGURE 19. Example word-, character-, and subword-based tokenizers with limited vocabularies. Each tokenizer processes the input string differently. The output elements are called tokens. These tokens can then be translated to numbers using a lookup table to create a valid input for a neural network. The <unknown> token is denoted as <unk>.

TABLE 5.
Overview of data source types of the datasets used by the selected studies. Manually created datasets have humans-in-the-loop during data generation while automatic generation implies data created in a programmatic manner. If there is not enough data readily available, automatic data generation methods can also be considered. Machinegenerated data is used in all of the works reviewed in the domain of PBE since any randomly generated input-output pairs for numeric calculations or string transformations are usually acceptable to derive a program from. References [26], [114] generate automatic data in addition to creating data manually to achieve a balance between quality and quantity of data.
The need for a partition of the dataset for evaluation purposes reduces the amount of data that can be used for training. This problem is usually resolved by using crossvalidation. Cross-validation involves training many models and is often computationally expensive. This is especially the case for large language models. More discussion on this topic is provided in Section IV-G.

E. QUALITY OF AVAILABLE DATA
As mentioned in the previous section, researchers leverage automatic mining from open-source repositories to obtain large volumes of data. Even after preprocessing and filtering, the quality of the automatically collected data can be unreliable. Automatically mined source code often has dependencies that can be difficult to obtain automatically, making the source code non-executable. Reference [91] found this to be problematic for obtaining input-output pairs in the domain of PBE. Executable source code is also important for functional evaluation of other applications. References [4], [114] manually curated test datasets to ensure better quality testing data. An alternative to evaluation by comparing synthesized code with code snippet is human evaluation. This is done TABLE 6. Overview of evaluation methods used by the selected studies. Token match metrics used in the review which are popular for NLP applications include BLEU [22], CIDEr [35], ROUGE [37], and METEOR [36]. Examples of other token match metrics are ''exact match'' and ''token accuracy''. Dynamic analysis analyzes runtime behavior of code while static analysis does not require code to be executed.
by [4], [43], [44], [95]. Manually generating or evaluating data is time-intensive which is why automatic data mining or generation and automatic evaluation are more common.

F. EVALUATING GENERATED CODE
Evaluating the quality of synthesized code is done either by comparing it to a ''ground-truth'' code statement, analyzing it statically, or analyzing it at runtime. Table 6 shows which evaluation methods each of the selected studies used.
The selected studies that evaluate synthesized code by comparing it to ground-truth code statements do so at the token level. Token comparisons are either performed by algorithms from NLP literature such as BLEU [22],CIDEr [35], ROUGE [37], and METEOR [36] or other metrics such as ''exact match'' and ''token accuracy''. Figure 21 shows that token match is used by 76% of the selected studies. Token match evaluation is popular because the same data used for training can be used for evaluation, it is automatic, and does not require the code to be executable. Using token match metrics from NLP for documentation generation is not a problem as these metrics show they correlate with human judgments. However, reference [4] shows that this is not the case for synthesized code. Reference [124] argues that BLEU and exact match do not properly capture code semantics and instead propose a code-specific metric, CodeBLEU. Code-BLEU uses a weighted sum of BLEU, BLEU weighted on code keywords, syntactic similarity by comparing ASTs, and data-flow similarity [124]. CodeBLEU was not used by any of the works reviewed. References [18], [111] used custom, code-specific token match metrics to measure program equivalence to better measure code semantics. Dynamic analysis involves evaluating the functional correctness and/or the time-to-completion of executable code at runtime. Functional correctness requires certain types of data such as input-output examples (e.g. PBE), unit tests [4], or formal specifications [53]. This is a code-appropriate metric and allows for different code implementations that are functionally equivalent to obtain the same score. This is unlike token match evaluation which will give a better score to an implementation that is most similar to reference code. Furthermore, reference [4] argues that functional correctness correlates well with what humans would consider to be quality code. Dynamic analysis often requires the code to be syntactically correct which is not always guaranteed by ML-based code generation methods as discussed at the end of Section IV-B.
Static analysis is more accessible than dynamic analysis since the code does not need to be executable and no ground-truth references are needed. However, only using syntactic correctness as a metric to validate models can lead to degeneration where synthesized code does not exhibit any desirable functional or semantic properties [111]. Human evaluation is a holistic evaluation method but is time-consuming and requires programmers with knowledge of the PL the code is written.

G. FUTURE WORK
In this section, we list three suggestions for future work that would contribute to the field of code generation using ML.
Improving language model efficiency is our first suggestion for future work. Models such as transformers [2] are good for general code generation tasks but are extremely datahungry [4], [95]. Training and evaluating these models are therefore computationally expensive. Similarly, improving the energy efficiency of language models would also lower the barrier of entry for research. High energy consumption leads to high monetary costs. References [43], [71], [75], [83] mention computation costs as restrictive to their research.
Ensemble learning is our second suggestion for future work. Some models excel in specific contexts while performing poorly in general. References [69], [71] are examples of this as they found that different models performed better on certain bug types in the context of APR. Reference [75] gives an example of how training and evaluating a model for three specific types of cross-PL translations required half as much data to achieve similar performance compared to a model trained and evaluated on four types of translations. Studying ensembles that combine the strengths of different models to improve performance over a variety of cases is a promising direction for future work.
New ways of using Abstract Syntax Trees (AST) representations of source code is our final suggestion for future work. Multiple studies discussed in this review, such as [21], [95], use AST representations of code for their models. Further exploitation of this data structure, which is characteristic of PLs, is recommended for future research. Generalized ASTs over multiple programming languages could lead to greater transfer learning capabilities and models that generalize to multiple languages. Code-specific decoding methods for sequential output models remain unexplored to the best of our knowledge. A decoding method that exploits AST or other syntax information could lead to more efficient and syntactically correct synthesized code.

V. THREATS TO VALIDITY
This section first discusses threats to the validity of the search criteria used by this review. Afterward, the threats to the validity of the models proposed by the selected studies are discussed.
One search phrase was used across two databases due to a large number of publications returned. While we believe the search phrase accurately and precisely defines the types of works we aimed to survey, we recognize that it is sensitive to variations in terminology and/or missing keywords. For example, [6] presented an influential code language model, CodeBERT, but never mentions ''machine learning'' even though ML techniques were used by the study. This led to it not being returned by our search. Other examples of influential models that were not retrieved by our searches are PyMT5 [3], Code2Seq [5], and TransCoder [7]. This survey should first and foremost be used as an introduction to the various applications, ML models, tokenizers, data, and evaluation methods used in the various sub-domains of code generation. We encourage readers who want to read more about a topic covered in this review to perform citation searches on the selected publications, or the influential publications mentioned above, to find additional relevant literature.
To investigate whether our chosen search phrase retrieves a disproportionate amount of description-to-code publications, a small experiment was conducted. Table 7 compares retrieval statistics for the original query for publication dates between 2016 and 2021 with a similar query which also includes two similar search phrases. These additional search phrases, ''code modification using machine learning'' and ''code summarization using machine learning'', are more specific for the code-to-code and code-to-description paradigms, respectively. This extended search query retrieved roughly 8% more publications than the query with only the original search query. Extrapolated to the number of selected studies, this increase leads to 40 selected studies. We consider this to be a relatively small number of additional studies and therefore conclude that our search phrase adequately covers all three paradigms discussed in this review.
As mentioned in Section IV-F, many of the metrics used to evaluate generated code are not code-specific and rely on comparisons with a ground truth code statement. Comparing generated code to a ground-truth program is limiting as the space of valid output programs for a given input is generally large. Furthermore, many token match metrics such as BLEU are sensitive to the tokenizer used [125]. This means that results from different studies using these metrics should only be compared if a similar tokenizer is used to tokenize the output before calculating the token match metric. Functional correctness is the best metric to avoid this problem but requires extra data in the form of test programs. VOLUME 10, 2022 TABLE 7. Statistics for database publication searches using different search phrases and a publication date range between 2016 and 2021. The search phrase ''code generation using machine learning'' is the one applied to identify publications for this review. Adding two additional search phrases, ''code modification using machine learning'' and ''code summarization using machine learning'', using the or logical operator retrieves only slightly more publications.

VI. CONCLUSION
This systematic review selected 37 works published in the last six years in the arXiv [11] and IEEE Xplore [12] databases that proposed ML models for code generation, documentation generation, and code modification. Each publication's application, model, datasets, results, limitations, and proposed future work were summarized. Then, the general findings of these 37 studies were discussed.
The discussion started by introducing the various application categories of the selected studies. The most popular applications of the selected publications include code generation from NL descriptions, documentation generation, APR, and PBE. The popular model types used by these studies such as RNNs, transformers [2], CNNs, and ML augmented search, were introduced and compared in the context of different applications. RNNs and transformer models were used mostly for code generation given natural language (NL) description as well as documentation generation. In general, transformer models outperformed RNN models when the two were compared by an evaluative study. CNNs are used for image data but also to augment other models such as transformers.
Different tokenization strategies used by these publications are listed. How some of these strategies handle the out-ofvocabulary problem is also discussed. Effective tokenization processes used by the reviewed publications are subwordtokenization, copy-mechanisms, and/or multiple <unknown> tokens to capture a particular subset of possible tokens such as variable identifiers.
Limitations in the quantity and quality of data for code generation models were highlighted in respective sections of the discussion. Language models require large datasets, especially when the model is expected to be able to generate code in many different types of contexts. Automatic mining of online sources such as Github [9] is often needed to obtain large enough volumes of data. The quality of automatically mined source code varies greatly. This led two studies [4], [114] to manually create their own test datasets. Automatically generating data is a fast alternative for obtaining data but is only appropriate for certain contexts such as PBE.
The question of how to measure the quality of code is also discussed in a corresponding section. Automatic evaluation by comparing tokens of generated code to tokens from ground-truth references was conducted by 76% of the selected studies. However, most of these token match algorithms are not appropriate for evaluating code. Functional correctness is a viable alternative but requires certain types of data and requires the generated code to be executable. Static analysis doesn't require the code to be executable but can lead to degeneration when only syntax is considered or is time-consuming if performed by humans.
Finally, three promising directions for future work were suggested: (i) improving the efficiency of language models, (ii) ensemble learning for specialized models, and (iii) more research on the possibilities for exploiting abstract syntax tree representations of source code.
SANDIP HALDER received the Ph.D. degree in metallurgy and materials science from RWTH Aachen, in 2006. He joined IMEC, in 2007, as a Research Scientist at the Advanced Materials and Process Department and was responsible for leading the metrology and inspection path-finding activities for the 3D SIC program. In 2013, he moved to the Advanced Patterning Center, IMEC, where he has worked as a Research Scientist and then as the Team Leader for metrology and inspection. Since 2020, he has been the Litho Group Lead within the same department. He has published more than 90 papers and has ten published patents.
STEFAN DE GENDT (Senior Member, IEEE) received the Doctor of Science degree from the University of Antwerp, in 1995. He subsequently was recruited by IMEC, Leuven, Belgium, the world's largest independent research institute in nanoelectronics and technology. He is currently a Full Professor (part-time) at KU Leuven and the Scientific Director of IMEC. Together with his respective teams, he has (co)authored more than 500 peer-reviewed journal publications. His research interests over his 25-year career at IMEC include metrology, semiconductor cleaning and passivation, high-k and metal gate unit process research, and post-CMOS nanotechnology (including nanowires, carbon nanotubes, graphene, and related 2D materials).
WANNES MEERT (Member, IEEE) received the Master of Electrotechnical Engineering degree in microelectronics, the Master of Artificial Intelligence degree, and the Ph.D. degree in computer science from KU Leuven, in 2005, 2006, and 2011, respectively. He is currently an IOF Fellow and a Research Manager at the DTAI Section, Department of CS, KU Leuven. His work is focused on applying machine learning, artificial intelligence, and anomaly detection technology to industrial application domains. VOLUME 10, 2022