A Neural Approach to Program Synthesis for Tabular Transformation by Example

Data transformation is a laborious and time-consuming task for analysts to properly prepare the data for analysis. Programming by example (PBE) is a technology that makes this troublesome task easier for data analysts by automatically generating programs for data transformation. In the past, most PBE research has been based on search algorithms, but with the recent success of machine learning (ML), PBE research using ML has emerged. For example, RobustFill, an ML-based PBE for string transformations using long-short-term memory (LSTM) as the sequential encoder-decoder model, has been proposed. However, there is still no ML-based PBE research for tabular transformations, which are frequently used in data analysis. This paper proposes an ML-based PBE for tabular transformations. First, we address tabular transformations, which are more complex and data-intensive than string transformations, and propose a new ML-based PBE based on the state-of-the-art Transformer sequential encoder-decoder model. To the best of our knowledge, this is the first ML-based PBE for tabular transformations. Next, we propose two decoding methods, multistep beam search and Program Validation (PV)-Beam Search, which are optimized for program generation and so generate correct programs with higher accuracy. The evaluation results show that the Transformer-based PBE achieves much higher performance than LSTM-based PBE when applied to tabular transformations and furthermore, the Transformer-based model with the proposed decoding method outperforms the conventional PBE using the search-based method.


I. INTRODUCTION
The integration of data from a variety of data sources into a unified format is a time-consuming and labor-intensive task for engineers or domain specialists. This data integration task has long been a high hurdle in data analysis and data utilization. An example of such a data integration task is described below. Suppose a data analyst wants to utilize the semi-structured tabular data shown in Table 1. 1 This data needs to be stored in a database so that it can then be analyzed. However, since Table 1 was created without considering such later data analysis, it needs to be converted into a relational form that can be stored in a database such as Table 2. This task requires various transformations, which are 1 This example is similar to the previously published "Foofah" example [1]. difficult and troublesome for data analysts who do not have expertise in data transformation.
To solve this hurdle in data utilization, various approaches have been adopted. Programming by example (PBE) is one of these approaches. PBE is a type of program synthesis that has been studied for a long time. Program synthesis is a technology that generates programs according to the specification conditions given by the user. PBE generates programs that satisfy the input-output examples given by the user as specification conditions.
As pointed out in [2], many data integration tasks involve string transformations and table layout transformations. (The transformation in the previous example is also a combination of both types of transformations.) Such a transformation is called tabular transformation in this paper. • We adopt the Transformer-based model to the ML model for PBE. Fast training by parallel processing and expressiveness by large capacity make it possible to handle table transformations, which are more complex and data intensive than string transformations. • We propose decoding methods called multistep beam search, which is suitable for finding a consistent program by running beam search multiple times, and PV (Program Validation)-beam search, which realizes efficient beam search by searching only the hypothesis space that is valid as a program. • We implemented the Transformer-based model with proposed decoding methods and evaluated the performance in experiments. We empirically show that the Transformer-based model outperforms the LSTM-based model. Furthermore, the Transformer-based model with our proposed decoding methods achieved better performance compared to the conventional search-based method which has been the most advanced study of PBE for tabular transformations. The remainder of this paper is organized as follows. We formally define the problem in Section II and explore related work in Section III. We present our proposal in Section IV and examine our proposal's performance via a set of benchmarks used in previous work and evaluate our ML model in Section V. Finally, we conclude the paper in Section VI.

II. PROBLEM FORMULATION
We now define formally the PBE problem discussed in this paper and the domain-specific language used for tabular data transformation.

A. THE PROBLEM
Given a pair of input-output tabular data (T i , T o ), the requirement is to synthesize a tabular transformation program P that satisfies T o = P (T i ). The program P is composed of a sequence of tabular transformation operations (P 1 , P 2 , . . . P n ). Each operation P i is a unit of tabular data transforms. The program P is synthesized by chaining operations that satisfy the following expression.

B. PROGRAM COMPONENTS
Each operation is parameterized by arguments that specify, for example, column(s) to be manipulated or delimiters for dividing cells. We use the operators and their arguments listed in Table 3 as the domain-specific language in this work. Our operation list is very similar to that of "Foofah" [1], with the differences highlighted in Table 3.
For example, we can convert the input Table 1 to output  Table 2 by the following sequence of the operators: • Split(":", 1): split the column 1 with the delimiter ":", which converts Table 1 to Table 4. • Delete(1): delete rows including a empty cell at the column 1, whith converts Table 4 to Table 5. • Fill(0): fill empty cells by the value above at the column 0, which converts Table 5 to Table 6. • UnFold(): unfold the table, which converts Table 6 to  Table 2.

III. RELATED WORK A. SEARCH-BASED PBE
PBE has been applied to a variety of domains including the data integration. FlashFill [5]   Duplicates a column to the following position CopyToDest(pos, pos) Duplicates a column to another position (newly added) Fill(pos) Fills empty cells in a column with the one above Split(delim|delimregex, pos) Splits a column into two columns at the occurrence of a delimiter or the match of a regex MergeToOne(delim, pos) Merges two or more columns after a position by a delimiter Divide(predicate, pos) Divides a column into two columns whereby one satisfies the predicate and the other does not Extract(regex, pos) Extracts the first match at a column with addition of the matched column (newly added) Fold(pos) Folds all the columns after the specified column into one column Unfold() Expands the folded columns of the    space for a program that matches the example efficiently using a pruning technique called as Version Space Algebra.
In addition, ProgFromEx [6]], Foofah [1] and FlashRelate [7] are PBE studies that include layout transformation among the data integration tasks. They also use search algorithms for program generation. For example, Foofah uses the A* algorithm, a graph search algorithm, to make the search process more efficient. Thus, traditional PBE studies were developed by devising search algorithms to make the search more efficient.

B. ML-BASED PBE
Since the neural methods are expected to extend a model by simply increasing the number of training data without the need for laborious technical development, research on realizing PBE using neural methods has attracted much attention in recent years [3], [8], [9], [10]. RobustFill [3] is a fundamental study for PBE using neural methods and is used as a basis by many other studies. It achieves a string transformation PBE by leveraging LSTM model, which is a well-known sequential encoder-decoder model. However, to the best of our knowledge, no neural method for table transformation PBE has been proposed and implemented.

C. BEAM SEARCH
Beam search has been studied for a long time, especially in recent years in natural-language-processing (NLP) neural models [11] [12] [13] [14] [15]. Iterative beam search [15] focuses on generating diverse sequences for tasks such as VOLUME 4, 2016 dialog response generation and is an example of research on improving beam search to make it suitable for applications.
In this paper, we propose variants of beam search, namely multistep beam search and PV-beam search, which are beam search methods optimized for program generation.  Table  Output Table  Program Input Table  Output Table  Program Input Table  Output Table  Program FIGURE 1. System overview in our approach Fig. 1 gives an overview of the system architecture used in our approach. The processing in our system can be split into a training phase and an inference phase.

IV. OUR APPROACH
In the training phase, the neural model is trained using training datasets, which are composed of pairs of inputoutput tables and the corresponding programs.
In the inference phase, given an example of an input-output table, the previously trained model generates a candidate program, and eventually provides a final program consistent with the example.

B. TABULAR DATA LINEARIZATION
We linearize a tabular data into a serialized structure in order to feed the tabular data into the sequential encoder-decoder model. Fig. 2 shows an example of this processing. First, each string in each cell of the input table is separated into character-level tokens with special characters <eoc> and <eol> that represent the boundaries of columns and rows respectively and linearized to a sequence of tokens. Linearizes into a character-level sequence inserting the separators between row and column boundaries ʻNʼ ʻaʼ ʻmʼ ʻeʼ <eoc> ʻNʼ ʻuʼ ʻmʼ ʻbʼ ʼeʼ ʼrʼ ʻsʼ <eol> ʻAʼ ʻlʼ ʻIʼ ʼcʼ ʼeʼ <eoc> ʻTʼ ʻeʻ ʻlʼ ʻ:ʼ ʻ ʻ ʻ(ʻ ʻ0ʼ ʻ3ʼ … In addition, in the Transformer-based model (see IV-D), we concatenate the input table and output table by introducing the special token <sep> to represent the separation point between the two tables.

C. PROGRAM LINEARIZATION
A program is also linearized into a sequence of operation tokens. An operation token is constructed from the Operators, its numeric parameters and its non-numeric parameters listed in Table 3 according to the tokenization rules. We have the following two options for tokenization rules.

a: Separated Token Rule
A combination of operator and non-numeric parameter are represented by one token and numeric parameters are represented by one token individually in separated token rule.
The vocabulary size of operation tokens in separated token rule is 100 including the special tokens. Separated token rule is the default rule in our experimental settings.

b: Combined Token Rule
A combination of operator, non-numeric parameter and numeric parameter is represented by one token in combined token rule.
The vocabulary size of operation tokens in combined token rule is 2148 including the special tokens.

D. THE TRANSFORMER-BASED ENCODER-DECODER MODEL
"Transformer" is a state-of-the-art ML model for the NLP tasks [16]. The architecture is based on attention mechanisms and achieves a high performance by high-level parallelization and short training times. This is realized by dispensing with recurrent and convolutional processing entirely. Fig. 3 illustrates our Transformer-based model.

Encoder-Decoder Cross Attention
Linearize and concatenate tables Input  In order to feed a pair of input-output tables and a program into the Transformer model, they are linearized to token sequences as described in IV-B and IV-C, respectively. Let a token sequence of input-output tables be t and a token sequence of program be o.
The Transformer embedding layers embed t and o and append the positional encoding P E.
be the output of the i th layer of the Transformer encoder, where x i j ∈ R d emb and d emb is the embedding dimension size. Similarly, let y i = (y i 0 , y i 1 , · · · , y i m ) be the output of the i th layer of the Transformer decoder, where y i j ∈ R d emb . The Transformer encoder and decoder are structured as layers, where the sublayers process the computations described in the following expressions.
When all computations in the encoder and decoder sublayers have terminated, the output from the last sublayer of decoder y last is provided. The linear and softmax layers at the last part of the Transformer-based model (see Fig. 3) compute the program probabilities from y last .
For a set T io of token indexes for the input-output tables and the token sequence length n of the linearized inputoutput tables, we denote a token sequence of the inputoutput tables as an integer vector t = (t 0 , t 1 , · · · , t n ) where For a set O of token indexes for the program and the token sequence length m of the sequence the program, we denote a token sequence of a program as an integer vector Now, program probabilities can be represented as a sequence p = (p 0 , p 1 , · · · , p m ), where p i ∈ R |O| . The variable p i = P (z | o <i , t) represents the conditional probability distribution of the discrete random variable z ∈ O at position i when (o 0 , o 1 , · · · , o i−1 ) and t are given.
The left part of Fig. 4 shows the decoding process during the training phase. The program probabilities p are output after inputting a sequence of tokens for input-output tables t into the encoder layer and a sequence of tokens for the program (shifted by 1 token (<sos>)) into the decoder layer.
We can compute the cross-entropy loss from p and (the one-hot vector of) o, and then proceed to training with the aim of reducing the loss value.
The right part of Fig. 4 shows the decoding process during the inference phase. In this phase, beam search processes the decoding via autoregression and generates the candidates of the programs. The basic beam search processing is as follows.

Encoder layers
Decoder layers o 0 < l a t e x i t s h a 1 _ b a s e 6 4 = " y 9 5 P k t 5 T r k p u i e r u m F 7 u i G H u n 9 1 1 o N v 0 b L y z 7 P W l s r 7 G L 0 c G z j 7 V 9 V l W c P e 5 + q P z 1 7 K G H R 9 y r Z u + 0 z r V v o b X 3 9 4 P h l Y y k z 1 Z i m C 3 p m / + f 0 Q L d 8 A 7 P + q l + m R e Y M E f 6 A 1 P f n / g m 2 Z p O p u e R 8 e i 6 x v B J 8 R R g T i G O G 3 3 s B y 1 j D O r J 8 r s Q R T n A a e l J i y q g y 3 k 5 V Q o F m B F 9 C i X 8 A k K + L p w = = < / l a t e x i t > Q m X 6 l e 6 e S U b H D 9 h Y I l Y k I u I z b P y A x X w C l i Q 2 F m 7 3 d C I I b q W q T p 2 6 5 9 a p K t 0 1 h R 8 w 1 k h J L a 1 t 7 R 2 d X e n u n t 6 + j N w / U P K d q m d w 1 X B M x 9 v Q N Z + b w u Z q I A K T b 7 g e 1 y z d 5 O t 6 Z S n a X 6 9 x s v d f a 4 V x j c h L n W a 9 q e V u O X M 0 v P r 2 r 8 q i O c D + p + p P z w F 2 M R d 7 F e T d j Z n o F k Z T X z s 4 e V m d L 0 6 E k + y S P Z P / C 9 Z g d 3 Q D u / Z q X B V 4 8 R x p + g D l + 3 P / B K X p n J L P z R T y 2 Y X F 5 C s 6 M Y p x T N F 7 z 2 I B y 1 i B S u c K H O M U Z 6 k n S Z a G p J F m q p R K N I P 4 E t L 4 B w r 7 i + Q = < / l a t e x i t > <sos> squence of tokens of program

Encoder layers
Decoder layers k p u i e r u m F 7 u i G H u n 9 1 1 o N v 0 b L y z 7 P W l s r 7 G L 0 c G z j 7 V 9 V l W c P e 5 + q P z 1 7 K G H R 9 y r Z u + 0 z r V v o b X 3 9 4 P h l Y y k z 1 Z i m C 3 p m / + f 0 Q L d 8 A 7 P + q l + m R e Y M E f 6 A 1 P f n / g m 2 Z p O p u e R 8 e i 6 x v B J 8 R R g T i G O G 3 3 s B y 1 j D O r J 8 r s Q R T n A a e l J i y q g y 3 k 5 V Q o F m B F 9 C i X 8 A k K + L p w = = < / l a t e x i t >  h Y I l Y k I u I z b P y A x X w C l i Q 2 F m 7 3 d C I I b q W q T p 2 6 5 9 a p K t 0 1 h R 8 w 1 k h J L a 1 t 7 R 2 d X e n u n t 6 + j N w / U P K d q m d w 1 X B M x 9 v Q N Z + b w u Z q I A K T b 7 g e 1 y z d 5 O t 6 Z S n a X 6 9 x h A k L R V q u J U I R N 7 R u M J z f I + a 9 K 8 2 9 P 1 1 C r t o t P v k D K M G H t m 9 6 z D n t g D e 2 E f v / Z q e D 2 6 X u q U l Z 6 W 2 6 X p y / n U + 7 8 q g 7 L A 6 Z f q T 8 8 C F W x 5 X j X y b n t M 9 x R q T 1 8 7 v + q k t p O x x h J r s V f y 3 2 R t 9 k g n M G t v 6 u 0 h T 9 4 g S A + Q + H n d / S C z G k 9 s x N c O 1 6 M 7 u / 5 T j G I R E S z T f W 9 i B / s 4 Q J r 2 d X C N J l q B j r Q g h a V I r 1 Q K + J o 5 f A t p 5 R M T x J A J < / l a t e x i t > t < l a t e x i t s h a 1 _ b a s e 6 4 = " L Q b n t q u F R L B P + P n f A d X U k T E + Y N 4 = " > A A A C c X i c h V H L S s N A F D 2 N 7 / q q u l H c l J a K I J S p i o o r 0 Y 1 L b e 0 D 2 l q S d K r B v E i m h V r 8 A X 9 A w V U L I u J n u P E H X P Q T x G U F N y 6

T Q O i R b 0 h c + c u e f O m R n F 1 j V X M N Y O S A O D Q M j o 2 P B Y n J q e n Q z G z G t a q O y t O q p V t O T p F d r m s m T w t N 6 D x n O 1 w 2 F J 1 n l b O 9 7 n q 2 x h 1 X s w j U b d 5 0 Z B P T K 2 i q b I g 6 r i g W H r Z r R u U G u K i F I q y O P M i 3 A S P o j C j w M r d I c C y r C g o g o D H C Y E Y R 0 y X P r y S I D B J q 6 I B n E O I c 1 b 5 7 h
A k L R V q u J U I R N 7 R u M J z f I + a 9 K 8 2 9 P 1 1 C r t o t P v k D K M G H t m 9 6 z D n t g D e 2 E f v / Z q e D 2 6 X u q U l Z 6 W 2 6 X p y / n U + 7 8 q g 7 L A 6 Z f q T 8 8 C F W x 5 X j X y b n t M 9 x R q T 1 8

v + q k t p O x x h J r s V f y 3 2 R t 9 k g n M G t v 6 u 0 h T 9 4 g S A + Q + H n d / S C z G k 9 s x N c O 1 6 M u / 5 T j G I R E S z T f W 9 i B / s 4 Q J r 2 d X C N J l q B j r Q g h a V I r 1 Q K + J o 5 f A t p 5 R M T x J A J < / l a t e x i t > t FIGURE 4. Decoding process in the encoder-decoder model
In the first step, given a token <sos> at the decoder layer and the token sequence t at the encoder layer, the decoder outputs the conditional probability p 0 = P (z | t) at position t = 0. Using beam search with the beam size of 1 gives the token z 0 ∈ O having the highest probability to give to the decoder at the next position.
In the next step, given the token sequence t at the encoder layer and tokens <sos> and z 0 at the decoder layer, the decoder outputs the conditional probability p 1 = P (z | z 0 , t) at position t = 1.
Again, beam search determines the token z 1 ∈ O having the highest score given by the equation 1 with respect to the tokens <sos>, z 0 and z 1 to give to the decoder at the next position. In this way, the token z t is found having the highest score at each position t.
The beam search terminates when the end-of-sequence token <eos> has the highest score at positions t < m, finally obtaining one candidate z = (z 0 , z 1 , · · · , z t−1 ).
The procedure of beam search with a beam size of K > 0 is described in Algorithm 1.
The beam search maintains K number of candidates H cand t of token sequences (each token sequence is called as hypothesis h) at each position t . At each position t, each hypothesis h in the previous candidates H cand t−1 is concatenated by each token z ∈ O (line 8 in Algorithm 1), and the scores for concatenated hypotheses h are given by equation 1 (line 9 in Algorithm 1).
The hypotheses are ranked by the score and the top-K hypotheses are kept in the candidates H cand t (line 10-22 in Algorithm 1). When <eos> token is selected as a candidate, the K value is reduced by one and the candidate is placed in the "final-candidate" list H f inal (line 16-18 in Algorithm 1).
When the beam size K becomes zero, the search stops with a final candidate list containing K number of candidates (line 19-20 in Algorithm 1).

∪ {h}
Finally, we can get the program by selecting a hypothesis from the final candidate list and the hypothesis into the program form.
Not all of the programs in the candidates are consistent with the input-output tables.
Therefore, in order to select a consistent program, each program is picked out of the final candidate list one-by-one and checked whether each program is consistent with the input-output tables.
Once the consistency check on a program is successful, it's the solution for the input-output tables. If no successful program is found from the final candidate list, it is regarded as a failure of finding the solution for the input-output tables.

E. DECODING METHODS
Focusing on the fact that the PBE task is a search problem finding consistent programs, we propose two methods to enhance the basic beam search method for the program generation task in this section. H ← beam_search(n pick , n output , K) ▷ basic beam search 5: for i ∈ {0 · · · K − 1} : if ∃ n ∈ N ′ such that n is identical to n output : 8: return P ▷ operation sequence from n input to n output 9: T ← graf t(T , S) 10:

T ← prune(T )
In order to keep the candidates of the intermediate tables and efficiently handle them, we maintain a directed tree structure T = (N , E) that comprises a set of nodes N = {n i | i ∈ Z} and a set of directed edges E = {e i | i ∈ Z}. (see Fig. 5).
This tree structure represents the space already explored by the multistep beam search. Each node n i represents a table and each directed edge e i represents a operation that transforms a table into another table. T is a tree structure that has the following features.
• The root node of T is the node n input representing the given input table. • A child node n child is the result of applying a directed edge e i to a parent node n parent . We also denote this relation in the form n child = e i (n parent ) . • ∀n i ∈ N is not identical to ∀n j ∈ N \ {n i } First, the multistep beam search "picks up" a node n pick from T that is an origin node of a beam search (see Line 3 in Algorithm 2). In the first step, the n input is picked up as the n pick , because T includes only the node n input .
Next, it invokes a basic beam search with the origin node n pick , the goal node n output , namely the output table, and beam size K, returning the set of hypotheses H = ∪ K−1 i=0 H i (see Line 4 in Algorithm 2).
If the application of the operation to a node fails, all subsequent processing in this operation sequence will fail (denoted by null). The null entries are then deleted from N ′ and E ′ .
If some n ∈ N ′ is identical to n output , the corresponding operation sequence from n input to n output is considered the desired program P (see Line 8 in Algorithm 2).
To merge the sequence S into the directed tree T , the first node n pick of S is grafted onto the n pick of T (see Line 9 in Algorithm 2 and Fig. 6) It then prunes the edge and node that reach the identical nodes among T and S that have longer paths from the root node n input of T (see Line 10 in Algorithm 2 and Fig. 7). Using this pruning process, we can avoid identical nodes in T , keep T simple, and make it efficient to run the operations over all T . This merging process is applied to all hypotheses, and then it proceeds to the next pick-up (see Line 3 in Algorithm 2).
Pick-up processing can be based on either of two strategies, namely a shortest-path strategy or a longest-path strategy. Using a shortest-path strategy, the node with the shortest path from n input is picked up, whereas the node with the longest path is picked up under a longest-path strategy.
After a node is picked up, the node is marked as done and never picked up in the subsequent processing. This procedure is repeated until a program consistent with the given inputoutput table is generated or a timeout occurs.

b: PV-Beam Search
Here we propose a beam search method called as PV-beam search. Basic beam search keeps candidates including even programs that cannot create any output table and never become any correct programs. In order to improve the efficiency of the basic beam search procedure, we propose the PV-beam search that removes invalid hypotheses in the inner loop of beam search procedure and keeps only valid hypotheses.
The PV-beam search is described in Algorithm 3. It is different from basic beam search at line [16][17][18][19] in Algorithm 3.
It synthesizes a program P from hypothesis h, and checks whether the program can transform the input table into any table, and if not, removes the corresponding hypothesis from candidates H cand t . Therefore it searches only the valid hypothesis space and finds a correct program efficiently. s ← score(h) ▷ score given by equation 1 10: H t ← H t ∪ {h} 11: S t ← S t ∪ {s} 12: sort S t by scores 13: sort H t by the same order to S t

14:
H cand t ← ∅ 15: for h ∈ H t in order : 16: synthesize a program P from hypothesis h 17: if output_table is not a valid There are two metrics for evaluating the performance of PBE systems [3], namely consistency and generalization. Consistency measures whether any generated program is consistent with a user-given example. Here, the PBE system VOLUME 4, 2016 would be regarded as successful in terms of the consistency test.
Generalization measures whether a program generated from a example is consistent with the original table from which the example is derived. If a program, which has been generated from the example, can transform the original table  into the user's intended table, the program would be regarded  as generalizable in terms of the original table. Consistency is the important metric and is used mainly to evaluate our methods in V-B. We define accuracy as the proportion of benchmark tests for which the PBE system is successful in terms of consistency.
Although generalization is also another important metric, with several past works [17] [18] [19] particularly focusing on it, we focus on the consistency in our evaluations as with the work [4].

2) Training Datasets
Preparing a large-scale training dataset for PBE ML model is quite difficult, because such a dataset is composed of many input-output pairs of tabular data with corresponding transformation programs. We therefore synthesized the training datasets as in previous works [3], [20], [21].
In order to construct the training data, we employed the random sampling and generation methodology shown in Fig. 8. An input table and a program are randomly synthesized with characters or operations or others that are randomly sampled in a uniform distribution.   The input table is synthesized with ASCII characters, numbers, and punctuation characters. The size of the string in a cell, number of rows and columns of the table are also randomly chosen, up to a predefined maximum value. In our experiments, these maximum values for strings, rows, and columns were set as 10, 5, and 5, respectively.
Each program is a sequence of tokens generated from operators and its arguments randomly sampled from items given in Table 3. The length of each program is up to the predefined maximum size. The maximum size is parameterized from 3 to 8, and its default value is 6 in our experiments.
An output table is generated by transforming the synthesized input table using the synthesized program. Because a program might fail when applied to the input table, program synthesis is repeated until the synthesized program no longer fails when applied to the input table.

3) Benchmark Datasets
We evaluated our models using the evaluation benchmark proposed and used in Foofah [1] 2 . This benchmark is composed of 250 tests gathered from previous works and constructed to evaluate tabular-data transformation tasks. Examples of this benchmark are shown in Fig. 9. We divided the benchmark into two parts according to the size of the table. The "small" benchmark for each test involves tables with rows smaller than 5 and columns smaller than 5, whereas the remaining ones are referred as "large".
The "small" benchmark was composed of 73 tests, as opposed to 177 for the "large" benchmark.

4) Hyperparameters and Settings
The hyperparameters are given in Table 7. The experimental results in various settings are listed in Table 8.

5) Hardware and Software Settings
We trained our model and evaluated it using one GPU slot of an NVIDIA Tesla V100-PCIE GPU with 32 GB memory. An Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40 GHz with 132 GB memory was used to run the search-based PBE system (based on Foofah). We developed and evaluated both types of neural models using the Pytorch library [22] 3 and our Transformer-based We adopted the model as tabular transform PBE to compare with our neural model using tabular data linearization and program linearization described in IV-B and IV-C respectively. See Fig. 10 for overview of the model.  We implemented Foofah as a baseline search-based system to compare our neural model using the A * -graph search as the core search algorithm with the table-edit-distance batch as the heuristic function and the same pruning rules as with Foofah. Its operators are listed in Table 3 and slightly differ from Foofah.

B. ANALYSIS OF OUR NEURAL NETWORK MODELS
In this subsection, we present and analyze the experimental results from our proposed models and the baselines. All experimental results are listed in Table 8 and are referred to by Experiment number in the following discussions.

1) Comparing the LSTM-based Model with the Transformer-based Model
We compared a baseline (LSTM-based) model and our Transformer-based models with 2 and 6 layers in terms of the learning time required to train them. Table 9 gives the training times for each model.
Note that the LSTM-based model took much longer to train for one epoch in average, despite the much smaller number of parameters for the LSTM-based model than for the Transformer-based models.   Table 8 give the accuracy for the benchmarks of the LSTM-based model and both Transformer-based models (2-layer and 6-layer). The Transformer-based models outperform the LSTM-based model for both large and small benchmarks.
The reason might be that the Transformer-based model has a greater capacity to learn a statistical distribution and capture the features of the training datasets much more quickly than LSTM-based model.
While the Transformer-based model might be better performed by being trained from a much larger variety of datasets as described in the following experiments, the LSTM-based model can not be trained any more due to the much longer training time.

2) Comparing Models Trained on Various Numbers of Samples
We examined the performance on training datasets of various sizes. The corresponding Experiment numbers are 2-13 in Table 8. Fig. 12 gives these experimental results as a bar plot. We can note that, in general, as the number of training samples increases, the accuracy for both large and small datasets also increase.
A second tendency is that the 6-layer Transformer-based model shows higher accuracy with large-scale training data (1M, 2M, 4M, 10M) than the 2-layer model with a few exceptions. Because increasing the numbers in the training dataset enables the models to learn a larger variety of tabular data and achieve a more expressive ability, we can hypothesize that the model with the larger capacity (the 6-layer model) can learn expressive representations more effectively.

3) Comparing Models for a Variety of Program Lengths
We evaluated the accuracy with respect to program length. The corresponding experimental results are Experiment numbers 14-21 in Table 8. Fig. 13 shows these experimental results in bar-plot form.
Although we observe no significant differences for the variety of program lengths, the accuracy for the 6-layer model is always higher than for the 2-layer model for the program length of over 6. This also indicates that the 6-layer model has sufficient capacity to train via datasets containing long programs (as was also found in Section V-B2).
We use 10M for the number of samples in training data and 6 for the program length of training data in the following experiments, because these values show nearly the best performances in V-B2 and V-B3.

4) Comparing Various Beam Search Methods
We compare the accuracy of our proposed multistep beam search and PV-beam search with basic beam search and iterative beam search.
We implemented the iterative beam search with reference to the work [15] for comparing to our proposed decoding methods. The iterative beam search repeats the basic beam search iteratively avoiding searching the hypotheses seen in the previous iterations and eventually searches wider search space diversely.
We set the beam size from 1 to 1000 and a timeout to 30 seconds in our experiments. Each experiment aborts when the timeout occurs. In the experiments of the following sections, the combined token rule described in IV-C is employed as a program linearization method.
The experimental results here is given Experiment number 22-49 in Table 8 and Fig. 14 shows plots for these results.
The performance of the basic beam search improves by increasing the beam size because the possibility that correct hypotheses get to be removed decreases with the increased beam size.
The performance of the iterative beam search outperforms the basic beam search with the small beam size because the iterative beam search repeats the basic beam search and thereby searches the extensive search spaces. However, its performance does not better than the basic beam search with the beam size of over 50 because the second iteration of iterative beam search takes over 30 seconds to process due to the time-consuming process of checking whether each hypothesis has explored in the previous iteration.
The performance of the multistep beam search with the shortest path strategy is shown in the experimental results because we observed the performance for shortest path strategy is slightly better than or equals to that for longest path strategy in almost experimental settings. The performance of the multistep beam search betters according to the increase of beam size and achieves the best at the beam size of 500, eventually outperforming the iterative beam search and the basic beam search.
Since multistep beam search repeats the basic beam search changing the starting point of the search until it succeeds to find a consistent program, it always succeeds whenever the basic beam search does. The light-weight computational cost of it results in outperforming the basic beam search with every beam size, although the iterative beam search does not outperform the basic beam search with the beam size over 50 The performance of the PV-beam search outperforms other beam search methods with beam size of over 50 in terms of the accuracy on the all data and achieves the best accuracy among other beam search methods.

5) Comparing the Proposed Model with Baselines
We show the experimental results of two baseline systems (LSTM-based system and search-based system) and the Transformer model with our proposed methods (multistep  Table 8 and summarized in Table 11. The LSTM-based system shows the worst performance on both small and large data. As described in Section V-B1, this is mainly because LSTM-based system does not have enough expressiveness for tabular transformation and can not be trained by large-scale training data. Therefore, the transformer model outperforms the LSTM-based system considerably. The search-based system, which is the best system ever developed for tabular transformation PBE system to the best of our knowledge, shows the performance independent on the size of data and achieves good performance even on large data. While the Transformer model have outperformed the search-based system on small data, it has been difficult for the transformer-based model to outperform the search-based system. However, the Transformer-based system with our proposed PV-beam search (beam=500) achieved the better accuracy even on large data than the search-based system due to the efficient program generation processing of PV-beam search.
Finally, we compare our Transformer-based systems and the baseline systems in terms of response performance. Fig. 15 shows the comparisons of the response time of Experiment number 1, 40, 47 and 50 from Table 8, namely two baseline systems and the Transformer models with proposed decoding methods with beam size of 100. The horizontal axis shows the response time from the start of the inference and the vertical axis shows the corresponding accuracy at that response time.
Each panel shows the Transformer models outperforms the LSTM-based system on every benchmark data from start to the timeout (30 seconds).
The top panel shows that the Transformer-based models better the search-based system on small benchmark data in terms of accuracy from the start to the timeout.
The center panel shows that the Transformer-based model with PV-beam search outperforms the search-based system except for a few seconds from start and ends up with almost even eventually, and the Transformer-based model with multistep beam search has a better response than other systems for the first few seconds.
The bottom panel shows that the Transformer-based model with PV-beam search outperforms the search-based system from start to timeout. And it also shows the Transformerbased model with multistep beam search has a much better response than other systems.
We observed in our experiments the Transformer model with PV-beam search with beam size of 500 and 1000 (Experiment number 48 and 49) loses to the search-based method for a few seconds from the start time because of the high computational cost of the program validation process (line 16-18 in Algorithm 3). Therefore, we conclude that the Transformer-based model with PV-beam search by beam size 100 is the best system considering the balance of both the final accuracy and the response.
We can also say that Transformer-based model with PVbeam search with beam size of 500 would be better if the final accuracy is preferred. And the Transformer-based model with multistep beam search would be better in the situation that a quick response is more desirable.

VI. CONCLUSION AND FUTURE WORK
In this paper, we propose a new ML approach to realize PBE for tabular transformations, which are required in data analysis and data integration scenarios.
In order to deal with tabular transformations, which are complex and require a large amount of data, we proposed an ML-based PBE for tabular transformations using the Transformer model, which is faster and more expressive than the LSTM model (baseline) used in conventional ML-based PBE research.
In addition, we proposed multistep beam search and the PV-Beam search, which optimize the conventional beam search for program generation, and implemented them on top of the Transformer model. As a result, we found that the PV-beam search far surpasses the accuracy and response performance of the baselines and the multistep beam search realizes much shorter response time than the search-based system.
As a result, our proposed systems achieve superior performance to baselines and provide a high-performance and scalable ML model for tabular transformation PBE, which enables users to perform table transformation easily and effortlessly.
As one of our future works, we are planning to study an embedding method for 2D tabular data using a sophisticated method as proposed by Tapas [24] which will improve the simple linearizing method in this paper.