SymFormer: End-to-End Symbolic Regression Using Transformer-Based Architecture

Many real-world systems can be naturally described by mathematical formulas. The task of automatically constructing formulas to fit observed data is called symbolic regression. Evolutionary methods such as genetic programming have been commonly used to solve symbolic regression tasks, but they have significant drawbacks, such as high computational complexity. Recently, neural networks have been applied to symbolic regression, among which the transformer-based methods seem to be most promising. After training a transformer on a large number of formulas, the actual inference, i.e., finding a formula for new, unseen data, is very fast (in the order of seconds). This is considerably faster than state-of-the-art evolutionary methods. The main drawback of transformers is that they generate formulas without numerical constants, which have to be optimized separately, yielding suboptimal results. We propose a transformer-based approach called SymFormer, which predicts the formula by outputting the symbols and the constants simultaneously. This helps to generate formulas that fit the data more accurately. In addition, the constants provided by SymFormer serve as a good starting point for subsequent tuning via gradient descent to further improve the model accuracy. We show on several benchmarks that SymFormer outperforms state-of-the-art methods while having faster inference.


Introduction
Many natural processes and technical systems can be described by mathematical formulas.Knowing the correct formula would not only provide us with some insight into the process's inner workings, but it would also allow us to predict how the process will evolve in the future.Therefore, being able to automatically derive a simple formula that fits the observed data would have a tremendous impact on applications in all areas of science.The task of finding such a formula from the observed data is called symbolic regression (SR) and has already been applied to a variety of real-world problems, e.g., in physics [Wadekar et al., 2022, Matchev et al., 2021], robotics [Kubalík et al., 2019, Hein et al., 2017], or machine learning [Wilstrup and Kasak, 2021].
Historically, most of the symbolic regression methods [Schmidt andLipson, 2009b, Kubalík et al., 2020] were tackled by means of genetic programming [Koza, 1992, Schmidt and Lipson, 2009a, Staelens et al., 2013, Arnaldo et al., 2015, Bł ądek and Krawiec, 2019].Unfortunately, they have to be carefully designed for each problem.Also, predicting formulas was slow and computationally expensive.For a single problem, an entire population of formulas had to be evolved and evaluated repeatedly through many generations.In recent years neural approaches emerged [Petersen, 2019, Mundhenk et al., 2021] to tackle these problems.They were trained end-to-end with the sampled points as the input and the symbolic representation of the formula as the output.However, these methods are still trained from scratch for each formula.Although they work well for simple formulas, they are impractically slow and inefficient for more complex formulas due to their reliance on reinforcement learning.To increase the efficiency, fully supervised approaches were proposed [Biggio et al., 2021, Valipour et al., 2021], that train a transformer model on a large collection of formulas.The formula is autoregressively generated by predicting each symbol conditioned on previously decoded symbols.The generated expression is decoded without constants, i.e., all constants were replaced by a special symbol, and the concrete values were found using global optimization.We argue that the concrete values of constants have a large impact on the generated function, and without predicting them, the model will not learn to represent the data well.For example, we can imagine a simple model returning a sum of sine and cosine functions with increasingly higher frequencies.Since every function can be expressed using the Fourier basis, by changing the constants, we are able to represent most functions with low error.
Inspired by d' Ascoli et al. [2022], where a similar idea was applied to the problem of recurrent sequences, we propose a novel approach called the SymFormer that generates the concrete values for constants alongside the symbolic representation of the formula.More specifically, we introduce the following contributions: • We design a transformer-based architecture trained end-to-end on a large set of formulas consisting of hundreds of millions of formulas.• The model generates both the symbolic representation of the formula and the concrete values for all constants at the same time.This allows the symbolic decoder to condition on the generated constants and it improves the quality of the symbolic representation.• We also use the generated constants to initialize the local gradient search to fine-tune the final constants effectively and reliably.• Our approach was thoroughly evaluated and compared to relevant methods.Also, we validate our design choices in an ablation study.• The source code and the pre-trained model checkpoints are publicly available1 .

Related work
Genetic Programming approaches are a traditional way of solving SR [Koza, 1992].Genetic programming evolves expressions encoded as trees using selection, crossover, and mutation.A limitation of the genetic algorithm-based approaches is that they are sensitive to the choice of hyperparameters [Petersen, 2019].They need to evolve each equation from scratch, which is slow, and the models tend to increase in complexity without much performance improvement.It is also problematic to tune expression constants only by using genetic operators.
Neural Network approaches can be generally divided into three categories.The first one is approaches based on Equation learner (EQL) [Martius and Lampert, 2016, Sahoo et al., 2018, Werner et al., 2021].The idea behind EQL is to find function f (x) = y by training a neural network on x as input and y values as output while using as few network weights as possible.As activation functions, elementary functions (sin, log, . . . ) are used, and after the training, they are read from the network with corresponding weights.A limitation of such an approach is that they require special handling of functions that are not defined over the whole R (e.g., log), that the depth of the network limits the complexity of the predicted equation.Lastly, they can be slow since they need to find each equation from scratch.
The second approach is based on training a recurrent neural network (RNN) using reinforcement learning [Petersen, 2019].The idea is to let the RNN generate the equation and then calculate the reward function as an error between the ground truth f (x) values and the values from the predicted function f (x).An interesting extension is proposed by Mundhenk et al. [2021], where they sample from the RNN, and the output is then taken as an initial population for a genetic algorithm.The limitations of both of these approaches are that the model does not predict the constants, and therefore they have to be found through global optimization in postprocessing steps which slows down the whole training loop [Mundhenk et al., 2021, Petersen, 2019].
The transformer-based approach is proposed by Valipour et al. [2021], Biggio et al. [2021], d'Ascoli et al. [2022], where they first generate a large amount of training data and train a transformer [Vaswani et al., 2017] model in a supervised manner.Valipour et al. [2021] train a GPT-2 [Radford et al., 2019] model on pairs of points and symbolic output.Then they use global optimization to find the constants for each equation.Biggio et al. [2021] uses the encoder from the Set transformer Lee et al. [2018] and the decoder from original transformer architecture [Vaswani et al., 2017].Similar to Valipour et al. [2021], they train the models only on skeletons (expression without the constants), and afterward, they fit the constants using global optimization.Another extension is introduced by d' Ascoli et al. [2022], where they train the transformer model [Vaswani et al., 2017] on recurrent sequences.They predict the expression constants jointly by encoding them into the symbolic output.To encode integers, they use their base b representation e.g., for x = −325 and base b = 30 the representation would be [−, 10, 25].In the case of floats, they use the IEEE 754 float representation and round the mantissa to the four most significant digits.They also introduce new tokens representing exponents.For example if we have number −0.015, then they encode it as [−, 15, e−4].The disadvantage of this approach is that the mantissa has only finite precision.Therefore, the model typically only predicts the largest terms when approximating complicated functions [d'Ascoli et al., 2022].

Method
In symbolic regression, it is assumed that there is an unknown function f and that we observe its output on a finite set of input points.The goal is then to find the mathematical formula of this function.Therefore, we want to find a function f such that the squared difference between the function's output on the input points and the outputs of the unknown function f is minimized.
Given a set of observed input-output pairs, our model generates the symbolic representation of the formula together with values of all constants present in the formula in a single forward pass of a neural network.This is visualized in Figure 1.First, we transform all input-output pairs using an encoder block to obtain an internal representation.Given this representation, the decoder then autoregressively generates individual symbols and corresponding constants.This means that in order to generate the next token in the symbolic representation of the function, we pass all previously generated tokens and constants to the decoder.We proceed in this fashion until we obtain the entire formula.During inference, we sample multiple predictions from the model and fine-tune all constants to minimize the error between the predicted formula's outputs and the observed outputs.Finally, we select the prediction with the lowest error.

Model architecture
Our neural network is an autoregressive transformer-based architecture that contains an encoder with cross-attention blocks [Lee et al., 2018, Jaegle et al., 2021] and a simpler self-attention decoder [Vaswani et al., 2017].The input to the encoder consists of the data points, which are first passed through a trainable affine layer to project them into a latent space.The resulting vectors are then passed through several induced set attention blocks [Lee et al., 2018], which are two cross attention layers.First, cross attention uses a set of trainable vectors as the queries and the input features as keys and values.Its output is used as the keys and values for the second cross attention, and the original input vectors are used as the queries.After these cross attention layers, we add a dropout layer [Srivastava et al., 2014].In the end, we compute cross attention between a set of trainable vectors (queries) to fix the size of the output such that it does not depend on the number of input points.This final representation is then passed to the decoder.
The decoder autoregressively generates the symbolic representation and the constants given the encoder's representation.The input symbols are first passed through the embedding layer and then pairwise summed with trainable positional encoding vectors.These newly formed vectors are pairwise concatenated with affine-projected constants.The resulting vectors are then passed through several decoder layers [Vaswani et al., 2017].The decoder has two heads, where the first one is a classification head which predicts the probability distribution over the next symbol in the sequence.If the predicted symbol is a constant, the other (regression) head outputs its value.Training can efficiently process each sequence in a single forward pass of the network thanks to the masked attention and teacher forcing [Vaswani et al., 2017].

Training & inference
We train the model using cross-entropy L class for the symbolic expression and mean squared error L MSE for the constants: where λ is a hyperparameter, If λ is too small, the model will not learn to predict the constants at all, and if λ is too large, the model will not learn to predict the symbolic output well, and therefore, the constants will be useless.Therefore, at the beginning of the training, we set λ to zero, and after a few epochs, we gradually increase it using the cosine schedule [Loshchilov and Hutter, 2016].Note that we calculate the regression loss only at indices where the model should predict a constant.We have also found it beneficial to add a small random noise sampled from N (0, σ 2 ) to the constants during the training since, during inference, the constants are not always precise.Parameter σ is decreased according to the cosine schedule.
During inference, we use Top-K sampling [Fan et al., 2018] to generate candidate equations.Then, we fix the symbolic expression and run gradient descent on all constants.We use the mean squared error between the predicted function's outputs and the outputs of the ground-truth function.Finally, we select the equation with the lowest error on the input points.

Dataset generation
We generate two training datasets, one with 130 million equations containing only univariate functions and the second one with 100 million functions containing bivariate functions, by following the same algorithm as described by Lample and Charton [2019] with the maximum of ten operators.The algorithm starts by generating a random unary-binary tree and filling the nodes with appropriate operations.The unnormalized probabilities of each operation and operator and the hyperparameters of the generator are given in Supp.Mat.In our dataset, we have also introduced new operators such as pow2( . . to make it easier to represent them.The generated expressions are then simplified using SymPy [Meurer et al., 2017].We discard the expressions that cannot be simplified in 5 seconds.Finally, we sample uniformly at random 100 points (200 for bivariate functions) from the interval [−5, 5], and if there are any non-finite values (NaN or ±∞), we try (0, 5] and then [−5, 0) (similarly for the bivariate dataset).The reason for selecting these intervals is that functions such as log x, √ x or log −x are not defined on the full interval.Furthermore, we ignore any equations with values on the sampled points larger than 10 7 (in absolute value).Similarly, we discard equations with constants smaller than 10 −10 or larger than 10 10 in absolute value.We also ignore any linear functions created by the simplification process.We do not want to keep all the linear functions because the dataset would be biased towards linear functions.Finally, we throw away any constant functions and functions that contain more than 50 symbolic tokens.

Expression encoding
We use preorder tree representation to encode expressions and replace constants with special symbols.The constants are encoded using a scientific-like notation where a constant C is represented as a tuple of the exponent c e and the mantissa c m : (2) In this representation, the mantissa is in the range [−1, 1], and the exponent is an integer.For example the expression 0.017 • x + 1781.5 will have symbols [+, mul, x, C-1, C4] and constants [0, 0, 0, 0.17, 0.17815].To further help the model represent constants, we add all integers from interval [−5, 5] into the model vocabulary.Different encodings are compared in Section 4.7.In contrast to d'Ascoli et al. [2022], who are able to express constants only up to four most significant digits, our approach achieves full float precision.

Experiments
This section describes our training setting and the metrics that we used to demonstrate the model's ability to predict the formulas and compare our model to previous approaches.We also show how the SymFormer generalizes to two dimensions and outside of the known range.Furthermore, we manually inspect the model's predictions to examine different equivalent mathematical formulas that the SymFormer found.In the end, we compare different encodings and their impact on the model's performance.In our experiments, we refer to the model trained only on univariate functions as the Univariate SymFormer and the model trained on both the univariate and bivariate functions as the Bivariate SymFormer.We also always use a local gradient search on the constants if not stated otherwise.

Training
We train our model using the Adam optimizer [Kingma and Ba, 2014] for 3 epochs on 8 NVIDIA A100 GPUs.The training of the model takes roughly 33 hours.We use 130 million univariate equations for the training set and 10 000 for the validation set.Furthermore, we randomly selected 256 equations to calculate the metrics using the beam search.We use a training schedule similar to the original transformer [Vaswani et al., 2017].However, we divide the learning rate by five since the training often diverged when using the original learning rate.The regression λ is set according to the cosine schedule and delayed for 97 700 gradient steps, reaching 1.0 at the end of the training.2For the random noise, we sample from N (0, ), where is initially set to 0.1 and decreased to zero during training using the same schedule.The complete set of hyperparameters for the model containing approximately 95 million parameters can be seen in Supp.Mat.The hyperparameters were found empirically.We use the same settings for our Bivariate SymFormer.

Metrics
To assess the quality of the model, we have selected two metrics: the relative error and the coefficient of determination (R 2 ).The relative error is the average absolute difference between the predicted value and the ground truth divided by the absolute value of the ground truth: where y i and ŷi are the ground-truth and predicted values for point i, respectively.The coefficient of determination (R 2 ) [Glantz and Slinker, 2000] is defined as follows: where y i and ŷi are the ground-truth and predicted values for point i, respectively.ȳ is the average of y i over all the points.The advantage of using R 2 is its nice interpretation.If R 2 > 0, then our prediction is better than the predicting just the average value and if R 2 = 1, then we have a perfect model.

In-domain performance
To demonstrate the SymFormer's ability to predict the formulas successfully, we use Top-K sampling [Fan et al., 2018] with K = 16 and 256 samples to generate the best equation.In our experiments, we report the median values of all metrics since the mean can be skewed by outliers.In Figure 2, we plot some of the model predictions.The Univariate SymFormer achieved an R 2 of 0.9995 and a relative error of 0.0288.Furthermore, when we used the local gradient search, the model improved to R 2 1.0000 and a relative error of 0.0010.The Bivariate SymFormer achieved an R 2 of 0.9996 and a relative error of 0.0389 using Top-K [Fan et al., 2018] with K = 16 and 1.0000 R 2 and a relative error of 0.0035 when the local gradient search was used.This demonstrates the model's ability to generalize to higher dimensions.
Unfortunately, these benchmarks are better suited for methods where parameters such as the number of variables, set of symbols, or sampling range are set specifically to match the problem at hand.In our method, these parameters are fixed in the beginning and cannot be changed later.Note that it is difficult to make a completely fair comparison on the benchmarks for two reasons.The first one is that some methods use a restricted vocabulary and thus have a smaller search space giving them an advantage over our method.The second problem arises from the different sampling ranges and the number of sampled points.
We use Top-K sampling with K = 20 and 1 024 samples with early stopping for the benchmark.
From the results in Table 1 we can see that the SymFormer method is competitive in terms of the model performance on all of the benchmarks while outperforming both NSRS [Biggio et al., 2021] and DSO [Mundhenk et al., 2021] in the time required to find the equation.One of the observations that we have found is that the model sometimes predicts semantically the same expression as the ground-truth, but using a more complex expression, e.g., in one case model had to predict x 2 +x 2 , but predicted ln (exp(0.5x 2 ) exp (0.5x)), which is same after simplification, but unreasonably complex.This was likely caused by the distribution of our dataset.Furthermore, to demonstrate the Bivariate SymFormer performance on both the univariate and bivariate functions, we evaluated a single model on all univariate and bivariate functions using the same benchmark.Note that the benchmark functions are mostly univariate.In Table 2, we can notice   only a slight drop in performance.However, the average inference time increased, which could be explained by a larger search space the model needed to handle during the optimisation of constants.Furthermore, we have manually inspected the model's predictions on benchmark functions.We found that the model had no problems recovering simple equations but was slow or failed in cases of more complicated functions.

Out-of-domain performance
One intriguing property of symbolic regression is its ability to predict the correct values outside the sampling range.To test it, we first run the inference on the points sampled from the training range and then evaluate these predicted functions on points outside the sampling range.More formally we calculate the metrics on the function values for points sampled from the set {x ∈ R|5 < |x| < 5 + d}, where d is the maximal distance.The effect of the distance on the relative error and R 2 can be seen in Figure 3.Even though the error increases with the distance, the final relative error is minimal, even for the maximal range.Therefore, we can conclude that the model generalises outside of the sampling range, and the local gradient search does not overfit the constants to the sampled data.

Discovering mathematically equivalent expressions
To qualitatively evaluate the SymFormer's prediction capabilities, we have manually inspected the model's predictions.The model is often able to find mathematically equivalent expressions.In one case the model discovered the rule − ln x = ln x −1 .The goal was to predict − 8 ln x , but the model predicted 8 ln (x −1 ) .Another rule the model discovered was the law of exponents (x a ) b = x a•b .It was observed when the model had to predict (x −1 ) 1.5 , but found an equivalent form 1 Table 3: Comparison of expression encoding strategies and local gradient search ('LGS').The SymFormer uses both 'extended encoding' and local gradient search.Metrics are estimated using 256 equations using Top-K sampling with K = 20 and 256 samples, where the equation with the lowest mean squared error on the input points is selected.We report the R 2 and the relative error.The base encoding refers to case when no preprocessing for the constants is used.BFGS init refers to a situation when the predicted constants are used as a starting point for the BFGS [Fletcher, 1987].The 'GS' refers to the case when the gradient search was used to find or improve the constants further.Furthermore, the model was also able to find some trigonometric equivalencies such as 1 tan x = cot x.However, a more interesting example is the expression cos (3.5 + 2x + x 2 ).In this case, the model predicted − sin (2 + 2x + x 2 ), which has a very small error.The rule, that the model discovered is cos ( π 2 + x) = − sin (x).Applying this rule we get cos(3.5 + 2x + x 2 ) ≈ cos( π 2 + 1.9 + 2x + x 2 ) ≈ − sin(1.9+ 2x + x 2 ).Another interesting example the model predicted is the exponential rule a x = e x•ln a .This rule is probably used when the SymFormer needs to deal with precise constants, due to the numerical stability of ln.For example, the model had to find (−1.3673x),but it found ln(0.2492x ) = ln(e x ln 0.2492 ) = x ln 0.2492 = −1.3895xwhich is close to the previous expression.

Ablation study
This ablation study aims to look at the effect of different constant encodings.In the first setting, we did not predict the constants and used a global optimisation (BFGS [Fletcher, 1987]) to find the constants.This setting is the same as used by Biggio et al. [2021].In the second setting we used the constants during training, but we did not preprocess them.Therefore, a single symbol, 'const', was used to represent any constant regardless of its magnitude.In the last case, we have used encoding as described in Section 3.4, while also trying to use the predicted values of constants as a starting point for global optimisation.From the results in Table 3, we can see that the constants help the model performance in terms of both the R 2 and the relative error.Therefore, one can conclude that the performance of SymFormer in comparison to Biggio et al. [2021] is better not because of a different dataset or a larger model but because of the usage of constants during training.The last row shows the results for the extended encoding, which uses a local gradient search to improve the constants further.The extended encoding clearly outperforms the base encoding in terms of both the R 2 and the relative error.We believe this to be the case because it is easier for the model to attend to previous symbolic tokens than to real values and, therefore, the model can make a more informed decision when predicting the next symbol in the sequence.

Conclusion
To tackle the problem of symbolic regression, we introduced a novel transformer-based approach called the SymFormer that uses a neural network trained on hundreds of millions of formulas to be able to generate a symbolic representation of a previously unseen formula given a set of input-output pairs efficiently.Our model jointly predicts the symbolic representation of a function and the values of all constants in a single forward pass of a neural network.A local gradient search is used to improve constants further to fit the input points better.We demonstrated that the SymFormer is competitive with current state-of-the-art approaches while outperforming them in terms of the time required to find the expression.We validated the importance of the proposed encoding of constants in an ablation study.Furthermore, by evaluating the SymFormer outside the sampling range, we showed that it has good extrapolation capabilities.Finally, in a qualitative evaluation, we present some intriguing mathematical relations the model was able to recover just by learning on a large collection of formulas.
Limitations.One of the limitations of our approach is that the maximal number of dimensions and the sampling range cannot be changed after the model was trained.In the future, this can be partially tackled by varying the sampling distribution of the input points during training.

Supplementary Material
SymFormer: End-to-end symbolic regression using transformer-based architecture In the supplementary material, we give more details on the experiments presented in the paper, data generator's and model's hyperparameters, and we also include additional experiments that evaluate the performance of the presented approach in more detail.In Section A, we investigate the effect of the number of sampled equations using Top-K sampling [Fan et al., 2018] on the model's performance and the inference time.In Appendix B, we visualise several different model predictions using both univariate and bivariate SymFormer.In Appendix C, we enumerate all benchmark functions that were used to compare our method with the current state of the art, and in Appendix D, we present the exact values that we have measured while comparing the methods including the used hyperparameters.
In Appendix E, we present the exact hyperparameters that were used to generate both datasets, and in Appendix F, we present the model hyperparameters, including its vocabulary and other training details.
Furthermore, we have also published the source code containing the necessary files to run the training and the inference3 .We have also included a video4 which shows a visualisation of inference on several functions.The process starts by sampling 256 equations and selecting the best eight functions from the sampled equations.In the video, they are visualised as light blue functions.The orange points represent the sampled points used for the inference.We then show how the functions improve with each constant optimization step.First, we evaluate the model's performance with varying number of sampled equations during inference.In Table 4, we can see that the SymFormer's ability to predict the correct expression improves with the number of sampled equations, however, the time increases substantially, averaging approximately three minutes per equation for the largest number of equations.These results are measured by running the inference with Top-K sampling [Fan et al., 2018] with K = 20 and without a local gradient search.If we had used a local gradient search, the R 2 and relative error (RE) would improve, however, the time would increase.

B Examples of generated functions
We plotted several predictions generated by the SymFormer with the Top-K sampling [Fan et al., 2018] with K = 20 with 256 sampled equations and the local gradient search.The shaded area represents the sampling range.The generated predictions can be seen in Figure 4.The predicted functions are not always the same as the original formula, however, they fit almost perfectly on the function domain.In Figure 5, we show similar visualisation for the bivariate SymFormer.[Fan et al., 2018] with K = 20 and 256 samples.'GT' denotes ground-truth and 'Pred', the model prediction.

C Benchmark functions
This section describes the exact functions used to compare the SymFormer with the current state-ofthe-art methods.The benchmark's names and the contained functions can be seen in Table 5.
Table 5: Benchmark functions that we have used in our experiments.We have restricted ourselves only to the univariate and bivariate functions.

D Benchmark comparison
This section presents a comparison for each equation from the used benchmarks.NSRS denotes Neural Symbolic Regression that Scales [Biggio et al., 2021], and DSO denotes Symbolic Regression via Neural-Guided Genetic Programming Population Seeding [Mundhenk et al., 2021].You can see that DSO tends to perform similarly to SymFormer, but it usually takes more time to find the underlying equation due to the usage of reinforcement learning, for example, in the case of the Kaijzer [Keijzer, 2003] benchmark, where the DSO required thousands of seconds in most cases.For the NSRS, we can see that the times are consistent across the benchmarks, however larger than for SymFormer.The reason for this observation is that the model uses global optimisation to find the coefficients, which slows down the inference.The hyperparameters for the DSO can be found in https:// github.com/brendenpetersen/deep-symbolic-optimization/and for the NSRS in https: //github.com/SymposiumOrganization/NeuralSymbolicRegressionThatScales.The experiments were run using 32 CPU threads and 64 GB of RAM.From the results in Table 6, we can see that in most cases, the SymFormer is competitive in terms of R 2 , however, it outperforms them in case of the time in the average case.

Figure 1 :
Figure1: Schematic diagram of inference.The input points are passed through the transformer, generating several candidate equations using Top-K sampling.These candidates are further improved using gradient descent.The final equation is then selected by the lowest mean squared error.

Figure 2 :
Figure 2: Examples of model predictions using Top-K sampling with K = 20 and 256 samples.The shaded area represents the sampling range.For the 3D functions, x and y were sampled from [−5, 5].'GT' denotes ground-truth and 'Pred', the model prediction.The first six images are generated using univariate SymFormer, and the last one was generated by bivariate SymFormer.

Figure 3 :
Figure 3: The effect of a distance when calculating the relative error outside the sampling range.They are estimated on 1024 equations generated with Top-K sampling with K = 20 and 256 samples.Each of the equations is also improved by a local gradient search on constants.

Figure 4 :Figure 5 :
Figure4: Examples of model predictions using Top-K sampling[Fan et al., 2018] with K = 20 and 256 samples.The shaded area represents the sampling range.'GT' denotes ground-truth and 'Pred', the univariate SymFormer prediction.

Table 1 :
Results comparing the SymFormer with state-of-the-art methods on several benchmarks.The SymFormer uses Top-K sampling with K = 20 while generating 1024 samples and improving them using local gradient search with early stopping.We report R 2 and the average time to generate an equation in seconds.

Table 2 :
Results comparing the SymFormer, when the different model is used for univariate functions and bivariate functions and when the Bivariate SymFormer is used for all the benchmark functions.The (Bivariate) SymFormer uses Top-K sampling with K = 20 while generating 1024 samples and improving them using local gradient search with early stopping.We report R 2 and the average time to generate an equation in seconds.

Table 4 :
[Fan et al., 2018]ber of sampled equations for Top-K[Fan et al., 2018]with K = 20 without a local gradient search on 256 equations.The time is in hours:minutes:seconds format.

Table 6 :
Comparison between methods on each of the benchmark functions, R 2 values are rounded to 4 decimals and time to whole seconds.

Table 7 :
Unnormalised probabilities of unary operators used by the dataset generator.We have also used special operations to generate them more often.

Table 8 :
Unnormalised probabilities of binary operators as used by the dataset generator.

Table 9 :
Unnormalised probabilities of leaf values as used by the dataset generator.

Table 11 :
Vocabulary as used by the model.Besides the elementary functions, it contains special functions such as pow2 and pow3, which are commonly used.