Learning and Applying a Function Over Distributions

We present a method for learning a function over distributions. The method is based on generalizing nonparametric kernel regression by using the earth mover’s distance as a metric for distribution space. The technique is applied to the problem of learning the dependence of pitcher performance in baseball on multidimensional pitch distributions that are controlled by the pitcher. The distributions are derived from sensor measurements that capture the physical properties of each pitch. Finding this dependence allows the recovery of optimal pitch frequencies for individual pitchers. This application is amenable to the use of signatures to represent the distributions and a whitening step is employed to account for the correlations and variances of the pitch variables. Cross validation is used to optimize the kernel smoothing parameter. A set of experiments demonstrates that the new method accurately predicts changes in pitcher performance in response to changes in pitch distribution and also outperforms an existing technique for this application.


I. INTRODUCTION
An important use of machine learning techniques is the recovery of a model from observed data. The development of learning methods for the recovery of three-dimensional shape from image data, for example, has been a topic of recent interest in computer vision [1], [2]. Nonparametric methods [3] are a powerful tool for model recovery and continue to support a variety of applications [4], [5]. In this work, we generalize nonparametric techniques that learn a function of multiple variables to the problem of learning a function over distributions.
The ability to quantify player skill and team performance in professional sports has been revolutionized by the deployment of sensors that collect large amounts of data during each game [6], [7] [8]. This has led to the use of machine learning algorithms by teams to exploit this data to gain a competitive advantage. Machine learning methods are particularly well suited for baseball due to the discrete structure of the sport [9]. We will apply the learning method derived in this article to one of the most challenging problems in baseball analytics.
Nonparametric kernel regression can be used to estimate a function of unknown form and has been applied in a wide range of settings [10]. Generalizing this approach to learn The associate editor coordinating the review of this manuscript and approving it for publication was Michele Magno . a function over distributions requires a suitable metric for distribution space. The Wasserstein metric or Earth Mover's Distance (EMD) can be used to compare distributions and has been applied to many problems in signal processing and machine learning [11]. The EMD uses a cost function called the ground distance to determine the minimum amount of work that is needed to transform one distribution into the other. The computational cost of finding the EMD can be expensive which leads to the use of signatures to approximate the distributions thereby enabling the use of efficient linear programming methods [12].
We develop an algorithm that learns a function over distributions by generalizing nonparametric kernel regression using the EMD as the distribution-space metric. The algorithm is applied to the problem of optimizing pitch distributions in baseball. A nonparametric learning method is appropriate for this application because the effectiveness of a pitch distribution has a complicated dependence on the quality, frequency, and interaction of a pitcher's set of pitches.
We represent a collection of pitches using a multidimensional distribution that is derived from sensor measurements that capture the physical properties of each pitch. These properties have been shown to have a strong effect on pitch value [13]. Pitchers typically use a small number of different pitch types which allows these distributions to be accurately encoded using signatures. A whitening transform [14] is used by the EMD ground distance to account for the variances and correlation structure of the component variables that define the distributions. A method that is similar to leave-one-out cross validation [15] is used to optimize the kernel smoothing parameter. After recovering the function over pitch distributions, an efficient low-dimensional search can be used to find the optimal frequencies for a pitcher's various pitch types. We show that the new model accurately predicts the dependence of pitcher performance on changes in pitch distribution and significantly outperforms an alternative approach based on game theory.

II. LEARNING A FUNCTION OVER DISTRIBUTIONS
We develop a method for learning a function over distributions when the underlying structure of the function is unknown. The method is based on generalizing nonparametric kernel regression using a whitened Earth Mover's Distance as the metric for distribution space. We will illustrate properties of the algorithm with a set of experiments in Section III.

A. NONPARAMETRIC KERNEL REGRESSION
Let (x i , y i ) for i = 1, 2, . . . , n be a set of observations where x is the explanatory variable and y is the response variable. The data can be modeled by where is an error term. Kernel regression [16], [17] is a nonparametric method that constructs an estimate for f (x) using the weighted average where d i = x − x i and k(·) is a kernel probability density function that is typically maximum at zero and decreases with |d i | so that the largest weights k(d i ) are given to the y i associated with the x i that are closest to x. A popular kernel function is the zero-mean Gaussian which depends on the smoothing parameter σ.

B. EARTH MOVER's DISTANCE
Given a set of observations (X i , y i ) where each X i is a multidimensional distribution, we can generalize equations (2) and (3) to approximate a function over distributions by replacing d i with a distance D i between the distributions X and X i The Wasserstein metric which is also called the Earth Mover's Distance (EMD) is a standard method for computing the distance between distributions. The EMD utilizes a ground distance between individual points to determine the minimum amount of work that is required to transform one full distribution into the other. For many applications [12], a distribution can be accurately represented as a signature S defined by a set of m clusters where µ i is the mean vector for cluster i and w i is the fraction of the distribution represented by cluster i. Thus, the signature S approximates a distribution by a set of m point masses at the locations µ i with the weights w i where m depends on the distribution. An established algorithm [12] for finding the EMD using signatures is based on the solution of the transportation problem [18] for finding the minimum cost to move product from a set of producers to a set of consumers with each having a known demand. For the transportation problem, the ground distance is the cost to move one unit of product from a given producer to a given consumer. The computation of the EMD using signatures can be formulated as a linear programming problem for which efficient solutions [19] and software [20] exist.

C. GROUND DISTANCE
The computation of the EMD requires the specification of a ground distance between the µ i mean vectors that define the point masses for each distribution. The use of a Euclidean distance between mean vectors is problematic because the component variables in the vectors can have different variances and these variables may also have significant correlations. We define the ground distance G(i, j) between µ i and µ j as the Mahalanobis distance [14] G where the covariance matrix for the population of mean vectors µ i serves to correct for differences in the variances of the vector components and also for their correlation structure. This distance is equivalent to a Euclidean distance after a whitening transform [14] has been applied to transform the original variables to a new set of variables which are uncorrelated and have unit variance.

D. FINDING THE SMOOTHING PARAMETER USING CROSS VALIDATION
The accuracy of kernel regression has a strong dependence on the smoothing parameter σ [14]. Let (X i , y i ) for i = 1, 2, . . . , n be a set of observations that associate distributions X i with responses y i . For the distribution X j we can use equation (4) to compute where D ij is the whitened EMD between X i and X j as described in Sections II-B and II-C and the (X j , y j ) VOLUME 8, 2020 observation is excluded from the sums. The error in the approximation is given by We define the optimal smoothing parameter σ * as the value of σ that minimizes the total absolute error in the approximation over the observations Note that if we include the (X j , y j ) observation in the sums in (7), then as σ approaches zero the approximation f (X , σ ) approaches a sum of Dirac delta functions centered at the observation points causing each E j (σ ) and the sum in equation (9) to approach zero. This yields a poor approximation to the underlying f (X ) function everywhere except at the observation points. The method described in this section for finding σ * is similar to leave-one-out cross validation methods that are used for density estimation [15].

A. SENSOR DATA
A baseball game is defined by a set of one-on-one matchups between a pitcher and a batter. The pitcher throws a ball which the batter attempts to hit with a bat. Each throw is called a pitch and each matchup consists of one or more pitches. The pitcher's goal is to make it difficult for the batter to make solid contact with a pitch. The PITCHf/x optical video and TrackMan Doppler radar sensors [7] capture data during baseball games that can be exploited to recover information about pitches. Our analysis considers the estimated s, b x , and b z parameters for each pitch as reported by Brooks Baseball (www.brooksbaseball.net). The parameter s represents the speed of a pitch in three dimensions and the pair (b x , b z ) specifies the pitch's horizontal and vertical movement relative to a theoretical pitch thrown at the same speed with no spin-induced movement [21]. The sensor coordinate system is arranged with the origin at home plate which is near the batter's location with positive z up, positive y parallel to the ground plane in the direction from the origin to the pitcher, and positive x chosen to complete a right-handed system. The pitcher starts the process of throwing each pitch from a location that is 60.5 feet from home plate. By convention, Brooks Baseball reports s for y = 55 feet and (b x , b z ) from y = 40 feet to home plate.
Pitchers typically throw different types of pitches to make the batter's task more difficult. A given pitch type has specific speed and movement characteristics. For example, a fourseam fastball from a right-handed major league baseball (MLB) pitcher will typically have a speed s above 90 miles per hour with a negative horizontal movement b x and a positive vertical movement b z due to backspin. A curveball from the same pitcher will typically have a speed s of less than 80 miles per hour with a positive b x and a negative b z due to topspin. For a left-handed pitcher, the sign of the horizontal movement b x will reverse for these pitches. Major League Baseball Advanced Media (MLBAM) uses measured pitch parameters to classify the type of each pitch in realtime. After each game, Pitch Info (www.pitchinfo.com) uses a manual review process to improve on the accuracy of the MLBAM classifications. As an example, Figure 1  B. OPTIMIZING THE PITCH DISTRIBUTION A pitcher's success is highly dependent on the characteristics of his pitch distribution. A larger speed s for an individual pitch reduces the batter's available reaction time while greater movement (b x , b z ) makes it more difficult for the batter to determine the optimal contact point. In addition, the diversity of a pitcher's distribution of pitches affects the batter's ability to anticipate the speed and movement of the next pitch. A pitcher can benefit from having pitches with large differences in speed [22] or from having pitches with similar speed that move in different directions [23].
The best result of a matchup for a pitcher is a strikeout which means that the batter was unable to hit the ball successfully given multiple opportunities. A pitcher's strikeout rate is the fraction of his matchups that result in a strikeout. This rate is a repeatable pitcher skill [24] and is a strong determinant of a pitcher's success [25]. We can use the algorithm described in Section II to learn the dependence of pitcher strikeout rate on the pitch distribution defined over the s, b x , and b z variables. Since a given pitcher can throw several different pitch types, he can adjust his pitch distribution and expected strikeout rate by changing the frequency of each pitch type.
Using the learned relationship between strikeout rate and pitch distribution, we can therefore find the pitch frequencies that optimize a pitcher's strikeout rate. We will evaluate this approach in the following sections.
Previous work on optimizing the pitch distribution has been based on game theory. Using this approach, Paine [26] has suggested that a pitcher's optimal pitch distribution occurs at Nash equilibrium where the pitcher's effectiveness is equal for each of his pitch types. This principle is used to derive the Nash score which is a measure of how close a pitch distribution is to Nash equilibrium. One difficulty with this method is that it requires the use of effectiveness values for each pitch type which are known to have a low reliability [27]. We will evaluate the use of the Nash score for assessing pitch distributions in Section III-C6.
We built the strikeout rate model described in Section III-B using 2016 sensor data for each MLB pitcher who threw at least 1500 pitches during the season. This threshold ensures the use of a reasonably large sample for generating the pitch distributions and strikeout rates and also removes pitchers who were used purely as relievers which often results in a different style of pitching. There were 108 right-handed pitchers and 41 left-handed pitchers who threw at least 1500 pitches in 2016.
The effectiveness of a given pitch depends on the handedness (left or right) of the batter and pitcher. Thus, we separately consider the dependence of strikeout rate on pitch distribution for each of the four possible platoon configurations (RHP vs. RHB, RHP vs. LHB, LHP vs. RHB, LHP vs. LHB). A pitcher's strikeout rate for a platoon configuration and year is defined as the ratio of strikeouts to the number of batters faced after removing all matchups with a pitcher as a batter and also removing all matchups that resulted in a bunt or an intentional walk. Using the 2016 constant of 4.262 batters per inning, the FIP equation [25] predicts that an increase of 0.03 in strikeout rate leads to 0.26 fewer runs allowed per game which is a significant improvement in pitcher performance.
The process of learning and applying a function over distributions can be summarized by the following steps. Training data is first partitioned by platoon configuration and each step is carried out separately for each configuration. The training data provides a set of pitch distributions specified by signatures S i as defined in Section II-B and associated strikeout rates y i . The covariance matrix in equation (6) is computed for the population of mean vectors specified by the S i signatures. The smoothing parameter σ is found using cross validation as described in Section II-D. The learned model can then be applied to a pitch distribution X described by a signature S to compute the expected strikeout rate by using equation (4). This process is summarized by Figure 2 where application of the model will be described in more detail in Section III-C5.

2) SIGNATURE MODEL
Pitchers tend to throw a small number of distinct pitch types which allows the pitch distribution for a pitcher for a given year and platoon configuration to be accurately modeled using the signature representation of equation (5) where each pitch type corresponds to a cluster. The number of clusters m corresponds to the number of distinct pitch types as identified by the Pitch Info classifier where m can depend on both the specific pitcher and the platoon configuration. For each pitch type i, µ i is the pitch parameter mean vector (s i , b xi , b zi ) and w i is the fraction of pitches of that type for the pitcher and platoon configuration.

3) COMPUTING THE EMD
The signatures are used to compute the distance between distributions using the EMD as described in Section II-B with the whitened ground distance defined in Section II-C. As a two-dimensional example of this process, Figure 3 is a scatterplot of the mean (s i , b zi ) values for each pitch cluster in a signature for the right-handed pitcher versus right-handed batter platoon configuration in 2016. We see that s i and b zi have a large positive correlation so that a pitch thrown with a higher speed will tend to have a larger vertical movement. The variance of the s i values is also larger than the variance of the b zi values. These effects are addressed by using the Mahalanobis ground distance defined by equation (6).
The impact of the correlation between the two variables can be seen by considering the orange, green, and red points in Figure 3 which correspond to the (s i , b zi ) values for three specific pitch clusters in the figure as detailed in Table 1. The Euclidean distance of 6.10 between the green point (Latos cutter) and the red point (Chacin four-seam) is significantly larger than the Euclidean distance of 3.49 between the green point (Latos cutter) and the orange point (Kennedy changeup). Since the vector difference between the green point and the red point is aligned with the direction of correlation of the variables, however, a significant portion of the separation between these points is due to the correlation between s and b z . On the other hand, the vector difference between the green point and the orange point is approximately orthogonal  to the direction of correlation. If we compute the Mahalanobis distance using the s and b z variables shown in Table 1, the distance of 0.81 between the green point and the red point is now significantly less than the distance of 1.32 between the green point and the orange point.

4) CROSS VALIDATION
The cross validation process described in Section II-D is used to find optimized values for the smoothing parameter σ for each platoon configuration using the total absolute error defined in equation (9). In cases where E T (σ ) is near its minimum value over a range of σ, we prefer smaller values of σ over the range since these yield more small values of g(D i , σ ) in equation (4) and therefore more terms in the sums that can be neglected without significantly affecting the approximation. Thus, we select the optimal value σ * of the smoothing parameter as the smallest value of σ for which The use of this equation to favor smaller values of σ has little effect on the accuracy of the model in equation (4) Table 2.    For small values of σ, the g(D ij , σ ) in equation (7) are approximately Dirac delta functions and f (X j , σ ) is approximately a sum of Dirac delta functions centered at the observations (X i , y i ) for i = j. This results in a relatively large error E j (σ ) for small σ in equation (8) and a relatively large error in E T (σ ) for small σ in equation (10). As σ increases, the approximation in equation (8) improves and the error decreases as shown in the figures.

5) FINDING OPTIMIZED PITCH FREQUENCIES
The goal for a pitcher is to maximize his future strikeout rate. This can be accomplished by using the estimated f (X , σ * ) function which represents strikeout rate as a function of the pitch distribution X . Suppose that a pitcher has a pitch distribution X which is represented by a signature with m pitch types as in equation (5). Each pitch type i has a pitch parameter vector µ i = (s i , b xi , b zi ) and a frequency w i . For a given pitcher, the pitch parameter vector µ i for each pitch type is characteristic of his ability and typically does not change. Each frequency w i , however, can be easily changed by varying how often pitch type i is thrown. Thus, a pitcher can endeavor to maximize future strikeout rate by finding the values of w i that maximize f (X , σ * ) subject to the constraints w 1 +w 2 +· · ·+w m = 1 and w i ≥ 0. Since the number of pitch types m is typically small, the optimal w i values can be found efficiently using an exhaustive search over combinations of the frequencies w i .
We illustrate this process for left-handed pitcher Danny Duffy for the LHP vs. LHB platoon configuration using his 2016 signature as shown in Table 3. We note that the signature model S in equation (5) is general and can accommodate any number of different pitch types. Individual pitchers, however, typically are not able to throw every pitch type effectively. As reported by Brooks Baseball, Danny Duffy only used the five pitch types listed in Table 3 during 2016. Other pitchers use other pitch types such as the cutter and the split which are represented in their signatures. Figure 8 is a visualization of f (X , σ * ) for pitch distributions X formed by varying the frequency w 1 of his fourseam and w 2 of his slider. In order to limit the plot to two dimensions, the w i for his two least frequent pitches are set to their 2016 values so that w 4 = 0.0252, w 5 = 0.0069, and w 3 is then constrained to w 3 = 1 − (w 1 + w 2 + w 4 + w 5 ). The red point in the figure indicates the location of Duffy's 2016 signature and corresponds to an actual strikeout rate of 0.330 and an estimated strikeout rate using f (X , σ * ) of 0.317. We see that the model predicts that the pitcher could improve his strikeout rate by increasing w 1 (fourseam frequency) and reducing w 2 (slider frequency). In 2017, Duffy's w 1 and w 2 frequencies for this configuration moved in the opposite direction to the point shown in black in the figure. This resulted in a reduced strikeout rate of 0.245 in where 2017 predicted strikeout rate is computed by evaluating f (X , σ * ) using equation (4) for the pitcher's 2017 pitch distribution with σ * computed as described in Section III-C4. Figure 9 is a scatterplot with 198 points that represent ( , ) for each of the 72 right-handed and 27 left-handed pitchers against each handedness of batter. We see that the points have a positive correlation. In particular, for the 25 points with strong positive predictions > 0.03 we have 21 points (84.0%) with a positive in actual strikeout rate. For the 39 points with strong negative predictions < −0.03 we have 24 points (61.5%) with a negative in actual strikeout rate. Thus, the model is useful for predicting the dependence of changes in strikeout rate on changes in pitch distribution. VOLUME 8, 2020 FIGURE 9. Predicting strikeout rate changes using f (X , σ * ). For comparison, Figure 10 is a scatterplot of the actual change in strikeout rate from 2016 to 2017 for each of the 99 pitchers versus each pitcher's Nash score difference [26] N = 2016 Nash score − 2017 Nash score (14) As described in Section III-B, a low Nash score indicates that a pitcher is close to Nash equilibrium while a higher Nash score indicates that a pitcher is farther from equilibrium. Thus, for N positive we would expect a pitcher to improve from 2016 to 2017 and for N negative we would expect a pitcher to get worse from 2016 to 2017. In Figure 10, however, we see that the points in the scatterplot do not have an increasing trend and, in fact, the points have a small negative correlation. We believe that this is due to the low reliability for the pitch values [27] on which the Nash score is based. We can assess the statistical significance of the difference between the correlation coefficients of r 1 = 0.320 in Figure 9 and r 2 = −0.081 in Figure 10 using the Fisher z-transformation [28]. Even if we disregard the negative sign on r 2 , this method yields a z observed test statistic of 2.01 and a corresponding p-value of 0.044 which supports the conclusion that r 1 is significantly larger than r 2 . Thus, the function f (X , σ * ) has value for predicting future strikeout rate and can be used to find optimized pitch frequencies w i using the approach described in Section III-C5.

IV. CONCLUSION
The proliferation of sensor systems at sporting events has provided large data sets that support the generation of predictive models using machine learning algorithms. These models are playing an increasingly prominent role in the operational activities of professional sports teams. In an industry where the difference between success and failure is often small, models derived from sensor data can be used to gain an edge over the competition.
We have developed and evaluated an algorithm for learning a function over distributions. The algorithm employs the earth mover's distance as a metric for distribution space within a nonparametric kernel regression scheme. We have demonstrated the algorithm for the task of learning a pitcher's strikeout rate as a function of a multidimensional pitch distribution that is generated from pitch trajectory measurements. The algorithm efficiently represents the pitch distributions using signatures and compensates for the correlation of the trajectory variables with a whitening step. The smoothing parameter for the regression kernel is learned using cross validation. We have assessed the algorithm for the prediction of strikeout rate from pitch distributions on out-of-sample data and have demonstrated that it performs better than an alternative algorithm based on game theory principles.
The new technique can be used for a number of applications in the areas of strategy [29], player development [30], and player evaluation [31] in baseball as well as for play selection [32] in football. Given the physical characteristics of a pitcher's different pitch types, the function can be used to determine the frequencies for each pitch type that maximize strikeout rate. The method can also be used to evaluate the improvement in strikeout rate that is possible by adding a new pitch type to a pitcher's current collection of pitches. By utilizing physical measurements, the algorithm allows the direct comparison of pitchers across environments. This enables, for example, a prediction of how a college pitcher would perform in major league baseball after optimizing his pitch distribution. The framework can also be applied outside of the baseball domain. We could, for example, use a similar approach to build a model for the dependence of a football team's performance on the distribution of offensive play types, e.g. run or pass, that are used. This model could then be utilized to determine the play distribution that a given offense should use to maximize success.