Measurement Space Partitioning for Estimation and Prediction

An important and challenging problem in the evaluation of baseball players is the quantification of batted-ball talent. This problem has traditionally been addressed using linear regression or machine learning methods. We use large sets of trajectory measurements acquired by in-game sensors to show that the predictive value of a batted ball depends on its physical properties. This knowledge is exploited to estimate batted-ball distributions defined over a multidimensional measurement space from observed distributions by using regression parameters that adapt to batted ball properties. This process is central to a new method for estimating batted-ball talent. The domain of the batted-ball distributions is defined by a partition of measurement space that is selected to optimize the accuracy of the estimates. We present examples illustrating facets of the new approach and use a set of experiments to show that the new method generates estimates that are significantly more accurate than those generated using current methods. The new methodology supports the use of fine-grained contextual adjustments and we show that this process further improves the accuracy of the technique.


I. INTRODUCTION
Radar and optical sensors have been installed in Major League Baseball (MLB) stadiums in recent years and collect several terabytes of data during each game [1]. As a result, the assessment of player skill and the prediction of future performance is increasingly dependent on data-driven models rather than subjective evaluation. The accuracy of these models is critical to a team's success. During 2021, for example, the Los Angeles Angels completed a 240 million dollar contract with player Albert Pujols but received only 41 million dollars of value [2]. Not surprisingly, the Angels failed to meet expectations and did not win even a single postseason game during the period of this contract.
Methods for evaluating and predicting player performance on batted balls are of particular interest since the majority of MLB matchups result in a batted ball. Player talent level on batted balls is defined as the expected value of a statistic which can be estimated from a sample of observations. The utility of an estimate is often evaluated by its ability to predict player performance on unobserved data. An intuitively appealing measure of talent level is the naive estimate which The associate editor coordinating the review of this manuscript and approving it for publication was Anton Kos .
is the value of the batted ball statistic over a player's observed sample.
A disadvantage of using the naive estimate is that battedball results are subject to a large amount of random variation due to factors such as the response time and positioning of fielders [3] and the moisture content and texture of the playing surface [4]. These variables cause a player's battedball performance to have a low correlation across samples [5], [6]. Batted ball results are also biased by variables such as the atmospheric conditions [7], [8], the ballpark geometry [9], and the batter's running speed [10]. For these reasons, the prediction of batted ball results is considered the most difficult aspect of forecasting player performance [11], [12].
Machine learning (ML) algorithms [13] provide an alternative approach for deriving estimates of talent level. ML methods have achieved success for a diverse range of applications [14]- [16] and are especially appropriate for the analysis of sports such as baseball which are defined by a discrete series of events [17]. Several ML methods have used sensor data to quantify player performance on batted balls. These methods include a technique [18] that combines k-nearest neighbors with a generalized linear model as well as techniques [10], [19] that use kernel density estimates within a Bayesian framework. These measures reduce the impact of random variation compared to the naive estimate, VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ but have the disadvantage that they are optimized for modeling performance on observed data rather than for predicting performance on unobserved data. Research in statistics [20]- [22] has shown that the naive and ML measures will be less accurate for prediction than estimates that are defined by a weighted average of a measure with the average performance over a group of players. This weighting is typically implemented using linear regression (LR) where the weights depend on the correlation of the performance measure across samples. Estimates generated using LR have been utilized by several systems for predicting player performance [12], [23]. A disadvantage of combining LR with any of the various performance measures including those generated by ML methods is that all batted balls are assumed to have the same predictive value.
We derive a new method called measurement space partitioning (MSP) for implementing the regression component of a talent level estimator. The method is based on the principle uncovered in this work that the predictive value of a batted ball depends on its physical properties. In order to exploit this principle we developed a new method for distribution estimation that transforms an observed distribution over local regions of measurement space. Implementation of this transformation required the development of new methods for predicting the correlation of distribution values across samples and for learning an optimal partition of measurement space. The result is that a player's underlying batted-ball distribution and the corresponding talent level can be estimated using a method that adapts to the physical characteristics of his particular collection of batted balls. We show that by modeling the variation in the predictive value of batted balls, the MSP method improves on the accuracy of existing methods for estimating batted-ball talent level.
Another advantage of the MSP approach is the ability to incorporate fine-grained contextual information into estimates. Contextual information includes a range of variables that can affect batted-ball value. The weather conditions and elevation, for example, will affect how far a batted ball will carry in the air [8]. Batted balls that follow similar trajectories can have different outcomes due to differences in outfield geometry from ballpark to ballpark [9]. A player's running speed [10] and variables that include the height of the infield grass and the composition of the infield surface [4] can affect the value of batted balls hit on the ground. The fate of batted balls also depends on the quality of the defenders in the field.
Contextual factors are typically accounted for by a coarse adjustment that compensates for the average effect of the environment [24]. These coarse adjustments often perform poorly because a given environment affects batted balls in different ways depending on their properties [25]. Since the MSP method computes talent level estimates from estimated batted ball distributions defined over physical parameters, contextual adjustments can be employed that depend on the characteristics of individual batted balls. A ball hit in the air at high speed, for example, can be adjusted differently from a ball hit softly on the ground. We will show that the use of fine-grained contextual adjustments further improves the accuracy of predictions made by the MSP method.

II. ESTIMATION AND PREDICTION
A. TALENT LEVEL Talent for a skill varies from player to player and can be represented by a statistic that is derived from a set of observations. The computed value of such a statistic equals talent level T (j), which is the expected value of the statistic for player j, plus estimation error. In this work, we examine the problem of estimating player talent level on batted balls. Consider a dataset that contains information on 2N batted balls for each of P players where the data is arranged so that the first N batted balls for each player are observed and the second N batted balls for each player are unobserved. Let R(i, j) represent the numerical value of batted ball i for player j and define the observed performance statistic for player j as the average over the first N batted balls and define the unobserved performance for player j as the average over the second N batted balls Estimation is the process of using the observed batted ball data to estimate talent level T (j) for the x(j) statistic for each player j. Prediction is the process of using the observed data to predict the unobserved performance y(j) for each player j.

B. LINEAR REGRESSION
The naive estimate of T (j) is simply the observed performance x(j) for player j. However, the James-Stein paradox [21], [22] as illustrated by Effron and Morris [20] shows that a more accurate estimate of T (j) is obtained by adjusting the x(j) using an average of the observed R(i, j) values over multiple players. Since an estimate for talent level can be assessed by its ability to predict the unobserved performance y(j), we can define an estimate y(j) for T (j) by minimizing the sum of the square errors using the linear regression model The values of a and b that minimize E are where µ x and σ x are the mean and standard deviation for the x(j), µ y and σ y are the mean and standard deviation for the y(j), and r is the correlation coefficient for the set of P points (x(j), y(j)) [26].
Since the data used to generate the y(j) are unobserved, the parameters µ y , σ y , and r in equations (5) and (6) cannot be computed directly. The y(j), however, are generated in the same way for the same players as the x(j) which allows us to use the approximations µ y = µ x and σ y = σ x . The remaining unknown parameter, the correlation coefficient r, can be approximated from the observed R(i, j) values using Cronbach's alpha [27] where σ 2 R i is the variance of the observed R(i, j) values over players j for batted ball i and σ 2 R T is the variance of over players j. Using these approximations, equation (4) becomes which can be computed using the observed data. y(j) in equation (9) is consistent with the James-Stein result that an improved estimate for T (j) can be obtained by adjusting x(j) using the overall mean µ x .

C. VARYING OBSERVED SAMPLE SIZE
The α(N ) that is used to compute the estimate y(j) in equation (9) is derived using a dataset of N observed batted balls for each of P players using equation (7). The utility of the method is enhanced if we can use this dataset to compute the estimate y(j) using a sample of N batted balls for player j where N = N . The value of α(N ) tends to increase with N due to a decrease in the variance of the random error in the observed performance x(j) [28]. The Spearman-Brown prophecy formula [29], [30] allows us to predict α(N ) from the estimated α(N ) using where C = N /N . This α(N ) can be used in equation (9) to compute y(j) using an observed performance x(j) computed using any number of samples N .

III. EXPLOITING SENSOR MEASUREMENTS A. PARTITIONING THE MEASUREMENT SPACE
Sensors allow batted balls to be represented by a point in a measurement space with dimensions defined by properties such as speed, direction, and spin. The measurement space can be partitioned into B disjoint subsets. For the dataset described in Section II-A let M (i, j, k) be a binary-valued function which is one if batted ball i for player j is in subset k and zero otherwise. Define the observed batted ball distribution for player j over the subsets k by and define the unobserved batted ball distribution for player j over the subsets k by We will show that an estimate for p y (j, k) can be used to generate an estimate for the talent level T (j).

B. ESTIMATING MEASUREMENT SPACE DISTRIBUTIONS
For a given subset k we can use a linear regression model and approximations similar to those described in Section II to estimate p y (j, k) from the observed data according to and α(N , k) is the Cronbach approximation to the correlation coefficient for the set of P points (p where σ 2 M i is the variance of the observed M (i, j, k) values over players j for batted ball i and subset k and σ 2 M T is the variance of over players j for subset k. α(N , k) can then be used in equation (13) to compute the regressed distribution p y (j, k) using only the observed data. We note that the calculation in equation (15) can yield α(N , k) values that are negative [28] and in these cases α(N , k) is set to zero for the calculation of p y (j, k).

C. ESTIMATING TALENT USING MEASUREMENT SPACE PARTITIONING
The batted ball distribution estimate p y (j, k) for player j can be used to estimate the player's talent level T (j). If R(j, k) is an estimate of the expected value of batted balls for player j in subset k then T (j) can be estimated by For cases where we would like to estimate y s (j) using a sample of N batted balls for player j, the values α(N , k) VOLUME 9, 2021 for each k in equation (13) can be computed using the Spearman-Brown formula as described in Section II-C.
The y s (j) estimate in equation (17) is equivalent to the linear regression estimate y(j) in equation (9) if α(N , k) has the same value α(N ) for all subsets k and the average value of the observed batted balls in any subset k is the same for all players j. For this special case, if we let R(j, k) equal the overall mean of the observed R(i, j) for subset k then equation (17) can be written where the first sum in equation (19) equals x(j) and the second sum equals µ x which demonstrates the equivalence to equation (9). We will see that by allowing α(N , k) to vary over subsets k and by allowing R(j, k) to vary over players j, the model in equation (17) can generate estimates that are more accurate than the linear regression estimate in equation (9).

A. SENSOR DATA
The Trackman (TM) phased-array Doppler radar has been used by MLB's Statcast system [1] since 2017 to track and characterize batted balls. The TM radar operates in the X-band at approximately 10.5 GHz and is positioned high behind home plate. The measured initial speed s and vertical launch angle v ( Figure 1) for batted balls play an important role in determining batted ball value [19]. In particular, batters tend to achieve the best results for batted balls with an initial speed of greater than 90 miles per hour and a vertical launch angle between 10 and 30 degrees.

B. REPRESENTING BATTED BALL VALUE
Many statistics [24] can be used to quantify a batter's performance on batted balls. Batting average, for example, is the fraction of batted balls that result in a hit but has the deficiency that all hits are given equal value. Slugging percentage allocates different weights to different kinds of hits, e.g. single or double, but has been shown to overweight doubles, triples, and home runs. Weighted on base average (wOBA) [23] uses weights for each batted ball outcome that are proportional to run value and, for this reason, we use wOBA to represent batted ball value R(i, j).

C. CONTEXTUAL INFORMATION
A batted ball with a given set of physical parameters such as s and v occurs in a context that can affect its value. Variation in the outfield geometry across stadiums [9] and variation in the ambient weather conditions [8] can affect the value of a ball hit in the air. The batter's running speed [10] plays a role in determining batted ball value especially for balls hit on the ground. The quality of defenders can also affect the value of a batted ball hit to a given region of the field. These factors cause the batted ball value R(j, k) for subset k to vary depending on the distribution of contextual variables for player j. We will show later in this section how contextual information can be combined with the batted ball distribution estimates p y (j, k) to improve the accuracy of the y s (j) predictions.

D. ASSESSING PREDICTION ACCURACY
Statcast data from MLB games in 2019 was employed to evaluate methods for using observed data to predict player performance in unobserved data. After removing bunts from the dataset, each of the P = 159 players with at least 300 batted balls during the 2019 season was considered. Switch-hitters who bat both right-handed and left-handed were regarded as a different batter for each handedness. The first 300 batted balls for each player were divided into an observed set of N = 150 batted balls and an unobserved set of N = 150 batted balls. The odd batted balls in chronological order for each player defined the observed set and the even batted balls defined the unobserved set. The batted ball value R(i, j) for batted ball i and player j was defined by the wOBA weight for the batted ball result as described in Section IV-B. For the 2019 MLB season the wOBA weights are out=0.000, single=0.870, double=1.217, triple=1.529, homerun=1.940, and batter reaches on error= 0.920 [31]. The observed batted ball data was used to generate predictions for the unobserved performance y(j). The accuracy of a set of predictions y(j) is evaluated using the sum of squared errors (SSE) between the unobserved performance and its prediction.

E. LINEAR REGRESSION
The linear regression model defined by equation (9) was used to generate the y(j) predictions for the data described in Section IV-D. The resulting model is where the observed batted ball data was used to compute α(150) = 0.294 and µ x = 0.402 as described in Section II-B. This model gives an SSE of 0.647 using equation (20). Two boundary instances of the linear regression model are the naive prediction y(j) = x(j) for α(N ) = 1 and the baseline prediction y(j) = µ x for α(N ) = 0. For this dataset, the naive prediction gives an SSE of 0.780 and the baseline prediction gives an SSE of 0.743 which are both larger than the SSE obtained using the linear regression model in equation (21). The y(j) prediction lines for the linear regression model and the naive and baseline predictions are shown in Figure 2 along with the (x(j), y(j)) points for each of the 159 players. The SSE results for these methods are summarized in Table 1.

F. MACHINE LEARNING
A machine learning method [18] based on utilizing sensor measurements in combination with k-nearest neighbors and a generalized linear model has also been used to quantify player performance on batted balls. This method generates a value R (i, j) called xwOBAcon for batted balls and is publicly available at baseballsavant.com. The xwOBAcon prediction y m (j) = x (j) is given by This prediction gives an SSE of 0.688 for the dataset described in Section IV-D. This result is worse than the SSE of 0.647 obtained using LR as shown in Table 1 since the xwOBAcon prediction is optimized for modeling the observed data rather than for predicting the unobserved data.
Although it is not part of the xwOBAcon calculation, the James-Stein paradox [20]- [22] suggests that we can improve the predictive accuracy of xwOBAcon by applying the steps in Section II-B to the R (i, j) values. This results in a regressed xwOBAcon prediction which is given by The SSE obtained using this prediction is 0.578 which improves on the xwOBAcon prediction and provides more empirical evidence for the James-Stein result. The xwOBAcon baseline prediction y m (j) = µ x where µ x is the mean for the x (j) gives an SSE of 0.736 which is slightly smaller than for the standard baseline prediction y(j) = µ x . The results obtained using x (j) are summarized in Table 2 and we see that the SSE are smaller than the corresponding values in Table 1.

G. MEASUREMENT SPACE PARTITIONING
The measured initial speed and launch angle can be used to represent a batted ball as a point in a two-dimensional (s, v) measurement space. This space can be partitioned into B disjoint subsets as described in Section III-A. In the Appendix, we show that the accuracy of the prediction in equation (17) depends on the partition. In this section we define different ways to partition the (s, v) measurement space and show how training data can be used to optimize partition selection.  The internal region can be further divided into rectangular subregions b(i, j) of dimension s width × v width which are defined by b(i, j) : (s min + (i − 1) * s width ) ≤ s < (s min + i * s width ) and  The prediction method described in Section III-C was used to process the observed and unobserved data described in Section IV-D using each of the 42 partitions. For the finer partitions the observed data does not contain enough samples to reliably estimate R(j, k) for each (j, k). Therefore, the mean R(k) in equation (18) was used to approximate R(j, k) for each j and k. The smallest SSE of 0.532 was obtained for P 2.5,40 while the largest SSE of 0.743 was obtained for P 80,160 . If we neglect the effect of the boundary regions, the use of P 80,160 is equivalent to the baseline prediction y(j) = µ x for which we also reported an SSE of 0.743 in Section IV-E.

2) PARTITION SELECTION
Partition selection is an important issue since there are large differences in the SSE for different partitions. To address this issue, we examine whether the analysis of previous year data can be used to optimize partition selection for current year data. To this end, we computed the SSE for each of the 42 partitions defined in Section IV-G1 using 2018 batted ball data arranged as described in Section IV-D for the 2019 data. There were P = 158 players with at least 300 batted balls in 2018 that were considered for analysis. Figure 4

3) EXAMPLE
In this section we illustrate the mechanics of the MSP method using the 2019 batted ball data. The example considers the P 5,10 partition defined in Section IV-G1 that was selected using prior year data as described in Section IV-G2. Figure 5 plots the α(150, k) function and Figure 6 plots the mean µ(k) function over the subregions k for this partition. The α(150, k) function is approximately in the shape of a rotated V with most of the larger values occurring for s greater than 95 mph. Figures 7 and 8 demonstrate properties of α(150, k) and µ(k) for specific subregions S 1 and S 2 of P 5,10 defined by S 1 : (87.5 mph ≤ s < 92.5 mph) and (5 • ≤ v < 15 • ) S 2 : (107.5 mph ≤ s < 112.5 mph) and (15 • ≤ v < 25 • )  which correspond respectively to b (11,9) and b(15, 10) using the notation in Section IV-G1. The observed data described in Section IV-D gives values of α(150, S 1 ) = 0.01, µ(S 1 ) = 0.017, α(150, S 2 ) = 0.61, µ(S 2 ) = 0.011 which predict little correlation between the fraction of batted balls in the observed and unobserved data for S 1 and a larger correlation between the fraction of batted balls in the observed and unobserved data for S 2 . Figure 7 plots the P = 159 points (p x (j, S 1 ), p y (j, S 1 )) as defined by equations (11) and (12) along with the prediction line from equation (13) where each point in the figure has been moved by a small random amount to increase the visibility of the points. There is little correlation between the p x (j, S 1 ) and the p y (j, S 1 ) as predicted by the small estimated value of α(150, S 1 ). Figure 8 is the same plot for S 2 where the points have a larger positive correlation as predicted by α(150, S 2 ). In each figure the red prediction line agrees reasonably well with the structure of the data.   Figure 9 displays the full observed distribution p x (j, k) for player j = Jorge Polanco as left-handed batter using P 5,10 . Figure 10 is the corresponding regressed distribution p y (j, k) computed using equation (13). We see that the regressed distribution captures the overall structure of p x (j, k) but is substantially smoother. The regressed distribution results in a talent level estimate y s (j) in equation (17) of.397. This y s (j) is much closer to the unobserved performance y(j) of 0.386 than the LR prediction y(j) = .424 or the naive prediction of x(j) = .475 which corresponds to the observed distribution shown in Figure 9.

4) COMPARISON WITH LINEAR REGRESSION
In this section we compare properties of the LR and MSP predictions. For the data described in Section IV-D the LR prediction is defined by the line (equation (21)) plotted in Figure 11. This figure also plots the 159 y s (j) predictions for the same data using the P 5,10 partition. We see that players j 1 and j 2 with the same observed performance x(j 1 ) = x(j 2 ) and therefore the same LR prediction VOLUME 9, 2021  y(j 1 ) = y(j 2 ) can be assigned different MSP predictions y s (j 1 ) = y s (j 2 ).
In Section III-C we showed that an important difference between y(j) and y s (j) is that the former is defined using a single α(N ) while the latter employs a separate α(N , k) for each subset k. Players with an observed batted ball distribution p x (j, k) that includes a large fraction of batted balls in subsets k with large values of α(N , k) will have less regression to the mean in the calculation of y s (j) than players with a batted ball distribution that has smaller values of α (N , k). This allows the y s (j) prediction to adapt the amount of regression to a player's collection of batted balls. By comparing equations (13) and (17) with the LR model of equation (9) we see that the correlation-weighted expected wOBA should capture a large fraction of the variance in the difference y s (j)− y(j). Figure 12 is a scatterplot of y s (j)− y(j) versus   Table 3 considers four players with similar y(j) LR predictions. The table also shows that several of the players have significant differences in correlation-weighted expected wOBA C. The players (Hernandez, DeJong) with below average values of C have negative y s − y differences while the players (Acuna, Donaldson) with above average values of C have positive y s − y differences as predicted by Figure 12. We see from the last two columns of the table that these differences benefit the MSP prediction as the LR prediction error y − y is larger in absolute value than the MSP prediction error y s − y in each case.

5) INCORPORATING CONTEXTUAL INFORMATION
In Section IV-C we described several contextual factors that can affect the value of a batted ball with parameters (s, v). Accounting for each of these factors can improve the accuracy of the MSP predictions. In this section we describe a method that can be used to estimate R(j, k) in equation (17) to account for the effects of varying outfield geometry and atmospheric conditions across ballparks. Since a player j typically plays about half of his games in a single home park these effects can have a significant impact on R(j, k). As an example, Figure 13 plots the outfield boundaries for Fenway Park in Boston and Yankee Stadium in New York where the batter's location is at home plate in the lower left corner. A shorter distance from home plate to the outfield boundary typically improves the batter's likelihood of a home run for a batted ball hit in the air. In addition, the altitude of the ballpark affects the air density which plays an important role in determining how far a batted ball will carry [8]. The outfield geometry can affect players differently depending on whether they bat right-handed or left-handed since righthanded batters tend to hit most of their home runs to left field while left-handed batters tend to hit most of their home runs to right field. We will learn ballpark-dependent batted ball values from 2018 data and use these values to process the 2019 data described in Section IV-D. The value of batted balls in a subset k will depend on the quality of the fielders that defend against these batted balls. The home team defenders are on the field about half of the time for games played in park p which can cause bias in batted ball values for a given (k, p). Define R h (k, p) as the average wOBA value for batted balls hit by batters of hand h in subset k and park p with the visiting team on defense in 2018. Let R h (k) be the average wOBA value for batted balls hit by all batters of hand h in subset k in all parks in 2018.
For (h, k, p) groups that correspond to vertical launch angles v ≥ 15 • and include at least ten batted balls in the calculation of R h (k, p) we compute the factor where otherwise F h (k, p) is set to 1. For a player j of hand h with home park p in 2019 we define where R(k) is defined in equation (18) and the 0.5 accounts for the fact that a player plays approximately half of his games in the same home ballpark. The R(j, k) can be used to improve the accuracy of the prediction in equation (17).
For this subregion we have the R(j, k) values shown in Table 4 which demonstrate that right-handed batters have an advantage in Fenway Park and left-handed batters have an advantage in Yankee Stadium. These observations are consistent with the outfield geometries shown in Fig. 13. Let y s1 (j) be the prediction of equation (17) using R(j, k) = R(k) and let y s2 (j) be the prediction using R(j, k) as defined by equation (26). As reported in Section IV-G2, y s1 (j) produces an SSE of 0.546 for partition P 5,10 on the data described in Section IV-D. The use of y s2 (j) reduces the SSE to 0.526. Table 5 presents the five players j with the largest differences y s2 (j) − y s1 (j) and Table 6 presents the five players with the smallest differences y s2 (j) − y s1 (j). Thus, the players in Table 5 are expected to benefit from their home ballpark while the players in Table 6 are expected to be hindered by their home ballpark. The parks represented in Table 5 are known to benefit batters. Coors Field in Denver has an altitude of 5197 feet which enables batted balls to carry longer distances and Citizens Bank Park in Philadelphia has an   Table 6 have outfield geometries that are detrimental to right-handed batters. The last two columns in each table give the prediction errors E 1 (j) = y s1 (j) − y(j) and E 2 (j) = y s2 (j) − y(j) where y(j) is the unobserved performance. E 1 (j) is negative for each of the players in Table 5 which is consistent with the expectation that these players should benefit from their home ballpark while E 1 (j) is positive for four of the five players in Table 6 which is consistent with the expectation that these players should be hindered by their home ballpark. We see that for nine of the ten players in the two tables we have |E 1 | > |E 2 | so that the use of home park information reduces the prediction error.

V. CONCLUSION
Sensor systems that acquire large sets of data have been deployed to document the mechanics of several sports including baseball [1], basketball [32], football [33], and golf [34] at unprecedented levels of detail. Data-driven techniques have been applied to these sensor measurements to discover new skills [35], quantify known skills with greater accuracy [36], and understand biomechanical principles [37] to improve performance and prevent injury. This information has been used by professional sports teams in search of an advantage [38]. While there are large disparities in the financial resources available to teams, the use of data-driven models has enabled small market franchises to compete successfully against their more affluent opponents [39].
We have used ball-tracking radar data to show that the predictive value of a batted ball in baseball depends on its speed and vertical launch angle. This constraint enables a batted ball distribution to be estimated from a set of observations using a regression process that adapts to a player's particular collection of batted balls. We showed that these estimated distributions can be used to make improved predictions about unobserved data. The methodology can be adapted to include additional sensor measurements for properties such as spin and horizontal angle as they become available. Since the approach is based on estimating distributions defined over a partition of measurement space, fine-grained contextual adjustments can be included to improve the accuracy of the predictions. The measurement space partitioning process can be used for several applications in baseball including performance forecasting and defensive positioning as well as for a range of other estimation and prediction tasks involving large sets of multidimensional sensor data.

APPENDIX: DEPENDENCE OF PREDICTION ACCURACY ON THE PARTITION
The error in a prediction generated using the MSP approach depends on the partition of measurement space. Using equation (17), we can write the unobserved performance for player j as y(j) = B k=1 p y (j, k) + p (j, k) R(j, k) + R (j, k) . (27) The error terms are defined by p (j, k) = p y (j, k) − p y (j, k) and R (j, k) = R y (j, k) − R(j, k) where R y (j, k) is the average value of the unobserved batted balls in subset k for player j. The prediction error is given by y(j) − y s (j) = B k=1 p y (j, k) R (j, k) + R(j, k) p (j, k) + p (j, k) R (j, k) (28) where each term in the sum depends on the subset k.
The error terms have a complex dependence on the group of subsets that define the partition. Reducing the size of the R (j, k) error depends on balancing the competing goals of using subsets k that include enough data to estimate R(j, k) accurately but which also allow a single R(j, k) to be representative of any particular sample within a subset that might occur in y(j). The variance of the p (j, k) error is given by [26] VAR p (j, k) = σ 2 p (k) 1 − α 2 (N , k) where σ 2 p (k) is the variance of p x (j, k) over batters j for subset k. Thus, VAR p (j, k) depends on both the distribution of the p x (j, k) and the α (N , k). Since the error terms and the prediction error in equation (28) have a complex dependence on the interaction between the measurement space partition and the structure of the data we use a learning process for partition selection as described in Section IV-G2.