Deep reinforcement learning in a racket sport for player evaluation with technical and tactical contexts

Evaluating the performance of players in a dynamic competition is vital for achieving effective sports coaching. However, a quantitative evaluation of players in racket sports is difficult because it is derived from the integration of complex tactical and technical (i.e., whole-body movement) performances. In this study, we propose a new evaluation method for racket sports based on deep reinforcement learning, which can analyze the motion of a player in more detail, rather than only considering the results (i.e., scores). Our method uses historical data including information related to the tactical and technical performance of players to learn the next-score probability as a Q-function, which is used to value the actions of the players. We leverages long short-term memory model for the learning of Q-function with the poses of the players and the position of the shuttlecock as the input, which are identified by the AlphaPose and TrackNet algorithms, respectively. We verified our approach by comparing various baselines and demonstrated the effectiveness of our method through use cases that analyze the performance of the top badminton players in world-class events.


I. INTRODUCTION
With the development of deep learning technologies, computer vision-based deep learning methods have become increasingly important in sports analytics. Among the broad scope of tasks to be addressed, player evaluation is a major task; it has received increasing attention as it can assess and appraise the actions observed in a game to players, coaches, and other staff in order to facilitate decision making (i.e., tactics) and improve technical skills, thus providing a competitive advantage to an individual or a team.
There are two main approaches for player evaluation. The first is to use various statistics to sum up "the total contributions of a player to his/her team" into a number. The second approach is to assign values to the actions performed during a match. In the second approach, traditional methods (e.g., [1]) demonstrate significant limitations as they can only evaluate the actions that directly lead to a score (e.g., shooting), but are unable to evaluate those actions that indirectly lead to score. Recently, Markov models have been used to address this issue, which have the advantage of unified evaluation criteria (actions are evaluated in the same scale by anticipating expected outcomes). These approaches are based on the analysis of event stream data (including optical data) that describe the actions performed in a game. However, in racket sports, the task of action value evaluation of players is almost unexplored because, in such a sport, technical whole-body movements have be evaluated in addition to the tactics.
In racket sports, most previous studies on player evaluation were limited to traditional methods. Shen et al. evaluate a single shot type (lob) in badminton, and defined a good/bad lob on whether the subsequent action of the opponent is a smash and leads to a score [2]. This work was considered in the limited available research; thus, inspired by this idea, we adopt the Markov model approach (i.e., the second approach in the above paragraph) using technical whole-body movement and tactical information for player evaluation in actual games.
With the recent advances in deep learning, deep reinforce-VOLUME 4, 2016 ment learning (DRL) [3] has been applied in various fields [4]- [7], and has shown significant promise in player evaluation in complex and dynamic environments. A previous method used the Markov model in team sports (e.g., ice hockey) to evaluate the tactical performance, but ignored the effect of technical performance on the value of the action. As racket sports have the characteristic of less participants (single-or double-player games), the result of a game depends largely on individual technical skills. Therefore, when compared to team sports, the specific sports-based technical performance of player(s) should be analyzed through videos. Most previous studies have used active reinforcement learning (RL) to calculate optimal strategies for complex continuous-flow games [8], [9]. Similar to Liu et al. [10], we solve a prediction (not control) problem in the passive learning (fixed policy) setting. We use RL as a behavioral analytic tool for real human agents, not to control artificial agents.
In this study, we propose a player evaluation approach with play contexts in badminton, which leverages historical match data containing tactical and technical information to assign a rating to the actions (e.g., smash) performed by the players in a match. For a given badminton game, we used deep learning methods to extract the features from videos that contain information related to both the tactical and technical performance of the players. Through experiments, we examined the effect of technical and tactical contexts on player evaluation by applying a DRL method on a badminton dataset (videos of the BWF: Badminton World Federation). Our research aims to provide coaches some insight into the influence of the movements of players on the advantages and disadvantages in specific competition situations. Therefore, we analyzed the content of concrete movements and its influence.
In summary, the primary contributions of this study are as follows.
(1) We propose a player evaluation method in racket sports based on deep reinforcement learning (DRL) that can analyze the motion of a player in more detail, instead of analyzing the results (i.e., scores).
(2) Methodologically, our method leverages historical data including the tactical and technical performance information of players to learn the next-score probability as a Q-function, which is used to value the actions of the players.
(3) In the experiment, we verified our approach by comparing various baselines, and confirmed the effectiveness of our method through use cases that analyze the performance of the top badminton players in world-class events.
The remainder of this paper is organized as follows. An overview of the related works is presented in Section II, our method is described in Section III, the experimental results are discussed in Section IV, and this paper is concluded in Section V.

II. RELATED WORK
Evaluation methods in many sports. There are two main approaches employed in evaluating sports players. One in-volves boiling down the contributions of a player into a single number. Some well-known examples are Wins Above Replacement (WAR) [11] in baseball and Player efficiency rating (PER) [12] in basketball. The other approach involves quantifying the value of the action of a player using the Markov model. Recently, the latter approach has been applied in various sports. Cervone et al. proposed Expected Possession Value (EPV) to evaluate the decisions and actions of players using spatiotemporal tracking data in basketball [13]. Liu et al. introduced Game Impact Metric (GIM) to aggregate the values of the actions of players in ice hockey [10]. In soccer, a method of Valuing Actions by Estimating Probabilities (VAEP) [14] and its variants [15] has been proposed. A survey on team sports was recently conducted in [16].
Based on reinforcement learning frameworks, certain papers were published on team sports using inverse planning methods, which estimate the action model or reward function from the observed data using statistical learning techniques. For example, the state-action value function (Q-function) of a player was estimated using a recurrent neural network [10], [17], which was interpreted using a linear model tree [18]. To evaluate the shooting action of players, researchers investigated the expected possession value [19]- [21], and the value of the space [22], [23] by extending a Voronoi diagram [24]. In team games, DRL was used for estimating the quality of defensive actions in ball-screen defense in basketball [25]. Compared to these studies on invasive sports, we propose a DRL framework for racket sports utilizing technical (i.e., pose) information and tactical contexts. Quantitative analysis in racket sports. In racket sports, there is a broad range of tasks to address. Most of the related research focused on the detection/tracking of players and sports equipment (e.g. ball) [26]- [28], stroke classification [29]- [31], and match outcome prediction [32], [33]. Despite the popularity of racket sports, only a limited number of works offer computer vision-based solutions for the task of player evaluation. Certain previous works [34] manually designed evaluation formula for players (i.e., first approach in the above paragraph). Pfeiffer et al. were the first to adopt the Markov model approach for performance diagnosis in table tennis [35]. McGarry and Franks used the Markov model for explaining the championship performance in squash [36]. However, these studies discretized the coordinates of location and time, which resulted in the loss of information and failed to generalize the unobservable parts of the state space.
For stroke evaluation, traditional methods require sports analysts to examine the videos repeatedly, and evaluate the stroke performance based on their knowledge. Deep learningbased computer vision approaches extract stroke features from videos [35], [37], [38] to characterize and model the competition process. Wang et al. [39] integrated the knowledge of analysts, and trained a classifier that learned the evaluation of analysts to obtain quantified evaluation results; however, their approach using domain knowledge (table tennis) varies from our general approach for modeling racket sports. Recently, Wang et al. [40] introduced a badminton language to describe the process, and predict the win probability in a rally. Compared to reinforcement learning methods, their method cannot estimate the value of each stroke directly.

III. PROPOSED METHOD
In this section, we describe the proposed method for estimating the action value of a player. The overview of our approach is illustrated in Fig. 1.
FIGURE 1: Overview of our approach. Pose estimation of players and shuttlecock position detection using AlphaPose [41] and TrackNet [42], followed by a DRL model for estimating the Q-function. Given a Q-function, the action value is defined as the change in Q-value due to the action.
Badminton is a competitive sport, where the winner of a match is determined based on the best performance out of three games and each game is played for 21 points. A rally starts with a serve and ends when the point is won. To describe a badminton match, we considered a rally as the analysis unit. The course of the rally can be described as the transition from one state to another. A rally in badminton comprises a sequence of strokes with outcomes.
By using the video data of badminton games, we first segmented each game in a match into several rallies. For each rally, we estimated the XY-coordinate values (17 body parts in the COCO (Common Objects in Context) dataset) for joint positions of nose, eyes, ears, shoulders, elbows, wrists, hips, knees, and ankles using AlphaPose [41]-a popular high precision multi-person body-pose estimation system, and detected the shuttlecock position using TrackNet [42]an object tracking network that has been proven to exhibit a decent tracking capability in games with high-speed balls such as badminton.
As a prepossessing procedure, joint positions that were not properly estimated owing to an overlap were annotated through the COCO annotator. Moreover, we assumed that the midpoint of the two ankles indicates the position of the player, and the pose represents the coordinate values relative to the position of the player. For the pose of a left-handed player, we reversed the corresponding relative coordinate values.
We combined the outputs as an input feature vector of a DRL model, applied the DRL method to estimate the Qvalues, and finally, obtained the action value from the Qvalues for evaluating the performance of a player. According to the position of the players relative to the camera, we divided the two players in a game into the front and back players. We considered the player closer to camera as the front player; conversely, the player farther from the camera is the back player (we sometimes denote them as front and back, respectively).

A. FORMULATION
We utilize a reinforcement learning framework, specifically based on the recent sports-related work [10]. The reward R specifies the player who wins a point at the end of a rally. In this study, we generalized the strokes to nine types, namely serve, drop, smash, clear, lift, drive, block, net kill, and net shot. Fig. 2 illustrates the different stroke types. Action a t is one of the stroke types. To describe the state, we considered the feature vector at each hit time (the moment when the racket contacts the shuttlecock) as state representation s t at time step t.
The Q-function Q(s, a), represents the conditional probability that the front resp. back player winning the point at the end of the rally such that Q f ront/back (s, a) = P (point = 1|s t = s, a t = a). (1) The Q-function computes the expected rewards for an action taken in a given state. Different Q-functions can be used to study different outcomes of interest, such as goals and penalties [43]. In a variation from a previous study [35] that used "point" and "fault" as the expected outcomes, we used the next-score probability as the Q-function. The advantages are as follows. 1) Compared to the outcome of a rally ("point" or "fault"), the next-score probability function is highly interpretable, as it models the probability of an event. 2) It can provide coaches with a more detailed overview of player performance during a rally. Fig. 3 shows the architecture of our DRL model. We constructed dynamic two-layer long short-term memory (LSTM) networks to learn the Q-function and estimate the Q-values.  The networks take a sequence of states s t and actions a t at the moment the player hits the shuttlecock (hit frame) as the input. We used this model to simultaneously evaluate both the front and back players in a given rally.

B. LEARNING Q FUNCTION
The LSTM networks have an input layer with 256 nodes, a hidden layer with 256 nodes, and a dense output layer with 3 output nodes. The three output nodes, Q Back (s, a), Q F ront (s, a), and Q Rally_end (s, a), represent the probability that the front/back player wins the next point according to the present state and action and the probability that a rally ends according to the present state and action. The LSTM networks require four types of input data for model training, namely the XY-coordinates of the positions of the front and back players, both their poses, the position of the shuttlecock, and the action.
In this study, the Q-function is learned via a neural network, which is called a function approximation approach. We approximate Q-function by the LSTM networks, we denote LSTM networks as q, then the approximation can be written as: Q(s, a) ≈ q(s, a; w), where LSTM networks are parameterized by w. We use the positions of the front and back players, both their poses, and the position of the shuttlecock as the state s, and the action a in q(s, a; w).
We applied state-action-reward-state-action, which is an on-policy reinforcement learning algorithm for estimating Qvalues. The Q-value for a state-action is updated using the following equation: where s t+1 and a t+1 denote the state and action at time step t+1. α is the learning rate, and γ is the discount factor. Instead of tabular learning, we considered the neural networks as function approximators.
Our network is trained with loss L as: The network is trained using the Adam optimizer with a learning rate of 10 −4 for 90 epochs. The hyperparameters were set to γ = 0.3.

C. STROKE EVALUATION
For each action of the front/back player in a rally, action value A can be computed as: We set the probability that the front/back player will win the next point (Q-value) before service as 0. We calculated the action value of each stroke type and the corresponding number of each stroke type that a player takes in each game, and calculated the average action value of each stroke type performed by a player in a game, as shown in Fig. 4.
We present a typical example to demonstrate the effectiveness of our approach. The graph on the left in Fig. 4 shows the dynamic changes in the Q-values of a rally in a match between Lin (back player) and Lee (front player), which was conducted on March 17, 2018. The back player Lin won this rally in the end. The figure plots the values of the three output nodes. The graph on the right in Fig. 4 shows the action value of each stroke performed by the front and back players. According to the bar chart, the fourth stroke, which was a "smash" performed by Lin, has the greatest action value, and the fifth stroke, which was a "block" performed by Lee, has a minor action value. The result of the action value is consistent with our intuitive feelings from the video that the "smash" performed by Lin was significantly powerful, and Lee failed to intercept the shuttle through the "block" stroke.
Moreover, this figure provides more insights on the rally. The intuitive feelings gained from the video may make the coach erroneously attribute the loss of a point to the poor defensive "block" performed by Lee. However, our result shows the action value of the "block" is positive, albeit a small one. Conversely, the third stroke, which was a "clear" stroke performed by Lee, has a negative action value; after this stroke, Lin seized the chance to attack using the offensive "smash" action. We examined the video and identified that Lee returned the shuttle to a position closer to Lin with the "clear" action; consequently, Lin has sufficient time to jump and attack with a "smash." The figure reflects that the previous actions in a rally have an indirect impact on the final result. Therefore, we can conclude that the result of our approach can provide useful clues for further analysis over the course of multiple strokes in a rally.

A. DATASETS
We collected videos from the BWF TV channel. We ignored the cross-view problem, and only selected videos captured from a single view (as shown in Fig. 1). Our constructed dataset provides information regarding game contexts and player actions for the 2018-2020 BWF season, which contains 21 matches, covering 22 players and 320 rallies.

B. VERIFICATION OF OUR METHOD
We evaluated the design choices of our approach in terms of input components by comparing the performance of the LSTM networks with different inputs.
As presented in Table 1, a model trained with all the input components (shuttle position, player position, and player poses) achieves the best performance, indicating that all the input components contribute in estimating an accurate Qfunction.
Next, we examined the effect of the stroke type on our model. As the action is a primary element in reinforce-

C. CHARACTERISTICS OF THE PLAYERS IDENTIFIED USING OUR METHOD
The matches considered here include one final, two semifinals, and one quarter-final men's singles matches in the 2018 BWF World Tour, which are the matches in the All England Open Tournament between Lin Dan and Shi Yuqi, Huang Yuxiang and Lin Dan, Son Wan-ho and Shi Yuqi, and Lin Dan and Lee Chong Wei. We examined the performance of the players from the perspective of the average action value of each stroke type performed by a player in a match (Fig. 5).
For example, we identified that, when Lin was playing against Huang and Lee (Figs. 5 (B) and (D)), the strokes performed by Lin were superior to those by his opponents. Lin had fewer stroke types whose action value were substantially below zero, especially when he was playing against Lee. His offensive strokes such as "smash" and "net shot" were better than those of Lee. However, when playing against Shi (Fig. 5 (A)), Lin did not display significant advantages, which would explain why he lost the final match. During the match between Shi and Son ( Fig. 5 (C)), Shi showed superior performance in offensive strokes such as "smash" and "net shot." The summary of all player action values in the 2018-2020 BWF Tour was indicated in Table 2.

D. RELATIONSHIP WITH SCORE
Owing to the cross-view scenes in badminton videos, we could only use a portion of the rallies in each match. Therefore, to further demonstrate the effectiveness of our approach, we examined the relationship between average (maximum) action values and the score, as shown in Figs. 6 (A) and (B). Here, the average (maximum) action values refer to the average value of the maximum value of the actions performed by a player in each rally over several rallies, and the score indicates the number of points scored by the front/back player in a match.
Spearman's rank correlation coefficient ρ was applied to quantify the correlation, because we also examined the relationship with the rank of the players, as described in the next subsection. Fig. 6 (A) shows that ρ = 0.150 (p > 0.05), VOLUME 4, 2016 indicating that there is no correlation between the score and average action value. Fig. 6 (B) shows that ρ = 0.312(p < 0.05), which reflects a weak positive monotonic correlation between the score and average (maximum) action value.
The results suggest that the average (maximum) action value could be associated with the score, and if the maximum action value of a player in a rally is greater than that of his opponent, the action can lead to obtaining the score of the rally.

E. RELATIONSHIP WITH THE RANK OF THE PLAYER
We also examined the relationship between the rank of the player and average (maximum) action values, as shown in Figs. 6 (C) and (D) Here, the average action values refer to the average value of all actions performed by a player over several rallies. For the rank of a player, we used the official ranking data of the BWF for the front/back player. Fig. 6 (C) shows that ρ = −0.353 (p < 0.05), which reflects a weak negative monotonic correlation between the rank of the player and average action value. Fig. 6 (D) shows that ρ = −0.048 (p > 0.05), which indicates that there is no correlation between the rank of the player and average (maximum) action value.
The results suggest that average action value can be associated with the rank of the player, and higher ranked players tend to perform actions with higher action values.

F. CORRELATION COMPARISON WITH BASELINE
No previous studies have used the DRL model to evaluate each stroke in racket sports and compared with the similar method in team sports of ice hockey [10], our study took into account the importance of technical performance in singleracket sports. To evaluate our method, we eliminated the information of technical performance (player pose) and used it as a baseline.
Here, we computed the same Spearman's rank correlation coefficient ρ as above regarding the baseline model. We summarize the results in Table 3, which shows that there was no correlation between player evaluation metrics and standard success for baseline model (ps > 0.05). Compared to the baseline, our method (full model) shows a weak monotonic correlation in the two success measures. The correlation results suggest some important information in our full model (pose information) to evaluate the performance of the players.

V. CONCLUSIONS
In this study, we applied the DRL method to the task of player evaluation in racket sports (badminton). We developed a framework to estimate the action value of a player from an input video. We verified our approach by comparing various baselines, and validated the effectiveness of our method through use cases that analyzed the performance of top badminton players in world-class events. We also discovered valuable insights into the correlation between the action value and the score/rank. A badminton dataset with ground truth annotations of both stroke type and game scores is also provided if the paper will be ready for the publication.
In our future work, we plan to improve the framework (e.g., modeling of players separately for player-specific analysis), and extend our framework to other racket sports such as tennis. We also believe that more accurate 3D feature extraction can help in modeling different sports.

A B
C D