End-to-End Key-Player-Based Group Activity Recognition Network Applied to Basketball Offensive Tactic Identification in Limited Data Scenarios

In this paper, we propose an end-to-end key-player-based group activity recognition network specially applied to the identification of basketball offensive tactics in limited data scenarios. Our previous studies show that basketball tactics can be better recognized via key player detection with multiple instance learning (MIL) using the support vector machine (SVM). However, the SVM in that work is required to extract features depending on basketball- and tactic-specific knowledge for good performance. Thus, in this study, we develop an end-to-end trainable neural network without prior knowledge and integrate MIL into it. As long as a tactic label is given, MIL can train the network to identify tactic’s key players. For testing, our network can recognize the key players in a video clip and provide a tag of the tactic related to them. Like other neural network models, our network requires a large annotated dataset. At the same time, we could collect only a few labeled data, which is common in dealing with group activity recognition. To overcome such a limitation, we propose a novel data augmentation framework, the tactical-based conditional generative adversarial network (GAN), for generating new labeled trajectories. The experimental results show that our method significantly improves 9.13 % in tactic recognition and 4.965 % in key player detection.


I. INTRODUCTION
Group activity recognition is a widely used but challenging problem. Generalized from single-person activity recognition, group activity recognition needs to deal with complicated dynamics among people, including individual's role, the interaction among different individuals, and each behavior. Although the existing human activity recognition algorithm can accurately identify individual actions, there is still much room for improvement in identifying group interactions. For example, the graphical model can adequately describe the relationship of intermediate action through nodes and edges. However, the number of behaviors that can be included is greatly restricted. On the other hand, statistical The associate editor coordinating the review of this manuscript and approving it for publication was Jenny Mahoney. learning has no limit on the number of actions, but its capability in analyzing the interaction among human behavior is deficient. Due to the rapid development of deep learning in recent years, one can use data-driven techniques to simultaneously identify a larger number of behavior patterns, capture the division of roles, and analyze their interaction.
In this work, we put our emphasis specifically on analyzing group behavior in sports. Sports include all forms of competitive physical activities or games through casual or organized participation. Such analysis improves physical ability and skills while providing enjoyment to participants and, in some cases, entertainment for spectators.
In cooperative group activities, multiple players typically act according to pre-defined tactics. Needless to explain, recognition of the tactics taken is essential for coaches and players. The audience can also enjoy more at the same time VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ if they can identify the tactics applied by both teams. Traditionally, such recognition is done by senior sports analysts at a very slow pace. It would be useful if the computer can recognize tactics and display them to the coaches and the audience. The players can also be beneficial in learning the tactics through such a computer program. This study aims to develop tools for recognizing basketball as an initial attempt for general group activity identification applied to sports. For a given tactic, a subset of players is required to perform particular behaviors. Based on this observation, we can separate all players into two groups. One is the key player group, which covers core members of the tactic operation and has small intra-tactic variation. The other non-key player group contains the rest of the players and usually has a larger intra-class variation. Most tactics are characterized by activities of the key player group. Thus, we transform the tactic recognition to a detection problem of the key player group. Our method provides better tactic recognition results and better recognition interpretation since key players are detected for verification.
We adopt the MIL to detect the key players, with the identified tactic as the bag label and moving trajectories of each player subset as instance. The MIL combines with handcrafted spatial-temporal features named motion intensity maps (MIM) from recorded video clips and provides satisfactory recognition results in [1].
However, using the handcrafted features has many drawbacks. First, MIM features heavily rely on prior knowledge of basketball courts, which is pretty cumbersome to prepare. Second, to represent videos of different temporal lengths in vectors of the same dimension, a fixed number of time segments is taken. A simple average operation is then applied to get specific segment's features. All segment features are concatenated into a global dynamic feature. This re-sampling method ignores real pace information.
To overcome those drawbacks, we propose an end-to-end trainable network. This network's input is the position, i.e., (x, y) coordinate, of each player along the temporal axis. Since our model only requires raw trajectories without any prior basketball knowledge, it is also applicable to different group sports such as soccer and volleyball and group behaviors appearing in surveillance recordings. In addition to key player pattern detection, our method can carry out temporal pattern discovery, leading to an in-depth understanding of tactic interaction.
Deep networks require a large number of training data to train the parameters of the model. According to experience, the best number of training data is several times more than the number of network weights. For still image object recognition, a researcher can pre-train model on large-scale datasets such as ImageNet, CoCo, etc., then fine-tune the network using a customized dataset. However, for time series learning problem like ours, well-known datasets like ImageNet or MS COCO dataset that can provide a large amount of labeled data are not available. The dynamic characteristics: including the posture, trajectory, and position of the object that change over time, often yield large diversity due to the varying settings of the start and the end time, or the difference in the viewing position. To increase model robustness for temporal variation, we need more data for training the network. The disadvantage of a small dataset is even apparent. However, if we have only a small dataset, as in many application scenarios of group activity recognition, we may use data augmentation. Data augmentation can be achieved by either adding slightly modified copies of existing data or synthesizing new data from existing data. Adding slightly modified copies is easy to implement through geometric transformation (e.g., rotation, cropping, etc.), temporal transformation (e.g., resampling, repace, etc.), but its scale factor in such approaches is pre-defined and inflexible. Synthesizing new data does not have restrictions on the scale factor but needs a good generative model. We found that the GAN is an excellent one and is very suitable for our purpose.

II. RELATED WORK
Group behavior analysis on sports video has been explored in the literature to analyze individual and team performance. Methods of this category are used for various applications such as football play recognition [2], [3] and basketball behavior detector [4] as well as feature extraction such as team occupancy and team centroid features [5]. For basketball tactic recognition, Chen et al. [6] make a breakthrough by using dynamic time warping (DTW) with the Gaussian mixture model (GMM). Our previous work [1] first introduces the idea of tactic recognition via key player detection, where the authors extracted handcrafted basketball court MIM features with a MIL classifier.
Recently, deep neural networks have become a powerful learning algorithm owing to their strong feature extraction capability. Under the deep neural network's framework, group activity recognition shares many common layers, such as a dynamic feature extractor and a classification layer, with single-person activity recognition. The main difference between single-person and group activity recognition is that group activity recognition has an additional aggregation layer. An aggregation layer, which tries to merge multiple individual features, can be formulated in different ways, e.g., simple concatenation [7], [8], average or max-pooling [9], attention [10], [11], and the semantic graph [11], [12]. Our method introduces the so-called ''NchooseK'' layer, which is a new type of aggregation method based on an assumption of key players. It turns out that an end-to-end neural network for key player detection can be derived based on the proposed NchooseK layer.
Global average pooling (GAP) and class activation map (CAM) are introduced by Zhou et al. [13]. Via global average pooling and a fully-connected layer with linear activation before the softmax layer, their method not only reduces the number of learnable parameters but also applies the commutative law of multiplication to evaluate the contribution of each feature vector to the specific class score. Although CAM is effective, the requirement of global average pooling placed on the last classification layer limits its usage on more complex and pre-trained neural networks. Selvaraju et al. [14] propose a modified version of CAM, called Grad-CAM. Grad-CAM uses backpropagation gradients of succeeding layers as class feature weight without requiring GAP layer. CAM and its variants can be used in a variety of applications. Due to their flexibility of feature maps' size, they can also be used on time series classification, sequence-to-one classification [15], [16], which merges different-length time sequences to a global feature of the same dimension. Our method uses a similar strategy to CAM, but we further leverage the mutually exclusive property of key player instances.
In this work, MIL and GAN play important roles in our method. Here we shall briefly review these two topics as follows. MIL is a type of weakly supervised learning. It is introduced by Dietterich et al. [17] for drug activity detection and has various applications to image classification [18], object detection [19], text or document categorization [20], and semantic segmentation [21], [22]. MIL algorithms originally work on pre-defined features. But with the development of deep learning, a variety of neural networks with MIL have been proposed for combining the powerful feature extraction capability of deep learning and the low-cost labeling of MIL. Zhou and Zhang [23] propose an instance-space MIL algorithm, casting instance features to instance scores followed by MIL pooling on the score layer. Wang et al. [24] propose an embedded-space MIL algorithm that performs MIL pooling directly on the feature domain. Instance-space methods allow the identification of positive instances but with a lower performance than embedded-space methods. Ilse et al. [25] use an attention layer to combine instance-space interpretation and embedded-space. Instances in this research are mutually exclusive. Thereby, instance-space max-pooling matches our requirement.
The GAN is a popular deep generative network in recent years. It is proposed by Goodfellow et al. [26]. Deep generative models before GAN are beautiful in theory, but it is not very effective in practical applications. Among them, the models belonging to the undirected graphical model include Restricted Boltzmann machine (RBMs) [27], Deep Belief Networks [28], and Deep Boltzmann Machines (DBMs) [29]. This kind of Boltzmann Machine-based generative model uses maximum likelihood to estimate the value when calculating the data distribution. The calculation is very complicated, and other solutions except trivial solutions are very difficult to obtain. Using the Markov Chain Monte Carlo (MCMC) method to find an approximate solution is an alternative way, but MCMC is also a complicated method. To avoid complicated calculations like the computation of log-likelihood, researchers have proposed other optimization methods like the score-matching [30] and the noise-contrastive estimation (NCE) [31]. But on most occasions, the density function is not normalized and estimating a normalization constant is also very time-consuming. As to the backpropagation method, although one can use labels to simplify the process, it is only feasible for the tasks like pattern matching. It cannot be used directly for a task that needs the recursive instruction to be executed during training. On the other hand, the GAN model uses two coupled networks and applies the min-max game algorithm to train both networks alternatively with the existing backpropagation methods. Because the generator in the GAN architecture has sound data generation capability, [32]- [34] prove that it can be used for a wide variety of data augmentation. There are two kinds of trajectory augmentation. One is known as trajectory prediction, in which an initial sequence is given, and with the initial sequence, the model generates successive data. The other is called trajectory simulation, where models simulate the entire trajectory. In general, trajectory simulation is more difficult to implement than trajectory prediction. Social GAN [35] and GD GAN [36] are examples of using GAN to perform trajectory prediction. Since we need to increase the total amount of training trajectories, trajectory simulation is more suitable for our demand. Crowd simulation [37] is a virtual simulation of the entire trajectory. The strategy it adopts is a method inherited from the trajectory prediction method. VOLUME 9, 2021 The neural network corresponding to trajectory prediction will read the initial trajectory produced by another independent neural network and make the prediction. Those two neural networks work together to produce the final trajectory. The main difference between our method and the crowd simulation is that the latter adds movement constraint to generated trajectories to simulate various pedestrian interactions. Instead, we are inspired by conditional GAN [38]and add tactical information as conditional input to the original Crowd simulation GAN. We call this enlarged GAN as Group-Tactic-Role conditional GAN (GTRCGAN).

III. OUR APPROACH A. NETWORK OVERVIEW
Given a set of half-court offensive videos, each of which belongs to one of the C tactics. The trajectories of the five offensive players in the video i have been retrieved and denoted by {π i,p } 5 p=1 . Each trajectory is a temporal sequence of that player's (x, y) positions in the court i.e.
where F is the frame number of video i. It is worth noticing that the frame number varies from video to video. Besides, the five trajectories in each video are orderless. Each player has a tactical label, which contains tactical type c i and role ID r p . Note that role ID r p here represents a specific player movement in a specific tactic, not the basketball positions such as center, forward, and guard. As a result, players of different tactics that have the same role ID do not mean they have a similar trajectory. On the other hand, players of different tactics whose role IDs are different might have a similar trajectory.
Our neural network is designed for multi-class classification. It consists of multiple subnets for various key player groups. Figure 2 (a) shows the subnet where the number of key players is K . This network is composed of four high-level layers, including (1) RNN auto-encoder for individual player feature extraction, (2) the NchooseK layer for group instance aggregation, (3) temporal global average pooling for dimensionality reduction, and (4) instance-space miNet for MIL. To train the network, the loss function with two terms is described as (1): where L cross-e represents the cross-entropy loss of tactic classification, and L auto-encoder represents the auto-encoder loss which calculates the Euclidean distance between an original trajectory and decoded trajectory from auto-encoder. The details of the network components and the loss functions are given in the following.

B. RNN AUTO-ENCODER FOR PLAYER FEATURE EXTRACTION
To capture the features of each player, a single layer of recurrent neural network (RNN) is adopted, which casts the player's (x, y) coordinates to a D-dimensional hidden state features h t p at each timestamp t. To maintain the correlation between the hidden state feature h t p and the original trajectory for video i, an auto-encoder is incorporated with RNN for regularization to avoid overfitting and its loss function is defined by (2) where π i,p andπ i,p represent the original and decoded trajectories, respectively.

C. NCHOOSEK LAYER FOR GROUP INSTANCE AGGREGATION
NchooseK layer aggregates individual player features to group features. The proposed aggregation layer is developed based on grouping the key player number n c out of the five players together for a specific tactic c. Since trajectories are randomly ordered, we list all possible C 5 n c groups. The positive instance is one of the C 5 n c instances which covers all key players. This layer is called the NchooseK layer. The output of this layer is {h

D. TEMPORAL GLOBAL AVERAGE POOLING
To overcome the problem that different videos may have different numbers of frames, temporal global average pooling (tGAP) is applied along the temporal dimension, as illustrated in Figure 3. After tGAP, the temporally pooled features for each instance k is denoted by H k .

E. INSTANCE-SPACE MINET FOR MULTIPLE INSTANCE LEARNING
A fully-connected layer is used to cast high-dimensional features H k to class prediction probability s c k of whether instance k is positive for tactic c. Because each video has just one positive instance, which is supposed to be the one with the maximum score value s c * k . Regarded as MIL pooling, a max-pooling is used to obtain the final class prediction Y c , i.e., Y c = max k s c k . Finally, we compute Y c for each tactic c and concatenate them into the tactic prediction vector Y of this video. Since we have the ground-truth tactic label for each training video, cross-entropy is used to define loss L cross-entropy in Eq. (1). Note that our method can predict the key players since the positive instance is found via max-pooling. Figure 2 (c) shows key instance prediction is obtained by adding an argmax pooling when network inferences.

F. GENERATING TRAJECTORIES USING GAN
As illustrated in Figure 4, for augmenting enough data, we take the architecture proposed in [37] and make two modifications: (1) we modify the characteristic of trajectory π from position-based coordinates to displacement-based coordinates. This modification is reasonable because a displacement-based feature can better characterize the causality relationship than a position-based feature; (2) we  combine both the tactical label c i and the tactical role r p as the conditional input of GAN. Thus, in the implementation, we will combine c i and r p and make it a one-hot encoding vector 1 c i ,r p of size C × 5. In the original architecture in [37], the entire trajectory was divided into two groups of generating tasks, namely the one for initial path (entry-points part) and the other for subsequent path (continual-points part). Entry-points part refers to the beginning part of the track, which contains n e data points. As for its generator, an entry fully-connected layer is applied to integrate latent variables z and conditional input 1 c i ,r p to generate a displacement sequence {d 1 , · · · , d n e }. The discriminator of the original architecture will take the sequence {d 1 , · · · , d n e } as the input of the entry fully-connected layer and then output a value sitting within the range [0,1]. We denote this sequenceto-one output as v e (d π 1:n e |1 c i ,r p ; D ) for judging whether the initial path is correct or not. As for the continual-points part, which contains F − n e data points, we consider a sequenceto-sequence model. The generator uses the gated recurrent unit (GRU) to read the displacement sequence from time 1 to t − 1, the latent variable z, and the conditional input 1 c i ,r p .
Then we transform the hidden state at time t into displacement d t through a continual fully-connected layer, and this VOLUME 9, 2021 transformation can be represented as g(z|d π 1:t−1 , 1 c i ,r p ; G ).
As to the discriminator, we use GRU and a continual fully-connected layer to calculate the probability that d t is true under a given displacement sequence {d 1 , · · · , d t−1 }, which can be represented as v c (d π 1:t |1 c i ,r p ; D ). During the training process, the generator will generate a set of N × 5 trajectories as quantities of batch size from different sequences of noise vectors in each iteration. The loss function in total includes four items: 1. The recognition success rate of the initial path is described as where the first item above is the probability that trajectory π from the real database being judged as real, and the second item is the probability that trajectory π of the augmented database from the generator being determined as fake.

Recognition success rate of subsequent path is described as
where the first term above is the probability of trajectory π being judged as real from the real database , while the second term is the probability of trajectory π being judged as fake from the augmented datasbase of the generator. 3. The Euclidean distance between generated trajectory π i,p and real trajectory π i,p is described as , π,π 5 p=1 π i,p − π i,p 2 .
4. The boundary condition of the court is described as where m ∈ {x, y}, LB m and UB m are the lower and upper boundary of the court. We have adopted soft boundary, i.e., the generated trajectory π i,p is allowed a little bit over the boundary.

G. IMPLEMENTATION DETAILS
Our model is implemented using the TensorFlow deep learning framework. In the MIL Network, the RNN hidden state dimension is set to 512 and the NchooseK pooling is carried out by max pooling. We use the RMSprop optimizer for training with a learning rate 0.001 and batch size 2. Each trajectory is down-sampled by a factor of 10 to reduce training memory load and accelerate training speed. In the GAN model, latent variable z uses the 2-dimensional uniform random distribution. In generator G and discriminator D, the entry-point part contains a layer of 512-dimensional fully-connected layer, and the number of the generated entry point is 1. In the continual-point part, it contains a 100-cell GRU block and 128-dimensional fully-connected layer. For obtaining the best accuracy, the GRU block will consider all trajectory points, including those generated from the entry-point part. When training GAN, after a large number of parameter tunings, we use RMSprop optimizer like the MIL Network, the learning rate is set to 0.001, but the batch size is changed to 50. Since the generator is more difficult to converge than the discriminator, every time we update the parameters of the latter once, we must update those of the former three times. For all fully-connected layers, we use leaky-ReLU activation to avoid the gradient-vanishing problem. To achieve the best performance of GAN, we set Euclidean loss weight 32.0 and boundary loss weight 1.0 in the loss term.

A. DATASET USED FOR EVALUATION
To evaluate our method's performance, we perform experiment on a dataset from [6], which contains 134 videos of the NBA 2013-2014 season. These videos are distributed over 10 half-court offensive tactics and the offensive player trajectories are also available. The details of this tactic dataset are given in Table 1.

B. PERFORMANCE MEASURE AND EVALUATION PROTOCOL
Our method requires two parts to verify its effects. One is the MIL Network, and the other is the GAN.
A simple yet effective accuracy measure is adopted to evaluate and compare the performance of different methods with our MIL Network. Average tactic accuracy, which is abbreviated as tactic accuracy, first calculates each tactic's accuracy and then averages over the accuracy of all tactics. Similar to the average tactic accuracy, the average key player accuracy, which is abbreviated as key player accuracy, is obtained by first calculating key player accuracy of each tactic and then averaging over all tactic key player accuracy.
The evaluation protocol in the experiments is 5-fold crossvalidation. Due to the small size of the dataset, average tactic accuracy and key player accuracy are computed multiple times for each hyperparameter value to further reduce metric variations caused by random initialization.
In using GAN to augment the data, we use Euclidean distance of (x, y) coordinate (ED) to calculate the similarity between the augmented trajectory and the referenced trajectory. However, the augmented trajectory should be similar to the referenced trajectory but not precisely the same (i.e., ED is small yet not 0). We will use a visualization tool to check the quality of the augmented trajectory. As for selecting the best weight, we adopt the same protocol as the one in MIL Network, i.e., using 5-fold cross-validation for verification.

C. COMPARISON WITH THE STATE-OF-THE-ART METHODS
Four different methods are compared. The first one is learning the spatial-temporal template by unsupervised Gaussian mixture model [6]. The second one is the same as the first one except that the ground-truth tactic labels are provided to train the Gaussian mixture model. The third one adopts multipleinstance-learning mi-SVM with handcrafted spatial-temporal features named motion-intensity-map (MIM), where the feature dimension is set to 1040. The fourth one is our method, called RNN-tCAM-miNet+GTRCGAN, whose overall structure is described in Figure 2, where the feature dimension is 512.  Table 2 illustrates the results of tactic recognition accuracy and key player detection accuracy on different methods. Unsupervised and supervised GMM models are not able to provide key player results. The tactic accuracy of unsupervised GMM is 0.8550, and that of supervised GMM is 0.8867. The method using mi-SVM reaches tactic accuracy of 0.8933 with key player accuracy of 0.7143, serving as a baseline of supervised methods. 1 The proposed model achieves tactic accuracy of 0.9846 with key player accuracy of 0.76395. Our method substantially improves tactic accuracy by 9.13% and key player accuracy by 4.965% compared to the mi-SVM model. From the quantitative results, our deep neural network model outperforms conventional SVM models with fewer heuristic parameters and a lower feature dimension. Table 3 reports our method's ablation studies with other popular group-aggregation methods and handcrafted features with miNet. The first competing method MIM + miNet replaces the RNN auto-encoder in our model with handcrafted feature MIM and obtains tactic accuracy of 0.8868 with key player accuracy of 0.6100. Our method improves tactic accuracy by 9.78% and key player accuracy by 15.395%, which confirms that the end-to-end neural network provides better features than the handcrafted MIM. The second competing method is called RNN-LastStep-Net, where our temporal global average pooling layer is replaced with the last step of RNN auto-encoder achieves tactic accuracy of 0.9489 and key player accuracy of 0.7173. Our method improves tactic accuracy by 3.57% and key player accuracy by 4.665%, which results from replacing the last step pooling with temporal global average pooling. The third method RNN-CAM-Net uses the global average pooling as the aggregate layer on every single player, which removes a ''NchooseK'' aggregation layer, resulting in tactic accuracy of 0.9443 and key player accuracy of 0.6311. Our method improves tactic accuracy by 4.03% and key player accuracy by 13.285%. The large improvement in key player accuracy proves the effectiveness of the proposed NchooseK layer. The fourth competing method is RNN-GMP-Net, which replaces temporal global average pooling layer with global maximum pooling (GMP), obtaining tactic accuracy of 0.8263 and key player accuracy of 0.7137. The proposed method improves tactic accuracy by 15.83% and key player accuracy by 5.025%. The significant improvement in tactic accuracy results from tCAM which considers the whole trajectory time interval while GMP refers to a time step with maximum value. The fifth competing method is RNN-tCAM-miNet, which only uses real trajectory in model training, obtaining tactic accuracy of 0.98135 and key player accuracy of 0.70585. The proposed method improves tactic accuracy by 0.325% and key player accuracy by 5.81%. The improvement between the fifth and proposed methods proves the effectiveness of using GAN augmented data in training complex models. The reason that GAN only improves accuracy on key player detection is that in MIL tactic can classified true even if the key instance is not selected. GAN greatly increases the discriminativity of key players and reduces the chance of non-key players subgroup being selected as the key instance. Without GAN augmented data, RNN-tCAM-miNet can still recognize correct tactic but the rate of detecting wrong key instance increases. As a result, GAN has little effect on tactic accuracy. From the ablation study, we know that RNN is a powerful feature extraction tool. But to get RNN maximum capability, our design NchooseK and tCAM layer that allows tactic information to be properly backpropagated into RNN. GAN provides more role-player trajectory information that allows RNN-tCAM-miNet to predict trajectories of key players more precisely.

E. VISUALIZATION
To analyze the impact of our model on each tactic thoroughly, we demonstrate the accuracy per tactic in Table 4. For tactical accuracy, the difference between each other is small, and all tactics have an accuracy rate of more than 90%. Our model has a relatively large difference between tactics in key player accuracy, which ranges from the lowest 33% to 100%. The reason for this difference is that although the MIL finds consistent instance for each tactic, but in some tactics the instance found is not a key players instance defined by experts. To explain this phenomenon intuitively, we design a visualization tool for illustration. Figure 5 shows the result on video F23-002, which is classified as a correct tactic with correct key player detection. We notice that tactic accuracy is higher than key player accuracy in every model. After a thorough examination, errors can be separated into two categories. The first one is a wrong tactic classification. As shown in Figure 6, column (a) displays the trajectories of five offensive players. Column (b) plots the ground truth key instance labeled by professionals, and  player 4 (yellow) on a HK-039 has a very different trajectory compared to the lower part of column (b). This long trajectory also makes another instance r2r3r4 (role 2, role 3, and role 4) at the upper part of column (c) looks like the positive instance of tactic F23 at the lower part of column (c).
The second category is a correct tactic classification with a wrong instance detection. As displayed in Fig. 7, video EV-018 is correctly predicted as tactic EV. However, instead of detecting the correct positive instance r1r2r3 at the middle column, the proposed system chooses another instance r2r3r5. This is due to that our features sometimes cannot separate non-key players from the key players, because non-key players may also have regular trajectories. Even humans cannot distinguish key players from non-key players  without extra information (e.g. basketball trajectory or player action). This semantic gap typically cannot be solved in weakly-supervised approaches.
More Classification results are shown on our website. 2

F. QUALITY OF GAN-AUGMENTED TRAJECTORY
The original size of a basketball court in our test image is 348 × 326, and for a GAN started with the random initialization, root mean squared error (RMSE) is set about 126 pixels. After the process of hyperparameter tuning, RMSE will drop to 23 pixels. Figure 8 shows a comparison between GAN augmented trajectories and the template generated from spatial-temporal clustering [6]. We can see that the GAN augmented data does fall within the permitted range defined by the template.

V. CONCLUSION
Group activity recognition is a difficult but strongly demanded topic. In this paper, based on the concept of key-player-based tactic classification, we propose an end-toend trainable neural network to automatically learn players' dynamic features. To overcome pace variations of players' trajectories, global average pooling (GAP) is applied. GAP also reveals an activation map on the time axis of each key players' trajectories, which allows experts to study the temporal pattern of each key players' subgroup. By Adopting deep neural networks, our approach significantly increases both tactic and key player accuracy without prerequisite knowledge on the basketball field, which generalizes the system to other group activation recognition applications. Furthermore, to solve the problem of insufficient raw data, we also design a GAN that can generate group tactical behavior as augmented data. Like Key-Player-Based group activity recognition Network, our GTRCGAN can also be used to simulate