ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network

In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spatial and temporal dimensions in order to capture effectively both local and long-term dependencies between objects or frames. Moreover, using the weighted in-degrees (WiDs) derived from the adjacency matrices at the various GAT blocks, we show that the proposed architecture can identify the most salient objects and frames that explain the decision of the network. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets (FCVID, Mini-Kinetics, ActivityNet).


I. INTRODUCTION
D UE to the explosion in the creation and use of video data in many sectors, such as entertainment and social media, to name a few, there is a great demand for analyzing and understanding video content automatically. Towards this direction, the recognition of high-level events and actions in unconstrained videos plays a crucial role for improving the quality of provided services in various applications, e.g. [1]- [7].
The introduction of deep learning approaches has offered major performance leaps in video event recognition [5]- [14]. Most of these methods operate in a top-down fashion [6], [7], [10]- [14], i.e. they utilize a network architecture to directly extract patch-, frame-or snippet-level features; and, through an appropriate loss function (e.g cross-entropy), exploit the class labels to learn implicitly the video regions that are mostly related with the specified action or event. For instance, state-of-the-art Transformers [10], [12], [14] segment image frames using a uniform grid to produce a sequence of patches, as shown in the first row of Fig. 1. A similar image partitioning is also imposed implicitly by convolutional neural networks (CNNs), where the patch size is determined by the CNN's receptive field [10]. This "patchifying" is contextagnostic and usually only a small fraction of the patches contains useful information about the underlying event. During the supervised learning procedure the Transformer or CNN learns to disregard patches irrelevant to the target event, while extracting and synthesizing information from the patches that are related to the target event. Considering that the real action or event may be occurring in only a small spatiotemporal region of the video, this procedure is expensive; it is also suboptimal to start by treating all image patches equally, as FIGURE 1. Illustration of how top-down (1st row) and bottom-up (2nd row) approaches learn to focus on the salient frame regions, using a video labelled as "Walking the dog" event. Top-down approaches explicitly (e.g. Transformers) or implicitly (e.g. CNNs) "patchify" each frame to generate patch proposals in a context-agnostic way; the video labels are then used to train the network so that it learns to focus on the patches mostly related with the event (e.g. the 32 blue patches in this example) while ignoring the rest of them (the red patches). Instead, the proposed bottom-up approach supports the classifier by providing the main objects depicted in the frames. Such an approach can also facilitate the generation of object-and frame-based explanations about the event recognition outcome. An example of this is shown in the second row of the figure.
a large amount of them is irrelevant and does not need to be thoroughly analyzed [15]- [19].
Studies in cognitive science suggest that humans interpret complex scenes by selecting a subset of the available sensory information in a bottom-up manner, most probably in order to reduce the complexity of scene analysis [16], [20], [21]. It has also been shown that the same brain area is activated for processing object and action information for recognizing actions [22], [23]. Finally, psychological studies suggest that events may be organized around object/action units encoding their relations, and that this structural information plays a significant role in the perception of events by humans [24]- [26].
Motivated by cognitive and psychological studies as described above, recent bottom-up action and event recognition approaches [5], [9] represent a video frame using not only features extracted from the entire frame but also features representing the main objects of the frame. More specifically, they utilize an object detector to derive a set of objects depicting semantically coherent regions of the video frames, a backbone network to derive a feature representation of these objects, and an attention mechanism combined with a graph neural network (GNN) to classify the video. In this way, the classifier is supported to process in much finer detail the main video regions that are expected to contain important information about the underlying event [15]. The experimental evaluation in these works has shown that the bottom-up features constitute strong indicators of the underlying events and are complementary to the features extracted from the entire frames. More specifically, in [9], an I3D video backbone model is applied to extract spatiotemporal features, object proposals are generated using RoIAlign [27], an attention mechanism [28] is used to construct the adja-cency matrix of the spatiotemporal graph whose nodes are the object proposals, and a GNN is used to perform reasoning on the graph. However, the use of 3D convolutions in the above work to represent the video may not be adequate for describing actions or events that require long-term temporal reasoning, as for instance is explained in [10]- [12], [14], [16], [29], [30]. Moreover, a large graph is constructed that captures the spatiotemporal evolution of the objects along the overall video, which imposes strict limitations in terms of memory requirements and also makes it difficult to sample a larger number of frames to improve recognition performance (see [12]: Fig. 7 and the related ablation study concerning the effect of the number of frames in the action recognition performance). In [5], the 3D-CNN backbone of [9] is replaced by a 2D-CNN (i.e. ResNet [31]), and an attention mechanism [32] with a GNN are used to encode the bottomup spatial information at each frame only; the sequence of feature vectors is then processed by an LSTM [33] to classify the video. Therefore, in contrast to [9], the above architecture factorizes the processing of the video along the spatial and temporal dimension, thus, effectively removing the memory restrictions imposed in [9] by the use of expensive 3D-CNN and the construction of the large spatiotemporal attention matrix. Moreover, the authors in [5] make a first attempt at exploiting the weighted in-degrees (WiDs) of the graph convolutional network's (GCN's) adjacency matrix to propose eXplainable AI (XAI) criteria and provide object-level (i.e., spatial) explanations concerning the recognized event [5]. However, despite the fact that this architecture can process long sequences of video frames, it is well known that the LSTM struggles to model long-term temporal dependencies [10]- [12], [14], [16], [29], [30]. Additionally, only qualitative results of ObjectGraphs' explanation approach are presented in [5].
Recently, pure-attention top-down approaches, i.e. methods that aggregate spatiotemporal information via stacking attention for modelling more effectively the long-term dependencies in videos, have achieved superior video action recognition [10]- [12], [16], [29], [30] or activity anticipation [14] performance over previous methods that use CNN or LSTM layers in their processing pipeline. In this work, inspired by the above findings and building on the bottom-up approach of [5], we replace the hybrid GNN-LSTM head of [5] with a graph attention network-based (GAT-based) head network to process both the spatial (object) features as well as the sequence of features derived from the multiple frames. Our resulting head network, called hereafter ViGAT head, utilizes attention along both the spatial and temporal dimensions to process the features extracted from the video. Moreover, we use the Vision Transformer (ViT) as backbone (instead of a ResNet backbone, used in [5]) to derive a feature representation of both the frames and the detected objects. Therefore, in our work attention is factorized along three dimensions, i.e., i) spatially among patches within each object (by using ViT), ii) among objects within each frame, and iii) temporally along the video. Thus, in overall, due to the use of the ViT backbone (instead of a ResNet one) and the employment of a fully-attention head (instead of a hybrid attention-LSTM one), the proposed ViGAT can extract much richer features and model more effectively the long-term dependencies of video events in comparison to ObjectGraphs [5]. Additionally, in contrast to [5], which learns an adjacency matrix with respect to the objects at individual frames, and can thus derive only object-based explanations, we also derive an adjacency matrix along the temporal dimension, i.e. with respect to individual frames. Thus, the WiDs calculated from the different learned adjacency matrices in the ViGAT head (i.e. along the spatial and temporal dimensions) facilitate the derivation of multilevel explanations regarding the event recognition result, i.e., the extraction of not only the salient objects but also of the most salient frames explaining the model's outcome. We should also note that despite the fact that the extraction of bottom-up (object) information increases the computational complexity of the proposed approach, during training this is only done once using the pretrained object detector and ViT backbone; thus, compared to the majority of other methods, which typically train the employed backbone end-to-end along with the rest of their components, ViGAT has a significantly lower training complexity. Finally, following other works in the literature [34]- [36], we also explore the weight-tying of the individual GAT blocks in the ViGAT head of the proposed model to further reduce its memory footprint. Extensive experiments demonstrate that the proposed approach provides state-of-the-art performance on three popular datasets, namely, FCVID [37], MiniKinetics [38] and ActivityNet [39]. Summarizing, our main contributions are the following: • We propose the first, to the best of our knowledge, bottom-up pure-attention approach for video event recognition. A ViT backbone derives feature representations of the objects and frames, obtaining rich bottom-up information about the video scenes; and, an attention-based network head (called ViGAT head) is factorized along the spatial and temporal dimensions in order to identify the most interesting scene parts and thus capture effectively the long-term dependencies of events in video. • We contribute to the field of explainable AI by demonstrating how to exploit the WiDs of the adjacency matrices at the various levels of the ViGAT head in order to derive explanations along the spatial and temporal dimensions for the event recognition outcome; and, by successfully adapting popular XAI measures from the image recognition domain, being the first to quantitatively document the goodness of temporal (frame) explanations for video event recognition. The structure of the paper is the following: Section II presents the related work. The proposed method is described in Section III. Experimental results are provided in Section IV and conclusions are drawn in Section V.

A. VIDEO EVENT AND ACTION RECOGNITION
In this section, a survey of deep-learning-based video event and action recognition approaches is presented. For a broader literature survey on this topic the interested reader is referred to [40], [41].

1) Top-down approaches
The majority of event and action recognition approaches are top-down. We further categorize these methods according to their design choices in relation to feature extraction.
Convolutional 2D. These approaches utilize architectures with 2D convolutional kernels to extract features at framelevel. In [42], a two-stream network is proposed that utilizes a spatial and a temporal branch to process independently RGB and optical flow frames. This architecture can utilize deep CNNs pretrained on large-scale datasets, but can only operate on single frames and is computationally expensive due to performing dense video sampling. TSN [43] extends the above work extracting sparsely sampled snippets, i.e. dividing the video to a few segments of equal length and selecting randomly one frame from each segment, yielding a significantly lower computational cost. The above techniques operate on frame-level to derive a classification score; then, simple late fusion, i.e. average pooling of these scores, is applied to classify the video. Average pooling, however, ignores the temporal ordering and other higher-order rich statistical information which is useful for capturing complex dynamics of actions in video. To go beyond late fusion, in [44], a factorized bilinear operator is incorporated into the network's convolutional layers to capture pairwise interactions among CNN features of adjacent frames and utilize more effectively the temporal relations across frames. In [30], the non-local module [29] (which is a kind of self-attention mechanism for modeling the correlation between spatial positions in feature maps) is generalized to model the interactions between positions across channels in ResNet backbones, resulting in a modified backbone that captures more effectively the longterm dynamics of actions in videos. In [45], a new attentive polling mechanism is integrated in various CNN backbone networks to combine frame-level action recognition scores. In [46], VLAD pooling [47] that has shown state-of-the-art performance in combining hand-crafted features, is utilized to aggregate the features derived from the temporal and spatial CNN-based streams. In [48], ActionVLAD derives a global feature descriptor for the entire video, using learnable pooling (NetVLAD [49]) aggregating both appearance and temporal information along the video. In contrast to the above works, where the exact temporal ordering of the descriptors is ignored, spatiotemporal VLAD (ST-VLAD) [50] reformulates the VLAD optimization problem using Lagrange multipliers to impose the minimization of the difference between the VLAD descriptors corresponding to neighboring frames. As a result, the derived VLAD descriptors of the video signal vary smoothly along the temporal dimension. Similarly, in [51], a 2D descriptor, called VideoMap, which is a row-wise layout of the per-frame vectorized CNN's features, is learned, VOLUME X, 2022 for action classification. Several works have also used recurrent neural networks to process the extracted CNN time series features. In [52], a pretrained ResNet is used to derive a feature representation for each frame and an LSTM to process the temporal information. In [3], a 2D-CNN and LSTM are used to process the spatiotemporal video information, and in addition, shot boundary detection is applied to segment and predict multiple actions occurring in a video. PivotCorrNN [53] introduces contextual gated recurrent units (cGRUs) to exploit time-varying information among different modalities (MFCC, IDT, etc.). Although many of the above approaches utilize rather sparsely sampled frames, the extraction of a feature representation for each sampled frame using a rather deep CNN is still a computationally expensive process.
In response to the above drawback, techniques that use reinforcement learning and/or a gating network in order to further reduce the number of video frames being processed have also emerged. In [54], AdaFrame exploits a policy gradient method to select future frames for faster and more accurate video predictions. In [55], a frame sampling strategy is learned using multi-agent reinforcement learning (MARL). In [56], instead of a complex reinforcement learning policy network, ListenToLook introduces the audio-modality to build a video skimming mechanism for selecting the most salient clips for the recognition task. The above approaches utilize a fixed size network (i.e. with fixed memory footprint) irrespectively of video's complexity. In contrast, LiteEval [57] determines dynamically the frame resolution and utilizes a coarse-and a fine-LSTM cooperating through a binary gating module that decides whether additional highresolution frames are necessary, thus leveraging network capacity dynamically. Furthermore, the adaptive resolution network (AR-Net) [7], instead of an expensive reinforcement learning mechanism or an additional audio modality, utilizes a lightweight policy network that learns to compute the optimal frame resolution on-the-fly, allowing the recognition of multiple video actions efficiently. Contrarily to the above, AdaFocus [58] utilizes a reinforcement learning policy network to leverage spatial redundancy, i.e., selects the most salient regions in the video frames with respect to the action recognition task. In [19], AdaFocusV2 extends [58] by replacing reinforcement learning with a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. The above methods operate on untrimmed videos (i.e., videos that contain many irrelevant frames to the underlying action), where it is much easier to identify and discard less-significant image regions or entire frames. In [59], differently from the above methods, the socalled SMART approach leverages a multi-frame attention and relation network to select the most informative frames in short trimmed videos. In another line of research in the efficient video recognition paradigm, in [2], a low-cost CNN implemented in an embedded platform is used for violence recognition in video.
Regardless of whether the emphasis is on exploiting temporal and other statistical information or on improving ef-ficiency, none of the top-down methods discussed in this section extracts and uses representations at a finer-than-frame level (e.g. for individual objects within a frame).
Convolutional 3D. This category includes approaches with 3D convolutional kernels in the network architectures, operating at clip-or entire-video level. C3D+LSVM [60] is one of the first works demonstrating that 3D convolutional kernels constitute a good descriptor for action recognition in video. In [61], a two stream architecture called I3D, which combines an optical flow and a 3D-CNN stream, is introduced. Additionally, [61] describes how to leverage discriminant information from 2D-CNNs trained on ImageNet; it also shows that, when pretrained on a large-scale dataset (e.g. Kinetics), I3D provides recognition performance that is competitive to 2D-CNN approaches. Both optical flow and 3D-CNN have high computational cost, limiting the applicability of the above two stream architectures in realworld applications [38]. In order to reduce the computational overhead of the optical flow computation, [62] employs a distillation approach during training to ingest optical flow stream information to a student network that operates on RGB frames. Most works described above utilize relatively shallow networks, restricting the capacity of the networks for adequately learning a large number of complex video actions. In [63], ResNet-like 3D-CNN architectures of various depths are examined, with the authors concluding that carefully designed 3D-CNNs of large depth can improve the recognition performance when trained on large-scale datasets. However, even when trained on large-scale datasets such heavy architectures can still suffer from overfitting. To mitigate this problem, in [64], a multiplicative regularization approach, called random mean scaling, perturbs the lowfrequency components of feature maps, effectively alleviating overfitting in deep 3D-ResNet architectures. Similarly, in [16], a bilinear attentional mechanism (i.e. a bilinear matrix multiplication operator with learnable weights) is introduced between network layers, extending the idea of non-local operators for 2D-CNNs [29] to the 3D-CNNs paradigm; via directly connecting all locations of input feature maps, it is shown that this mechanism can capture the long-range spatiotemporal dynamics of video actions. In SlowFast [65], a low-and a high-frame rate pathway, consisting of differentdepth 3D-ResNets, are used to capture the spatial frame information and rapidly changing motion, respectively. In [66], a 3D-CNN is first used to produce a feature representation for each video segment, which are then processed using an attention network with fast and slow pathways. In [67], 3D-CNN architectures are build using a temporal one-shot aggregation module to capture multiple temporal receptive fields, and depth-wise spatiotemporal factorized components for modeling short-and long-term motion dynamics. In [68], a local and global branch are utilized using asymmetric convolution and two paralleled 1D-like convolutional blocks, to extract semantic and temporal action information, respectively; moreover, a supervised and self-supervised loss are combined to ingest information from labelled and unlabelled videos, respectively. Contrarily to the above methods that leverage multi-scale spatiotemporal information, in [69] a dynamic equilibrium module is inserted into a 3D-CNN backbone to directly suppress the influence of spatiotemporal variations of actions in video. In another line of research, in [70], a self-knowledge distillation approach is used to boost the performance of baseline 3D-CNN models (3D ResNet-18 and -50) for the task of action recognition.
Further to the above, several works also investigated how to reduce the high computational cost of using 3D-CNNs. In [71], separable 3D-CNNs are introduced, factorizing the 3D convolutional filters to a 2D spatial and a 1D temporal convolutional component, allowing faster processing of video sequences. In [38], the above work is further extended adding a feature gating mechanism, which is a simple selfattention operation. In [72], a differentiable similarity guided sampling module is introduced in the architecture of 3D-CNNs that measures the similarity of temporal feature maps and adaptively adjusts the temporal resolution. In [1], an efficient architecture is proposed, consisting of a 2D-CNN and two lightweight 1D-CNN-based branches to capture spatial information, short-and long-term motion dynamics, respectively, and a 3D-CNN feature enhancement module to obtain more fine-grained spatial and temporal cues. This architecture is much more efficient from SlowFast, which uses two 3D-ResNets in its branches. SCSampler [73] extracts C3D features and as in [56] for 2D-CNNs, the audiomodality is exploited to build a lightweight saliency model that selects short temporal clips within a long video that represent well the latter. In another direction, a multigrid approach is proposed in [74], to derive variable mini-batch shapes (i.e. number of videos, frames and spatial resolution) during training, accelerating the training procedure and improving the generalization performance of 3D-CNNs. In [75], similarly to EfficientNet [76], the X3D family of networks progressively expands a base network along different network dimensions (spatiotemporal resolution, frame rate, etc.) to derive powerful and efficient models. In [17], Ada3D, trains a two-head network to learn frame and convolutional layer activation policies conditioned on the input video clip, thus reducing the computational cost of 3D-CNN models. In [6], FrameExit utilizes X3D [75] for feature representation and applies a conditional early exiting to further improve the efficiency of the backbone network, i.e., stops processing video frames when a sufficiently confident decision is reached.
In general, despite efforts to reduce the computational cost of using 3D-CNNs, such approaches typically continue to be much more expensive in terms of computational complexity and power consumption in comparison to their 2D-CNN counterparts.
Transformers. Convolutional or recurrent-based operations can only process a local neighborhood of the video in space and time; in order to model long-range dependencies, deep CNN or RNN architectures are utilized that stack several layers implementing the above operations, effectively ex-tending the receptive field of the overall network. However, the repetition of such local operations is computationally inefficient and causes optimization difficulties [16], [29], [77]. In contrast, Transformers utilize global self-attention to obtain a larger receptive field, thus, capturing more effectively the long-term dependencies in action videos [77]. In [78], inspired from the success of Transformers in natural language processing, the so-called vision transformer (ViT) was introduced outperforming convolutional-based approaches in popular image recognition benchmarks. Subsequently, several attention-based architectures were also introduced concurrently for modeling the spatiotemporal contextual information of actions in videos [10]- [14]. TimeSformer [11] applies temporal and spatial attention, demonstrating that in comparison to 3D convolutional networks the attention-based architecture is faster and can be applied to much longer video clips. Similarly, Video ViT (ViViT) [12] factorizes attention to spatial and temporal dimensions to efficiently process long video sequences and proposes effective training strategies for ViViT by ablating different tokenization and regularization methods. The spatiotemporal separable-attention video Transformer (VidTr) [10] performs spatial and temporal attention separately and utilizes a deviation-based topK pooling operator to focus on the most representative frames of the video sequence. In [13], similarly to SlowFast [65] and X3D [75] in the 3D convolutional paradigm, multiscale ViT (MViT) introduces several channel resolution scale stages into transformer models.
A common characteristic of all the above transformerbased approaches is that they rely on a context-agnostic extraction of a multitude of image patches, using a uniform grid, in order to learn the video actions; they do not take advantage of the inherent object-based composition of a visual scene and of the varying importance of specific objects for recognizing an event.

2) Bottom-up approaches
The methods described so far are top-down, i.e. entire frames or context-agnostic image patches corresponding to equallysized receptive fields are processed along with the action class label of the video, to train a neural network to learn attending on the input frames or patches thereof that are related to the underlying action class. Contrarily, bottom-up approaches utilize a more human-like mechanism to select a subset of the visual stimuli corresponding to salient image regions [20], [21]. These methods typically use an object detector to provide bottom-up information for training an event classifier [5], [8], [9], [26], [79], [80]. For instance, in [80], a person detector (Faster R-CNN), a long-term feature bank and a 3D-CNN applied on short video segments, are used to provide long-term supportive information for action recognition. In [81] scene-and object-class pseudo-labels are derived for each video using pretrained networks (ResNet-50) on place365 and MS-COCO datasets respectively; a multi-scale deformable 3D convolutional network and actorobject-scene attention model are then used for action recog-VOLUME X, 2022 nition and factorization of actions into an actor, co-occurring objects, and scene cues. In [26], the Action Genome dataset is introduced, containing videos manually annotated with events, objects and their relationships, i.e., rich bottom-up information is provided, contrarily to [81] where the objects are annotated at video clip level. This dataset is used to learn a spatiotemporal scene graph feature-bank for action recognition; during inference the Faster R-CNN and RelDN [82] are used to extract objects and visual relationships for building the spatiotemporal graph. Since video object annotation is a labor-intensive and time consuming process, in [8], in contrast to [26] where object annotations are provided in the training set, a region proposal network (R-FCN [83]) and KLT trackers [84] are used to derive and track video objects, and build a semantic graph for each frame; subsequently, a hierarchical RNN is used to process the graph information and recognize group actions in video. Instead of bounding boxes, semantic segmentation masks are extracted in [79] using RefineNet-152; this bottom-up information is combined with optical flow features derived using FlowNet2 in a two-stream architecture for the task of short-term action recognition. In [9], features extracted using a 3D-ResNet backbone with an object detector (RoIAllign [27]), are used to train an attention-based GNN, which in comparison to RNNs or dense classification heads used above can learn more effectively the long-term dependencies of video actions. In [5], object features are extracted at framelevel using an object detector with 2D-ResNet; these features are then used by a network head, composed of an attention mechanism, a GNN and an LSTM, factorizing the spatial and temporal dimension. In comparison to [9], the above work factorizes the spatial and temporal dimension, allowing the efficient processing of long video signals. Moreover, weighted in-degrees (WiDs) derived from the graphs' adjacency matrix are utilized to identify the most salient objects in the video that can explain the event recognition result. Despite the considerable performance gains obtained by [5], [9], the use of 3D-CNN [9] or LSTM [5] may not be adequate to fully capture the long-term dynamics of actions or events in video, as explained in [10]- [12], [14].
In this work, to benefit from bottom-up video information while mitigating the above limitations, we propose a pureattention bottom-up model utilizing an attention head network factorized along the spatial and temporal dimensions. Additionally, using the temporal GAT components of our model, we are able to derive not only explanations at spatial level (i.e. objects, as in [5]) but also at temporal level (i.e. frames). Furthermore, we explore the possibility of tying the weights of the various GAT blocks to further reduce the memory footprint of the model, similarly to works in other domains [34]- [36].

B. GNN DECISION EXPLANATION
There have been only limited works studying the explainability of GNNs. In contrast to CNN-based approaches where explanations are usually provided at pixel-level [85], for graph data the focus is on the structural information, i.e., the identification of the salient nodes and/or edges contributing the most to the GNN classification decision [86]. In the following, we briefly survey techniques most relevant to ours, i.e., targeting graph classification tasks and providing nodelevel (rather than edge-level) explanations. For a broader survey of various works on explainability the interested reader is referred to [86]. In [87], for each test instance the so-called GNNExplainer maximizes the mutual information between the GNN's prediction and a set of generated subgraph structures to learn a soft mask for selecting the nodes explaining the model's outcome. However, the explanation masks in [87] are optimized individually for each input graph and thus may lack a global view [86], [88]. In [89], a surrogate, probabilistic graphical model that can learn the non-linear relationships of the input graph, as captured by the underlying GNN, is proposed. More specifically, the so-called PGM-Explainer consists of a random perturbation approach to generate a synthetic dataset of graph data and respective predictions, a filtering step to discard unimportant graph data, and a learning step that trains the probabilistic graphical model using a Bayesian information criterion (BIC) score objective to provide explanations for the derived predictions. Both the above approaches learn to derive explanations by minimizing an objective function -mutual information [87] or BIC score [89] -whose relevance to explainability is unclear, as discussed in [90]. Contrarily, in [90], a new explainability measure called RDT-Fidelity is introduced, satisfying the desired properties of good explanations; subsequently, a combinatorial procedure called ZORRO is proposed that uses a greedy forward selection algorithm to select the subgraphs that directly maximize the RDT-Fidelity score. The approaches discussed above have shown promising results, however, they introduce a high computational cost to the overall procedure, due to introducing an additional training step [87], [89] or because a greedy evaluation of a large number of possible node combinations is necessary [90]. To this end, [91] extends popular gradient-based CNN methods to the GCN setting. These methods are efficient as only one forward pass of the network is required; however, they suffer from the well-known gradient issues [92].
In this paper, to counter the described drawbacks of both gradient-based and computationally-expensive learningor perturbation-based methods, we propose deriving WiD scores from the adjacency matrices at the various levels of the proposed attention head network; these WiD scores exhibit more stable behavior and improved explanation quality, and obtaining them introduces very limited computational overhead that is comparable to [91].

III. VIDEO GAT A. VIDEO REPRESENTATION
Let us assume an annotated video training set of C event classes. A video is represented with N frames sampled from the video and a backbone network extracts a feature representation γ (n) ∈ R F for each frame n = 1, . . . , N . The are concatenated and the resulting feature is fed to layer U() to produce a score for each event class. Additionally, the WiDs derived from the adjacency matrices of the three blocks provide comprehensive explanations (in terms of salient objects and frames) for the recognized event.  feature representations are stacked row-wise to obtain matrix Γ ∈ R N ×F , Similarly to recent bottom-up approaches [5], [9], we additionally use an object detector to derive K objects from each frame; each object is represented by an object class label, a degree of confidence (indicating how confident the object detector is for this specific detection result), and a bounding box. The backbone network is then applied to extract a feature representation x (n) k ∈ R F for each object k in frame n. Sorting the feature representations in descending order according to their respective degree of confidence and stacking them row-wise we obtain the matrix X (n) ∈ R K×F representing frame n, Although various backbones can be used, similarly to works in other domains, we use a Vision Transformer (ViT), which has shown excellent performance as backbone in a pureattention framework [14].

B. VIGAT HEAD
The ViGAT head depicted in Fig. 2 is used to process the features extracted from the backbone network. It is composed of three GAT blocks, Ω1, Ω2 and Ω3, where each block consists of a GAT and a graph pooling layer (the structure of the GAT block is described in detail in the next subsection). Each GAT block is applied separately to a different feature type, effectively factorizing attention along the spatial and temporal dimensions. This is a major advantage over the method of [5], where attention was utilized only along the spatial dimension; the temporal video information was encoded using a less-effective LSTM structure. More specifically, the feature representations of video frames (1) and objects of frame n (2) in the input of the GAT head are processed by the blocks Ω1 and Ω2, respectively, where δ, η (n) ∈ R F are new feature representations for the entire video and frame n, respectively. Subsequently, the N outputs of Ω2 (which correspond to the N video frames) are stacked row-wise to obtain a new matrix H ∈ R N ×F for the overall video,

VOLUME X, 2022
This matrix is then fed to the block Ω3 to obtain a second new feature representation ∈ R F for the entire video, The derived features δ and are then concatenated to form a new feature ζ ∈ R 2F for the video, Finally, ζ is passed through a dense layer U() in order to derive a score vectorŷ = [ŷ 1 , . . . ,ŷ C ] T , whereŷ c is the classification score obtained for the cth event class. Using an annotated training set, an appropriate loss function and learning algorithm, the ViGAT head can be trained end-toend. Moreover, in case that the weights of the three GAT blocks are tied (i.e. Ω1 = Ω2 = Ω3), the gradient updates for the GAT block parameters are simply the sum of the updates obtained for the N + 2 roles (see Fig. 4) of the GAT block in the network, as in [34], [36], [93].

C. GAT BLOCK
The GAT block structure Ω depicted in Fig. 3 is the building block of the ViGAT head. To avoid a notation clutter, we use in this section block Ω2 (4) as an example for defining the GAT block (blocks Ω1, Ω2, Ω3 are identical). The input to Ω2 is matrix X (n) ∈ R K×F (2), i.e. the feature representations of the K objects of the nth frame. The first component of the GAT block is an attention mechanism that is used to compute the respective matrix E (n) ∈ R K×K as follows [5], [32], [94], where,W,W ∈ R F ×F ,b,b ∈ R F are the weight matrices and biases of the attention mechanism, , is the inner product operator and e (n) i,j is the attention coefficient at the ith row and jth column of E (n) . The attention coefficients are then normalized across each row of E (n) to derive the adjacency matrix A (n) ∈ R K×K of the graph [5], [9], [32], [94], where, a (n) k,l is A (n) 's element at row k and column l. The derived adjacency matrix and the node features are then forwarded to a GAT head of M -layers [5], [9], [95] where, m is the layer index (i.e. m = 1, . . . , M ), σ() denotes a nonlinear operation (here it is used to denote layer normalization [96] followed by element-wise ReLU operator), and W [m] ∈ R F ×F , Z [m] ∈ R K×F are the weight matrix and output of the mth layer, respectively. The input of the first layer is set to the input of the GAT block, i.e. Z [0] = X (n) , and the output of the GAT head, Ξ (n) ∈ R K×F , is set to the output of its last layer, i.e. Ξ (n) = Z [M ] . Subsequently, graph pooling [97] is applied to produce a vector-representation of the graph at the output of the GAT block, where ξ (n) k ∈ R F is the kth row of Ξ (n) . We note that (12) resembles the layer-wise propagation rule of GCNs [95]. However, as the exploitation of the attention mechanism to create the graph's adjacency matrix is central in our approach and due to the fact that this matrix is not symmetric (which violates the symmetry assumption in [95]), we resort to the more general message passing framework [98] and GAT [94] to describe our model.

D. VIGAT EXPLANATION
Considering that during the inference stage, the multiplication with the adjacency matrix in (12) amplifies the contribution of specific nodes, and the resulting video representation gives rise to the trained model's event recognition decision, the adjacency matrix can be used for deriving indicators of each node's importance in said model's decision. This was first attempted in [5], where the importance of object l at frame n was estimated using the associated WiD value, where a (n) k,l is A (n) 's element at kth row and lth column. The qualitative results presented in [5] demonstrated the usefulness of WiDs to produce explanations about the recognized video event. However, the use of LSTM in [5] to process the frame features restricted the computation of WiDs only to objects at static frames, and thus the derivation of explanations only at object-level. In contrary, here we extend the utilization of WiDs in the temporal dimension. Specifically, the use of temporal attention through blocks Ω1 and Ω3 to process the frame features enables us to derive two WiDs for the nth video frame, where, π τ,n , δ τ,n are the elements of matrices Π ∈ R N ×N and ∆ ∈ R N ×N at row τ and column n, and Π, ∆ are the adjacency matrices of blocks Ω1 and Ω3, respectively (similarly to A (n) being an adjacency matrix of block Ω2, as computed in (11)). A large ω indicates that the contribution of frame n in the event recognition outcome is high. In order to derive a single indicator for each frame, we average the above values to obtain a new indicator β (n) for the importance of frame n, Equation (17) is our proposed XAI criterion, i.e. we propose that the top-Υ frames with the highest β (n) values constitute an explanation of the network's event recognition outcome.

A. DATASETS
We run experiments on three large, publicly available event/action video datasets: i) FCVID [37] is a multilabel video dataset consisting of 91223 YouTube videos annotated according to 239 categories. It covers a wide range of topics, with the majority of them being real-world events. The dataset is evenly split into training and testing partitions with 45611 and 45612 videos, respectively. Among them, 436 videos in the training partition and 424 videos in the testing partition were corrupt and thus could not be used. ii) MiniKinetics, which comes in two variants, one comprising approximately 130K video clips (121215 for training and 9867 for testing) [7] and one with approximately 85K clips (a 80K/5K training/testing split) [38]. Both variants contain instances of 200 event/action classes and originate from the Kinetics dataset [99]. Each clip has been sampled from a different YouTube video, has 10 seconds duration and is annotated with a single class label. iii) ActivityNet v1.3 [39] is a popular multilabel video benchmark consisting of 200 classes (including a large number of high-level events), and 10024, 4926 and 5044 videos for training, validation and testing, respectively. As the testing-set labels are not publicly available, the evaluation is performed on the so called validation set, as typically done in the literature.

B. SETUP
Uniform frame sampling is one of the most commonly-used strategies in video action recognition due to its simplicity, efficiency and effectiveness, and has offered state-of-the-art results in this domain (e.g. see [6], [7], [29], [42], [43], [48], [50], [54], [55], [59] and references therein). For this reason, uniform sampling is also applied here to represent each video with a sequence of N frames in the input of the proposed ViGAT. The number of sampled frames N per video is selected based on the videos' average duration and the complexity of the actions in the respective dataset, also considering the number typically used in the relevant literature works. The average duration of the videos in MiniKinetics and FCVID is 10 and 167 seconds, respectively [37], [99]. On the other hand, most videos in ActivityNet are much larger, i.e., with duration between 5 and 10 minutes [39]. Concerning the complexity of the events/actions in the different datasets, FCVID mostly contains generic categories, such as "baseball", "fire fighting" and "birthday". On the other hand, MiniKinetics and ActivityNet contain a broader variety spanning from high-level events to short-term actions that are more difficult to differentiate, such as "applauding" and "clapping", "cleaning shoes" and "shining shoes" (MiniKinetics), "drinking beer" and "drinking coffee", and "long jump" and "triple jump" (ActivityNet). Based on the above analysis and following other works in the literature, we set N to 9 frames for FCVID (e.g. as in [5], [7], [54], [59]) and 30 frames for MiniKinetics (e.g. similarly to [64], [68]). For Ac-tivityNet, due to both video length and events complexity, we decided to sample a larger number of frames, i.e. N = 120 (e.g. similarly to [55]); in this way, we want to ensure that the complex events/actions, especially the ones that resemble each other, as well as those covering only a small portion of the longer videos in this dataset, are adequately represented.
The object detector is used to extract a set of K = 50 objects from each frame (the ones with the highest degree of confidence). Thus, each object is represented with a bounding box, an object class label (which we only use for visualizing the object-level explanations) and an associated degree of confidence. As object detector we use the Faster R-CNN [100] with ResNet-101 [31] backbone, where feature maps of size 14 × 14 are extracted from the region of interest pooling layer. The Faster R-CNN is pretrained and fine-tuned on ImageNet1K [101] and Visual Genome [102], respectively.
ViGAT utilizes a pre-trained backbone network to derive a feature representation for each object in a frame as well as for the overall frame, as described in (1), (2). We experimented with two backbones: i) ViT: the ViT-B/16 variant of Vision Transformer [78] pretrained on Imagenet11K and fine-tuned on Imagenet1K [101] is our main backbone; specifically, the pool layer prior to the classification head output of the transformer encoder is used to derive a feature vector of F = 768 elements, ii) ResNet: a ResNet backbone is also used in order to compare directly with other literature works that use a ResNet backbone, and to quantify the performance improvement of the proposed pure-attention model (i.e. the effect of using attention also at object pixel-level through the ViT backbone); specifically, the pool5 layer of a pretrained ResNet-152 on ImageNet11K is used to derive an F = 2048 dimensional feature vector.
Concerning the ViGAT head (Fig. 2), the parameters of the three GAT blocks are tied, and M = 2 layers (12) are used in each GAT head. Moreover, U() is composed of two fully connected layers and a dropout layer between them with drop rate 0.5. The number of units in the first and second fully connected layer is F and C, respectively, where C (the number of event classes) is equal to 239, 200 and 200 units, for the FCVID, MiniKinetics and ActivityNet dataset; the second fully connected layer is equipped with a sigmoid or softmax nonlinearity for the multilabel (FCVID, ActivityNet) or single-label (MiniKinetics) dataset, respectively.
We performed in total eight main experiments, one for each possible combination of dataset (FCVID, the two variants of MiniKinetics, ActivityNet) and backbone (ViT, ResNet). In all experiments, the proposed ViGAT is trained using Adam optimizer with cross-entropy loss and initial VOLUME X, 2022 learning rate 10 −4 (e.g. as in [78]). Following other works in the literature (e.g. [12]), a batch size of 64 is utilized, except for the experiment on ActivityNet with the ResNet backbone, where we reduced the batch size to 36 due to GPU memory limitations. For the proposed ViGAT with ViT backbone the initial learning rate is multiplied by 0.1 at epochs 50, 90, for FCVID; 20, 50, for MiniKinetics; and 110, 160, for ActivityNet. The total number of epochs is set to 100 for MiniKinetics and 200 for FCVID and ActivityNet. For the ViGAT variant with ResNet backbone the initial learning rate is similarly reduced at epochs 30, 60; and 90 epochs are used in total for each dataset. We should note that in all experiments the proposed method exhibited a very stable performance with respect to different learning rate schedules. All experiments were run on PCs with an Intel i5 CPU and a single NVIDIA GPU (either RTX3090 or RTX2080Ti).

C. EVALUATION MEASURES
Similarly to other works in the literature and in order to allow for comparison of the proposed ViGAT with them, the event recognition performance is measured using the top-1 accuracy and mean average precision (mAP) [103] for the single-label (MiniKinetics) and multilabel (FCVID, ActivityNet) datasets, respectively.
The explainability performance of ViGAT is measured using the top Υ frames of the video selected by it to serve as an explanation. We use two XAI evaluation measures used extensively for the explanation of CNN models, i.e., Increase in Confidence (IC) and Average Drop (AD) [104], where, Q is the total number of evaluation-set videos, δ(a) is one when the condition a is true and zero otherwise,û q ∈ {1, . . . , C} is the event class label estimated by the ViGAT model using all N frames,ŷ q,ûq ,ȳ q,ûq are the model's scores for the qth video and estimated classû q , obtained using all or just the top Υ frames identified as explanations by the employed XAI criterion (17), respectively. That is, IC is the portion of videos for which the model's confidence score increased, and AD is the average model's confidence score drop, when just the Υ most salient frames are used to represent the video. Higher IC and lower AD indicate a better explanation. Additionally, we utilize two more general explainability measures, fidelity minus (F −) and fidelity plus (F +) [86], defined as mAP(%) ST-VLAD [50] 77.5 PivotCorrNN [53] 77.6 LiteEval [57] 80.0 AdaFrame [54] 80.2 SCSampler [73] 81.0 AR-Net [7] 81.3 SMART [59] 82.1 AR-Net (EfficientNet backbone) [7] 84.4 ObjectGraphs [5] 84.6 AdaFocusV2 [19] 85.0 ViGAT (proposed; ResNet backbone) 86.0 ViGAT (proposed; ViT backbone) 88.1
ii) Concerning our ViGAT variant that utilizes a ResNet backbone pretrained on ImageNet, this outperforms the bestperforming literature approaches that similarly use a ResNet backbone in FCVID and ActivityNet (see Tables 1 and 3). Specifically, we observe a significant performance gain of 1% over AdaFocusV2 [19], which is the previous state-of-theart method. We also see that ViGAT provides a performance improvement of 1.4% over ObjectGraphs [5], which is the best previous bottom-up method. The above result clearly demonstrates the advantage of our architecture, i.e. the use of a pure-attention head in order to capture effectively both the spatial information and long-term dependencies within the video, instead of using an attention-LSTM structure as in [5]. We also observe a large gain of 3.1% over AdaFocusV2 (the previous top-performing approach with ResNet backbone) on ActivityNet. We should also note that in some cases ViGAT even with a ResNet backbone outperforms methods utilizing a stronger backbone, e.g. the AR-Net with the EfficientNet backbone on FCVID and ActivityNet [7]. On the other hand, this is not the case in the MiniKinetics dataset. This is attributed to the fact that our ImageNet-pretrained backbone is frozen, used as a feature extractor; whereas the above methods train or fine-tune the employed ResNet backbone in the larger MiniKinetics dataset, leading naturally to improved performance.
iii) The use of ViT instead of the ResNet backbone in ViGAT, i.e. the proposed pure-attention approach, provides a considerable performance boost: 2.1% on FCVID, and an impressive 8.2%, 7.8% and 6% on MiniKinetics 130K, 85K and ActivityNet. The latter may be explained by the fact that ActivityNet and MiniKinetics contain a more heterogeneous mix of short-and long-term actions, and thus a stronger backbone that provides a better representation of the objects can facilitate the discrimination of a larger variety of action/event types. This behavior has also been observed in other methods, e.g., AR-Net (using ResNet and EfficientNet) and FrameExit (using ResNet and X3D-S), as illustrated in Tables 1 and 3. Concerning computational complexity, the Fvcore Flop Counter [105] is used to compute the FLOPs (floating point operations) of the ViGAT head and ViT backbone. For the Faster R-CNN object detector, due to its inherent randomness during the inference stage, we utilize the GFLOPs per frame reported in [106]. Using the above tool, we verified that the proposed ViGAT head is very lightweight, with 3.85 million parameters and only 3.87 GFLOPs to process a video in MiniKinetics. On the other hand, counting also the execution of the Faster R-CNN [100] object detector and the ViT backbone [78] applied on each object and frame increases the total complexity of our method to 34.4 TFLOPs. The latter figure is comparable with the complexity of some of the most recent top-down approaches of the literature, such as ViViT Large and Huge [12] with 11.9 and 47.7 TFLOPs, respectively. However, we should note that during ViGAT training, the pre-trained Faster R-CNN and ViT backbone that are the most computationally expensive components of ViGAT are executed only once per video, yielding a dramatic GFLOP reduction for the overall training procedure. Thus, compared to the video transformer models mentioned above, which were trained on dedicated high-performance tensor processing accelerators, ViGAT has a significantly lower training complexity that allowed all reported experiments to run on single-GPU PCs. Moreover, the overall complexity of ViGAT can be optimized by using more efficient pre-trained networks for object detection and feature representation, such as the ones presented in [107], [108], which report a considerably smaller number of GFLOPs than [78], [100].

E. EVENT RECOGNITION ABLATION STUDY
In order to gain a further understanding of the proposed event recognition approach, results of two ablation experiments are presented in this section. These experiments are performed using the ViGAT with ViT backbone and following the training procedure described in Section IV-B. Specifically, we perform: • Assessment of the impact of the weight sharing scheme, as well as the relative importance of the object and frame feature information, on the performance of our model. • Investigation of the effect of using a different number of layers within the GAT blocks of the proposed architecture. In the first ablation experiment, we utilize MiniKinetics 85K to evaluate the performance of four different variants of VOLUME X, 2022  our method: i) ViGAT: our proposed model (Section IV-D), i.e. with weight-tying applied across the three GAT blocks, ii) noWT-ViGAT: this model has the same architecture as ViGAT with the difference that the weights are not shared along the three GAT blocks (i.e. the blocks Ω1, Ω2 and Ω3 of Fig. 2 have different weights), iii) Global-ViGAT: this model utilizes only the GAT block Ω1 to process only the frame feature representations (1), iv) Local-ViGAT: contrarily to the above, this model employs only the GAT blocks Ω2 and Ω3, i.e. the branch of the ViGAT head that processes the object feature representations (2). The evaluation performance in terms of top-1(%) for all models along the different epochs is shown in Fig. 5. From the obtained results we observe the following: i) The Local-ViGAT model outperforms Global-ViGAT with a high absolute top-1(%) gain of 4.58%, demonstrating the significance of the bottom-up information (represented by the object features) and the effectiveness of our approach in exploiting this information. Moreover, we observe that the object and frame features are to some extent complementary, as shown by the 1.66% absolute top-1(%) performance gain of ViGAT (which exploits both features) over the Local-ViGAT.
ii) ViGAT outperforms NoWT-ViGAT in MiniKinetics 85K by 0.26% absolute top-1(%), showing that the use of shared weights along the different GAT blocks may act as a form of regularization stabilizing the training procedure, as for instance has been observed in [34]- [36]. However, we should note that this is not necessarily always the case, i.e. for other datasets a larger network capacity may be beneficial. Besides potentially improved event recognition results, the use of shared weights leads to reduced memory footprint: using the Fvcore Flop Counter [105] we can see that NoWT-ViGAT has 8.426 million parameters. In comparison, the proposed ViGAT (3.85 million parameters) achieves a 2.3× lower memory footprint.
In a second ablation experiment, the influence of the number of GAT layers M (12) in the performance of ViGAT is examined. Specifically, M within each block (Fig. 2) is varied from 1 to 4 and the performance is recorded. From the results shown in Table 4, we observe that M = 2 is optimal or nearly optimal along all three datasets (for simplicity, concerning MiniKinetics we run this ablation experiment only on its 85K variant), and the performance starts to decrease for M > 3. This behaviour has been often observed in the literature and is attributed to the well-known oversmoothing problem of GNNs [109].

F. EVENT EXPLANATION RESULTS AND ABLATION STUDY
In this section, the proposed explainability approach (Section III-D) with the ViT backbone is evaluated on the ActivityNet dataset. This dataset is selected here because its videos are represented with a large number of frames (i.e. N = 120), allowing for a thorough evaluation of different XAI criteria.
Firstly, we perform a quantitative evaluation using the XAI measures described in Section IV-C. Specifically, the various criteria are evaluated based on their ability to select the Υ most salient frames explaining model's outcome, where Υ is set to Υ = 1, 2, 3, 5, 10 and 20.
We assess the following four ViGAT-based criteria (which can be also considered as a form of ablation study examining    3 ), iii) Local Only, ω (n) 3 (16), and, iv) Global Only, ω (n) 1 (15). Additionally, the above criteria are compared against i) GCN-Grad-Cam [91], which is the closest approach to ours and can be applied to the ViGAT architecture, and ii) random frame selection, as a baseline. For the latter (denoted hereafter simply as Random), random selection is repeated five times and the average is reported for each individual XAI measure.
The evaluation results in terms of AD (19), IC (18), F − (20) and F + (21) are depicted in Figs. 6, 7, 8 and 9, respectively. From the obtained results we observe the following: i) In all cases the proposed WiD-based XAI criteria outperform by a large margin the random frame selection. Therefore, it is clear that the WiDs derived by the learned adjacency matrices in the proposed ViGAT architecture can provide valuable information for explaining the model's decision.
ii) The proposed criteria also outperform GCN-Grad-Cam across all performance measures. For instance, for Υ = 1 (i.e. when the single salient frame is considered) our proposed XAI criterion (β (n) ) provides an absolute explanation performance improvement of approximately 25%, 9% and 18% over GCN-Grad-Cam in terms of AD, IC and F −, respectively.
iii) The local WiDs are powerful explainability indicators, outperforming the global ones; this further highlights that bottom-up (i.e. object) information is crucial for the recognition of events in video.
iv) The combination of the local and global WiDs (using either operator) in most cases offers a small but noticeable performance gain, showing that these indicators are to some degree complementary. For instance, we observe in Fig. 7 that the mean WiDs provide consistently an absolute 2% IC performance gain over using any of the individual WiD indicators alone. v) Generally, in terms of AD, IC and F −, GCN-Grad-Cam exhibits a performance close to the random baseline. In contrary, it achieves a much better F + performance from the random baseline, as shown in Fig. 9. This is in agreement with similar results in the literature, e.g. in [91]. More specifically, we note that the computation of AD, IC and F − is based on the selection of the Υ most salient frames, while in contrary, F + on the remaining Q − Υ least salient ones. Based on this observation, we can say that AD, IC and F − correspond to the notion of sparsity (measure of localization of an explanation in a small subset of the graph nodes) and F + resembles the notion of fidelity (measure of the decrease in classification accuracy when the most salient graph nodes are occluded), as sparsity and fidelity are defined in [91]. In the experimental evaluation of the above work it is shown that GCN-Grad-Cam provides explanations of high fidelity but poor sparsity, similarly to the results obtained here.  (17) is provided at the top of the figure. The two and six video frames with lowest and highest β (n) (depicted with red and green bars, respectively) are shown below the barplot. The video frame corresponding to the highest β (n) is placed within a green rectangle. We see that the model focuses on the frames that contain at least one bike and ignores other irrelevant ones (e.g. the computer graphics frame, appearing first from the left in the figure). It is also worth noting that the frame selected as the most salient (i.e., with highest β (n) ) is the one that depicts multiple BMX vehicles.  Explanation example for a video correctly categorized into the class "Waxing Skis". This is a hard example because, as we can see from the two left-most frames in this figure, frames showing a skier and snow are part of the video and are even assigned high β (n) values (17); these could mislead to classifying the video as "Skiing" (which is among the events included in this dataset). However, thanks to the highest β (n) values being assigned by the proposed ViGAT to frames that depict waxing skis, the classifier correctly recognizes this event. values correctly indicate the frames that are irrelevant to the recognized class, e.g. the two frames with the lowest β (n) depict a computer graphics image and an empty bowl, respectively. On the other hand, the two frames with highest β (n) show human hands cutting lemons, thus providing a convincing explanation why this video was misrecognized as "Making lemonade" by the proposed model.
In order to gain further insight into the proposed explainability approach, qualitative results (examples) are also given in Figs. 10 to 16. In Figs. 10, 11 and 12, we show the six most salient and the two least salient frames selected using our ex-plainability criterion β (n) from correctly-recognized videos belonging to class "BMX", "Rock Climbing" and "Waxing Skis", respectively. In Fig. 10, we see that all selected frames contain at least one BMX bike, while the one with the highest FIGURE 14. Explanation example for a video belonging to class "Chopping wood" but miscategorized as "Starting a campfire". The most salient frames (based on the β (n) values (17)) are the ones depicting a person chopping wood next to a campfire. These frames provide a convincing explanation why the classifier has mistakenly labeled this video. On the other hand, we see that the most irrelevant frames to the classification decision (receiving a β (n) value close to zero) are the ones with overlay text on black frames. This video has many such frames, yielding a barplot that looks quite different from that of other videos.

FIGURE 15.
Explanation example for a video belonging to class "Removing ice from car" but miscategorized as "Shoveling snow". We observe that the most salient frame (based on the β (n) values (17)) depict a person removing ice from car, thus not providing enough evidence why this frame has been misclassified. To this end, we resort to the object-level explanations, provided at the second row of the figure. Specifically, the eight most salient objects are depicted for the most salient frame of the video, at the left side of the row, and the respective WiD values (ω (l,n) 2 (14)) are shown in the barplot at the middle of the row. We also show a respective object detection barplot at the right side of the row, depicting the eight objects detected with the highest degree of confidence (DoC) value. Concerning the bar colors: a green bar in the WiDs barplot indicates that the corresponding object did not appear in the top-8 DoC list but was promoted by our approach; a red bar indicates that this object is completely irrelevant with the recognized event. We see that the most salient objects identified by our approach, i.e. "person", "snow", "man", etc., are not characteristic enough to differentiate between the two classes. However, we observe that the object car, which is a differentiating factor between the two classes, is not detected by the object detector; additionally, the frame regions that the classifier focuses on do not include the car region, convincingly explaining the network's recognition decision in this failure example. β (n) contains several bike instances. Similarly, in Fig. 11 the climber and the climbing wall, and in Fig. 12 instances of a person waxing skis, are clearly shown in the selected frames. Regarding Fig. 12, despite this video being a difficult example due to containing instances of two related events ("Waxing Skis" and "Skiing"), the classifier correctly gives more attention to the frames related to the actual, "Waxing Skis", event, rather than to the ones depicting the skiers and snow, thus achieving to correctly classify the video. On the other hand, in all figures, the frames assigned a low β (n) depict information irrelevant to the recognized event and thus are correctly dismissed as potential explanations by our approach.
Contrarily to the above examples, Figs. 13, 14 and 15 show failure cases: videos of the classes "Preparing salad", "Chopping wood" and "Removing ice from car" that have been miscategorized as "Making lemonade", "Starting a campfire" and "Shoveling snow" respectively. As in the previous examples, we observe that the frames associated with the lowest β (n) are visually irrelevant to the recognized events and thus were correctly dismissed. On the other hand, most of the frames associated with high β (n) as well as the ones corresponding to the top β (n) value, contain objects relevant to the recognized class, explaining why the classifier mislabelled these videos. For instance, the most salient frames of the videos in Figs. 13 and 14 depict pieces of lemons and a campfire, thus providing an explanation why the classifier misclassified these videos as "Making lemonade" and "Starting a campfire", respectively. Similarly, in Fig. 15, utilizing the object-level explanations for the most salient frame of the video (provided at the second row of this figure), we discover VOLUME X, 2022 FIGURE 16. Each row of this figure provides an explanation example produced using our approach for a video belonging to a different event category (from top to bottom): a) "Assembling a bike", b) "Skiing", c) "Cleaning windows", d) "Getting a haircut", e) "Brushing teeth". An explanation example consists of the video frame associated with the highest β (n) (frame-level WiDs) and the four objects in this frame corresponding to the highest object-based WiDs. The two barplots in the middle and right of each row depict the objects in the frame corresponding to the eight highest WiD or degree of confidence (DoC) values, respectively. A green bar in the WiDs barplot indicates that the corresponding object did not appear in the top-8 DoC list but was promoted by our approach and convincingly explains the network's recognition decision, e.g. see the "skier" and "baby" objects in the examples of the second and fifth row. On the other hand, a red bar in the barplots indicates that this object is completely irrelevant with the recognized event, e.g., see the "tree" objects in the examples of second and third row. We observe that in most cases our approach indicates objects very relevant to the recognized event as explanations for the event recognition result ("dog" in the fourth example is a notable exception). In contrary, objects with high DoC, although may indeed be depicted in the frame, are often not related to the event recognized by the model and are correctly not considered by our WiD-based approach as good explanations. that the object car is not detected by the object detector, misleading ViGAT to miscategorize this video as "Shoveling snow".
Finally, Fig. 16 depicts several examples of the frame-and object-level explanations generated by our model: in each row, the selected best video frame explanation, as well as the top four object-explanations within each frame, as identified by our approach, are shown. Additionally, two barplots per row are provided, depicting the eight objects with the highest WiD (ω (l,n) 2 , see (14)) and degree of confidence values (the latter being an output of the employed object detector), respectively. The same type of information is also been provided in the second row of Fig. 15 to help us understand why ViGAT misclassified that video. We observe that the objects associated with the highest WiDs are well correlated with the recognized event. Moreover, in most cases (i.e. when the object detector provides a correct object class detection) the class names of the objects can be used to provide a sensible semantic recounting [110] that describes the event detected in the video in a human-comprehensible format. On the other hand, the same cannot be said for the objects associated with high degree of confidence values; these provide a general overview of the various objects depicted in the frame, rather than an insight on which of the depicted objects led to the event recognition decision.

G. LIMITATIONS
As shown from the experimental results, due to the extraction of bottom-up information and the utilization of attention at various levels of ViGAT, our method attains improved event recognition performance and has the ability to provide comprehensive explanations about the decision of the classifier. However, as expected, in comparison to efficient top-down approaches, the above achievements come with a high cost in memory consumption and inference time. To this end, we have tied the weights of the three GAT blocks of ViGAT, achieving more than 2× improvement in memory utilization (see Section IV-E). However, the computational overhead is mainly due to the use of the object detector at each sampled frame, to extract a set of objects, and the subsequent use of a backbone network (ViT) to provide a feature representation of them (i.e. to derive the bottom-up information). To reduce this overhead, inspired by the relevant literature [6], [7], [59], we plan on investigating techniques for selecting only a small fraction of the sampled frames to use for extracting bottomup information.
Another limitation of the proposed approach relates to the accuracy of the employed object detector. More specifically, we observe that despite the fact that the objects derived by our approach focus on the area where the event is taking place and explains well event classifier's decision, their labels are not always correct (e.g. see the red colored bars in the WiD barplots of Figs. 15,16). This limitation in the provided explanations is attributed to the imperfection of the object detector. Nevertheless, we observe that our WiD-based explanation approach highlights the detected objects that are most-related to the recognized event, and which are usually more accurately labeled by the object detector; in this way, it realizes a sort of an error-correcting mechanism on the object detection results (e.g. compare the left and right barplots in Figs. 15, 16, depicting the objects detected with the highest WiD and DoC values, respectively). To address the objectdetector accuracy limitation, we plan on experimenting with newer object detectors (e.g. [107], [111]), aiming to further improve the overall accuracy and efficiency of ViGAT as well as the quality of the produced object-level explanations.

V. CONCLUSION
We presented a new pure-attention bottom-up method for video event recognition, composed of three GAT blocks to process effectively both bottom-up (i.e. object) and framelevel information. Moreover, utilizing the learned adjacency matrices at the corresponding GAT blocks, WiD-based explanation criteria at object-and frame-level were proposed. Experimental results on three large, popular datasets showed that the proposed approach achieves state-of-the-art event recognition performance and at the same time provides powerful explanations for the decisions of the model.
As future work, we plan to investigate techniques towards optimizing further the efficiency of ViGAT, for instance, techniques for discarding early in the processing pipeline the objects/frames less correlated with the depicted event, similarly to [6]; and investigate the utilization of more efficient object detectors and network backbones, such as [107], [108], [111], as well as alternative frame sampling strategies [7], [17]. Hellas, Information Technologies Institute. He has coauthored more than 40 journal articles, 20 book chapters, 180 conference papers, and three patents. His research interests include multimedia understanding and artificial intelligence, in particular, image and video analysis and annotation, machine learning and deep learning for multimedia understanding and big data analytics, multimedia indexing and retrieval, and applications of multimedia understanding and artificial intelligence. He serves as a Senior Area Editor for IEEE Signal Processing Letters and served as an Associate Editor for IEEE Transactions on Multimedia.