Graph Neural Networks for Contextual ASR With the Tree-Constrained Pointer Generator

Incorporating biasing words obtained through contextual knowledge is paramount in automatic speech recognition (ASR) applications. This paper proposes an innovative method for achieving end-to-end contextual ASR using graph neural network (GNN) encodings based on the tree-constrained pointer generator method. GNN node encodings facilitate lookahead for future word pieces in the process of ASR decoding at each tree node by incorporating information about all word pieces on the tree branches rooted from it. This results in a more precise prediction of the generation probability of the biasing words. The study explores three GNN encoding techniques: namely the tree recursive neural network (Tree-RNN), the graph convolutional network (GCN), and GraphSAGE, along with different combinations of the complementary GCN and GraphSAGE structures. The performance of the systems was evaluated using both Librispeech and the AMI corpus with a visual-grounded contextual ASR pipeline. The findings indicate that using GNN encodings achieved consistent and significant reductions in word error rate (WER), particularly for words that are rare or have not been seen during the training process. Notably, on LibriSpeech test sets, the combined GNN proposed in this paper achieved a 20% relative rare word error rate reduction compared to Tree-RNN, 30%–40% compared to standard TCPGen and 60% compared to standard ASR systems without TCPGen.


I. INTRODUCTION
E ND-TO-END ASR systems often struggle with the accu- rate recognition of rare "long-tail" words that were not present in the training data.To combat this issue, Contextual biasing which applies external contextual knowledge to the ASR system during inference, becomes a crucial factor in addressing the long-tail word problem in various applications [1]- [14].Contextual knowledge is often represented as a list (referred to as a biasing list) of words or phrases (referred to as biasing words) that are likely to appear in a given context.Biasing lists can be sourced from various resources, such as a user's contact book or playlist, recently visited websites and visual information from presentation slides etc.Despite their infrequent occurrence and thus the limited influence on the overall word error rate (WER), biasing words significantly impacts understanding the content as biasing words are mostly content words such as nouns or proper nouns, which are crucial for downstream tasks.The inclusion of a word in a biasing list increases its likelihood of being correctly recognised, making contextual biasing a crucial component for the accurate recognition of rare content words.
G. Sun and P.C. Woodland are with the Department of Engineering, University of Cambridge, Trumpington St., Cambridge.Email: {gs534, pcw}@eng.cam.ac.uk C. Zhang is with the Department of Electronic Engineering, Tsinghua University, Beijing, China.Email: cz277@tsinghua.edu.cnEnd-to-end trainable ASR systems [15], [20] are designed to encapsulate all necessary knowledge within a single, static model, making it difficult to integrate dynamic context-specific knowledge at test-time.To overcome this challenge, specialised contextual biasing methods have been proposed, such as shallow fusion (SF) with a weighted finite-state transducer (WFST) or an adapted language model (LM) that incorporates contextual knowledge [1]- [3], [11], [12], [14], attention-based deep context approaches [4]- [8], as well as deep biasing (DB) with a prefix tree for improved efficiency when dealing with large biasing lists [6], [9], [10].More recently, contextual biasing components with a pointer generator mechanism [46]- [48] that directly modifies the output distribution have been proposed [34], [38], which can be jointly optimised with ASR systems.In particular, the tree-constrained pointer generator (TCPGen) component proposed in [34] builds a neural shortcut by directly interpolating the original model distribution with the TCPGen distribution estimated from contextual knowledge structured as a prefix-tree, based on a dynamic interpolation weight predicted by the TCPGen component.TCPGen performance was further boosted by encoding the prefix-tree using a tree recursive neural network (tree-RNN) [35].
This paper substantially extends the work in [35] and proposes to use three types of graph neural networks (GNN) for prefix-tree encoding in TCPGen 1 .These include tree-RNN, graph convolutional network (GCN) [43] with its variant GCNII [42], and GraphSAGE with the max-pooling aggregator [44].While tree-RNN is a representative GNN model with a single recursive layer, GCN and GraphSAGE are two popular and effective multi-layer GNN designs.Specifically, GCN encodes the tree by utilising spectral representation, while GraphSAGE, as a spatial method, directly explores the graph topology [28].To further enhance the performance of GNN tree encodings, this paper proposes attentive and bilinear combination approaches to exploit the complementarity between GCN and GraphSAGE.Additionally, this paper introduces an effective parameter-tying scheme for both GCN and Graph-SAGE to improve their performance with deeper structures.
GNN encodings provide more powerful node representations in the prefix tree of TCPGen, allowing for "lookahead" functionality where each node contains not only its own word piece information but also information about its child branches.This improved node representation in TCPGen leads to more accurate generation probability predictions for biasing words, enabling better contextual biasing by incorporating information about future word pieces during each ASR decoding 1 The main code is at https://github.com/BriansIDP/espnet/tree/GNN.arXiv:2305.18824v1[cs.CL] 30 May 2023 step.TCPGen with GNN encodings, as a generic component for end-to-end ASR, is integrated into both attention-based encoder-decoder (AED) [15]- [19], [26] and neural transducer architecture (N-T) [20]- [24].
Experiments were conducted with two different setups: (1).a simulated contextual ASR task using LibriSpeech audiobook data, and, (2).an audio-visual speech recognition pipeline [35] with the AMI meeting data.In addition to the consistent and large reductions in error rates achieved by TCPGen with tree-RNN in [35], using the proposed GNN encodings, especially combined ones achieved further significant improvements in the word error rate (WER) of rare content words.
The remainder of this paper is organised as follows.Section II reviews related work.Section III introduces the TCPGen component.Section IV describes the details of applying proposed GNNs.Section V and VI present the experimental setup and results.Section VII gives conclusions.

A. End-to-end contextual speech recognition
Recently, various contextual biasing algorithms have been developed for end-to-end ASR.One prominent research stream focuses on representing biasing lists as external weighted finite-state transducers (WFSTs), which are integrated into a class-based language model (LM) via shallow fusion (SF) [1]- [3], [11].These methods often depend on context prefixes such as "call" or "play", limiting their ability to handle the diverse grammar in natural speech.On the other hand, deep context approaches, often using attention mechanisms, have also been proposed, which encode the biasing list into a vector to use as input for the end-to-end ASR models [4]- [8].While deep context approaches eliminate the dependence on syntactic prefixes seen in SF methods, they require more memory and are less effective for handling large biasing lists.
The study in [9] combined the use of deep context and shallow fusion of a WFST in an N-T, leading to improved efficiency by limiting the biasing vector extraction to a subset of word pieces determined by a prefix tree representation of the biasing list, which is referred to as deep biasing (DB).The prefix-tree-based method was further expanded in [10] to include RNN LMs for shallow fusion, resulting in improved recognition of biasing words.Prior research only analysed industry datasets, however, [10] proposed and validated a simulation of contextual biasing on open-source data by incorporating a large number of distractors into the list of biasing words in the utterance.More recently, [38] and [34] simultaneously proposed a neural shortcut between the biasing list and the final model output distribution.In contrast to [38], TCPGen [34] adopted a structured prefix-tree representation for biasing lists, which also enabled the future development of tree-RNN encodings [35].

B. GNN for speech and language processing
GNNs have been extensively employed in a multitude of speech and language tasks.In language-related applications, such as sentence-level text classification or word-level sequence labelling, GNNs are often utilized to capture the syntactic dependencies or semantic relations among words in a sentence.Furthermore, the encoding of subword-unit-based tree structures using GNNs [39], [40] has been explored for the purpose of generating more effective word representations.GNNs have also been applied to named-entity recognition [52] and neural machine translation [53], [54].
GNNs also have numerous applications in speech processing, such as text-to-speech synthesis, where GNNs model the syntactic and semantic relationships in the text.In GraphTTS [45], the authors structured the sentence into a hierarchical tree by dividing the utterance into words and then further into characters.This allowed the system to capture prosodic relationships among different parts of the input.Additionally, GNNs are used in paralinguistic tasks as the syntactic encoder, including sentiment classification and hate speech detection tasks [49]- [51].In [35], a tree-RNN structure was used to encode a word piece prefix-tree in the TCPGen component for contextual biasing.

III. TREE-CONSTRAINED POINTER GENERATOR
TCPGen is a neural network-based module that employs a combination of symbolic prefix-tree search and a neural pointer generator for contextual biasing, allowing for end-toend optimisation.The biasing list is structured as a prefix tree at the word piece-level.At each output stage, TCPGen computes a probability distribution over a set of valid word pieces that are constrained by the prefix tree.TCPGen also predicts a dynamic generation probability, signifying the amount of contextual biasing required at the specific output step.The final output distribution is obtained by taking a weighted summation of the TCPGen distribution and the original AED or N-T output distribution (see Figure 1).×(1 − P gen i ) P mdl (y i ) P ptr (y i ) P (y i ) Fig. 1.Illustration of interpolation in TCPGen with corresponding terms in Eqn.(3).P ptr (y i ) is the TCPGen distribution.P mdl (y i ) is the distribution generated by a standard end-to-end model.P (y i ) is the final output distribution.
P gen i and P gen i are the scaled and unscaled generation probabilities.
The key symbolic representation of the external contextual knowledge in TCPGen is the prefix tree.For simplicity, examples and equations in this section are presented for a specific search path, which can be generalised easily to beamsearch with multiple paths.In the example prefix tree with biasing words (turner, vignette and turin) shown in Fig. 2, if the previously decoded word piece is Tur, word pieces in_ and n form the set of valid word pieces Y tree i .Denoting x 1:T and y i as input acoustic features and output word piece, q i as the query vector carrying the decoding history and acoustic information, K = [..., k j , ...] as the key vectors, scaled dot-product attention is performed between q i Ab i ga Ma an ri o_

Start
Attn.
i l_ an o_ P ptr (y i ) ne_ Fig. 2.An example of prefix tree search and attention in TCPGen.With previous output Tur, in_ and n are two valid word pieces on which attention will be performed.A word end unit is denoted by _.
and K to compute the TCPGen distribution P ptr and the output vector h ptr i as shown in Eqns.( 1) and ( 2).P ptr (y i |y where d is the size of q i (see [33]), Mask(•) sets the probabilities of word pieces that are not in Y tree i to zero, and v j is the value vector relevant to j.
TCPGen can be employed in both AED and N-T.In AED, the query is the combination of the context vector and the embedding of the preceding word piece, while the keys and values are derived from the decoder word piece embedding using a shared projection matrix.The generation probability in AED is computed from the concatenation of decoder hidden states and TCPGen output vectors h ptr i , followed by Sigmoid activation function to be constrained to (0, 1).In N-T, the pointer generator is applied to each combination of the encoder and the predictor step, where the TCPGen distribution is calculated using the concatenation of the corresponding encoder and predictor hidden states as the query.Keys and values in N-T are computed from the predictor word piece embeddings.The generation probability for N-T is derived from the joint network output and the TCPGen output vector h ptr i .To ensure that the probability of the null symbol in N-T is unchanged, P ptr i (∅|x 1:T , y 1:i−1 ) is set to 0 and the generation probability is scaled by 1 − P mdl i (∅|x 1:T , y 1:i−1 ) where P mdl i is the original model distribution before interpolation.
For better flexibility, an out-of-list (OOL) token is included in Y tree i indicating that no suitable word piece can be found in the set of valid word pieces.To ensure that the final distribution sums to 1, the generation probability, P gen i , is scaled as , and the final output can be calculated as shown in Eqn.(3).
where dependencies, y 1:i−1 , x 1:T , are omitted for clarity.P mdl (y i ) represents the output distribution from the standard end-to-end model.

A. Biasing-driven LM discounting (BLMD) for TCPGen
Log-linear interpolation is often used as a technique to incorporate an external LM via SF.Define the source domain data as the text of the training data for the end-to-end model, and the target domain data as the data used to train an external LM such that it generates better probability estimates for the test data.The LM discounting is defined as Eqn.(4).
where P mdl (Y ) is the probability from the end-to-end system, P src (Y ) is the probability of the source domain LM and P tgt (Y ) is the target domain LM probability.Extending this idea to contextual biasing with TCPGen, BLMD can be applied as shown in Eqn.(5).
where P ptr is the TCPGen distribution, and the same source and target LMs are used for both distributions, but with different sets of hyper-parameters α 1 , β 1 and α 2 , β 2 tuned on the validation set.

IV. GNN TREE ENCODINGS FOR TCPGEN
While looking ahead into future branches of the paths being searched on the prefix tree is greatly beneficial to the correct prediction of the generation probability, node representations in standard TCPGen only contain information about the word piece on that node.To achieve lookahead functionality, GNNs are used to encapsulate future branch information into each node representation.The pipeline of applying GNN encodings in TCPGen is shown in Fig. 3.
The word piece prefix-tree is first encoded with a GNN to obtain encodings associated with each node.Then, the tree with GNN encodings is used by TCPGen, where the key and value for the TCPGen distribution are computed using the encoding of nodes in the set of valid word pieces, in place of word piece embeddings as shown in Eqn.(6).
where W V and W K are parameter matrices, and h gnn is the GNN node encoding obtained using different types of GNN.This paper explores three different types of GNNs, namely the tree-RNN, GCN (including its variant, GCNII) and GraphSAGE with max pooling, together with combinations of GCN and GraphSAGE as two complementary types of GNN.Details of GNN structures applied in TCPGen together with modifications are described in the following sections.

A. Tree-RNN
Tree-RNN recursively encodes the tree from leaf nodes to the root using an RNN structure.Specifically, at node n j which contains child nodes n 1 , ..., n k , ..., n K , the vector representation of n j can be written as Eqn.(7).
where h trnn n k is the vector representation of node n k , and y j is the embedding vector of the word piece of node n j .W 1 and W 2 are parameter matrices jointly optimised with the ASR system by allowing gradient back-propagation through h trnn n k .In this way, each node recursively encodes information from its child nodes, such that the information of the entire branch rooted from it can be incorporated in the node encoding h trnn n k .Before the forward pass of the main ASR model, the encoding of each node is obtained by applying Eqn.(7) recursively from leaf to root.Then, for the same example shown in Fig. 2, at the node of Tur, node encodings h trnn in_ and h trnn n are used to calculate the TCPGen distribution and h ptr i .Therefore, if Turner appears in the utterance, TCPGen is aware of this entire word as early as in the encoding of Tur.Such lookahead functionality achieves a more accurate prediction of the generation probability to determine when contextual biasing is needed.
Although Tree-RNN achieves the lookahead functionality, it uses a rather simple RNN structure to encode the information of all succeeding nodes on that branch into a single vector representation.Therefore, more powerful and flexible GNN encodings are explored in order to improve performance.

B. Graph Convolutional Network (GCN)
As an alternative method to Tree-RNN, GCN is applied for tree encodings to achieve better node representations with controllable lookahead distance.GCN is a multi-layer network where each layer computes the encoding of a node as a function of its neighbours based on the graph Laplacian matrix.Each layer of GCN conducts one message passing from immediate neighbouring nodes.For a GCN with L layers, the encoding of a node covers information from a node that is an L-hop ahead on branches rooted from it.
Specifically, define H gcn (l) = [h gcn n1 (l), ..., h gcn n N (l)] as the node encoding matrix of layer l whose rows h gcn nj (l) are the encoding of the n j nodes, the GCN layer computation is where W (l) is the parameter matrix of l, f (•) is the activation function, Â = A + I N is the adjacency matrix with self-loops to enable the information of the current node to be included in the node representation, and D is the degree matrix of Â.
A specific form of normalised graph Laplacian [43], is used to address the vanishing/exploding gradient problem.Note that as only future branch information is needed in TCPGen, Â only contains edges that lead to child nodes, and hence D is computed based on this modified Â. TCPGen then takes the node encodings of the final layer, H gcn (L), to compute key and value vectors in the same way as tree-RNN.As a practical consideration for deep networks in general, residual connections and layer normalisation are added for any GNNs with multiple layers.
Although lookahead with a configurable scope can be achieved by varying the number of layers L, recent research found that the performance of GCN starts to degrade with more than three layers.Apart from the fact that deeper networks are more difficult to train, it was pointed out that the representations of the nodes in GCN are inclined to converge to a certain value and hence become indistinguishable, which is referred to as the over-smoothing problem.One promising method to address this problem is to build a shortcut directly linking to the first GCN layer to ensure a certain fraction of the final representation comes from the current node itself.Thus, GCNII is also investigated in this paper with each layer computed in Eqn (10): (10) where H gii (l) is the l-th layer output of GCNII, hyperparameter α scales the shortcut to the first layer, and W β (l) is the parameter matrix defined as: where β l is a layer-dependent hyper-parameter which is ln(1/l + 1) in this paper.Although GCNII has shown improved performance with deeper GCN of more than 4 layers, the maximum length in our biasing list is usually less than 10.With this depth for tree structures, the over-smoothing problem is less of a concern compared to the best structure of GCNII with 64 layers, and the network complexity is more problematic.Therefore, a simple parameter-sharing scheme is also proposed in this paper for deep GNNs to reduce network complexity.Specifically, the parameter matrices in the first K layers are shared: In contrast to other complex graph structures, message-passing operations from child nodes to the root in a tree were similar across different layers.Therefore, having the same weight matrix representing this process effectively reduces the model complexity to a degree that is adequate for tree encoding.
In particular, K = L − 1 for GCN in this paper so that there are effectively two layer parameters to be trained while maintaining the depth of the network.The first K layers act as universal message passing and the last layer performs a final information aggregation from neighbouring nodes.

C. GraphSAGE with Max Pooling
Node encodings for Tree-RNN and GCN are based on summation, whereas previous research [56], [57] has found that using max pooling also achieves competitive performance for word representation based on subword units.Hence, as an alternative GNN structure, GraphSAGE with a max pooling aggregator function is studied in this paper.GraphSAGE is a multi-layer GNN with each layer performing an information aggregation over a sampled set of child nodes followed by an update to the representation of the current node.Although one of the innovations in GraphSAGE is fixed-size sampling, as the training time biasing list is already a sampled subset from the full biasing list, the sampling of GraphSAGE is omitted in this paper.The computation of each layer is ) where max(•) and Concat(•) denote the element-wise max pooling and concatenation operators, N j is the set of child nodes of node n j .Although slightly better than GCN, Graph-SAGE also degrades when adding more layers.Therefore, it is proposed in this paper to apply parameter-sharing for Graph-SAGE, where both TCPGen with GNN encodings still achieves high efficiency in handling large biasing lists.In training, with a large biasing list of 1000 words, TCPGen with a tree-RNN was 3.5 times slower than the standard AED or N-T model, with a negligible increase in space complexity.Among the three GNNs, GCN achieved the highest efficiency for training as its computation can be parallelised most, whereas the recursive computation in tree-RNN and the max-pooling in GraphSAGE hinder their training speed respectively.As a result, GCN in training is 2.5 times slower than the standard AED model, while GraphSAGE is 3 times slower.Moreover, by generating GNN encodings offline before decoding once the biasing list is available, the time and space complexity during inference is close to the standard AED or N-T for biasing lists of thousands of words.

D. Combination of GNN Encodings
Combinations [29], [41] of GCN and GraphSAGE are explored in this paper for tree encodings, as they are conceptually complementary.GCN adopts a spectral approach where the graph Laplacian is used to aggregate information, while GraphSAGE directly exploits the graph structure and performs a max-pooling aggregation.To exploit the complementarity between the two GNNs, both additive and multiplicative combination methods are investigated here.
Additive combination performs a weighted sum of node encodings from GCN and GraphSAGE as the final node encodings before being processed by TCPGen (see Eqn. ( 14)).
U 1 and U 2 are two parameter matrices to rearrange the orders of the element in each GNN encoding as they may not encode information in the same order [41].Note that α gcn + α sage = 1 are weights that are either fixed or predicted via attention.The attention calculation is performed on each node n separately, as shown in Eqn.(15).
where q i is the same query vector used to calculate the TCPGen distribution.In this way, different sets of weights are assigned to different nodes at different decoder steps.The multiplicative combination is performed via a low-rank approximation of the bilinear pooling method.The combination is shown in Eqn. ( 16) where U 1 , U 2 and U 3 are parameter matrices, and ⊙ is the element-wise product between two vectors.Following [41], a shortcut connection from each individual GNN encoding was provided to form the final combined encoding for TCPGen, as shown in Eqn.(17).
where U 4 and U 5 are another two parameter matrices.

A. Data
Experiments were conducted on two distinct datasets, namely the LibriSpeech audiobook corpus and the AMI meeting data, where the latter followed the audio-visual speech recognition pipeline.The LibriSpeech corpus [32], consists of 960 hours of read English from audiobooks, was used for evaluation purposes, with the dev-clean and dev-other sets used for validation, while test-clean and test-other were employed for evaluation.To investigate the impact of critical hyper-parameters, small-scale experiments were carried out using the train-clean-100 subset as the training set.Moreover, models trained on the LibriSpeech dataset were fine-tuned and evaluated on the AMI dataset in accordance with the approach proposed in [35].
The AMI meeting corpus [25] consists of 100 hours of meeting recordings involving 4-5 individuals, which were divided into the train, dev, and eval sets.To demonstrate the effectiveness of contextual biasing on data from another domain with limited training resources, a subset comprising 10% of the utterances from the AMI training set corresponding to 8 hours of audio was used to fine-tune the models previously trained on the LibriSpeech 960-hour data.There were 14 meetings from the dev set and 8 meetings from the eval set accompanied by slides that were selected to formulate the new test set for the audio-visual contextual ASR pipeline.
The 80-dim FBANK features at a 10 ms frame rate concatenated with 3-dim pitch features were used as the model input.SpecAugment [30] with the setting (W, F, m F , T, p, m T ) = (40, 27, 2, 40, 1.0, 2) was used without any other data augmentation or speaker adaptation.

B. Biasing list selection
To simulate real-world scenarios in LibriSpeech, the complete list of rare words comprising 200,000 distinct words as suggested in [10] was used.The full rare word list consisted of over 60% out-of-vocabulary (OOV) words that were absent from the LibriSpeech speech training set.Consistent with the method proposed in [10], the biasing lists were extracted by identifying words from the full rare word list that appeared in the reference transcription of each utterance, followed by the addition of a specific number of distractors.During inference, 10.3% of the word tokens in the test sets belonged to the full rare word list.
Fig. 4 shows the visual-grounded contextual ASR pipeline for AMI that utilises optical character recognition (OCR) output for slides.The Tesseract 4 OCR engine, equipped with LSTM models 2 , was first applied to the slides of each meeting series (e.g., ES2011[a-d]).Subsequently, distinct word tokens were extracted from the OCR output text files, and words in the full rare word list, which also occurred fewer than 100 times in the AMI training set, were selected to form the biasing list for that particular meeting series.These meetingspecific biasing lists were then used for the recognition of all utterances in that meeting series.The sizes of the biasing lists vary between 175 to 576, and the total number of word tokens covered by these lists was 1,751 out of 112,110 word tokens (1.5%).As shown in Fig. 4, these words mainly consisted of highly valuable content words whose accurate recognition was crucial for comprehending the utterance.Therefore, although the biasing lists had a minor impact on the overall word error rate (WER), they were essential for improving the recognition performance of critical words.Details of the meetings with slides and the extraction pipeline are available 3 .

C. Model specification
The ESPnet toolkit [31] was used for developing the systems.A unigram word piece model comprising 600 unique word pieces was created on the LibriSpeech data and was applied directly to the AMI data.Both the AED and N-T models employed a Conformer [27]  LM shallow fusion and BLMD were implemented using a two-layer LSTM-LM with 2048 hidden units trained on the 800 million-word text training corpus of LibriSpeech as the target domain LM for the LibriSpeech experiments.Each source domain LM, trained on the text of the audio training data, used a single-layer LSTM with 1024 hidden units.It is worth noting that each LM had the same word pieces as the corresponding ASR system.

D. Training specifications
During training, biasing lists with 1000 distractors were used for the experiments conducted on the LibriSpeech dataset, while 100 distractors were used for the AMI data.To create these lists, biasing words were selected from the reference transcription, and additional distractors were added.To prevent the AED model from becoming overly confident about TCPGen outputs, a dropout-inspired technique was employed during training, as described in [10].Specifically, biasing words that were presented in the reference transcription had a 30% probability of being removed from the biasing list.The Conformer was optimised using the Noam optimiser [33].Additionally, the hyper-parameters for the BLMD model were determined based on the respective dev sets for each dataset.

E. Evaluation metrics
In addition to WER, the rare word error rate (R-WER) was used to evaluate the system performance on biasing words that were "rare" in the training data for that system.R-WER is the total number of error word tokens that belong to the biasing list divided by the total number of word tokens in the test set that belong to the biasing list.Insertion errors were counted in R-WER if the inserted word belonged to the biasing list, in contrast to [38].In addition, OOV WER was also computed in the same way as R-WER but for OOV words in the biasing list.There are altogether 443 such words in LibriSpeech testclean and test-other sets.Moreover, the slides' rare word error rate (R s -WER) is reported for the AMI experiments calculated in the same way as R-WER, but for the rare words in slides.Insertions of slides biasing words were included in R s -WER.
As rare words are scarce in the dataset, significance tests were performed to ensure that the improvements found by using GNN encodings were statistically significant.Specifically, independence was assumed at the book level for LibriSpeech and at the speaker level for AMI.The alternative hypothesis was defined as the GNN system performing better than the standard TCPGen (i.e.one-tailed sign test).

A. LibriSpeech train-clean-100 Results
Experiments were first performed on the train-clean-100 set to find the best-performing GNN setups.The first investigation was on the number of GNN layers which determines how much lookahead is needed.The plots of WER and R-WER against the number of layers for N-T are shown in Fig. 5, and those for AED are shown in Fig. 6.Note that parameter tying only had effects when the layer number exceeded two.
For N-T, both GCN and GraphSAGE had a clear trend that the performance tended to degrade when the number of layers was increased beyond three and degraded significantly when it reached twelve layers.GraphSAGE inherently suffered less from this problem than GCN as the max pooling operator enabled the gradient for salient nodes to remain at its original value instead of being over-smoothed [42].This degradation was mitigated by either using the GCNII structure or the parameter-tying scheme proposed in this paper.The best performance was achieved by 6-layer systems using WER as the selection criterion, where GCN and GraphSAGE with tied parameters achieved better performance than GCNII.
Similar observations were found with AED, where GCN and GraphSAGE with tied parameters achieved slightly better performance than GCNII.However, the best performance was achieved using 2-layer GCN and 3-layer GraphSAGE models.This difference is mainly caused by the label-synchronous nature of AED, where the model required knowledge for predicting only the next token, and information further into the future is less useful for this prediction in contrast to N-T.
The best-performing GCN and GraphSAGE with tied parameters were used for model combination.For additive combination, different fixed combination weights gave results shown in Fig. 7, together with dynamic combination weights.As a result, dynamic combination performed better than most fixed-weight combinations, whereas the best performance was still obtained by the fixed-weight combination for both AED and N-T, with α sage = 0.2.By examining the predicted dynamic weights, it was found that unless at the root node of the tree, the dynamic weight almost completely ignored GraphSAGE encodings and hence suffered from mode collapse, and hence GraphSAGE encodings were not properly trained.In fact, since GCN is usually better at handling nearfuture information, the model learnt to only rely on GCN.

B. LibriSpeech 960-hour Results
The main results for LibriSpeech full-set experiments were summarised in Table I for AED and in Table II for N-T respectively.Compared to the standard TCPGen, all three types of GNNs achieved significantly better WER and R-WER (at p values smaller than 0.01) on both test sets for both AED and N-T.In particular, using multi-layer GNN, such as GCN and GraphSAGE, achieved clearly better performance to tree-RNN on AED, whereas the performance difference among those three GNN types was less obvious on N-T.The bestperforming GNN structure on both AED and N-T was GCN with tied parameters.For AED, GCN achieved 16% relative WER reduction with a 20% R-WER reduction on the testclean set and 13% relative WER reduction with 19% relative R-WER reduction on the test-other set (comparing row 4 to row 2 in Table I).For N-T, GCN achieved a 9% relative WER reduction with 21% relative R-WER reduction on the test-clean set, and a 7% WER reduction with a 17% relative R-WER reduction on the test-other set (comparing row 4 to row 2 in Table II).With similar levels of R-WER reduction, AED achieved a higher reduction in WER.As analysed in [36], TCPGen produced a much more confident prediction of P gen with AED than N-T, where the main reductions in overall WER were attributed to the reduction in R-WER.The improvements using GNN indicated that the GNN encoding improved the prediction of P gen , which was more beneficial for the overall WER in AED.
Both additive and bilinear combinations of GNN encodings achieved superior performance to individual GNN encodings with both AED and N-T models.For the AED model, the best performance was achieved by the additive combination with the best set of fixed weights previously found on train-clean-100 experiments, while the bilinear combination achieved very similar performance to the additive one.This led to a total of 31% relative R-WER reduction compared to the standard TCPGen (comparing row 7 to row 2 in Table I), and a total of 56% relative R-WER reduction compared to the baseline Conformer AED model.The performance improvement with the GNN combination was smaller on N-T, with the best results achieved by the bilinear pooling combination.This resulted in a 30% relative R-WER reduction compared to the standard TCPGen (comparing row 9 to row 2 in Table II), and an overall 56% relative R-WER reduction compared to the baseline Conformer N-T model.
Finally, selected systems were evaluated with BLMD, where TCPGen with GNN encodings achieved further performance improvements.The best-performing system for AED was TCPGen with the additive combination of GNN encodings, which achieved an overall 66% relative R-WER reduction on the test-clean set and 58% reduction on the test-other compared to the baseline.For N-T, an overall 57% relative R-WER reduction was achieved on test-clean, and 54% was achieved on test-other.Moreover, the OOV-WER had the same reduction pattern as R-WER for both AED and N-T.The best AED and N-T systems with combined GNN encodings for TCPGen reduced the OOV-WER by over 60%.Notably, BLMD was particularly beneficial for GNN encodings in AED systems, reducing the OOV-WER to 1/3 of the baseline value.This confirmed that even though GNN required more parameters to encode the prefix-tree, TCPGen still generalises well to unseen branches (i.e.OOV words) on the tree.The performance of various GNN encodings for TCPGen was further integrated into the audio-visual contextual ASR pipeline, and results were shown in Table III.In general, reductions in R-WER had a much smaller influence on the overall WER than with LibriSpeech, as rare words only occupied a much smaller portion.The findings were consistent with Lib-riSpeech, where the best-performing combination methods for AED and N-T were additive and bilinear respectively.However, the relative R-WER improvement was smaller compared to that in the LibriSpeech experiments, as the best-performing system here already had an R-WER that was very close to the overall WER whereas the R-WER in LibriSpeech was still twice as high as the overall WER.Compared to the baseline standard systems, TCPGen in AED achieved over 35% relative R-WER reduction using the best combined GNN encodings, while TCPGen with the best combined GNN encodings in N-T achieved over 30% relative R-WER reduction both with and without BLMD.

VII. CONCLUSION
This paper proposes three different types of GNN encodings in TCPGen for end-to-end contextual ASR, including tree-RNN, GCN and GraphSAGE.GNN encodings gave a lookahead functionality by incorporating future information on branches starting from the current node.Combination methods that take advantage of the complementarity between GCN and GraphSAGE were also explored.Experiments on LibriSpeech and AMI following an audio-visual contextual ASR pipeline showed consistent and significant WER and R-WER improvement for both AED and N-T systems using GNN encodings.The best combined GNN encodings achieved over 60% R-WER and OOV-WER reductions compared to the baseline standard systems.

Fig. 3 .
Fig. 3. Pipeline of encoding prefix-tree with GNN for TCPGen.The prefix-tree is first encoded by a GNN, and the GNN-encoded tree is used by TCPGen to generate the TCPGen distribution where key and value vectors are GNN-based node encodings.The lookahead content for a 2-layer GCN as an example is denoted in {}, i.e. for node ri, h ri covers information about an ne_ and o_ are separately shared across all layers respectively.

Fig. 4 .
Fig. 4. Illustration of the visual-grounded contextual ASR pipeline for the meeting series ES2011 containing meetings ES2011a to ES2011d.
encoder, which comprised 16 conformer blocks comprising 4 attention heads of size 512.The AED used a single-layer LSTM decoder of size 1024 and a location-sensitive attention mechanism featuring 4 heads of size 1024.The N-T, on the other hand, employed a 1024-dimensional predictor and a joint network consisting of 2 OCR implementation at https://github.com/tesseract-ocr/tesseract 3 https://github.com/the-anonymous-bs/AMIslidesbiasing a single fully-connected layer of size 1024.GNN encodings for LibriSpeech train-clean-100 experiments used 256-d GNN encodings, whereas the LibriSpeech full-scale experiments used 1024-d for GNN encoders.

TABLE I WER
AND R-WER ON LIBRISPEECH TEST-CLEAN AND TEST-OTHER SETS USING CONFORMER AED AND TCPGEN TRAINED ON LIBRISPEECH 960-HOUR DATA WITH VARIOUS GNN ENCODINGS.NOTE THAT BOTH GCN (2-LAYER) AND GRAPHSAGE (3-LAYER) ADOPTED PARAMETER TYING.GCN AND ADDITIVE COMBINATION (FIXED) WERE SELECTED AS THE REPRESENTATIVE GNNS TO BE EVALUATED WITH BLMD.TABLE II WER AND R-WER ON LIBRISPEECH TEST-CLEAN AND TEST-OTHER SETS USING CONFORMER N-T AND TCPGEN TRAINED ON LIBRISPEECH 960-HOUR DATA WITH VARIOUS GNN ENCODINGS.NOTE THAT BOTH GCN (6-LAYER) AND GRAPHSAGE (6-LAYER) ADOPTED PARAMETER TYING.GCN AND BILINEAR COMBINATION WERE SELECTED AS THE REPRESENTATIVE GNNS TO BE EVALUATED WITH BLMD

TABLE III WER
(Rs-WER) ON AMI TEST SET USING THE AUDIO-VISUAL CONTEXTUAL ASR PIPELINE WITH 10% OF AMI TRAINING SET.BASELINE IS THE STANDARD AED AND N-T SYSTEMS, AND AMI BASELINE (THE FIRST ROW) SPECIFICALLY REFERS TO THE STANDARD SYSTEM TRAINED FROM SCRATCH ON THE FULL AMI TRAINING SET.