Context Conditioning via Surrounding Predictions for Non-Recurrent CTC Models

Connectionist Temporal Classification (CTC) loss has become widely used in sequence modeling tasks such as Automatic Speech Recognition (ASR) and Handwritten Text Recognition (HTR) due to its ease of use. Recent sequence models that incorporate CTC loss have been focusing on speed by removing recurrent structures, hence losing important context information. This paper presents extensive studies of Contextualized Connectionist Temporal Classification (CCTC) framework, which induces prediction dependencies in non-recurrent and non-autoregressive neural networks for sequence modeling. CCTC allows the model to implicitly learn the language model by predicting neighboring labels via multi-task learning. Experiments on ASR and HTR tasks in two different languages show that CCTC models offer improvements over CTC models by 2.2-8.4% relative without incurring extra inference costs. We have also found that higher order of context information can potentially help the model produce better predictions.


I. INTRODUCTION
Context has been extensively proved to be useful for various kinds of sequence modeling problems, such as Automatic Speech Recognition (ASR) [1], Text-to-Speech (TTS) [2], Handwritten Text Recognition (HTR) [3], Neural Machine Translation (NMT) [4], and Language Modeling [5]. In NMT, the same word can have different meanings in different contexts. A NMT system must be aware of the context of an input word to infer its correct meaning. Moreover, when producing an output translation, the model must also be able to produce words that are coherent together. Modeling contexts are usually done using contextual representations and context-aware inferencing algorithms in order to properly model context in the input and output spaces. For ASR and HTR, context awareness is used for making consistent predictions, reducing misspellings, and handling ambiguous input samples [6], [7], [8]. Nevertheless, encapsulating contexts usually comes with additional computational complexity, especially for temporal contexts, which are typically done in a sequential manner.
Incorporating contextual information into models can be done using recurrent models and/or contextual hidden The associate editor coordinating the review of this manuscript and approving it for publication was Chuan Li. representations. Since strong dependencies between letters within target sequences are usually found in sequence modeling, predictions made by recurrent approaches such as autoregressive (AR), iterative, and beam search decoding are generally better than predictions that are independently produced [6], [9], [10]. However, recurrent models have to predict letters sequentially since they use conditions of previous predictions in order to make the next prediction. This sequential nature can be detrimental to the inference time because the computation cannot be done in parallel. On the other hand, non-recurrent models can produce context-dependent predictions based on contextualized hidden representations, which are distilled from recurrent models [11], [12] or trained by context-dependent objective functions [6], [13], [14]. Although these methods do not directly control the predictions in the output layer, improvements over context-independent decoding could still be observed while retaining parallel decoding capabilities.
Connectionist Temporal Classification (CTC) has been commonly used for training non-autoregressive (NAR) ASR and HTR models due to its effectiveness and efficiency [15], [16]. CTC estimates the probability for the alignment between frame-level predictions and character-level ground truths without the need for expensive frame-level labels. To make the computation feasible, CTC assumes independence between the frame-level outputs. Such assumption limits the model especially for ambiguous cases that require contextual information in order to resolve. Early works typically use CTC with recurrent networks which helps alleviate this drawback. However, recent works use CTC with non-recurrent models instead of recurrent ones to maximize throughput [17]. Even though this combination worsens the correctness of CTC's predictions, the runtime improvement is often worth the trade-off in latency-sensitive situations. Many works have tried to re-introduce context into these models [9], [18], [19].
Previously in [6], we proposed Contextualized CTC (CCTC), a multi-task leaning framework with a contextdependent training objective, to incorporate contexts into non-recurrent and non-autoregressive (NAR) ASR models. CCTC tries to increase dependencies between predictions by mitigating the conditional independence assumption of the regular CTC loss [15] in a way that preserves parallel decoding capability. Concretely, CCTC frameworks predict left and right letters as well as the middle letter. We have shown that CCTC models produced promising results, especially when dealing with ambiguous predictions. However, we only used CCTC models that considered adjacent letters and left investigations on CCTC with larger context sizes untouched. Moreover, the effectiveness of CCTC on other sequence modeling tasks is unclear, especially the tasks that have different modalities to ASR, such as handwritten text recognition (HTR).
Non-recurrent NAR CTC-based models are also increasingly adopted for HTR with the aim of achieving lower latency [20], [21]. These models are fast and generally have impressive performance. Nonetheless, they suffer from the same disadvantages as non-recurrent NAR ASR. The lack of dependencies can hinder non-recurrent NAR text recognition models from correctly recognizing letters in ambiguous cases. Models with context-dependent prediction usually outperform in this situation [8], [22], [23]. We believe that CCTC has the potential to improve HTR performance in the same manner that it has done for the ASR task.
In this work, we present an extension of [6] on finding optimal context sizes and studying the generalizability of CCTC. We conduct experiments on four corpora including two different tasks, ASR and HTR, and two different languages, Thai and English. We chose a Thai Youtube dataset [6] for Thai-English ASR, and LibriSpeech [24] for English ASR. As for HTR, we used BEST [25], [26] for Thai, and IAM [27] for English. Experimental results show that CCTC models outperform the baseline CTC models, especially when no external language models are applied. A larger context further improves the performance of the models. Further analysis using character-level perplexities shows that, during inference, CCTC models give a higher priority to languagerelated information, in other words, contexts, than regular CTC models.
To summarize, we make the following contributions.
• We reduce the error rates of CCTC systems by increasing their context sizes, which were previously limited to only one in our previous publication.
• We demonstrate the ability of CCTC to generalize across tasks by showcasing its performance on HTR in addition to ASR, which we have already established as effective in our previous work.
• We formulate a heuristic that effectively reduces the context weight search space, which increases significantly as the context size grows. This eases hyperparameter tuning of large context size CCTC training.
• We present analyses on CCTC behaviors including the relation between CCTC and language modeling, the trade-off between training costs and performance gains, and the impact of forced alignments. The rest of the paper is organized as follows. We highlight some related works including the regular CTC in Section II. We presented details about our methods, the proposed CCTC, in Section III. We described our experimental setups in Section IV, showed the results in Section V, and provided some discussions in VI.

II. RELATED WORKS
This section describes how traditional ASR and HTR models handle contexts, how they adopt CTC, and how they deal with the shortcomings from context independent training.

A. CONTEXT FOR ASR
Context modeling has always been an important component in the ASR model. Traditional HMM-based ASR models comprise three components: an acoustic model (AM), a lexicon model, and a language model (LM), which model context on different levels. The AM is typically based on context-dependent (CD) units, which model several acoustic units together. On the other hand, the lexicon and LM focus on the word and grammatical structure of the sentence, disambiguating homophones and imperfections in the pronunciations [28].
For models based on deep learning, CTC has been proposed for training end-to-end models. CTC typically produces letters instead of CD units. Context dependencies are modeled implicitly using recurrent hidden states [15]. Transducers [29] and Sequence-to-Sequence models [30], which have been introduced later, explicitly model context by making predictions sequentially based on previous outputs. However, sequential prediction can become a computational bottleneck as the model size increases. Additional language model rescoring or beam search can be further introduced to reinforce context modeling with the cost of additional computation.
Recently, end-to-end ASR models have become enormous and require extensive computation resources [31]. The community interests have increasingly shifted towards models with non-recurrent and NAR, in which no further sequential decoding and post-processing are applied. Though 73532 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
non-recurrent CTC models have low decoding latency, they suffer from performance degradation. Potential remedies include using rescoring or iterative decoding where successive refinements are performed on the previous outputs [9], [17], [32].

B. CONTEXT FOR HTR
HTR frameworks heavily rely on contexts as they have to extract a sequence of dependent characters from an image. One of the early approaches is a HMM-based framework, which is analogous to lexicon-free ASR models [33], [34].
The combination of CTC and RNN was firstly adopted as an alternative to the existing HMM frameworks [16], [35], [36]. As CTC lacks dependencies, it was used with context rich models such as RNN and multi-dimensional RNN [37], [38], [39], [40]. Recently, non-recurrent models have shown promising results while reducing the latency. This introduces a wave of research based on non-recurrent models as they have no computation bottlenecks [20], [21], [41].

C. CONNECTIONIST TEMPORAL CLASSIFICATION
CTC [15] is an alignment-free objective function for sequence-to-sequence tasks such as ASR and HTR, where the input sequence length, T , is greater than or equal to the output sequence length, U . Given a set, A, that contains the entire allowed letters in the language, the goal of CTC models is to produce a variable-length letter sequence, y = (y 1 , y 2 , . . . , y U ) : y u ∈ A, from an input sequence, x = (x 1 , x 2 , . . . , x T ) : x t ∈ R M . In order to do so, an extra blank alphabet, ϵ, is introduced for handling blank spaces in images or silences in audios. CTC models produce a frame-level intermediate output path, π = (π 1 , π 2 , . . . , π T ) : π t ∈ A ′ = A ∪ {ϵ}. A letter that spans many consecutive input frames will cause consecutive duplicate outputs. Blank tokens can also be output in-between double letters to distinguish them from consecutive duplicates. Finally, the output path, π, is post-processed to an inferred sequence, y = B(π ), using a mapping function B : A ′ → A, which merges consecutive duplicates and removes excessive blank tokens.
CTC models are trained using the CTC loss, which is the negative log probability of all valid paths for the ground truth. The idea is to strengthen the probability of any path that can be mapped to the target sequence instead of relying on the ground truth alignment. CTC assumes conditional independence between tokens within a path to ease the calculation. Thus, the probability for a path, P(π|x), can be factorized as a product of the probability in each position as shown in (1). We depict the calculation of the CTC loss in (2).
where an inverse function, B −1 , is used to find valid paths, and the ground truth sequence is y * = (y * 1 , y * 2 , . . . , y * U ): The independence assumption encourages the model to isolate its predictions. Consequently, CTC models mostly produce blank tokens, π t = ϵ, and only predict the actual alphabets, π t ∈ A, when they are extremely confident. Thereby, actual alphabets in paths have low dependencies as alphabets are surrounded by non-informative blanks, making context conditioning without external tools difficult.

D. CONTEXT-DEPENDENT CTC
Incorporating context dependencies into CTC models has been mostly based on using subword modeling such as Byte-Pair Encoding [42], WordPiece [43], and context-dependent output units [44], which are the composition of several letters. A natural extension to contextualized CTC is to use these subwords as the alphabet. The transcriptions are pre-tokenized based on available output units and used as ground truths for training CTC models. Moreover, different letter segmentations, such as different letter-grams [3] or different WordPiece sizes [45], can be used together to train a single CTC model simultaneously via multi-task learning, capturing different scales of context. Unlike our work, the distinct prediction heads have no relationship between them because they are trained by dedicated CTC losses on different pre-tokenized targets.
CTC has also been extended to handle modeling inter-dependencies between output letters. Gram CTC [46] introduces a modification of the CTC loss that can aggregate the different possible segmentations of the output tokens on-the-fly. Recurrent transducer [47] autoregressively wraps posteriors of a CTC encoder with a language model and trains both modules together. Imputer [32] iteratively predicts missing letters in previous outputs. The Imputer model is trained using a modified CTC that is suitable for partial transcriptions, which mimics incomplete predictions. However, this line of work explicitly model inter-dependencies and is very distinct from our work that implicitly encourages context dependencies for CTC. Closest works to ours are [48] and [49] that predicted future ground truths for recurrent hybrid ASR models.

III. METHODOLOGY
In this section, we present the CCTC framework, which simultaneously predicts the middle and surrounding letters. We use CTC and cross entropy (CE) losses for training the main and context predictions, respectively. CE training generally requires an alignment between input frames and the output token, which is not available in the CTC framework. The main contribution of CCTC is an algorithm that can obtain target labels for the CE loss on-the-fly. We provide further details of CCTC training and context label generating in III-A and III-B, respectively. VOLUME 11, 2023 A. CONTEXTUALIZED CONNECTIONIST TEMPORAL CLASSIFICATION LOSS CCTC softens the strength of independent predictions by implicitly introducing context conditioning to non-recurrent NAR models without the need of sequential processing. CCTC allows the model to predict the output as well as estimate the contexts for its own predictions. The context estimation raises the awareness of surroundings, which helps mitigate the interference between consecutive outputs and improves the coherency of the predicted sequence. The contexts are predicted simultaneously with the actual prediction in a multi-task manner since we do not want the model to wait for the previous outputs as in sequential decoding. The overview of our framework is depicted in Fig. 1. A CCTC model has three groups of prediction heads: middle, left context, and right context. Given an input x t , the output from the middle head (π t ) is the main output found in a typical CTC model. The left (l t ) and right (r t ) characters, predicted by context heads, are the likeliest contexts for the middle letter. We gather outputs from the middle head to form a single sequence for CTC training. The context heads are separately optimized for each input frame using context loss.
The context loss is the negative probability for the context labels, which is the CE loss for the frame-level context references. We denote l * t and r * t as labels for left and right context heads, respectively. The context loss, L CT , can be defined as shown in (3).
where α and β are weights for the left and right contexts. In addition to contextualizing adjacent letters, the context size can actually be further increased to any arbitrary size. For the context size of K ∈ Z, the CCTC (K ) model predicts K consecutive letters to the left and K consecutive letters to the right. We introduce the superscript (k) for l * t , r * t , α, and β to indicate that l * (k) t and r * (k) t are k th left and right labels for the input x t , respectively. The α (k) and β (k) are weights for l * (k) t and r * (k) t . We can construct the general form of the context loss as follows: Finally, the training loss for CCTC frameworks is the summation of the CTC loss and the normalized context loss as shown in (5).
For inference, only the middle head is kept. Thus, CCTC models have the same runtime as the base model, which is especially important for low latency applications. CCTC can also be incorporated into any model structure and decoding scheme without any additional changes since CCTC only affects the training stage. This is the main advantage of CCTC over other methods that also try to incorporate context.

B. ACQUIRING CONTEXT LABELS
The CTC algorithm is alignment-free which means that there are no explicit frame-level ground truths for the context heads. Therefore, the frame-level labels for the context losses are obtained from the paths that are predicted by the middle head. In other words, the context heads aim to predict the outputs produced by the middle head for the neighboring frames. However, the model may learn little to no context information from learning to predict blank tokens. Thus, we opt to train the context heads with dense character supervision from the prediction, y = B(π), instead. Concretely, the contexts l * (k) t and r * (k) t are the k th -nearest characters to the left and right of π t that are not a blank token or a consecutive duplicate. The labels can be retrieved by conducting a naive search on a path. However, a naive search is computationally expensive as it has the time complexity of O(T ) for every position t, which results in the total of O(T 2 ). Alternatively, we propose to obtain the labels using an efficient algorithm that operates in (KT ).
To reduce the time complexity in the label obtaining procedure, we propose to search on a dense path, h = (h 1 , h 2 , . . . , h L ) : L ≤ T , instead of the usual CTC path, π. A dense path, h, is an intermediate result of applying B, in which all consecutive duplicates are already merged but blanks are not yet removed. In order to search on a dense path, we have to know where the surroundings of π t are in h. To do so, we store the relation between a path, π, and a dense path, h, in an index list, p = (p 1 , p 2 , . . . , p T ) : p t ∈ [1, L], p t ≤ p t+1 . An index p t indicates that a letter h p t is derived from a path token π t . As we know that h p t is the representative of π t , we can directly conduct a naive search on h using the predetermined start position of p t . Since a dense path has no consecutive duplicates and only two categories of characters exist: the blank and the actual alphabet, we can obtain k th nonblank characters within 2K operations, regardless of the input length. In total, we can obtain the labels for an input length T within a tight bound of (KT ). We use blanks as context labels for the edge cases where the dense path has insufficient context on either side. Since blanks have no other uses for context prediction, the model can learn to exclusively use blanks for no-context scenarios. We summarize the algorithm in Algorithm 1 and demonstrate this process with an example in Fig. 2.

C. MODEL INITIALIZATION
Since the labels for context heads are obtained from the main head predictions on-the-fly, optimizing context loss for random networks may cause training difficulties due to noisy labels. In our experiments, we used pretrained weights as the initialization to prevent the issue and reduce the training time. This also highlights the use case where one might choose to further improve an existing CTC-trained model by incorporating additional CCTC training afterwards.

IV. EXPERIMENTAL SETUPS
In this section, we presented the experimental setups designed to highlight the effectiveness of the proposed CCTC model over the baseline regular CTC model. In order to show the generalizability of the proposed method, we evaluated CCTC using two sequence modeling problems, namely ASR and HTR, in two different languages, Thai and English.
The section is organized as follows. We described datasets in IV-A. We provided the training details for each dataset in IV-B. The methods we used for selecting context losses are described in IV-C. The setups for LMs, metrics, statistical testing, and baselines are presented in IV-D, IV-E, IV-F, and IV-G, respectively.

A. CORPORA
For our experiments, we used two ASR corpora and two HTR corpora as follows.

1) ASR CORPORA
We investigated two kinds of ASR tasks, monolingual and code-switching (CS). Monolingual speech is spoken audio in which only one language is presented. In contrast, many languages can be found within a single utterance for codeswitched speech. Context is especially important for the code-switched dataset in order to produce a coherent spelling. We used LibriSpeech [24] for monolingual English and Thai YouTube [6] for Thai-English code-switching corpus. Both datasets have 16kHz sampling rates and 16-bit depth audios. Table 1 summarizes the ASR corpora.
LibriSpeech [24] is a common ASR corpus for English. We used the train-clean-100 subset for small-scale experiments and the total 960 hours for large-scale comparisons. The evaluations were conducted on dev-clean and test-clean.
Thai YouTube [6] is a 200-hour speech corpus taken from public Thai podcast YouTube channels. The training subset

Algorithm 1 Acquiring Labels for Context Losses
Given: π -CTC path, K -context size Result: l * -left context labels, r * -right context label Label generation procedure for the CCTC context losses. Given a path from the output of the middle head's softmax, the labels for the left and right heads can be found by searching on a dense path h. For frame index t , the first target to the left and right are f and e. The second targets are o and e, respectively. The correspondences between π and y are color coded. A letter, y i , is derived from the token, π t , with the same background color.
of Thai YouTube contains both monolingual Thai (TH) and CS Thai-English utterances. As for validation and testing sets, monolingual and CS utterances were separated into TH and CS subsets, respectively. Concretely, the dataset has one TH-CS training set, two validation sets: dev-th and dev-cs, and two testing sets: test-th and test-cs. Unlike the previous work [6], we included additional CS utterances into testcs, increasing the size by three times. Any hyperparameter tunings were done together on the combined validation set.
2) HTR CORPORA We used two line-level handwritten text corpora, the Thai dataset, BEST, and the English dataset, IAM. An overview of the two corpora are presented in Table 2.
IAM is a standard English HTR dataset [27]. It is composed of grayscale line-level handwritten images from 657 writers. IAM has 79 characters in total, including 26 English lowercase, 26 English capital letters, 10 Arabic numbers, and VOLUME 11, 2023 17 special symbols. There are several versions of data splitting for IAM. We chose the one with 6482 training images, 976 validation images, and 2915 test images, similar to [20]. We also followed the data preprocessing methods in [20], including grayscaling the images, normalizing the heights to 64 pixels, and standardizing the intensity values.
BEST is a standard benchmark for handwritten text recognition provided by The National Electronics and Computer Technology Center (NECTEC) [25], [26] as part of the 2019 Thai handwritten text recognition contest 1 . BEST corpus has a total of 3550 images, including 417 unique sentences, 71 unique Thai alphabets, and 10 Arabic numbers. We combined the original BEST dataset with additional 220 images. We resized the height and width of the images to 64 and 625 pixels, respectively. We converted them to grayscale and standardized their intensities, following the same procedure as in IAM. We applied brightness and contrast augmentation to the images. The dataset has two test sets, test-seen and test-unseen. The test-seen set comes from the same source as the training and validation sets. However, test-unseen comes from a completely different domain.

B. IMPLEMENTATION DETAILS
In this section, we presented models, training conditions, and other details for experiments conducted on each dataset.

1) LIBRISPEECH
We chose the pretrained Wav2Letter+ model from [6] as the base model for train-clean-100 experiments. We added context heads to the base model and additionally trained for 100 epochs using a learning rate of 1e−4.
Adam optimizer [50] and exponential learning rate scheduler were kept identical to the previous work. We re-tuned LM weights for the baseline CTC and CCTC (1) using grid search with a wider boundary and a finer step size using the validation set.
As for LibriSpeech 960 hours, we used the implementation of QuartzNet-5 × 5 [18] from NeMo toolkit [51]. We trained the base model for 300 epochs from scratch using the default configuration 2 . Afterwards, we added the extra heads and context losses and resumed the training for a total of 600 epochs. The training was done using 8 GPUs with a batch size of 32 per GPU. 1 https://www.nectec.or.th/ 2 https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels

2) THAI YOUTUBE
The pretrained Wav2Letter+ model from [6] was used for Thai YouTube. We attached context heads to the pretrained network and additionally trained the model for 100 epochs. Following [6], we used the initial learning of 4e−5 with an exponentially decaying rate of 0.98 every epoch and a mini-batch size of 64. We took the baseline CTC and CCTC (1) models from [6] and tuned new LM weights for them using the larger search space.

3) IAM
We used Gated Fully Convolutional Networks (GFCN) proposed in [20] as the base model for the IAM dataset. The model comprises 12 2D-depthwise separable convolutional [52] and 18 2D-convolutional layers. We added the context heads to the pretrained networks, provided by the authors 3 , and resumed the training for an additional 400 epochs. We followed the training batch size of 2, the learning rate of 1e−4, and all other training hyperparameters as described in the original paper. The model was implemented using PyTorch [53]. We selected the checkpoint with the best validation CER for the comparison.

4) BEST
The base model for BEST consists of 12 layers of 2-dimensional (2D) convolution, 2 layers of 2D depthwise separable convolution [52], and 4 layers of 1-dimensional convolution. We followed the two-step training methodology proposed in [6]. We initially trained the base model using only CTC loss for 300 epochs. We attached context heads to the pretrained CTC network and additionally trained the model for another 400 epochs. Adam optimizer [50] was used with the fixed learning rate of 1e−4 and batch size of 64. The checkpoints with the best validation CER were selected for the comparison.

C. CONTEXT LOSS WEIGHTS
In higher order context losses, tuning the loss weights by grid search becomes impractical. In order to reduce the search space of context loss weights, we propose to derive the high order weights through closed-form formulas, based on the weight of the 1 st -order context losses, α (1) . We simply set α (1) = 1 in most of our experiments as we found this value performs well in general. For maximum gains, one can obtain the better weight α (1) through grid searching within the range of 0.5 to 2.5 4 . As for weights of high order context losses, : k > 1, we have tried several heuristic approaches and found two effective methods.
The first approach equally assigns α (1) as weights for every context loss order, ∀ k α (k) = α (1) . The second approach sets the highest order context loss weight to one, α (K ) := 1. Then, we exponentially decrease weights as the order of the context loss shrinks, ∀ k α (k) = α (K ) /2 K −k . Note that we set the weights of the right and left context losses to be the same, ∀ k β (k) = α (k) .

D. LM RESCORING
For LibriSpeech, we used the official word-based 3-gram LM. The LM was applied to beam search decoding with a beam size of 64. The LM weights and insertion penalties of each model were tuned on the validation set using grid search from 0.0 to 2.0 and 0.0 to 5.0, respectively. The step size was 0.1 for both hyperparameters. As for Thai YouTube, the word-level 3-gram LM was trained using text corpora from Thai Wikipedia and the Thai Q&A forum [6]. The beam size was set to 64. LM weights and insertion penalties were obtained through grid search in the same manner as LibriSpeech.

E. EVALUATION METRICS
As for the evaluation, we opted for the standard metrics, Character Error Rate (CER) and Word Error Rate (WER). Both metrics are defined similarly as the Levenshtein distance for correcting the hypothesis into the ground truth over the length of the ground truth. A concrete definition is defined in (6).
where Lev is Levenshtein distance, and R is the length of references. The number of insertion, deletion, and substitution operations are written as I , D, and S, respectively. The operations are calculated at the character level for CER and word level for WER.

F. STATISTICAL SIGNIFICANCE
We used Sentence-Segment Word Error (MAPSSWE) twotailed test to evaluate statistical significant differences [54] between baselines and proposed methods. For word-level testing, we used off-the-shelf SCTK implementation 5 . For character-level testing, we treated each character unit as a word. We reported the significant differences between two ASR systems using the significance threshold of 0.05.

G. BASELINES
To evaluate the effectiveness of the proposed method, we compared the performances of CCTC models against the non-contextualized regular CTC model. We used similar architectures and training conditions for both methods. The distinctions between baseline and the proposed methods were 5 https://github.com/usnistgov/SCTK  the auxiliary context prediction heads, which were not used in the inference stage.

A. AUTOMATIC SPEECH RECOGNITION
We start by comparing the effect of CCTC on ASR in two corpora with three different decoding algorithms: argmax, beam search (beam), and beam search with 3-gram language modeling (3-gram).

1) LIBRISPEECH
CCTC models consistently outperform the baseline CTC in the scenario where LM rescoring was unavailable as illustrated in Table 3. A larger context can be beneficial as shown by how CCTC (1) is outperformed by CCTC (2) . The CCTC (2) model tends to achieve the best results for this setting with a relative improvement over the baseline of 3.8% and 3.3% on dev and test sets, respectively. On the other hand, the baseline CTC is slightly superior to the CCTC (2) model in the development set when the 3-gram language model was applied during beam search. Since additional context information is included in the decoding via the 3-gram language model, the benefits of CCTC can become smaller in this setting. However, the CCTC (2) model still outperforms the baseline on the test set. Similar results can also be found for the larger 960-hour setup as depicted in Table 4. Overall, the CCTC (2) model is consistently superior to the baseline by around 4.2% and 3.0% relative using argmax and beam search decoding, respectively. If LM rescoring is applied, CTC and CCTC models are more comparable with each other. We will discuss more about this discrepancy in Section VI-B. Table 5 shows the performance of models on the Thai dataset. CCTC (2) is also the best model without LM rescoring. The CCTC (2) model outperforms the baseline by 2.5% on test-th and 2.0% relative on test-cs. With 3-gram LM, CCTC (1) is slightly better than CCTC (2) . This is expected since the extra contextual constraint provided by the LM, helps reduce the dependency on the contexts from the model side. Note that  unlike in the English dataset, CCTC outperforms CTC in all decoder settings. This is due to the fact that context is more important in code-switching data than in monolingual in order to correctly predict the language being spoken.

B. HANDWRITTEN TEXT RECOGNITION
In this part, we compare the performance of the model trained using CTC and CCTC losses on both Thai and English HTR datasets. We also present qualitative analyses in order to study the effects of adding context losses.

1) IAM
The results on the IAM dataset are summarized in Table 6. It is common on the IAM dataset to present both the CER and WER metrics with only greedy decoding to measure the performance of the model as a standalone. As the size of context increases the model becomes better until the context size of 3. CCTC (3) improves by 5.3% and 7.5% relative to the baseline CTC model on CER and WER, respectively. Figure 3 shows examples of the differences between model outputs. We found that CCTC models do noticeably better on hard-to-read handwriting. In Fig. 3a, the letter t is highly ambiguous and looks like A. CCTC (1) would observe only the left space and the right letter h, which are inadequate for predicting the correct transcription. After we increased the context width, the high-order CCTC models were able to fix the issue. The sample in Fig. 3b is also very vague, and further contexts are needed to mitigate this problem. In Fig. 3c, CCTC encourages the consistent spellings of numerical letters. This is expected since context is very important to decipher ambiguous handwriting. In a sense, CCTC is able to embed the language model into the model without requiring an explicit LM. It is also interesting to note that the base model has a horizontal receptive field of 240 pixels, which covers roughly 4-7 characters for this dataset. This coincides with the best context size of three.

2) BEST
For the BEST dataset, CCTC models consistently outperform the CTC model, as shown in Table 7. We opted for CER as the only evaluation metric since BEST transcriptions were not properly tokenized, and Thai text has no word segmentation standard. CCTC (4) model achieves the lowest validation CER of 9.7%. However, this superior performance does not hold in the test sets. The test-unseen set which comes from a completely different domain does not work well with the implicit LM learned by the model. The best scoring model on the unseen test set is CCTC (2) which gives a good middle ground. Note that a context of 2 characters is still considerably weak as a LM and would not be detrimental even with domain mismatch, since it mostly learns about legal character sequences in the language.
Further inspections suggest that adding context losses can help improve the performances of character segmentation and ambiguous handwriting. Fig. 4a depicts handwriting with very narrow spacing between characters. The CTC model predicts an extra character that has a similar shape to the  combination of two characters while Fig. 4b also shows a similar occurrence where the CCTC can help disambiguate hard-to-decipher handwriting. These errors might get corrected with a LM. However, we would like to emphasize again that CCTC yields this improvement without any extra computation cost during inference.

VI. DISCUSSIONS A. CCTC LEARNS AN IMPLICIT LANGUAGE MODEL
Our experiments have shown that CCTC can help improve the performance of ASR and HTR systems in various settings. In this section, we present some supporting evidence showing that the model trained with CCTC also learns the LM in the process, thereby improving the model in sequence prediction tasks. To detect this effect, we computed the perplexity of the prediction outputs (test set) using the language model learned from the training text. If the model learns any sequence information in the training process, this perplexity should decrease.
We used 7-gram character LMs trained on the training sets to compute the perplexity. The choice of 7-gram is so that the context size will cover up to the context size of CCTC (3) . The text used to train the LM was deduplicated. Fig. 5 illustrates perplexities on argmax decoding predictions for each test set. The baseline CTC model generally has the highest perplexity, and the value tends to decrease as the context size increases. The lower perplexities of CCTC models indicate that the predictions of the CCTC models are more congruent to the LM than the baseline CTC, supporting the  claim that CCTC can learn an implicit LM. Note that for Thai YouTube, the ASR models were trained on the entire training set but tested separately in two different testing subsets. The Thai YouTube dataset is mostly monolingual Thai, causing the implicit LM learned by the CCTC models to be more focused on Thai. Thus the perplexity in the code-switching test set can increase, which is the case for CCTC (3) . BEST is also another dataset that does not exhibit the expected trend. A large portion of the training data for BEST is based on the same set of patterns, which are slightly different from the seen test set.

B. DISCREPANCY BETWEEN CONTEXT PREDICTIONS AND LM RESCORING
Even though increasing context sizes for CCTC models provide performance gains using argmax and beam search decoding, CCTC models with a shallower context window tend to be more suitable for external LM rescoring than the wider ones. From Table 3 and Table 5, CCTC (1) generally had the highest effectiveness when the external LM was used. However, as context size increased, the performance might drop, especially in the monolingual setup, in which CCTC models were sometimes inferior to the baseline CTC. Fig. 6 depicts selected samples of the scenario in which the CCTC model provides a better prediction with argmax but underperforms the baseline with 3-gram LM. In Fig. 6a, the sample from dev-cs of Thai YouTube shows that LM rescoring cannot fix the bad prediction of the CCTC model, but it can fix CTC's prediction. Fig. 6b depicts an example from LibriSpeech dev-clean that the LM causes the error in the CCTC model.
Aggressive context dependencies from context heads may cause conflicts between the internal language representations and the external LM rescorer. Further investigation on methods that can learn contexts jointly with the external LM might reduce this discrepancy.

C. RELATIONSHIP BETWEEN PERFORMANCE GAIN AND TRAINING TIME
Since the inference time for CCTC is always the same as the regular base model, we discuss the trade-off between the gain in evaluation metrics and the increase in training time in this section. Table 8 summarizes the trade-offs between gains and runtimes in the argmax decoding setup. The gains are shown in relative improvements of WER and CER for ASR corpora and HTR corpora, respectively. The runtime performances were measured using the ratio between the CTC and CCTC training time (higher is faster) and were averaged over tenbatch training, excluding data loading steps.
As expected the training speed of CCTC models reduces as the context size increases. Considering the trade-off, a context size of two seems to offer most of the benefit. The increase in training time varies across datasets due to the differences in encoders, input lengths, and alphabet sizes. For datasets with more input frames (LibriSpeech), a large proportion of the computation is used to compute the gradients, lessening the effect of adding context heads on the computation cost. Note that, the label generation process is not optimized to work in GPU memory in our implementation. With proper implementation, the performance drops should be further reduced, just like in the CTC loss [31].

D. APPLYING ADAPTIVE WEIGHT ASSIGNMENTS FOR CONTEXT LOSSES
To reduce the efforts of manual weight searching, we attempted to use adaptive task balancing methods such as DTP [55] and DWA [56]. However, we found no improvement due to the distinctive role of context prediction subtasks in comparison to subtasks of other multi-task learning setups, which require subtasks to perform well independently on their respective metrics [57], [58].  In CCTC frameworks, the performance of the main task was exclusively considered. The context prediction tasks of CCTC are sided tasks, which intend to complement the main CTC task and heavily depend on the main CTC task. The degradation of sided tasks is acceptable for gaining the performance of the main task. To the best of our knowledge, existing frameworks with similar setups also employ fixed weights tuning [59], [60].

E. CCTC TRAINING FROM RANDOM INITIALIZATION
We found that starting CCTC training from the first iteration was not effective (from-scratch), as shown in Table 9. We observed training instabilities and worse WER compared to both the regular CCTC model, which used pre-trained weights for initialization, and the baseline CTC model. Since CCTC labels were computed on-the-fly using current model, the low-quality predictions at the early stage hurt the overall performance.
We ensured that every setup had the same the number of model updates. For from-scratch, a randomly initialized model was trained by CTC and context losses for 400 epochs. The pre-train approach were trained for 300 epochs using CTC loss, followed by another 100 epochs of CCTC training. Table 9 illustrates that using labels from forced alignments for context labels (with g.t. labels) incurred a 2.5% relative degradation in the WER of CCTC models compared to utilizing the labels obtained from argmax predictions (pre-train CCTC). We hypothesized that the mismatches between the alignments and the CCTC model predictions were responsible for the performance degradation. Note that we computed the forced alignments from the wav2letter CTC model using CTC-Segmentation described in [61].

VII. LIMITATIONS
There are some considerations for applying CCTC. First, determining the optimal loss weights and context sizes requires hyperparameter tuning for the context weights. This can be mitigated by using search space reduction methods. Second, although the inference time remains the same, CCTC increases the training time. Finally, the impact of CCTC is less pronounced when used in conjunction with a language model. To address this limitation in future work, we plan to investigate training CCTC together with a language model.

VIII. CONCLUSION
CCTC is a framework that can incorporate context information into the CTC loss for training non-recurrent NAR models. Experiments on ASR and HTR benchmarks on Thai and English datasets have shown that CCTC can help improve the performance by learning an implicit character language model. CCTC models with a wider context are generally more superior up to a certain size but can have difficulties when used in conjunction with an external LM. In the future, we plan to investigate joint training with the language model in order to fully utilize the implicit LM learned by CCTC. Table 10 depicts the context loss weights assigned to each model. The weights vary due to the differences in the magnitudes of CTC losses. For the Librispeech 100 hr, Thai YouTube, and IAM models, the CTC losses were normalized by the length of transcriptions. In contrast, the remaining models used unnormalized CTC losses.