A Residual-Attention Offline Handwritten Chinese Text Recognition Based on Fully Convolutional Neural Networks

Offline handwritten Chinese text recognition is one of the most challenging tasks in that it involves various writing styles, complex character-touching, and large number of character categories. In this paper, we propose a residual-attention offline handwritten Chinese text recognition based on fully convolutional neural networks, which is segmentation-free handwritten recognition that avoids the impact of incorrect character segmentation. By designing a smart residual attention gate block, our model can help to extract important features, and effectively implement the training of deep convolutional neural networks. Furthermore, we deploy an expansion factor to indicate the trade-off between computing resources for model training and the ability of a gradient to propagate across multiple layers, and make our model training adapt to different computing platforms. Experiments on the CASIA-HWDB and ICDAR-2013 competition dataset show that our method achieves a competitive performance on offline handwritten Chinese text recognition. On the CASIA-HWDB test set, the character-level accurate rate and correct rate achieve 97.32% and 97.90% respectively.


I. INTRODUCTION
Offline handwritten Chinese text recognition (OHCTR) is a challenging issue and has received significant attention from researchers [1]- [3]. The reason can be generally attributed to two important factors. The first one is the rapid growth of the OHCTR application requirements, including office handwritten document processing, mailing address recognition and precious historical manuscript recognition. The second one attributes to the inherent long-standing complexity of the OHCTR that involves various writing styles, complex character-touching, and large number of character categories. Although the development of neural networks has promoted the development of OHCTR to a certain extent, the existence of characters with high similarity and the significant The associate editor coordinating the review of this manuscript and approving it for publication was Qi Zhou. difference in individual handwriting styles make it remain as an open problem.
The recognition of offline handwritten Chinese text usually adopts sequential pattern recognition techniques [4]- [6], which can be divided into two categories in general: oversegmentation approaches and segmentation-free approaches. The recognition approaches based on over-segmentation by integrating linguistic context model, character classifier, and topological geometric have been demonstrated to be successful in offline handwritten text processing. Both the character shape modeling and linguistic context model are playing very important role. They firstly acquire candidate segmentationrecognition topological geometric paths from consecutive over-segments, and then perform optimal path search by integrating linguistic context model and character classifier [7], [8]. There are some over-segmentation based systems combining with neural network language model to replace VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the conventional character classifier, segmentation and geometric models, and they have achieved the best performance of over-segmentation based methods on the CASIA HWDB 2.0-2.2 datasets [9]. However, these over-segmentation methods are faced with the data sparseness problem which hinders these models from estimating optimal path stably as the number of parameters grows exponentially with the length of the topological path. Moreover, these over-segmentation methods are designed for specific situations and it is difficult for these methods to deal with the overlapping and touching characters in general. The recognition approaches based on segmentation-free do not need to explicitly segment text lines into individual characters. One early approach to text line modeling using the Gaussian mixture model, hidden Markov model (HMM) is the representative which was firstly applied in speech recognition, and then introduced into the recognition of offline handwritten Chinese text lines and achieved good performance [4], [10], [11]. As the length of the recognition characters increases, HMM-based method involves excessive parameters, which leads to the degradation of its recognition performance. Different from these models, neural networks have completely changed many fields of machine intelligence, making the challenging task of image recognition successful with superhuman accuracy. One recent approach utilized recurrent neural networks(RNN) for the recognition of handwritten English languages with small number of character categories. The RNN approach is quite flexible and it avoids explicit segmentation which is largely due to the connectionist temporal classification (CTC) [12]. Suryani et al. [13] employed a CNN and LSTM under the HMM frame work to obtain a significant improvement over the traditional LSTM-HMM model. Wu et al. [14] used separable MDLSTM and RNN with CTC loss, instead of the traditional LSTM-CTC method. Although all those methods utilize recurrent architectures to properly conceal and learn serial context information, they lack the parallelization ability in the training phase and demand significant computing resource. Furthermore, handwritten text recognition is applied only in near neighbor character recognition or single character recognition to a large extent, and long-range dependencies have not been accounted so far.
For the above reasons, there is a recent shift to recurrencefree neural network architectures in most sequence recognition modeling works. We can see the trends of convolutional neural networks and connectionist temporal classification (CNN+CTC) in handwritten recognition problems [15], [16]. Peng et al. [17] proposed an end-to-end offline handwritten Chinese text recognition method using fully convolutional networks. This method is composed of three computation modules, include location, detection and classification, but its efficiency is not ideal. Liu et al. [18] proposed an efficient and effective offline handwritten text recognition method with convolutional neural networks for the challenging OHCTR task. This method performs convolution operations with equal weights on all input pixels, resulting in a decrease in the recognition performance of handwritten text with background noise. Mohamed et al. [19] proposed a novel handwritten recognition method based on fully convolutional neural networks. This method consists mostly of depthwise separable convolution operations with residual connections and softmax gating, trained on word or text line labels using the loss function of connectionist temporal classification. However, the method has very complex residual attention module and verifies its effectiveness only with small categories samples.
Inspired from the recent works on the convolutional neural networks [19]- [22], we proposed a residual-attention offline handwritten Chinese text recognition based on fully convolutional neural networks, which is segmentation-free handwritten recognition that avoids the impact of incorrect character segmentation. This method introduces a novel residual attention convolution to increase the importance of representative features and reduce the negative influences of the background or noise. More importantly, the expansion factor adjusts the number of tensor channels in the convolution process, balances the ability of a gradient to propagate across multiple layers. Our main contributions are as follows: 1) The recurrent-free architecture for offline handwritten Chinese text recognition is presented to avoid the degradation of recognition performance caused by character error segmentation and utilize the parallelization ability in the training phase and avoid the problem of large delay caused by recurrent iterative operation. 2) The novel smart residual attention gate block is designed to combine the advantages of residual framework and attention framework. Then the representative features are extracted and it alleviates the problems of gradient explosion or disappearance for deep convolutional neural networks.
3) The expansion factor is introduced to quantitatively analyze the performance of convolutional neural networks, so that the setting of the model parameters can achieves a balance between model performance and computing resources. 4) A set of extensive experiments are conducted on two widely adopted Chinese benchmark datasets: CASIA-HWDB and ICDAR-2013. The proposed method achieves competitive results in the benchmarks of character error rate without/with using language model, and demonstrates the validity of our proposed method. The rest of the paper is organized as follows: Section 2 reviews briefly previous related works. In Section 3, we explain the proposed a residual-attention offline handwritten Chinese text recognition method. Experimental studies are given in Section 4. Finally, we draw the conclusions and future works in Section 5.

II. RELATED WORKS
In this section, some basic concepts of attention mechanism and connectionist temporal classification are discussed. Regarding the most mathematical symbols involved in the work, we give a brief explanation here. Let X = {x 1 , x 2 , · · · , x n } be an offline handwritten text line image set, where the i-th image x i ∈ R h×w×3 . L = {l 1 , l 2 , · · · , l n } represents the sequence label set of the image databases, where l i = {l i1 , l i2 , · · · , l in }, l ij denotes the j-th character of the sequence label of the i-th image. Y = {y 1 , y 2 , · · · , y m } is the tensor set, where y i ∈ R h ×w ×c denotes the input tensor of the i-th layer of our model, h is the height, w is the width and c is the number of channel.

A. ATTENTION MECHANISM
The attention mechanism can be considered as a kind of guidance which makes the allocation of available processing resources tend to the part with the most discriminate information in the input document image, and improve the problem of partial information loss caused by indispensable down-sampling, such as limited computing resource, network transmission and storage space [21]. Generally, it can reduce or even filter the influence of background noise of lower-resolution feature maps on the results, and strengthen the important parts in the input document image. Currently, some tentative researches have been proposed to combine the attention mechanism with deep neural networks [23], [24]. These researches range from positioning and understanding in complex images [25] to sequenced-based neural networks [26]- [28]. It is also usually integrated with a nonlinear activation function, Softmax or Dropout, to down-sampling or up-sampling the feature maps.
Each attention module can be divided into two calculation sub-modules, which are the trunk sub-module and mask sub-module. The trunk sub-module implements feature processing, and can be introduced into any advanced neural network structures. Given trunk sub-module output feature map F c (y i ) with input feature y i , the mask sub-module utilizes a common bottom-up top-down calculation structure [29], [30] to obtain the same size output feature map M c (y i ), and it is used as the softly weight of output features map F c (y i ). In the mask sub-module, the design of bottom-up top-down calculation structure is derived from the fast feedforward and feedback process of neural networks. The output feature map of mask sub-module is employed to determine the nonlinear gates for neurons of the trunk sub-module. The output feature map of attention module H c (y i ) is: (1)

B. CONNECTIONIST TEMPORAL CLASSIFICATION
Connectionist Temporal Classification (CTC) is a training criterion function, which can make an automatic alignment between the two sequences of the unknown input feature sequence and the known label sequence, devised for solving the sequence labeling problems [31], [32]. It has been widely used in many fields, such as human speech recognition [33], [34], handwritten recognition [1], [19], gesture language recognition [35] and continuous image segmentation and classification [36]. The CTC has proven to be effective in connectionist sequence recognition tasks.
Given the input feature sequence Y and the known label sequence L, CTC can learn the alignment without using the frame level alignment information, and its implementation requires four steps. Firstly, CTC deals with the repeated label in the known label sequence. For most Chinese text recognition tasks, CTC does not require the blank symbol to distinguish different phrases, but it is essential in English tasks. Secondly, it is employed as label for unlabeled feature sequence. It is worth noting that the nonlinear softmax function needs to normalize the outputs to get the distribution from l t to Y at every timestep.
Thirdly, the complete output label sequence is employed to generate a distribution over all possible alignments, where each alignment is described as a possible path p t . The path p t is constituted of possible label sequence in L. We assume that the output label sequence of each timestep is independent of the other timesteps, the probability of one label sequence path p t is: where π t indicates the label sequence at the timestep t for path π . Finally, CTC then define a many-to-one mapping relationship F , it merges the first consecutive identical labels, and obtains the predicted label sequence of the label sequence path. In which, the probability of Label L can be processed as an optimization of the probabilities of all possible label sequence of the CTC. Its detail formula is as follows: As discussed, although much effort has been dedicated for offline handwritten Chinese text recognition methods based on convolution neural networks, the noted algorithms suffer from the following limitations and challenges.
1) Over-segmentation-based methods have the problem of data sparseness, in which they suffer from the unstable context estimation as the length of the context leads to the exponential number of parameters. 2) Most methods use recurrent architectures to correctly hide and learn successive sequence information, but they must confront the lack of parallelization during the training stage and require significant computing resource. 3) CNN-based methods are usually challenged by complex residual attention modules but their effectiveness has only been verified on some categories of small samples.

III. METHODOLOGY
In this section, we first introduce the framework of the residual-attention offline handwritten Chinese text recognition based on fully convolutional neural networks, and then give the concepts of residual attention gate block, which is VOLUME 9, 2021 the core computation of our method. Finally, we provide the model design and algorithm implementation.

A. FRAMEWORK
By introducing an attention mechanism and a residual network structure into convolutional neural networks, our method enhances the ability to extract meaningful features of the text line image and enhance the convergence ability of deep neural network training. More importantly, it only includes full convolutional operations, effectively avoids the demand on storage space and computing time due to loop or recursive operations. Therefore, our method can achieve a significant improvement in recognition accuracy and recognition efficiency. Figure 1 shows the flowchart of residual-attention offline handwritten Chinese text recognition. There are mainly three stages: (1) Initialization stage, the input text line image with h × w × 3 is processed with 1 × 1 convolutional operation to obtain a tensor with 31 channels, which is normalized with softmax to improve the convergence speed of neural network training. Then each channel is preprocessed independently with a 13 × 13 filter using a depthwise convolution operation, and the result of preprocessing is concatenated with the layer from normalized original image.
(2) GateBlock stage, as the core computational block of our method, it is composed of 12 GateBlocks. Each one is consisted of the attention mechanism and the residual structure, and adopts the separable convolutional operation to realize fast high-level feature extraction. Combining the model structure parameters in Table 1, there are four layers of the model, conv1.x-4.x, each layer includes two GateBlocks and a max pooling operation. The size of tensor is from h × w to h 8 × w 16 , and the number of channels is from 32 to 1024. The conv5.x layer of the model contains four GateBlocks, and the size of tensors remains unchanged, all of which are h 8 × w 16 ×1024. Through the above GateBlocks, we can achieve the high-level representation abstraction of the input text line image.
(3) CTC stage, the two input sequences required by the CTC function are both one-dimensional, and we need to perform a post convolution operation on the tensor of size h 8 × w 16 × 1024 to obtain a tensor of size h 8 × w 16 × C, where C represents the number of character categories, and then apply the average pool and Softmax to this tensor on the h direction, obtain a prediction tensor of size 1 × w 16 × C. Then CTC is used to realize the alignment operation between the tensor and the label, the prediction result and the loss function value are obtained. In addition, in order to improve the convergence speed and normalize the tensor data in training process, we use multi-normalization methods in each layer [37,38], such as batch normalization and layer normalization.

B. RESIDUAL ATTENTION GATE BLOCKS
Our method is constructed by stacking multiple GateBlocks as the main computational blocks. It uses attention gates to processing the flow of information between layers, and then increases the importance of representative features and reducing the importance of irrelevant features by weighting, this method has received significant interests from many researchers recently. In deep convolutional neural networks, attention mechanism is usually used together with residual structure to improve the convergence of the network. We opted to base the improved attention gates on the gating mechanism proposed by Deep Residual Learning [20]. Figure 2 shows the detailed structure of the i-th GateBlock.
Let Y = {y 1 , y 2 , · · · , y m } be the output tensor set for each neural network layer, where the j-th tenor y i ∈ R h ×w ×2 k , where h , w and k are determined by the corresponding network layer. Table 1 gives the parameters of our model. H (·) represents a mapping function from the input tensor to the output tensor of a network layer. That is, the i-th network layer's input tensor y i , and its output tensor y i+1 = H c (y i ). In order to strive for the goal of the GateBlock, we propose a novel mapping function of residual attention gate block, whose formal expression is: where, H c (y i ) represents the c channel output sensor mapped from the input sensors respectively. M (·) represents a mask branch function, and F(·) represents a trunk branch function. In order to further understand the Eq.(4), we perform the identity transformation and get the new transformed expression is:  In Eq. (5), let H c (y i ) = M c (y i ) × F c (y i ) and H c (y i ) = F c (y i ) + y i , and we get: The Eq. (6) is composed of two parts, where H c (y i ) represents the attention module. For the i-th layer, the gate block from the input tensor y i to the output tensor y i+1 contains two branches, M c (y i ) and F c (y i ). In which, M c (y i ) represents the mask branch of the attention module, and F c (y i ) represents the main branch of the attention module. They work together in the feature extraction process to encourage the representative feature information and depress the irrelevant feature information. In addition, H c (y i ) indicates the residual network structure, it is also composed of two branches, F c (y i ) and y i , respectively. In which, F c (y i ) represents the output tensor obtained by the mapping function or convolutional operation, and y i is the output tensor directly transmitted to the next layer without any processing.
In order to make effective use highway gates even for wide and deep networks, we need to consider the dimensionality problem for the residual connection transformation. The M c (y i ) is firstly increased by 1, then multiplied by the convolution function F c (y i ), and finally added together with the input signal y i . Here, we use the dual transformation mappings in Ref. [19]. Let P 1 be a negative transformation mapping x ∈ R H×W×C to x ∈ R H ×W ×C , and P 2 be a positive transformation mapping x ∈ R H ×W ×C to x ∈ R H×W×C . We can rewrite Eq.(4) such that Using the negative transformation mapping P 1 , it allows us to maintain the optimization benefit of residual connections whilst computing residual attention on different dimensional representations of y i . P 1 and P 2 are implemented as depthwise separable convolutions, and the Exponential Linear Unit (ELU) as an activation function. In addition, we set C = tC, H = H , and W = W . This means that down-sampling or up-sampling is only utilized on the channel dimension of the input data, and its spatial dimensions remain their original size. The expansion factor t is an exponential value with base 2, that is t = [ 1 8 , 1 4 , 1 2 , 1, 2]. t = 1 means the number of tensor channels is constant; t < 1 represents the number of tensor channels is reduced, and GataBlock executes faster and uses less computational resource; t > 1 represents the number of tensor channels increases, and GataBlock has more available information and better feature representation ability.

C. MODEL DESIGN AND ALGORITHM IMPLEMENTATION
Our design is inspired by residual attention network [19], [21], which are widely used as backbone neural network and proven to have excellent performance for various tasks. The architecture of our method is depicted in Table 1, the first and last lines represent the pre-convolution operation and post-convolution operation, and the others represent 12 GateBlocks and some max pooling. The network consists of successive convolution layers with 3 × 3 kernel size and residual attention gate blocks, as depth-separable convolution. Each GateBlock layer is followed by batch normalization, layer normalization during training, and exponential linear unit be used as the non-linearity activation function, which own to its robustness when used with low-precision computation. The shape of line image is h × w × 3, where h = 128, means the height of the line image is uniformly processed to 128 pixels after text line preprocessing. Our model can adapt to the text line recognition with variant length.
Algorithm 1 gives the pseudo code of residual attention gate block, which includes four input parameters, such as tensor y i , number of channel c, size of convolution kernel k, expansion factor t. In the implementation of the algorithm, it includes not only the necessary operation to realize the Eq.(4), but also three user-defined functions: forward depth separable convolution function forwardConv(·), tensor duplication function nGates(·) and backward depth separable VOLUME 9, 2021 convolution function blackwardConv(·). Finally, the output tensor y i+1 is returned.

IV. EXPERIMENTAL RESULTS
To facilitate comparative studies against existing models in this field, we evaluated the performance of our handwritten Chinese text recognition method on two widely adopted datasets: a large database of offline Chinese handwriting called CASIA-HWDB [39] and a small dataset from the ICDAR 2013 Chinese handwriting recognition competition, abbreviated as ICDAR-2013 [40]. The method was implemented on the desktop computer of Intel Core i9-9900K 3.60GHz CPU, programming with python3.6 in PyCharm. While for training CNN models, we used NVIDIA RTX 2080ti GPUs for acceleration.

A. EXPERIMENTAL PREPARATION
In the experiment, we compared our method against four well know offline handwriting text recognition methods [8], [9], [11], [12], [14], [17]. These methods involve text recognition technologies such as traditional character over-segmentation, CNN and CNN-LSTM, and they have all shown their advantages in their respective aspects. The experiments are on two datasets, CASIA-HWDB and ICDAR-2013. Table 2 gives their detail information. As an unconstrained handwritten text database, the CASIA-HWDB is divided into training set and testing set. CASIA-HWDB represents the CASIA-HWDB including 3,118,447 isolated characters samples. ICDAR-2013 with 91,519 isolated characters samples. Each text line image is preprocessed into 128 × 2400 pixels, where '0' represents character handwriting, and '1' represents a blank background. When the height or width is less than its values, '1' is padded for the blank part. Our model is trained by segmented-free pairs of text line and corresponding label sequence without any character/frame-level alignment. In each epoch, training examples are sampled from the training set without replacement. There are 90% samples of the training set from CASIA-HWDB for training the classifiers, and the remaining 10% samples is used to estimate the confidence parameter. Although neural networks show good performance at handwritten recognition, available training data is often not sufficient to capture handwriting variation widely from writer to writer, for which we introduce grid distortion method [43] to implement the augmentation of CASIA-HWDB database, shown as in Figure 4(b). We implemented our system on Ten-sorFlow [44] deep learning framework, with the Adam [45] optimizer. All experiments use an initial learning rate of 5 × 10 −3 , which is exponentially decayed to 0.001 after 1 × 10 6 batches; the maximum batch allowed by our platform (not less than 4, 2 × 2 max pool) is applied successively between layers until the height reaches 8 pixels. The architecture details of our method are shown in Table 1. In addition, the condition for model training to stop is that the maximum number of training times is 1 million iterations or the value of the loss function has not been improved for 50 consecutive iterations.
Levenstein edit distance [41] is used to measure the performance of the model on character level, and through the length of the label sequence to achieve normalization, which is commonly known as Character Error Rate (CER). In this paper, based on the literature [7], [9], [12], [14], the accurate rate (AR) and correct rate (CR) are employed to evaluate our model. Their formal expressions are as follows: where N t represents the length of the label sequence in the transcript. S e , D e and I e represent the numbers of substitution errors, deletion errors and insertion errors, respectively.

1) EXPANSION FACTOR
The expansion factor determines the number of channels in the GateBlock calculation process. It can down-sampling or up-sampling the original representation into low-dimensional or high-dimensional space, and performs lightweight depthwise convolution on the new representation, and then upsampling or down-sampling the representation into the same size of the original dimensional space [42]. Table 3 gives the results with different expansion factors on the In addition, Figure 3 shows the trend of accurate rate, training time and trained model size of CASIA-HWDB test set under different expansion factors. The horizontal axis represents the expansion factor, and the vertical axis represents the accuracy rate, each epoch time and trained model size, respectively. As can be seen from Figure 3(a), the growth of accurate rate is relatively gentle and tends to converge to a certain value. On the contrary, the each epoch time and trained model size increases rapidly with the increase of the expansion factor in Figure 3(b) and 3(c). It is worth noting that reasonable setting of the expansion factor is important for handwritten recognition. When the expansion factor is set too small, the model can complete the training task in less computing space and time, but the accurate rate is not high; when the expansion factor is set too large, the trained model can obtain higher accurate rate, but more computing space and time are needed to complete the model training task. Through multi-experiments verification, we set the expansion factor t = 1/2 which can achieve a balance to a certain extent.

2) RECOGNITION ACCURACY
To verify the proposed module, we evaluate it against six state-of-the-art methods in the offline handwritten Chinese text recognition literature. These methods involve text recognition technologies such as traditional character over-segmentation, CNN and CNN-LSTM, as well as data augmentation or language model adaptation used to enhance the recognition accuracy. The recognition results on CASIA-HWDB and ICDAR-2013 datasets using compared methods are shown in Table 4 and Table 5. Each row represents a method and the best rates are indicated in bold face.
From Table 4, our method achieves the best recognition performance on the CASIA-HWDB without using any language model, where the AR and CR are 96.85% and 97.46%, respectively. Compared with the suboptimal method, the AR recognition performance of our method is improved by 2.05%. In the ICDAR-2013 dataset, our method obtains 91.30% for AR and 91.96% for CR. The recognition performance of our method is the competitive result among the seven methods in comparison. It is meaningful that our method, fully convolutional neural networks combining residual structure and attention mechanism, can achieve such results on offline handwritten Chinese texts involving variant writing styles, character-touching, and large number of character categories without the involvement of any language model.
To further increase the performance of our system, explicit language model is integrated to explore the semantic relationships between characters. By incorporating lexical constraints and prior knowledge about the language, language model can rectify some obvious semantic errors, thus improves the recognition result. In this paper, we only considered character tri-gram language model in experiments.   Table 5, our method uses the language model to obtain the maximum value of AR and CR of CASIA-HWDB test set, which are 97.32% and 97.90% respectively. Similarly, the suboptimal values of AR and CR obtained on ICDAR-2013 data set are 96.51% and 96.76% respectively. Although the language model can improve the recognition accuracy of the offline handwritten Chinese text recognition method, we should realize its two limitations. Firstly, the effectiveness of the language model is limited by the dataset itself. When the character relationship in the text mark of the dataset conforms to the language model modeling, it is helpful to improve the recognition accuracy; otherwise, it has little impact on the recognition accuracy, and even decreases. Secondly, the language model has a weaker optimization effect on model with higher recognition accuracy. Du et al. method gains the AR 83.89% without language model, and achieves the AR 93.50% with language model and the recognition performance has been improved by 11.46%. Correspondingly, our method only improves the recognition performance by 5.71% using the language model compared with its performance without the language model.

As shown in
Through the further analysis of the CASIA-HWDB and ICDAR-2013 data sets, we also find that the datasets restrict the recognition accuracy of the compared methods to a certain extent. On the one hand, there are some incomplete or unclear characters in the offline handwritten Chinese text dataset, which also affects the text recognition result to a certain extent. On the other hand, although data augmentation has been done in the experiment processing, there are still many types of Chinese characters, large differences in writing styles  and a huge gap between the amount of available data and the amount of data required for model training.

3) MODEL ROBUSTNESS ANALYSIS
In order to verify the robustness of our model, the residual attention gate block can increase the importance of representative features (handwriting pixels) and reduce the importance of irrelevant features (background pixels) by weighting in the feature extraction process. We introduce grid distortion augmentation [43] and Gaussian noise on CASIA-HWDB database, as shown in Figure 4. We know that the character handwriting pixels are black (represented by '0'), and the background pixels are white (represented by '1'). Here, subfigure 4(a) is the original image and labels, and the size of image is 128 × 2400 pixels. Subfigure 4(b) shows the augmentation images with grid distortion, the interval is sets as {32, 64, 128} pixels and the variance sets as {6, 12, 24}. Subfigure 4(c) gives the Gaussian noise images with mean m = 0 and variance σ ∈ [0.1 : 0.1 : 0.8]. With the increase of σ , the amount of noise contained in the image increases, which has a greater challenge for text recognition.
It can be seen from Figure 5, we know that the character error rate of our model on the Gaussian noise database is increased from 3.15% to 13.85%, with an increase range of less than 11%. Furthermore, when σ ≤ 0.3 in Guassian noise, the CER of our method is still better than the compared methods. It shows that our model can realize the recognition of noisy offline handwritten texts to a certain extent, and has good robustness.

V. CONCLUSION
In this paper, we have proposed a residual-attention offline handwritten Chinese text recognition based on fully convolutional neural networks. A smart residual attention gate block has been designed to increase the importance of representative features and reduce the importance of irrelevant features by weighting, and it help to alleviate the problems of gradient explosion and gradient disappearance for deep convolutional neural networks. The expansion factor adjusts the number of tensor channels in the GateBlock convolution process, balances the computing resources for model training and the ability of a gradient to propagate across multiple layers. Experiments show that our method exhibits superior performance on CASIA-HWDB database.
In the future, we will continue to design and optimize the structure of the fully convolutional neural networks for offline handwritten Chinese text recognition, so that the method can be deployed under practical computational and other resource constraints.