A Transformer-Based Framework for Scene Text Recognition

Scene Text Recognition (STR) has become a popular and long-standing research problem in computer vision communities. Almost all the existing approaches mainly adopt the connectionist temporal classification (CTC) technique. However, these existing approaches are not much effective for irregular STR. In this research article, we introduced a new encoder-decoder framework to identify both regular and irregular natural scene text, which is developed based on the transformer framework. The proposed framework is divided into four main modules: Image Transformation, Visual Feature Extraction (VFE), Encoder and Decoder. Firstly, we employ a Thin Plate Spline (TPS) transformation in the image transformation module to normalize the original input image to reduce the burden of subsequent feature extraction. Secondly, in the VFE module, we use ResNet as the Convolutional Neural Network (CNN) backbone to retrieve text image features maps from the rectified word image. However, the VFE module generates one-dimensional feature maps that are not suitable for locating a multi-oriented text on two-dimensional word images. We proposed 2D Positional Encoding (2DPE) to preserve the sequential information. Thirdly, the feature aggregation and feature transformation are carried out simultaneously in the encoder module. We replace the original scaled dot-product attention model as in the standard transformer framework with an Optimal Adaptive Threshold-based Self-Attention (OATSA) model to filter noisy information effectively and focus on the most contributive text regions. Finally, we introduce a new architectural level bi-directional decoding approach in the decoder module to generate a more accurate character sequence. Eventually, We evaluate the effectiveness and robustness of the proposed framework in both horizontal and arbitrary text recognition through extensive experiments on seven public benchmarks including IIIT5K-Words, SVT, ICDAR 2003, ICDAR 2013, ICDAR 2015, SVT-P and CUTE80 datasets. We also demonstrate that our proposed framework outperforms most of the existing approaches by a substantial margin.

attention mechanism in its prediction module for STR prob-84 lem. The input text image pattern, the output text sequences 85 pattern and its difference are primarily learned by the atten-86 tion mechanism by examining the experience of the final 87 character sequences and encoded feature vectors. A wide 88 variety of attention-based approaches have evolved in the 89 STR field, influenced by the growth of machine transla-90 tion frameworks. There are various flaws in the attention 91 mechanism: This technique needs additional storage space 92 and computation power, it suffers from attention drift prob-93 lem, and The latest attention mechanism research is primar-94 ily focused on languages with just a few character groups 95 (e.g., English, French).

96
Transformer [16], a modern attention alternative, was 97 extensively used to increase parallelization and minimize 98 complexity for STR. Some efforts were made to replace 99 recurrent neural networks with non-recurrent structures in the 100 domain of regular text recognition, such as convolutional-101 based and attention-based approaches. Nevertheless, both 102 approaches depend on seq2seq structures, which are inade-103 quate for handling arbitrary shaped text. It's worth mention-104 ing that the Transformer is originally designed for language 105 translation tasks such as English to French, French to English 106 and so on. It takes one-dimensional sequences as its input. 107 Since position information is not naturally encoded within the 108 input set of sequences, the model is less sensitive to the posi-109 tioning of input sequences than RNN and LSTM frameworks 110 with associative bias. The transformer is permutative equiv-111 arience, because of its Self-Attention (SA) and Feed-Forward 112 Network (FFN) layers, which calculate the result of each 113 component in the input sequence separately. While the 1D 114 Positional Encoding (PE) approach employed in Transformer 115 may handle the permutation equivarience issue that may arise 116 in 1D sequences associated with NLP, it cannot preserve 117 the horizontal and vertical features produced by CNNs for 118 a 2D input image. 119 In short, the primary contributions of our research work are 120 as follows:

121
• Transformer [16] in Natural Language Processing 122 (NLP), which takes only 1D sequences as its input. 123 On the other hand, the scene text recognizers can only 124 handle 2D images. To solve this permutation equivari-125 ence problem and to preserve the order of sequential 126 information, we modify the conventional transformer 127 architecture to recognize texts in the scene image. Here, 128 we introduced a novel mechanism to convert the spa-129 tial encoder from 1D to 2D, by expanding the stan-130 dard transformer's 1D Positional Encoding (1DPE) to 131 2D Positional Encoding (2DPE).

132
• Input word images in natural scenes take different 133 forms, including curved and skewed texts. If such input 134 word images are transmitted left unchanged, the fea-135 ture extraction step must learn an invariant represen-136 tation for such geometry. To eliminate distortion on 137 the input word images and make text recognition eas-138 ier, we employ a thin-plate spline (TPS) transforma-139 tion [4]. The rectified or normalized images enhance 140 text recognition accuracy, particularly for datasets with 141 a majority of arbitrary texts and perspectively distorted 142 texts. TPS can be selected or deselected using our 143 framework.

144
• We propose a new mechanism called the Optimal text recognizer adopted to classify and capture character 226 sequences.

227
Recent deep networks can develop robust representations 228 that are tolerant to imaging distortions and changes in text 229 style, but they still have issues handling scene texts includ-230 ing viewpoint and curvature distortions. To deal with such 231 issues, Zhan and Lu [25] established an end-to-end STR 232 network called ESIR, which reduces viewpoint distortion 233 and text line curvature iteratively that improves the perfor-234 mance of STR systems. The posture of text lines in scenes 235 is estimated using an innovative rectification network that 236 introduces a different line fitting transformation. In addition, 237 an iterative rectification mechanism is being created, which 238 corrects scene text distortions in a fronto-parallel perspective. 239 Litman et al. [26] presented a new encoder-decoder archi-240 tecture named Selective Context ATtentional Text Rec-241 ognizer (SCATTER) for predicting character sequences 242 against complicated image backgrounds. A deep Bi-LSTM 243 encoder is designed for encoding contextual dependencies. 244 A two-step 1D attention method is used to decode the 245 text.

246
Instead of rectifying the complete text image, Liu et al. [27] 247 suggested using Character-Level Encoder (CLE) to iden-248 tify and rectify specific characters in the word image. 249 The arbitrary orientation network (AON) was developed by 250 Cheng et al. [28] to directly capture the deep feature repre-251 sentations of irregular texts in four directions along with char-252 acter location clues. A filter gate mechanism was designed to 253 integrate the four-direction character sequences of features, 254 and an attention-based decoder was employed to generate 255     Transformer'' to handle string copying and other rational 296 interpretation with string lengths that are longer than those 297 seen during training. There have also been numerous attempts 298 to interpret scene text without the use of recurrent networks.

299
Based on the Transformer model, Chen et al. [35] developed 300 a new non-recurrent seq2seq framework for STR, which 301 includes a self-attention block functioning as a fundamen-302 tal component in both the encoder and decoder architec-303 ture, to understand character dependencies. Yang et al. [36] 304 proposed an STR network that is much more simple and 305 powerful, based on holistic representation-guided attention.

306
An attention-based sequence decoder is linked directly to 307 two-dimensional CNN features. The holistic depiction may 308 steer the attention-based decoder to more precisely concen-309 trate on text regions. Because of their inherent model archi-310 tecture, all of these existing approaches are mostly focused on 311 regular STR and these are found hard to recognize irregular 312 text.

313
In contrast with the convolution network and attention 314 mechanism, we propose a simple but powerful STR model 315 with an Optimal Adaptive Threshold-based Self-Attention 316 (OATSA) mechanism in this paper. This method directly 317 maps word images into character sequences and it also works 318 well on both horizontal and arbitrary shaped scene text 319 images. 321 We propose a modified Transformer-based architecture to 322 recognize arbitrary shaped text from the natural scene image. 323 Fig. 1 illustrates the overall pipeline of the proposed frame-324 work. The modified transformer can be categorized into four 325 main modules: Image transformation, VFE, Encoder and 326 Decoder. Both the encoder and decoder utilize a multi-layer 327 stack of transformers. The encoder module is designed 328 to obtain a high-level feature representation from a scene 329 text image. The decoder block is designed to generate the 330 sequence of characters from the feature maps while paying 331 attention to the encoder output. Transformers are attention-332 based deep-learning architectures that use a self-attention 333 module to scan through each constituent of a sequence 334 and post updates by accumulating information from the 335 entire sequence. An attention mechanism greatly helps 336 the transformers to capture the global dependencies among 337 the input and output sequences, where previous deep-learning 338 approaches find it challenging to capture such relationships. 339

340
Text recognizers are most effective when their input images 341 include tightly confined regular text. This encourages us 342 to do a spatial transformation before a recognition to con-343 vert input images into ones that recognizers can read easily. 344 We employed a TPS transformation to transform an input text 345 image (I) into a normalized image (I') as shown in Fig. 2. Text 346 images come in various shapes for example: tilted, perspec-347 tive and curved texts. Such complex-shaped text images force 348 the feature extraction steps to learn an invariant representation 349 concerning such geometry. TPS rectification algorithm is an 350 alternative to the STN, which has been used for various aspect 351 ratios of text lines to reduce this complexity. TPS interpolates 352 between a collection of fiducial points using a smooth spline. 353 TPS identifies several fiducial points at the upper and lower 354 enveloping points, and then normalizes the character region 355 to a predefined rectangle.  ''identity shortcut connection''. Therefore, we use ResNet as 364 the CNN backbone for the VFE.  in the input F , resulting in the attention score tensor S with 382 dimension H×W×d'. The multi-head self-attention layer can 383 better capture 2D spatial information using the positional 384 encoded map F . By introducing a fixed 2D positional encod-385 ing P(.) as given in Eq. (1) -Eq. (4), we generalize the original 386 transformer's 1D encoding to be suitable for the 2D image 387 feature.

388
PE (hor, ver, 2i) = sin(pos (hor) .fre i ) (1) 389 PE (hor, ver, 2i + 1) = cos(pos(hor).fre i ) (2) 390 PE (hor, ver, 2j + ch/2) = sin(pos(ver).fre j ) (3) 391 PE (hor, ver, 2j + 1 + ch/2) = cos(pos(ver).fre j ) (4) 392 where pos(hor) and pos(ver) represent the horizontal and 393 vertical positions respectively, fre i , fre j ∈ R is the 2D posi-394 tional encoding signal's learnable frequencies, ch represents 395 the number of channels in F and i, j ∈ [0, d/4]. Position 396 code (P) and 2D PE (F ) are combined to enable to notice 397 character's information in each position before it. The 2D 398 encoding maps P added to F represented by F = F + P. 399 The Transformer's encoder only takes a set of vectors as 400 input. Hence, it is necessary to vectorize the d channels of 401 F and stacked together to create a single feature matrix 402 M is shown in Eq. (5). The function Mat2Vec is used to 403 convert the feature map into a vector represented by x i,j = 404 Mat2Vec(F (:, :, j)) ∈ 1 × HW. values matrix V ∈ R W×d (see Eq. (9)) using the appropriate 429 key matrix K ∈ R W×d (see Eq. (8)) and query matrix Q ∈ 430 R W×d (see Eq. (7)). The queries, keys, and values for the self-431 attention modules are all generated in the same sequence. The 432 attention output matrix is derived as follows (see Eq. (6)): The self-attention operation in the classic Transformer archi-469 tecture, on the other hand, has an evident flaw in that it 470 distributes credits to all context components. It's inappro-471 priate since a lot of credit could be given to information 472 that's not relevant and should be discarded. For example, the 473 traditional self-attention method calculates attention weights 474 by multiplying the specified query by the key from several 475 modalities. The weighted sum is then calculated by applying 476 the attention matrix to the value. However, many irrelevant 477 words may have a minimal association with encoded image 478 attributes, leading to a very modest amount after multiply-479 ing the provided query by key. When attention scores are 480 relatively close, a SOTA approach namely constraint local 481 attention cannot filter irrelevant information and will break 482 the long-term dependency.

483
In our proposed work, we integrate a new threshold module 484 namely Optimal Adaptive Threshold-based Self Attention 485 (OATSA) in a standard self-attention calculation as shown 486   The elements lower than the threshold multiplicative factor 504 (mean(p iw ) * t) are assigned to negative infinity (see Eq. (11)).  are taken ''one at a time'' by the Feed Forward Network. The 524 finest part is that, unlike with RNN, each of these attention 525 vectors is independent of the others. As a result, paralleliza-526 tion may be used here, which makes a huge impact. We can 527 now feed all of the words into the encoder block at the same 528 time and obtain the set of encoded vectors for each word at 529 the same time.

531
The decoder is made up of n layers of transformer decoders. 532 The embeddings of the decoded output sequence of charac-533 ters are sent into the decoder. The decoder is made up of N 534 identical layers, each of which contains three sub-layers. The 535 MHSA is used in the first layer of the network; the masked 536 mechanism in the first layer prevents the model from seeing 537 future data. This masking approach ensures that the model 538 only utilizes the previous words to generate the current word. 539 An MHSA layer without the masked technique makes up the 540 second layer. It applies a multi-head self-attention mechanism 541 over the first layer's result. This layer serves as the foun-542 dation for correlating text and image information with the 543 self-attention layer. A position-wise fully connected FFN is 544 incorporated in the third layer. Following layer normalization, 545 the Transformer establishes a residual connection for all three 546 layers. To convert the Transformer's result into probabilities 547 for each character sequence in a sentence, we attach a Fully 548 Connected (FC) layer and a softmax layer at the top. All the 549 characters in the phrase can be created simultaneously, unlike 550 the LSTM. The top encoder's output is then converted into a 551 set of K and V attention vectors. These K and V attention 552 vectors are utilized by each decoder in its ''encoder-decoder 553 attention'' layer, which assists the decoder in focusing on the 554 appropriate positions in the input sequence. 555

FIGURE 5.
Working procedure of proposed Adaptive Threshold-based attention mechanism to obtain most participative elements after assigning higher probabilities to it. a) Attention matrix pij is obtained by performing the dot product between the key (K) and the query (Q). b) Attention matrix pij split into 'n' (3) chunks. c) The Mean value is computed for each chunk, set the element value to −∞ if the mean value is lesser than the threshold value (0.7). d) Softmax function is applied on the p t to replace the −∞ to 0. The final matrix contains the most contributive elements.
For instance, in certain fonts, a decoder that recognizes char-559 acter sequence from L2R may have trouble choosing the 560 initial letter between upper-case 'I' and lower-case 'l'. These 561 initial characters are difficult to differentiate perceptibly 562 and the decoder has no memory of the previous deciphered 563 characters. Such challenging characters can be recognized 564 using an R2L decoder easily since the succeeding char-565 acters suggest the initial character based on the language 566 preceding. Decoders that function in opposite directions 567 could be beneficial. 568 We propose an architectural level bi-directional decoder 569 (see Fig. 6), which comprises a decoder with opposing direc-570 tions, to make use of the dependencies in both ways. The 571 decoder is primarily designed to predict texts from both 572 directions (L2R and R2F). After running the decoder, two 573 recognition results are generated. During inference, to aggre-574 gate the outcomes, we merely choose the symbol with the 575 highest log-softmax recognition score, which is the total 576 of all predicted symbols' recognition scores. In addition to 577 positional embedding and token embedding, we introduce 578 direction embedding during decoding to add more contextual 579 information. The framework is instructed to decipher the text 580 string from L2R or R2L using this direction embedding. The In this paper, we train our framework with only two synthetic 624 datasets: Synth90k [45] and SynthText [46]. We evaluated the 625 significance and robustness of our proposed STR framework 626 on seven standard benchmark datasets, four regular scene text 627 datasets and three irregular scene text datasets.

628
Synth90k is the synthetic text dataset proposed by 629 Gupta et al. [45]. A total of 9 million word pictures were 630 acquired from a collection of 9k frequent English language. 631 The entire dataset images were only used for training. Each 632 image in Synth90k has a word-level ground-truth annotation. 633 These images were generated with the help of a synthetic text 634 engine and are quite realistic.

635
Synthtext is another synthetic text dataset that was only 636 used for training, which is proposed by Jaderberg et al. [46]. 637 The process of image generation was similar to that of [45]. 638 Originally, the Synthext dataset was created for text detection, 639 unlike [45]. Characters are rendered as a full-size images.

IIIT5K-Words
[38] (IIIT5K) there are 3k cropped word 641 test images in the IIIT5K dataset collected from the inter-642 net. Each image has a vocabulary of 50 short words and 643 1,000 long words. A few words were created randomly and 644 the rest were created from the dictionary.

645
Street View Text [39] (SVT) dataset comprises of 646 647 cropped text pictures acquired from Google Street View 647 (GSV). Each image contains a 50-word lexicon. Most of 648 the images in the SVT dataset are severely distorted, noisy, 649 blurred and low resolution. [40] (IC03) consists of 251 scene text 651 images with text-labelled bounding boxes. For a fair compar-652 ison, we excluded word images containing non-alphanumeric 653 characters or images with less than 3 characters as sug-654 gested by Wang et al. [20]. The updated dataset comprises 655 of 867 cropped word pictures. Images in the IC03 dataset 656 include both 50-word lexicon and ''full-lexicon''.

657
ICDAR 2013 [41] Most of the (IC13) dataset image 658 samples are inherited from its successor IC03. For a fair 659 comparison, words with non-alphanumeric characters were 660 removed from the dataset. The filtered test dataset contains 661 1015 cropped word images with no lexicon associated with 662 them.   Table 1.

687
We implemented our method using the Pytorch framework. For the image encoding and the feature extraction, we exper-707 imented with various CNN models (see Table 3) such as 708 VGG16, ResNet18,ResNet34,ResNet50 and ResNet164. 709 In that, ResNet50 provides an excellent balance between   depiction of relationships between distant image patches, 729 initially, we built a self-attention layer on top of the convo-730 lutional layers. On the other hand, we drop the self-attention 731 layer from the decoder to analyze the influence of the 732 self-attention layer on the decoder side. The recognition accu-733 racy of the generalized model is marginally lower than the 734 standard system (91.6% v.s. 97.7% on the IIIT5K dataset and 735 83.3% v.s. 90.6% on the SVT-P dataset) as shown in Table 5, 736 but it is still comparable to earlier approaches.

737
In contrast to language translation approaches, we iden-738 tified that applying the self-attention mechanism in STR 739 has a significant impact on performance. We believe there 740 are three alternative reasons: Firstly, the length of character 741 sequences represented in standard STR tasks is often less 742 than those required for machine translation. Secondly, the 743 CNN-based encoder effectively represents the long-range 744 relationships between the words. For example, the receptive 745 field produced by ResNet50's final feature layer has a great 746 influence on long-term dependencies. Finally, self-attention 747 is often employed in machine translation to represent the 748 relationships between words in a phrase or even a paragraph. 749 TABLE 5. The encoder and decoder performance comparison with and without self-attention block. When comparing row 1 and row 2, we find that dropping the self-attention block from the decoder side of our framework produces a significant performance drop. Rows 2 and 3 illustrate that the self-attention block to the encoder side has shown slight improvement.    Table 6.

766
Normal surpasses reversed on SVT, IC15 and CUTE while 767 reversed excels IIIT5K, IC03 and SVT-P. In the worst case, 768 the difference in recognition accuracy between Normal and 769 Reversed decoders is minimal, when they are combined, they 770 provide a significant performance improvement. We carried 771 out experiments to see how text rectification impacted our 772 framework. As a text rectification approach, we employed the 773 image normalization technique proposed in the STR method. 774 Without an image normalization block, the proposed Trans-775 former based 2D-attention mechanism can locate a single 776 character scattered in 2D space. In this context, the image 777 normalization block has minimal effect on our framework 778 (see Table 7). We provide a novel technique called Opti-779 mal Adaptive Threshold-based Self-Attention (OATSA) that 780 effectively explicit sparse Transformer. The performance and 781 importance of the OATSA technique are shown in Table 8. 782 The OATSA technique preserves the long-term dependencies, 783 which are defined by the distribution of neighbour nodes. 784 It can focus the attention of the standard Transformer on the 785 most contributive components.

786
Optimal Adaptive Threshold is integrated into self-787 attention and performs as an attention mechanism in the 788 decoder, allowing the model to produce more accurate words. 789 VOLUME 10, 2022 TABLE 8. Performance comparison among four variations. ResNet50 is used as the proposed model's feature extraction module. Rectification and no rectification indicate the image normalization step was performed and the image normalization step not performed. 2D positional encoding represents the model that performs recognition tasks; it keeps tracking the character position in each iteration. The OATSA represent the Optimal Adaptive Threshold-based Self-Attention Algorithm. Different decoders are ''Normal'' denotes an L2R direction, ''Reversed'' signifies an R2L direction and ''Bidirectional'' denotes a combination of them.   TABLE 9. On several benchmarks, the overall performance of our STR model is compared with that of the previous state of art approaches. All values are expressed as a percentage (%). The outcomes are all in the no lexicon represented by ''None'', ''90K'', ''ST'', ''SA'' and ''Wiki'' stand for Synth90K, SynthText, SynthAdd and Wikitext-103, respectively; ''word'' and ''char'' denote the use of word-level or character-level annotations; and ''self'' denotes the use of a self-designed convolution network or self-made synthetic datasets.
on IC15 and +2.8% on CUTE datasets). Our method out-823 performs the prior SOTA approach by Yang et al. [36] by 824 a margin of +7.7% on SVT, +4.8% on IC13, +14.2% 825 on IC15, +9.7% on SVT-P and +5.9% on CUTE. The 826 significant improvement validates the effectiveness of our 827 method. We are only 0.6% behind Zhang et al. [31] on CUTE 828 Yang et al. [36] in the STR problem, the proposed framework

851
In this paper, we presented a new simple yet powerful frame-852 work for both regular and irregular STR based on the trans-853 former framework. The proposed framework breaks down 854 into four modules: image transformation, feature extraction, 855 encoder and decoder. First, the transformation module uti-856 lizes a Thin Plate Spline (TPS) transformation to normalize 857 the irregular or arbitrary word image into a more readable 858 word image that greatly helps to reduce the complexity of 859 extracting text features. Second, the Visual Feature Extrac-860 tion (VFE) module uses ResNet as the CNN backbone to 861 extract well-defined feature representations and expand the 862 standard transformer's 1D Positional Encoding (1DPE) to 863 2D Positional Encoding (2DPE) information a 2D Positional 864 Encoding (PE) to capture the order of sequential informa-865 tion from the 2D rectified word image. Third, Multi-Head 866 Self-Attention (MHSA) and Feed-Forward Network Layers 867 (FFNL) in the encoder module perform feature aggregation 868 and feature transformation concurrently. Finally, we proposed 869 a new Optimal Adaptive Threshold-based Self-Attention 870 (OATSA) model and an architectural level bi-directional 871 decoding approach comprised in the decoder module greatly 872 supports the framework to generate a more accurate char-873 acter sequence. The OATSA model replaces the standard 874 Scaled Dot-Product Attention. Moreover, it can be used in 875 both encoder and decoder modules to filter noisy information 876 effectively and choose the most contributive components to 877 focus on image text regions. The proposed framework is 878 trained with world-level annotations; it can handle words 879 of any length in the lexicon-free mode. Comprehensive 880 experimental results on challenging standard benchmarks 881 pp. 19-36, 2016. 888 [2] Q. Ye and D. Doermann, ''Text detection and recognition in imagery: