Toward Arbitrary-Shaped Text Spotting Based on End-to-End

At present, text spotting in natural scenes has become one of the research hotspots. Among them, curvilinear text and long text are the main difficulties of text spotting in natural scenes. To better solve these two types of problems, we propose a novel end-to-end text spotting model. The model includes three parts: shared convolution module, text detector module and text recognizer module. For the problem of long text, we adopt the corner attention mechanism to extract the features of long text more effectively. For the problem of curve text, we feed the rectification feature map into the SA-BiLSTM decoder to recognize the curve text more effectively. More importantly, the joint optimization strategy realizes the mutual promotion function of the text detection task and the text recognition task. Experimental results on TotalText, ICDAR2015, ICDAR2013, CTW1500, COCO-Text and MLT datasets prove that our method achieves excellent performance and robustness in text spotting tasks based on end-to-end natural scenes.


I. INTRODUCTION
Text spotting in natural scenes has high research value and a wide range of application scenarios, and has become one of the hot topics of research. On the one hand, text is an important carrier for the spread of human civilization; On the other hand, rich high-level semantic information helps us understand the world better. More importantly, text information in natural scenes has a very wide range of application scenarios, such as image search, instant translation, robot navigation, blind assisted reading, and industrial automation. Therefore, text spotting in natural scenes has important research value.
However, the inherent feature of text in natural scenes increase the difficulty of text spotting. First of all, the diversity of text, such as text in different languages, fonts, font sizes, shapes, etc., as shown in Figure 1.a; In addition, the complexity of the text background, the background may contain many objects similar to the text, such as leaves, bricks, windows, fences, etc., as shown in Figure 1.b; Finally, unsatisfactory data quality, imperfect imaging conditions often lead to low data quality, such as low resolution, distortion and blur, as shown in Figure 1.c.
The associate editor coordinating the review of this manuscript and approving it for publication was Huazhu Fu . Traditional OCR (optical character recognition) processing methods generally decompose text spotting in natural scenes into two independent subtasks: text detection task and text recognition task [1]. The text detection is used to detect whether there is a text instance in the picture; text recognition is to recognize the content of text. The OCR processing method has achieved good results in text spotting problems in natural scenes. However, this method ignores the inherent connection between text detection and text recognition. First, the accumulation of training errors, the errors in the text detection stage will be passed to the text recognition process, which will lead to a worse text recognition performance; second, the text detection task and the text recognition task cannot be optimized at the same time.
To address the issues of current OCR methods, end-toend OCR processing has been proposed by Li et al. [2], Liu et al. [3], He et al. [4], Sun et al. [5] and Lyu et al. [6]. Their common idea is that the text detection branch and the text recognition branch share a feature extraction network. Therefore, the text detector and the text recognizer can be optimized jointly, which can effectively solve the problem of error accumulation.
The end-to-end method proposed by Li et al. [2] can achieve good performance on horizontal text datasets, but it cannot handle curvilinear text problems in natural scenes well. In order to better deal with curvilinear text, Liu et al. [3], He et al. [4] and Sun et al. [5] proposed similar solutions. Their common ideas are: first, the feature extraction network extracts the features of the text area; second, rectify feature maps; finally, feed the rectified feature maps into the text recognizer.
Although Lyu et al. [6] can improve the performance of text detection, it has lost the potential order information between characters. On the one hand, even if each character is detected correctly, it is difficult to connect them into words correctly. On the other hand, the author's default text is recognized from left to right, so it cannot handle non-traditional text directions. The paper [7] adopts a joint optimization strategy which makes good use of the potential internal connection between text detection and text recognition, but it cannot well avoid the interference of complex backgrounds.
In order to better solve the above problems, we propose a novel end-to-end text spotting framework, which uses a joint optimization strategy to effectively utilize the inherent connection between text detection tasks and text recognition tasks. Firstly, the text detector uses a corner attention mechanism [8], which can better solve the problem of long texts; Secondly, the TPDM (Text Point Detection Module) can better avoid the interference of complex backgrounds. In addition, a feature rectification network is used to rectify the feature maps, and then the rectified feature maps is fed into the recognizer, which is beneficial to the recognition of curve text. Most importantly, SA-BiLSTM (spatial attention mechanism with BiLSTM) is a text recognition model based on the combination of spatial attention mechanism and BiLSTM, which can more effectively extract semantic information between characters.

II. RELATED WORK A. TEXT DETECTION IN NATURAL SCENE
The main difficulty of text detection in natural scenes is caused by the characteristics of texts in natural scenes, such as the diversity of text directions and the diversity of text languages. The paper [9] transformed the text detection task into a series of text box detection and introduces RNN (recurrent neural networks) to improve the effect of text detection. It has a significant effect on detecting horizontal text data, but it is not suitable for non-horizontal text. The thesis [10] first be cut each word into more directional small text segments that are easier to detect, and then connects each small text block into a word with a neighboring link, which is conducive to recognizing a wide range of lengths with directions Words and lines of text. The paper [11] first uses a FCN (fully convolutional network) to generate multi-scale fusion feature maps, and then directly performs pixel-level text block prediction on this basis, supporting two types of text area annotations: rotating rectangle box and arbitrary quadrilateral.

B. TEXT RECOGNITION IN NATURAL SCENE
Natural scene text recognition includes two categories: the CTC-based method and the method based on the attention mechanism. The paper [12] used the CTC-based method for the first time in recognition system and achieved good results. The papers [13], [14], [15], [16] and [17] also adopted the improved CTC method to further verify the effect of CTC method. Paper [18] first proposed the attention mechanism to solve the problem of machine translation, and it is now widely used in text recognition task in natural scenes. Paper [19] proposed an attention mechanism based on encoding and decoding, which can better adapt to the problem of text recognition. For irregular text recognition (warped and curved text), the paper [20] proposed a combination of attention mechanism and spatial transformation network to improve the performance of irregular text recognition.

C. TEXT SPOTTING BASED ON END-TO-END IN NATURAL SCENE
Text detection and text recognition are often regarded as two independent sub-problems, ignoring the intrinsic connection between text detection and text recognition. Therefore, endto-end text spotting has become one of the research trends. The paper [21] uses a text detector based on SSD (single shot multibox detector) [22] and a text recognizer based on CRNN (convolutional recurrent neural network) [23]. Paper [2] uses a text detector based on RPN (Region Proposal Network) [24] and a text recognizer based on the attention LSTM (long short-term memory) mechanism. Papers [2] and [7] use a strategy based on joint optimization of text detectors and text recognizers to achieve overall performance improvement.
We propose the advantages of an end-to-end text spotting framework as follows. First, the joint optimization model effectively uses the potential internal connection between the two tasks of text detection and text recognition to improve the overall performance. Second, the corner attention mechanism can better solve the problem of long texts; in addition, the TPDM can better avoid the interference of complex backgrounds. Finally, the rectified feature map is fed into the SA-BiLSTM recognizer, which can more effectively extract the semantic information between characters and is conducive to text recognition. VOLUME 8, 2020

III. MODEL DESIGN A. MODEL
We propose a text spotting model based on end-to-end natural scenes. The model consists of three parts: shared convolutional network, text detector and text recognizer. The scene text spotting flow chart is shown in Figure 2. First, feed the preprocessed image into the shared convolutional network to extract the shared feature maps. Then feed the shared feature maps into the text detector and text recognizer, and the text detector and text recognizer promote each other through Boxes and TPDM. Finally, a joint optimization strategy is used to make full use of the inherent connection between the text detection task and the text recognition task to improve the overall performance of text spotting in natural scenes.

B. MODEL FRAMEWORK
This paper proposes an end-to-end text spotting framework structure, as shown in Figure 3. It includes three parts: shared convolution feature extraction network, text detector and text recognizer. First, the picture is fed into the shared convolution feature extraction network for feature learning, and the obtained feature map is input to a text detector and a text recognizer. The IRM (Text Regressor Module) in the paper [8] can better adapt to the detection of long texts. In addition, the text detector uses boundary points to represent text instances, which is more suitable for detecting and recognizing text of any shape than rectangle box. The text recognition model based on SA-BiLSTM decoder can more effectively extract the semantic information between characters. The following describes each process separately.

1) SHARED CONVOLUTION MODULE
The shared convolution module adopts ResNet-50 [25] structure to extract shared features. Since texts in natural scenes usually have different sizes. In order to better adapt to texts of various sizes, it is necessary to maintain a large receptive field and richer features. Use dilated convolution to maintain a larger receptive field. Inspired by FPN (feature pyramid networks) [26], we use the method of concatenating low-resolution feature maps and high-resolution feature maps to extract richer text features. The size of the final output feature map is 1/4 of the input picture.

2) TEXT DETECTOR
The text detector is composed of three parts: text regression module, iterative optimization module and text point detection module, as shown in Figure 3.

a: TEXT REGRESSOR MODULE
Inspired by [11], TRM (Text Regressor Module) uses a fully convolutional sub-network as a text regressor. Based on the shared convolutional feature map, two prediction channels are output by pixel-wise: text and non-text. We use a similar approach to others: the pixels in the text area are defined as positive samples, and the pixels in the non-text area are defined as negative samples. For each positive sample, there are 8 channels to predict the four corners of the text box. TRM has two functions, one is the classification task of text and non-text, and the other is to locate the text.
For the classification task, we use the scale-invariant dicecoefficient function proposed in [8], which is defined as follows: where sum is a cumulative function in a two-dimensional space, y is a binary label value,ŷ is a predicted value, and w is a two-dimensional weight space.
For the text localization regression task, due to the better robust performance of smooth L 1 [26],we use smooth L 1 to optimize the text regression task L loc . Therefore, the loss function of TRM is defined as follows: where α is the hyper-parameter. The α parameter is used to balance the two sub-loss functions. In the experiment, α is set to 0.01.

b: ITERATIVE REFINEMENT MODULE
To better detect long texts, we adopt the IRM (iterative refinement module) proposed in the paper [8]. Since the position close to the corner of the text area can obtain more accurate boundary information in the same receptive field, a corner attention mechanism is introduced to return to the coordinate offset of each corner point. The loss function of IRM is defined as: where K represents selecting K detected text boxes from the output of the TRM step, C j k represents the offset of the j-th coordinate of the k-th text box, and Cˆj k is the corresponding predicted value.

c: TEXT POINT DETECTION MODULE
Using ROI-Align to extract the features of the text quadrilateral will extract a lot of background noise, which will affect the recognition network. The use of boundary points to represent arbitrary-shaped text can effectively avoid such problems. First, the boundary points can describe the precise text shape and eliminate the impact of background noise. Secondly, the boundary points are easy to rectify any shape text into horizontal text, which is beneficial to the text recognition network.
TPDM consists of four stacked 3 × 3 convolutional layers and a fully connected layer. Inspired by RPN where proposals are regressed based on default anchors, we use a similar method to set a set of default points for the text boundary. Specifically, N points are sampled at equal distances on each long side of the text instance as target boundary points. The corresponding default points are placed equidistantly along the long side of the smallest quadrilateral. Instead of directly predicting the coordinates of the boundary point, the offset of the default point associated with it is first generated. The module predicts a 4N -d vector which is coordinate offsets (2-d) of 2N boundary points. Given the coordinate offsets ( x , y ), the boundary points (x b , y b ) can be passed Calculated: where (x d , y d ) is the set default point, w 0 , h 0 are the width and height of the text box output by the IRM, respectively.
The loss function L tpdm of TPDM is defined as: where (x b,i , y b,i ) is the k-th predicted text point,,whose associated target boundary point is (x b,i ,ŷ b,i ).

3) TEXT RECOGNIZER a: ArbitraryRoIAlign
In order to better adapt to the curve text, we use a rectification network similar to the paper [18] to the features. Specifically, TPS (Thin-Plate-Spline) can rectify the deformed image (affine, perspective, curve arrangement, etc.) to obtain the rectified feature map, which is convenient for text recognition.

b: SA-BiLSTM DECODER
SAM is proposed in the paper [27]. In contrast, we combine the spatial attention mechanism with BiLSTM to better extract the semantic information between texts. The structure of SA-BiLSTM is shown in Figure 4. Suppose T iterations are needed, and the predicted character sequence is y = (y 1 , . . . , y T ). There are three inputs at step t: the input feature F, the hidden state s t−1 of the previous iteration and the character category y t−1 predicted by the previous iteration.
Firstly, expand the s t−1 vector into a feature map of shape (V , H p , W p ). V represents the size of the RNN hidden layer VOLUME 8, 2020 and is set to 256.
Secondly, calculate the weight α t of attention: The shapes of e t , α t are (H p , W p ), W t , W s , W f and b are training weights and bias. Thirdly, calculate the weighted feature g t : Embed the characters of the character type y t−1 predicted by the previous iteration and perform a concat operation with g t to calculate the input r t of the RNN: W y , b y are weights and bias, and N c is the number of types of sequence decoders. We set it to 79, including English letter case, Arabic numerals, and several special characters. The r t , s t−1 is fed into the RNN (Bi-LSTM): Finally, the prediction result of the t-th iteration is as follows: The loss function L recog of the text recognizer is defined as: where T represents the length of the tag sequence.

4) JOINT OPTIMIZATION AND LOSS FUNCTION
Our proposed text spotting framework uses a joint optimization strategy: text detection tasks and text recognition tasks share features and optimize at the same time. it saves computing time and can make better use of the internal connection between text detection and text recognition tasks. Therefore, the loss function L is defined as follows: where σ 1 , σ 2 , σ 3 , σ 4 are used to balance the four submodules, all set to 1.0 in the experiment.

IV. EXPERIMENTAL DESIGN AND ANALYSIS
A. DATA SET The data set used in our experiment and its related introduction are as follows: SynthText is a synthetic data set, contains 800,000 synthetic images and has a large number of multi-directional text examples.
TotalText is a text dataset of comprehensive scenes. The dataset contains 1255 training datasets and 300 test datasets, with a variety of texts such as horizontal, directional and curved texts. The data set provides word-level annotations.
Different from the TotalText data set, CTW1500 proposed in 2017 is a scene text data set containing arbitrary shape Chinese and English texts, with 1000 training images and 500 test images.
ICDAR2015 is a text dataset of natural scenes proposed in the ICDAR 2015 competition. These images are multidirectional text data sets, including 1000 training data sets and 500 test sets. All pictures provide character-level and wordlevel annotations.
The COCO-Text dataset has a total of 6368 images, it contains 43686 training data sets and 10,000 test sets.
The ICDAR2013 dataset contains only horizontal text. The training data set contains 229 images, and the test set contains 233 images. The data set provides both character-level and word-level annotations.
MLT is a scene text data set in multiple languages. It contains 7200 training data sets, 9000 test sets and 1800 verification data sets.

B. EXPERIMENTAL DETAILS
Different from the previous strategy of independent training or alternating training of text detection and text recognition, we used joint optimization based on the end-to-end text spotting model. The entire training process includes two stages: first, pre-training on SynthText dataset, and finally finetuning on actual data (TotalText, ICDAR2015, ICDAR2013, CTW1500, COCO-Text and MLT).
The experiment uses the SGD optimization algorithm, the weight attenuation value is 0.0001, and the momentum value is 0.9. During the pre-training phase, 300K iterative training was carried out for the model, the default value of the initial learning rate was 0.01, and in the 150K and 300K iterations, the learning rate dropped by one tenth.
During the fine-tuning phase, The default value of the initial learning rate was set to 0.001 and then reduced to one-tenth in 150k iterations. Fine-tuning process stops at 200k times. Our experimental model is based on Pytorch.
LabelGeneration Since the training stage requires equidistant boundary points to train TPDM, we use the algorithm in [28] to sample on the long side of the text boundary. N is also set to 7.

C. EXPERIMENTAL ANALYSES 1) CURVED TEXT
We have conducted experiments on the TotalText data set to verify the effectiveness of the model on arbitrary-shaped text. In the test phase, the long side of the picture is set to 1100. To be fair, this article follows the evaluation protocol in the latest method [29].
The performance of the experimental scheme proposed in this paper on the Total-Text dataset is shown in Table 1. As can be seen from Table 1, our method achieved the most advanced performance in both text detection tasks and endto-end text recognition tasks. In particular, compared with the method in [28], the performance of the method in this paper is 0.9% and 2.4% higher than that of Boundary in text detection tasks and end-to-end text spotting (without lexicon) tasks respectively. Compared with [27], it improved by 3.0% in text detection tasks. It should be noted that [27] requires character-level annotations. The reasons for the performance improvement are as follows, first, the corner attention mechanism is conducive to the detection and recognition of long text; second, the text decoder based on SA-BiLSTM can better extract the semantic information of the text; finally, TPDM can better avoid interference from complex background. We further verified the effectiveness of our method on the CTW1500 data set. The experimental results are shown in Table 2. It can be seen from Table 2 that we have also achieved very good performance on the CTW1500 data set. Especially in the text detection task, it is 1.1% higher than [30].

2) ORIENTED TEXT
The experimental scheme proposed in this paper was tested on the ICDAR2015 dataset to confirm the validity of the oriented text. The results are shown in Table 3. Compared with [28], our method showed an improvement of 1.2% and 3.7% in text detection tasks and the end-to-end text spotting with strong lexicon, respectively. In addition, compared with [27], our method improves text detection performance by 2.8%.   On the COCO-Text data set, we further verify the effectiveness of the oriented text. The results are shown in Table 4. Compared with [27], our method improves 1.2% and 0.9% in text detection tasks and end-to-end text spotting tasks, respectively.

3) HORIZONTAL TEXT
We conducted tests on the ICDAR2013 dataset to verify the effectiveness of the model on the horizontal text dataset. The results are shown in Table 5. It can be seen from Table 5 that the method proposed in this paper also achieves good performance on the horizontal data set. It should be noted that [27] requires character-level annotations.

4) MULTI-LANGUAGE
In order to verify the reliability of our method, we conduct experiments on MLT data. The experimental results are shown in Table 6. Our method also achieves good performance on the MLT dataset.  Figure 5 is the visualization results of some data. As can be seen from the first two lines, the model can handle arbitrarily shaped text well. The third line, due to the complexity of the text background and the blurred image quality, leads to misdetection and missed detection.

6) ABLATION EXPERIMENT
Ablation experiments can better verify our proposed model. FULL: It is our end-to-end text discovery framework.
No-TPDM: we train a model named ''No-TPDM'' which removes the TPDM part from FULL. It is used to verify the effectiveness of TPDM.
SAM: We train a model and name it "SAM", using the SAM proposed in the paper [27] as the text recognizer of our model. It is used for comparison with our proposed SA-BiLSTM.
As shown in Table 7, on the ICDAR2015 data set, compared with No-TPDM, FULL has increased by 4.2% and 1.9% in text detection tasks and end-to-end text spotting tasks, respectively. For the TotalText dataset, compared with No-TPDM, FULL has increased by 4.4% and 6.1% respectively in text detection tasks and end-to-end text spotting tasks. Therefore, TPDM can effectively share features between text detection and text recognition, and then make full use of the inherent relationship between text detection and text recognition to improve the overall performance of text spotting.
We propose a text recognizer based on SA-BiLSTM and compare it with the text recognizer based on SAM proposed in the paper [27]. As shown in Table 7, our proposed SA-BiLSTM text recognizer achieves good performance. Especially on the TotalText data set, compared with SAM, FULL has increased by 0.7% and 1.3% respectively in text detection tasks and end-to-end text spotting tasks.

V. CONCLUSION
Aiming at the problem of arbitrarily shaped text spotting in natural scenes, we propose an end-to-end text spotting framework and adopt a joint optimization strategy. Experiments show that the text decoder based on the SA-BiLSTM mechanism can better extract the semantic information of the text, TPDM can better avoid the interference of complex backgrounds. Our method achieves the most advanced performance in both text detection tasks and end-to-end text recognition tasks. However, the image background with high similarity to the text still cannot accurately remove the interference. Therefore, text spotting in a complex background is one of the future work in text spotting in natural scenes.