FDTA: Fully Convolutional Scene Text Detection With Text Attention

Text detection is the premise and guarantee of text recognition. Multi-oriented text detection is the current research hotspot. Due to the variability in size, spatial layout, color and the arrangement direction of natural scene text, natural scene text detection is still very challenging. Therefore, this paper proposes a simple and fast multi-oriented text detection method. Our method first optimizes the regression branch by designing a diagonal adjustment factor to make the position regression more accurate, which increases F-score by 0.8. Secondly, we add an attention module to the model, which improves the accuracy of detecting small text regions and increases F-score by 1.2. Then, we introduce DR Loss to solve the problem of positive and negative sample imbalance, which increases F-score by 0.5. Finally, we conduct experimental verification and analysis on the ICDAR2015, MSRA-TD500 and ICDAR2013 datasets. The experimental results demonstrate that this method can significantly improve the precision and recall of scene text detection, and it has achieved competitive results compared with existing advanced methods. On the ICDAR 2015 dataset, the proposed method achieves an F-score of 0.849 at 9.9fps at 720p resolution. On the MSRA-TD500 dataset, the proposed method achieves an F-score of 0.772 at 720p resolution. On the ICDAR 2013 dataset, the proposed method achieves an F-score of 0.887 at 720p resolution.


I. INTRODUCTION
Scene text detection has important application value and research significance in real-time translation, image retrieval, scene analysis, geographic location and blind navigation. At present, there still needs some improvement in scene text detection. Text detection methods mainly include traditional text detection methods and text detection methods based on deep learning.
Traditional text detection methods mainly use machine learning methods for classification. These machine learning methods mainly include neural networks, SVM, K-Means and so on. The neural network is a dynamic system with a topological structure of directed graph, which processes information by responding to continuous or intermittent inputs. The neural network has many applications in the latest research [1]- [6]. SVM (Support Vector Machine) is a kind of generalized linear classifier that classifies data binary by supervised learning. K-Means clustering algorithm is an iterative clustering algorithm. Traditional text detection methods can be divided into sliding window methods and connected The associate editor coordinating the review of this manuscript and approving it for publication was Shiping Wen . region methods. The sliding window-based methods [7], [8] mainly use sliding windows of different scales to search the whole image and extract the features in the window. Common text detection methods based on connected regions include Maximum Stable Extremum Regions (MSER), extremum region method and Stroke Width Transformation (SWT) [9], [10]. Traditional text detection methods usually include multiple steps: generating candidate regions, filtering candidate regions, constructing text lines and verifying text lines. Every module needs to be well designed in order to achieve good performance. These methods can achieve good performance in simple scenes. However, in complex natural scenes, such as uneven illumination or partial occlusion, the traditional machine learning method still cannot achieve an ideal detection performance.
In recent years, compared with traditional text detection methods, the text detection methods based on deep learning have made a great breakthrough in performance. Among the text detection methods based on deep learning, the regression-based methods and the segmentationbased methods are the most widely used. The regressionbased methods are mainly based on the improvement of text characteristics in the object detection framework, such as SSD [11], Faster-RCNN [12] and YOLO [13]. These methods obtain the detection result by regressing the shape of a horizontal rectangular, rotating rectangular and quadrilateral. The segmentation-based methods usually use the idea of semantic segmentation, and divide text pixels into different instances, and obtain text pixel-level positioning results through some post-processing methods. However, post-processing is usually more complex and consumes a lot of computing resources. Both regression-based methods and segmentation-based methods have their limitations.
Because of the above problems, this paper proposes a simple and efficient multi-oriented text detection method, which is robust to changes in the text scale. Because FCOS [14] has good performance in object detection based on anchor-free prediction, we propose a multi-oriented scene text detection method based on the attention mechanism.
Our main contributions can be summarized as follows: 1)A diagonal adjustment factor is designed to regress the loss function, making the position regression more accurate.
2)The text attention module is added, which makes the text pay more attention to useful information and restrain useless information.
3)Using DR Loss improves the imbalance between positive and negative samples, which further improves network performance. 4)Our method achieves competitive results in terms of speed and accuracy on some standard text detection benchmarks.

II. RELATED WORKS
In recent years, with the rapid development of deep learning, scene text detection has made great progress. This section focuses on the work most relevant to the method proposed in this paper.
The EAST [15] method combines multi-scale feature maps for dense pixel-by-pixel prediction. EAST also uses the rotating box (RBOX) and arbitrary quadrilateral (QUAD) for position prediction. However, due to the lack of a receptive field, EAST cannot detect long text effectively. The TextBoxes++ [16] method modifies the anchor of SSD and makes the performance better. TextBoxes++ also combines text recognition to improve accuracy. The RRD [17] method distinguishes the classification and regression of text detection to better detect long texts. The FTSN [18] method is an end-toend trainable multi-oriented text detection based on instanceaware segmentation. The Inceptext [19] method proposes a deformable PSROI pooling module to process multi-oriented text. The Masktext Spotter [20] method proposes a mask text detection, which can detect and recognize the text of any shape. In addition, some methods regard text detection as semantic segmentation of text regions. In [21], Shi et. al predicts the connection between the text instance fragments and then connects them into text instances. In [22], Deng et. al fuses multi-layer depth features to improve detection accuracy. The Corner [23] method integrates the detection and segmentation methods into a comprehensive score. The TextSnake [24] method uses orderly overlapping disks to represent text lines of arbitrary shape. FCOS is a full convolutional one-stage object detection algorithm, which solves the object detection problem by pixel prediction. The solutions of anchor free and proposal free are realized, and the idea of Center-ness is put forward. At the same time, the, recall approaches or even exceeds many anchor-based object detection algorithms. By removing the predefined anchor, FCOS completely avoids the complex operation of anchor and saves memory occupation during training. Different from the above methods, this paper proposes a multi-oriented scene text detection method based on the FCOS. Based on the FCOS anchor-free prediction, the regression method is designed for multi-direction text. On this basis, the diagonal adjustment factor is designed, which makes the position regression more accurate. In addition, when extracting features of texts, a text attention module is added to improve important features and reduce the interference of irrelevant information. Compared with directly using the object detection models, this method can ensure that small text areas are not lost, and ensure the integrity of narrow text areas detection.

III. THE PROPOSED METHOD
Our method is based on the FCOS network, which uses pixel-by-pixel prediction to detect text. It directly predicts the distance between the 4 points of the spatial point to the object on the feature map. This method does not need the anchor and avoids the complicated calculation related to the anchor. Because we use the high-resolution feature map, our method can detect very small texts. In this section, we will show our method in detail.

A. NETWORK STRUCTURE
Due to the good performance of FCOS, we design a text detection network based on FCOS. Figure 1 shows our network structure, which mainly includes three parts: feature extraction network, feature-merging branch and output layer. The feature extraction network uses Darknet53 [25] as the basic network for extracting text features. Darknet53 has five levels of feature maps. As shown in Figure 1, this paper mainly uses four levels of a feature layer, whose sizes are respectively 1/32, 1/16, 1/8 and 1/4 of the input image. The feature map extracted from Conv Stage5 is sampled and processed to enlarge the size to 1/16 of the input image. Then, the extracted feature map is combined with Conv Stage4. After merging the feature map, the 1 × 1 convolution operation is used to fuse the feature map and reduce the number of channels. Through four Conv blocks, the final output of the whole fusion branch is obtained as the input of the attention module. The attention module weights the extracted features, highlighting important feature information and weakening the irrelevant information. The content of the module will be introduced in detail in the third part. Finally, the output layer contains three branches, which are 1 channel text score In our approach, each target is expressed as [ where [x, y] is the location on the image, and [xs, ys] is the location on the feature, and s represents the step size in the feature map. The regression branch predicts the object offset and outputs the 8D vector [x lt , y lt , x rt , y rt , x rb , y rb , x lb , y lb ]. It represents every position in the feature map and corresponds to the position on the image, which is calculated as follows: In addition, in order to speed up convergence, the regression branch is divided into strides by train and inference.
FCOS introduces Center-ness to suppress the low-quality detection box without introducing any super parameters, which proves its effectiveness in object detection. However, the vector of our regression is eight-dimensional. Therefore, we design the eight-dimensional vector of the center regression corresponding to the regression vector [x lt , y lt , x rt , y rt , x rb , y rb , x lb , y lb ]. We calculate the fourdimensional vectors [lt, rt, rb, lb], representing the distance from the center point to the four vertices. Each vector can be written as: In this respect, we propose quadrilateral centerness: The square root is used to slow down the attenuation of centerness. The range of centerness is from 0 to 1, so training is conducted through binary cross entropy (BCE) loss. The loss is added to the loss function formula [5]. When testing, the final score is calculated by multiplying the predicted centerness by the corresponding class score. Therefore, centerness can reduce the score weight of the bounding box far from the center of the object.

B. LOSS FUNCTION
This section describes the loss function of this model. The overall loss function of the model is as follows: where L cls is predicting score loss. L reg represents regression quadrilateral loss. L center indicates centerness loss. N pos represents the number of positive samples in ground truth. λ and ω are balance factors. λ and ω set to 1. Class imbalance can reduce the performance of the class prediction of text detection models. Each image contains a large number of candidate boxes, usually 10 K candidate boxes. However, the bounding box of a real image may only have one, or even none, which leads to the imbalance between positive samples and negative samples. Meanwhile, negative samples consume a lot of computing resources.
The deep learning models solve this problem by data augmentation or hard negative mining in the training process. However, these practices introduce extra steps or introduce a non-differential stage in the whole detection process. Different from these methods, we introduce DR Loss to solve this problem. Because DR Loss can transform classification problems into sorting problems, the problem of imbalance between positive and negative samples is improved. L cls expressed as follows: where P − and P+ are the distributions of positive and negative samples. L is the approximate error of the control function. γ ensures that the positive and negative samples can be VOLUME 8, 2020 separated. In this experiment, L is set to 6 and γ to 2. More details of DR Loss can be found in the paper [26].
In terms of regression loss, we use smoothed-L1 loss [7]. Q is an ordered set of all coordinate values, which can be written as follows: and then the loss is calculated as: However, in natural scenes, the aspect ratio is extremely changeable. The smoothed-L1 loss does not consider the correlation of the coordinate points, which leads to inaccurate boundary prediction. In object detection, IOU Loss [27] is usually used to solve this problem. However, it usually requests the rectangle box from IOU in object detection. It is more time-consuming to calculate IOU for the quadrilateral. Therefore, inspired by CIOU Loss [28], we consider the diagonal proportion of the predicted object to fit the diagonal proportion of the ground truth. The diagonal adjustment factor is introduced, which significantly improves the accuracy of boundary prediction. More details can be found in the ablation study part (shown in Table 4). Diagonal adjustment factors are as follows: where d 1 , d 2 represent the length of the diagonal of the predicted box, and d gt 1 , d gt 2 represent the ground truth diagonal length. The formula is equivalent to adjusting the aspect ratio, making the prediction more robust. Finally, the loss of regression can be written as: According to FCOS [14], the centerness loss is constructed, and the standard binary cross entropy loss are extended to the centerness loss. The purpose of centerness loss is to encourage the network to choose a regression point close to the object center. Besides, the centerness also affects the confidence level of the predicted object. The loss can be written as: L center = BCE (P centerness ; G centerness ) . (13)

C. ATTENTION MODULE
The attention mechanism is derived from the research of human visual image information, mainly for the rational use of limited resources to represent the whole thing itself. The attention mechanism can highlight important features, and reduce the interference of irrelevant information on detection results. Therefore, we introduce the attention mechanism to improve the accuracy of prediction. The Dual Attention Network [29] adaptively integrates the local features and global dependency to capture the global feature, which enhances the feature representation of scene segmentation. We separate the channel attention module to use in the text detection model, which improves the prediction accuracy without increasing the calculation and reducing the inference speed.
As shown in Figure 2, the channel attention map X ∈ R C×C is directly calculated from the initial A ∈ R C×H ×W . And the shape of A to R C×N changes. Then, matrix multiplication is performed between A and A transpose. Finally, the channel attention map is obtained through a softmax layer. It can be written as: where x ij indicates the influence of i th channel on j th channel. Besides, the matrix multiplication between X and A transpose is made.
where β is initialized to 0, and then gradually learns. Formula [14] shows that the final feature of each channel is the weighted sum of all original features and channel features. As shown in Figure 1, our channel attention module is represented as text attention, which processes the output of feature fusion and enhances the ability of feature expression.

IV. EXPERIMENTS
In order to evaluate of our method, we test it on ICDAR2013 [30], ICDAR2015 [31] and MSRATD-500 [32]. We compare our method with the latest text detection methods, and verify that our method can achieve the most advanced level. Moreover, we also carry out ablation experiments to study the effects of text channel attention, diagonal adjustment factors and DR Loss on the overall model performance.
A. BENCHMARK DATASETS ICDAR2015 [31] is the open source dataset of the ICDAR competition in 2015. The images in the dataset are captured by Google Glass. The background environment is all random natural scenes. The ICDAR2015 dataset contains 1000 images for training and 500 images for testing.
Compared with other image databases, the image quality of ICDAR2015 dataset is not high. The size and direction of the text in the image are different. The text annotations in the image are given in the form of the word unit.
MSRA-TD500 [32] contains 500 pictures, 300 for training and 200 for testing. All pictures are indoor (office, shopping mall) or outdoor (street) scenes captured by portable cameras. The dataset contains not only English text but also Chinese text, with the text line as the unit.
ICDAR2013 [30] contains 229 training images and 233 testing images. The text instance is almost horizontal.
The evaluation index of text detection depends on Precision (P), Recall (R) and F-score (F), which are defined as: where TP, FP and FN are the correct detections, the wrong detections and the number of missing detections respectively. In addition, frames per second (FPS) are usually used to measure the performance of text detection algorithms. Secondly, 100k images are randomly selected from the Synth 800k dataset for pre-training. Then, the model is fine-tuned with real dataset such as ICDAR2015. The initial learning rate is 0.001. When iterating 25k, the learning rate decays to 0.0001. When iterating 30k, the learning rate decays to 0.00001. Finally, the training is complete until convergence.

B. IMPLEMENTATION DETAILS
We use the model trained on ImageNet dataset [33] as our pre-trained model. As shown in Figure 3, the training process includes two steps: we first randomly extract 100k pictures from the Synth800k [34] dataset to train 10 epochs, and then fine-tune the model with real dataset until convergence.
Our data augmentation methods include random horizontal flip, vertical flip and random rotation −5 to 5 degrees. Our network optimizes the model through the SGD and momentum method, making the weight decay coefficient of the network set to 5 × 10 −4 . The momentum sets to 0.9 and the batch size sets to 16. The initial learning rate sets to 0.001 and the learning rate is reduced by 10 times when 25K and 30K are iterated. We adjust the size of the input image to make the width equal to 1280 pixels, and the height equals 720 pixels. All experiments are implemented in Python, using PyTorch 1.3. Our model runs in Ubuntu 16.04 system and Nvidia Tesla V100.  Table 1 shows the qualitative comparison on ICDAR2015. Our method achieves the highest F-score 84.9% and our speed is 9.9 FPS without the multi-scale test. Compared with EAST * , our method increases 2.9%, 5.8% and 4.2% respectively on the three indexes of R, P and F. This is because the receptive field of EAST is very small, which leads to the poor detection of long texts. Our network expands the receptive field, making the long text of detection more accurate. It also proves the validity of the diagonal regulator module to the regression branch. Compared with the latest method proposed by Richard [35], our method increases 0.7% in F-score and the inference speed is 2.8 times faster. This indicates that our method is still competitive with the latest text detection methods.

C. COMPARISON WITH OTHER METHODS
We also evaluate our method on MSRA-TD500. Because the object size of MSRA-TD500 is large and the text is complex, the effect of the existing text detection method on MSRA-TD500 is worse than that of ICDAR2015. As shown in Table 2, Liang proposes a text detection method based on VOLUME 8, 2020  machine learning, which is 7.2% lower than our method in Fscore. Zhang et al. is an advanced multi-oriented text detection method published before. Compared with this method, our method exceeds 4.2%, 1.2% and 3.2% in three indexes respectively. The F-score of our method is 1.2% higher than that of the EAST method. The comparison result is shown in Table 3. We also conduct experiments on ICDAR2013, which is a popular horizontal text dataset. Our method achieves the best F-score on ICDAR2013. FASText is a text detection method based on machine learning, which is 11.9% lower than our method in F-score. Compared with SegLink, our method exceeds 1.6%, 5.5%, 3.4% on three indexes respectively.   Figure 4, we compare our method with EAST and TextBoxes++. From the left image, we can see that when TextBoxes++ misses the ceiling as a text area. Our method will not be confused. In the middle image, EAST and TextBoxes++ omit a small text area. Our method can still be detected. In addition, from the right image, we can see that EAST detects long text breakage and TextBoxes++ cannot completely encircling text information.
We have a good effect on the detection of long text, and the detection box is more accurate. In Figure 5, we show some samples of our method on ICDAR2015, MSRA-TD500 and ICDAR2013. Some examples of detection on MSRA-TD500 and ICDAR2013 show that our method not only detect English text, but also detect Chinese text with different fonts. Therefore, our method is robust to different font and irregular structure. Figure 6 shows some detection examples of large-scale texts. From the left image, our detector can completely detect ordinary large-scale text. From the middle image, our detector cannot completely detect the edge of the long large-scale text. From the right image, the detection boundary is incomplete for very large-scale texts. There are two reasons for the loss of large-scale text boundaries. On the one hand, it is the limitations of the receptive field of the network. On the other hand, in the process of training, there are few super largescale text samples and the learned features are incomplete.

D. ABLATION STUDY
In order to directly observe the function of each component in the model, the ablation experiment is conducted in this section. Because ICDAR2015 is the most influential and difficult, the result of the dataset can better reflect the practicability of the method. Therefore, the whole experiment is carried out on the dataset to study the influence of three components on multi-oriented text detection: text attention mechanism, diagonal factor and DR Loss. All the experiment is the same except for the control variables in this section, and the experimental results are shown in Table 4. We choose the Darknet53 model as the baseline of this experiment. From the Table 4, we can see that our text attention module can help the model to increase the F-score by 1.2%. The experimental results show that the attention module can be used to learn more important features and VOLUME 8, 2020 enhance the representation ability of the model. Second, our diagonal adjustment factor makes the F-score increase by 0.8% again. In addition, DR Loss is introduced as the classification loss, which increases F-score by 0.5%. This group of experiment shows that the positive and negative sample imbalance affects the network performance, and sDR Loss improves this problem and the network performance. When using the combination of diagonal adjustment factor and DR Loss, F-score is improved by 1.1%. Finally, when using these three components, compared with the baseline, F-score is improved by 2.1%.

V. CONCLUSION
In this paper, a multi-oriented text detection method combined with an attention module is proposed. This method is based on the FCOS network and uses pixel by pixel prediction to detect text. It directly predicts the distance between these four boundaries of the spatial point and the object on the feature map. In order to make the position regression more accurate, the diagonal adjustment factor for the regression loss function is designed, which increases F-score by 0.8. In addition, the text attention module is added, which pays more attention to useful information and increases F-score by 1.2. Then, aiming at the problem of sample imbalance in text detection, DR Loss is introduced to enhance the detection performance of the network and increase F-score by 0.5. Finally, we perform experimental comparison and model analysis on scene text detection datasets ICDAR2015, MSRA-TD500 and ICDAR2013. On the ICDAR 2015 dataset, the proposed method achieves an F-score of 0.849 at 9.9fps at 720p resolution. The results show that the method has achieved an advanced level. At the same time, we have verified the function of the components in this paper through ablation experiments. However, our detector still has room for improvement. Firstly, for very largelarge text, the boundaries our detector detects are incomplete. We plan to further improve this problem. In addition, our detector is currently designed for multi-oriented text, which is not suitable for curve text. We plan to further improve our network processing and output to achieve this.