STELA: A Real-Time Scene Text Detector with Learned Anchor

To achieve high coverage of target boxes, a normal strategy of conventional one-stage anchor-based detectors is to utilize multiple priors at each spatial position, especially in scene text detection tasks. In this work, we present a simple and intuitive method for multi-oriented text detection where each location of feature maps only associates with one reference box. The idea is inspired from the twostage R-CNN framework that can estimate the location of objects with any shape by using learned proposals. The aim of our method is to integrate this mechanism into a onestage detector and employ the learned anchor which is obtained through a regression operation to replace the original one into the final predictions. Based on RetinaNet, our method achieves competitive performances on several public benchmarks with a totally real-time efficiency (26:5fps at 800p), which surpasses all of anchor-based scene text detectors. In addition, with less attention on anchor design, we believe our method is easy to be applied on other analogous detection tasks. The code will publicly available at https://github.com/xhzdeng/stela.


Introduction
Text in scene usually conveys valuable semantic information. Thus, detecting text in natural images has recently attracted increasing attention in computer vision community cause perceiving information is a critical part of artificial general intelligence. It has been widely used in various applications such as multilingual translation, automotive assistance and image retrieval. Previous works [5,26,1,33] have been dominated by sliding windows or connected component with hand-crafted feature, which divided the task into a sequence of distinct steps and utilized bottom-up strategy to search characters and words. Although these methods have shown their promising performances, they may be restricted to complex situations due to the diversity of text instances and undesirable image quality.
With the astonishing progress for object detection by exploring the powerful deep learning technology [15], recent methods take text as a specific object and extend the general object detection frameworks [29,27,20] to hypothesize word or text locations. Those approaches can be divided into two major groups: two-stage proposal-driven and onestage proposal-free method. Although two-stage framework [37,24,10] consistently achieves top accuracy on the public benchmarks [12,11,25], recent works [16,22,2,9] based on one-stage frameworks also demonstrate yielding faster text detectors with comparable accuracy.
Unlike two-stage detector who can classify boxes at any position and shape by using learned proposals [29] and region pooling operation [6], one-stage detectors heavily rely on how densely the anchors cover the space of possible target locations [19]. A popular approach for achieving high coverage is to use multiple anchors to cover boxes of various scales and aspect ratios, especially in the tasks of scene text detection. TextBoxes++ [16] was based on SSD [20] and defined 7 specific aspect ratios (including 1, 2, 3, 5, 1/2, 1/3 and 1/5) for default boxes on each location of feature maps. In order to achieve multi-oriented text detection, DMPNet [22] added several rotated anchors, for a total of 12 (6 regular and 6 inclined) to find the best match to arbitrary-oriented text instance. Instead of choosing priors by hand, DeepTextSpotter [2] followed YOLOv2 [28] runs k-means clustering (k = 14) on the training set bounding boxes to automatically find suitable priors.
Given the anchor design of the above detectors, a natural question to ask is: could we decrease the number of anchors and maintain similar accuracy? This changing will bring twofold benefit: reducing manual attention on anchors and improving efficiency at inference stage. First, the shapes and scales of anchors has to be predefined for different tasks, and this must be careful because a wrong design may harm the performance of detection [35]. Second, most anchors correspond to false candidates which are irrelevant to the targets, and meanwhile a large number of anchors can lead to significant computational cost when the network involves heavy heads. Besides, although not mentioned in many papers, the anchor generation usually needs to cost a certain amount of time.  computational efficiency, in this work, we investigate the issue of anchor design within one-stage detector which we mentioned above for multi-oriented text detection. In onestage methods, the optimization target in training and the prediction reference in testing are both based on the coverage between original anchors and target boxes. Then, the quality of those prior boxes has a critical impact on the performances of a detector. Normally, as the number of anchors increases, the coverage of targets increases, but it will still be saturated in some situations, as shown in Figure.1(b). Therefore, we need to find a better way to choose priors that make it easier for the network to learn to predict better detection. Inspired from the learned proposal mechanism [29] in the two-stage R-CNN framework, we intend to utilize the learned anchor which is obtained through a regression operation to replace the original one into the final predictions. It is worth noting that unlike region proposal network (RPN) in two-stage detector which can reduce the number of possible locations down to one or two thousands, we still maintain the original quantity of anchors and keep the rest parts of one-stage detector's architecture. To validate its effectiveness, we adopt the state-of-the-art RetinaNet [19] as our baseline model and present a simple and intuitive text detector named STELA (Scene TExt Detector with Learned Anchor), in which each location of feature maps only associates with one anchor. Following the standard evaluation protocols in each benchmark, our method achieves comparable performances with an F-measure 0.887 on ICDAR 2013 [12], 0.833 on ICDAR 2015 [11] and 0.715 on ICDAR 2017 MLT [25]. Besides, our method is a totally real-time scene text detector with 26.5f ps at 800p, which surpasses all of anchor-based methods. At last, with less attention on anchor design, we believe our method is easy to be applied on other analogous detection tasks. Also, all of our training and testing code will open source soon.

One-Stage Object Detection
In this section, we first review the one-stage detection pipeline. OverFeat [31] is one of the first modern one-stage object detector based on deep neural networks. More recent SSD [20] and YOLOv2 [28] have renewed interest in onestage methods. The key idea of them is to associate a set of pre-defined anchors which are centered at each location of feature maps and make final predictions based on those reference boxes [14]. As shown in Figure.2(a), it basically contains a backbone network for feature extraction over the entire image and two parallel sub-networks following, one for predicting the probability distribution over multiple categories of each anchor and another for regression the offset from each positive candidate to a nearby ground-truth box, if one exists.
Comparing with two-stage R-CNN ( Figure.2(b)) methods, one-stage detectors skip the region proposal generation step and gives final predictions (classification and regression) based on original anchors directly. However, its detection accuracy is usually behind that of two-stage approaches, one of the main reasons is they must pro- cess a much larger set of candidate object locations regularly sampled across an image. The extreme foregroundbackground class imbalance problem will encounter during training phase and hamper the resulting performance. More recently, RetinaNet [19] proposed focal loss (FL) to address the class imbalance problem that one-stage detectors is able to match the accuracy of existing two-stage ones. The focal loss is modified from standard cross entropy (CE) loss: In the above y = 1 specifies the ground-truth class and p ∈ [0, 1] is the probability. Normally, we define p t : and rewrite CE(p, y) = −log(p t ). Then, the classification loss is defined as: where α t is a balanced weighting factor and γ is a f ocusing parameter. It applies a modulating term to the cross entropy loss in order to focus learning on hard examples and down-weight the numerous easy negatives. In our implementation, we follow the original focal loss that set α t = 0.25 and γ = 2.0.

Rotated Bounding Box Regression
As depicted in [4], using rectangular bounding boxes to localize multi-oriented text may result in redundant background noise and unnecessary overlap. Thus, we adopt rotated rectangular boxes to match arbitrary-oriented text instances. Each bounding box b is represented by a five tuple b = (x, y, w, h, θ), where x, y are the center point, w, h are width and height, θ is the angle to horizontal. The task of the regression operation is to predict the distance of each item from a positive anchor to the nearby groundtruth. Normally, to encourage a regression invariant to scale and location, the distance vector ∆ = (δ x , δ y , δ w , δ h , δ θ ) is defined by: where b and g represent a bounding box and its target ground-truth respectively. The regression task loss L loc is calculated by regression target ∆ t and predicted tuple ∆ p where smooth L1 is a robust L 1 loss defined in [6]. Usually, for improving the effectiveness of multi-task learning, ∆ is normalized by its mean and variance. In our experiments, the mean is set to (0, 0, 0, 0, 0) and the variance is set to (0.1, 0.1, 0.2, 0.2, 0.1).

Learned Anchor
Normally, the detector needs to search the true positives from thousands anchors and adjusts the shapes and locations to make them tighter on the targets. It is difficult to determine if a bounding box b is a positive candidate cause it usually includes an object and some amount of the background. In practice, this is solved by the IoU metric between box b and most nearby ground-truth g. Commonly, the threshold µ is a constant set to 0.5. If the IoU is above the threshold µ, bounding box b is considered to be an example of positive. Also, y = 1 specifies the ground-truth class. It is worth noting that conventional IoU based on rectangular boxes is unsatisfactory for our task, thus we modify it to compute the overlaps for rotated rectangles. Given all this, the optimization target in training is determined by the overlaps between original anchors with ground-truth boxes.
However, the original anchors with fixed scales and aspect ratios which are pre-defined manually may not be the optimal designs. Compared with one-stage detector, we argue that the most important part of proposal scheme in two-stage is that the selected proposals are chosen by learning. That makes two-stage method able to reduce the search space of targets, and meanwhile optimize the quality of candidates. Inspired by this, we intend to integrate this mechanism into the one-stage detectors. We simply add an extra regression branch for anchor refining and utilize learned one into the final classification and regression, as shown in Figure.2(c).
Especially, the regression targets of learned anchor is not arbitrary. As refer in [35], one of general rules for a reasonable anchor design is alignment. To use convolutional features as anchor representations, the center of an anchor need to be well aligned with feature map pixels. Towards this end, we only regress the offsets within ∆ = (δ w , δ h , δ θ ), and this will keep anchors still align with feature map, as depicted in Figure.3. Following regression task, the anchor refining loss L ref is defined as: Unlike two-stage R-CNN method filter anchors with an objectness score, we only adjust the shape of each anchor and keep the quantity of anchors here. Evaluated on public benchmarks, we find comparing with original ones, the coverage of targets will be given a huge enhancement after anchor refining stage, as shown in Figure.1.

Network Architecture
For the trade-off between efficiency and accuracy, all of our experiments are implemented on RetinaNet [19] with ResNet-50 [8] as backbone, though other networks are still applicable. We also adopt the Feature Pyramid Network (FPN) from [18] to construct a rich, multi-scale feature pyramid from a single resolution input image. The FPN consists of levels P 3 to P 7 feature maps, and the corresponding base anchor sizes from 16 2 to 256 2 for detecting small text instances (32 2 to 512 2 in source implementation). In original RetinaNet, the two sub-networks (heads) are deeper with 5 convolutional layers. For improving the running speed, we decrease the number of layers from 5 to 2 for streamlining the heads. This may result in a slight accuracy loss, but will give us a real-time text detector in return. Based on the above definitions, the model is trained to simultaneously minimize the losses on anchor refining, final regression and classification. Overall, the loss function is a weighted sum of three losses where λ ref , λ loc , λ cls are user constants indicating the relative strength of each component defined above. In order to keep the balance of different loss types, we set them to 0.5, 0.5, 1.0 respectively.

Implementation Details
The backbone of network is initialized by the model trained on ImageNet [30] for classification task and other layers are initialized by following [19]. The network is trained with Adam [13] optimizer. Restricted by the hardware, the batch size is set to 4 and the initial learning rate is set to 10 −4 . We randomly pick up 100,000 images from SynthText [7] to pretrain the network for 5 epochs, and collect real data from ICDAR 2013 [12], 2015 [11] and 2017 [25] to finetune a final model for 25 epochs. The learning rate is decayed to 10 −5 after 15 epochs of finetuning. We use the multi-scale training scheme that randomly resize the input size between 480 and 800. Random flipping is also used for data augmentation.
Specially, in order to capture more regression target candidates, we set the IoU threshold to 0.3 in anchor refining training. In the inference stage, a confidence threshold with 0.3 and a non-maximum suppression threshold with 0.3 are applied to yield the final outputs. The proposed method is implemented by using PyTorch 1 and all experiments are carried out on a standard PC with Intel i7-6800k and a single NVIDIA TITAN Xp.

Ablation Study
To investigate the effectiveness of our method, we conduct several ablation studies. Each model is evaluated on ICDAR 2013 [12] and 2015 [11] benchmarks.
Anchor Design: We first investigate the impacts of different anchor designs on performances, including accuracy and efficiency. The baseline models are directly extended from RetinaNet by simply changing the regression strategy introduced in Section.2.2. The aspect ratios of anchors on single location of feature maps are simply selected from {0.25, 0.5, 1, 2, 4}. Also, the scales are chosen from (2 k/3 , k < 3). As shown in Table.1, with the anchor increasing, the F-measure improved from 0.447 to 0.863 on ICDAR 2013, but no significant improvement (0.621 to 0.753) on ICDAR 2015. Analyzing from Figure.1, we argue that the coverage of targets gets saturated on ICDAR 2015, but not on ICDAR 2013. That proves once again the most important design factor in a one-stage detector is how densely it covers the space of target boxes.
Attaching anchor refining operation on the first baseline, the resulting model obtains a huge improvement (shown in Figure.1) in recall rates on both benchmarks, with great progresses on accuracy from 0.447 to 0.887 on ICDAR 2013 and 0.621 to 0.833 on ICDAR 2015. This strongly demonstrates the effectiveness of our approach. In addition, we also assess other baselines with anchor refining, there is only slower running speed, but no obvious improvement.  Table 1. The impact of the different anchor designs. "#sc" number of scales, "#ar" number of aspect ratios. "la" means learned anchor. All input images are resize to 800 pixels.
1 https://pytorch.org/ Number of Stages: Like Cascade R-CNN [3], we add more stages of anchor refining to compare the influences. We also increase the IoU threshold of each refining stage by following Cascade R-CNN. The results are summarized in Table.2. Increasing more refining stages will not lead to significant improvement, or even accuracy decrease. Besides, adding the number of stages will affect the running speed. Therefore, one refining stage is the best choice for our method.  Table 2. The impact of the number of stages. "#stage" means the number of anchor refining stages.

Comparison to State of the Art
We evaluate our method on several public benchmarks and compare to recent state-of-the-art methods. Figure.4 shows some detection results from each dataset.
ICDAR 2013 [12] dataset consists of 229 training and 233 testing images which were captured by user explicitly detecting the focus of the camera on the text content of interest. It is a standard benchmark for evaluating horizontal or nearly horizontal text detection. In this benchmark, we set the scale of input images to 800 for single-scale testing. We also evaluate on multi-scale testing which the scales are set to 320, 480, 640 and 800. As depicted in Table. 3, the proposed method outperforms all anchor based methods including DeepText [37], FCRN [7] and CTPN [34], which are mainly designed for nearly horizontal text detection. For single-scale testing, our method achieves a totally real-time running speed at 26.5 fps. Even the multi-scale, our method runs at a speed of 10.5 fps. Compared with recent methods [21,17,23,4], our method is comparable with accuracy and efficiency.
ICDAR 2015 [11] benchmark was released during the ICDAR 2015 Robust Reading Competition. It provides 1000 training and 500 testing images which were collected without taking any specific prior attention. It was designed for multi-oriented text detection, so all images are annotated with word-level quadrangles. To evaluate the adaptability of our learned anchor, we still set the input size to 800 pixels. As shown in Table.4, our method achieves an F-measure of 0.833, which also surpasses all of the anchorbased methods [22,32,9,24,16], including one-stage and two-stage frameworks. Compared with other approaches [38,21] which utilize a deep regression network that directly predict text region, our method still keep an absolute lead in running speed. ICDAR 2017 MLT [25] is a large scale multi-lingual text dataset, which includes 7200 training, 1800 validation and 9000 testing images with in 9 languages. It was proposed for verifying the generalization ability of each method. Therefore, it is more difficult than previous ICDAR challenges. Due to a larger number of small text instances in this dataset, we enlarge the scale of testing image by 2 times to 1600 pixels and our method achieves 0.655, 0.787, 0.715 in recall, precision and F-measure by using the online evaluation system provided officially, as shown in Table.5. The presented results demonstrate that our method is capable of applying practically in multi-lingual text detection.

Conclusion and Future Work
In this work, we propose a simple and intuitive method based on RetinaNet for multi-oriented text detection where each location of feature maps associate with only one an-chor. The aim of our method is to integrate the learning mechanism from two-stage R-CNN framework into the one-stage detector and utilize the learned anchor to replace the original one into the final predictions. Experimental results on public benchmarks confirm that the proposed method is capable of achieving comparable performance with state-of-the-art methods. Besides, it is a total real-time scene text detector. In the future, we are interested in integrating the detector with a text recognizer to consist an end-to-end text reading system. In addition, we also plan to evaluate it on other detection tasks to prove the universality of our approach.  [23] 0.556 0.838 0.668 FOTS [21] 0.575 0.809 0.672 Border [21] 0.621 0.777 0.690 SPCNET [36] 0.669 0.734 0.700 Ours 0.655 0.787 0.715 Table 5. Results on ICDAR 2017 Multi-lingual Scene Text Detection. All results of works are reported with single testing scale. "" means the result is obtained from the ICDAR 2017 MLT leaderboard.