Arbitrary-Shaped Text Detection withAdaptive Text Region Representation

Text detection/localization, as an important task in computer vision, has witnessed substantialadvancements in methodology and performance with convolutional neural networks. However, the vastmajority of popular methods use rectangles or quadrangles to describe text regions. These representationshave inherent drawbacks, especially relating to dense adjacent text and loose regional text boundaries,which usually cause difficulty detecting arbitrarily shaped text. In this paper, we propose a novel text regionrepresentation method, with a robust pipeline, which can precisely detect dense adjacent text instances witharbitrary shapes. We consider a text instance to be composed of an adaptive central text region mask anda corresponding expanding ratio between the central text region and the full text region. More specifically,our pipeline generates adaptive central text regions and corresponding expanding ratios with a proposedtraining strategy, followed by a new proposed post-processing algorithm which expands central text regionsto the complete text instance with the corresponding expanding ratios. We demonstrated that our new textregion representation is effective, and that the pipeline can precisely detect closely adjacent text instances ofarbitrary shapes. Experimental results on common datasets demonstrate superior performance o


I. INTRODUCTION
As a fundamental task in computer vision, accurate text detection is applicable to many fields in the real world including automatic identity recognition, financial document analysis and recognition, and environmental understanding. In the era of deep learning, the community has witnessed substantial advancements in methodology and performance of text detection. This task, however, is facing many challenges because of various image attributes, such as complex backgrounds, lighting conditions and arbitrary shapes. Our proposed method focuses on the arbitrary-shaped text detection with the same as previous state-of-the-art approaches.
Existing scene text detection methods use two forms to represent text regions: quadrangular representation and segmentation mask representation. To the best of our knowledge, most existing regression-based methods are specifically designed to use quadrangular representation to describe text regions, as is the case with general object detection approaches. However, the representation fails to deal with text instances of arbitrary shapes, e.g., the curved texts as shown in Figure 1(a), which contains extraneous information in addition to the text region. Segmentation mask representation as shown in Figure 1(b), on the other hand, are appropriate for arbitrarily shaped text. However, tiny intervals between text regions are very common in natural scenes and it is challenging to separate dense adjacent text instances using this representation. To address the above problems, we propose a robust VOLUME 4, 2016 1 arXiv:2104.00297v1 [cs.CV] 1 Apr 2021 pipeline with novel text representation. We consider a text instance to be composed of a central text region, and an expanding ratio between the central text region and the full text region, of which one example is shown in Figure 1(c). Our proposed method has the following three benefits. First, the central text region mask has a similar shape to the original text instance, which allows the proposed method to represent texts of arbitrary shapes and locate each text instance precisely. Second, the central text region map only captures the central area of each text, which allows separation of spatially close text instances. Third, we propose a post-processing algorithm called "polygon expansion algorithm", with which the identified full text instances can be successfully expanded from the central text region maps. Moreover, to further improve the representation robustness against text instances of various scales and aspect ratios, we introduce a training strategy that can adaptively adjust the central text region map and the expanding ratio based on multiple expanding ratios of the same text region. Based on the experiments, the proposed method can achieve 81.3%, 82.5%, 84.8% and 83.6% F-measure, the state-of-the-art or competitive performance, on DAST1500, CTW1500, TotalText and MSRA-TD500 evaluation datasets, where the proposed novel text representation provides 1.9%, 0.3%, 0.1% and 0.7% improvement, respectively.The contributions of this paper can be summarized as follows: • A novel text region representation method, which can precisely describe dense adjacent text instances with arbitrary shapes. • A training strategy to obtain the adaptive text region representations for text instances of various scales and aspect ratios. • A polygon expansion algorithm which is able to accurately expand the central text region map to the full text map. • Superior performance on representative scene text datasets.

II. RELATED WORK
As an important research area in computer vision, scene text detection has been inevitably influenced by the wave generated by the deep learning revolution. Most previous deep learning methods can be roughly divided into three categories according to the text region representation: quadrangular representation, segmentation mask representation and hybird methods which predict the segmentation mask of scene texts and regress the bounding boxes of the text instances at the same time. Quadrangular representation mainly draws inspiration from general object detection frameworks, such as Faster R-CNN [1] and SSD [2]. These methods use quadrangles to represent text regions, which can further be divided into onestage and two-stage approaches, similar to object detection methods. Following Faster R-CNN, CTPN [3] detected long horizontal text by connecting rectangular anchors of consistent width and different heights. To locate multi-oriented text regions, R2CNN [4] refined the axis-aligned boxes and predicted the inclined minimum area boxes with pooled features of different pool sizes. RRPN [5] added rotation to both anchors and RoIPooling in Faster R-CNN pipeline. For the one-stage text detector based on SSD, Textboxes [6] modified anchors and kernels of SSD to detect large-aspect-ratio scene texts. SegLink [7] proposed to predict text segments and then group these segments into text boxes by predicting segments links. To represent curved text regions, TextSnake [8] proposed to regress a sequence of disks with different radii. But it still requires complicated post-processing, and radius regression may result in drop of precision. And as mentioned in the paper [9], Wang proposed the use of polygons of an adaptive number of points to represent text regions. However, the representation is still a polygon and can not describe curved text regions with smooth boundaries.
Most quadrangular representation methods need to carefully design aspect ratios of anchor boxes which depend heavily on experience and may have multiple stages. Moreover, for curved text instances, which are common in application, quadrangular representation is unsatisfactory.
Segmentation mask representation is mainly based on the classification of each pixel in the image and then clustering to different text instances. To the best of our knowledge, semantic segmentation mask representation is quite suitable for text regions of arbitrary shapes. Zhang et al. [10] detected multi-oriented text by semantic segmentation and MSERbased algorithms. Pixel-Link [11] performs text/non-text and link prediction on an input image, then adds some postprocessing to get text boxes and filter noise. To separate different text instances, PSENet [12] outputs various text kernels and uses a progressive scale expansion algorithm to obtain final text boxes. However, it has many output kernels which may have negative effects on location results. [13] adopts a mirror symmetry of FPN [14] to produce embedding features and text foreground masks, and uses cluster processing to detect texts. DB [15] proposes a Differentiable Binarization module to predict the shrunk regions, and the shrunk regions are dilated with an constant expanding ratio. However, as shown in Figure 2, using an constant expanding ratio is inaccurate according to the Vatti clipping algorithm [16].
Hybrid approaches are based on quadrangular representation methods and segmentation mask representation methods, which predict the segmentation mask of scene texts and regress the bounding boxes of the text instances at the same time. EAST [17] and DeepReg [18] adopt FCNs to predict shrinkable text score maps and perform per-pixel regression, followed by a post-processing NMS. Mask TextSpotter [19] detected arbitrary-shape text instances in an instance segmentation manner based on Mask R-CNN.
While quadrangular representation is often not flexible enough, segmentation mask representation fails to separate dense adjacent text instances. In our proposed method, the novel text region representation can flexibly detect dense adjacent text of arbitrary shape, and our proposed Polygon  [15]. Detection boundary is inaccurate especially for the text "naUGHTY".
Expansion Algorithm needs only one clean and efficient step.

III. PROPOSED METHOD
This section is organized as follows: we first introduce the overall structure of the proposed method and the novel text region representation, then the training strategy. Next, we illustrate the details of the polygon expansion algorithm. Then, we introduce how to generate the central text region mask and expanding ratios label. Lastly, we explain the loss function design.

A. OVERALL PIPELINE
The overall structure of our proposed network is illustrated in Figure 3. First, ResNet50 [20] is used in our network as the backbone. As mentioned before, text instances in the natural image usually consist of arbitrary shapes. The superior performance of Deformable Convolutional Networks [21] arises from its ability to adapt to the geometric variation of objects. Therefore, we add deformable convolutional networks(DCN) to the ResNet50 backbone to extract features. Similar to [21], we apply deformable convolutional layers at stage3-stage5 of ResNet50. Specifically, we use deformable conv 3×3 instead of general conv 3×3 in the bottleneck.
Second, we use the similar feature merging strategy of the U-net [22] to combine low-level texture features and highlevel semantic features. The high-level semantic features make use of upsampling to concatenate with low-level texture features. Therefore, the fusion feature has various receptive fields which are adaptive to texts of arbitrary shapes. Next, the fusion feature map is projected into three branches to produce complete text segmentation results, along with central text region maps and expanding ratios. After obtaining these outputs, we use the polygon expansion algorithm to expand the central text regions to their complete text regions using the corresponding expanding ratios.

B. TEXT REGION REPRESENTATION
As mentioned in the introduction section, the text targets are represented in two forms: rectangles or quadrangles, and segmentation mask. The first representation falls short when handling curved texts. The other encounters difficulty separating dense adjacent text instances. Therefore, we propose a new representation for text regions. As shown in Figure  1(c), we apply the central text region map, which has a similar shape to the original text region, and the expanding ratio to represent the text region. On the one hand, the representation combines the advantages of segmentation mask representation, which can precisely describe text regions of arbitrary shapes. On the other hand, the representation solves the problem of separating dense adjacent text regions owing to large geometrical margins among central text regions.

C. TRAINING STRATEGY
For our proposed text region representation method, the representation of each text region is not unique. Specifically, we can represent the same text instance with different expanding ratios and corresponding central text regions as shown in Figure 4. Moreover, multiple text instances can be included in the same image, and their expanding ratios can be different. Therefore we need to find an adaptive representation for each text region. In every training iteration, we propose that our network is trained with different text instances labeled by different expanding ratios and corresponding central text regions of the same image. And another strategy is to use different expanding ratios of the same text instance(e.g. Figure 4) in different training iterations to train the network. With multiple iterations of the network, the network can learn the adaptive representations against text instances of various scales and aspect ratios.

D. POLYGON EXPANSION ALGORITHM
As aforementioned, we adopt the central text region map and the expanding ratio to represent the text region. However, our ultimate goal is to detect the entire text region. Therefore we propose a polygon expansion algorithm to obtain the complete text region detection results.
The polygon expansion algorithm is basically an isometric scaling of polygons. Corresponding boundaries of the original and expanded polygon are equidistant. For rectangles such as the one in Figure 5(a), it is simple to derive the new coordinate with the expanding ratio d. We can directly obtain the results as follows: However, text instances in natural images always have arbitrary shapes. Conveniently, methods like f indContours in OpenCV can be applied to obtain the boundaries of central text regions maps. More specifically, by using f indContours method, we can obtain the boundary points to represent the central text region map. Moreover, for curved texts, more points are adopted for text region representation as shown in Figure 5(b). The procedure of expanding the central text region to the complete text instance with the corresponding expanding ratio is illustrated in Figure 5(c). It shows how to get one of the full text region points q i with central text region point p i and expanding ratio d. First, we need to sort all the central text region boundary points in a clockwise direction to get P . Second, we calculate the vectors of point p i and two adjacent points, as v 1 , v 2 . Then we can derive the sine of the angle between v 1 and v 2 through their cross product, as formulated in Eqn. (2). Third, we calculate the unit vector of v 1 and v 2 , to obtain the unit VOLUME 3).
Finally, according to its geometric relationship, − − → p i q i can be obtained using the predicted expanding ratio d, as shown in Eqn. (4). Therefore, we can get one of the full text region points q i according to − − → p i q i . Similarly, we can derive all the final text region points as a result.
The details of polygon expansion algorithm are summarized in Algorithm 1. In the pseudocode, P is a collection of the central text region boundary points, D is the corresponding expanding ratio and Q represents the detected full text region points.

E. LABEL GENERATION
As illustrated in Figure 3, our network produces central text and full text map, as well as the expanding ratio. Therefore, it for each p i ∈ P do Q.append(q i ) 13: end for 14: return Q 15: end function requires the corresponding ground truths with segmentation masks and expanding ratios. In our practice, we can conduct these ground truth labels simply and effectively by shrinking the original text labels. As shown in Figure 4, the polygon with blue boundaries is the original text label, which denotes the full text mask label. To obtain the central text mask labels, we use the Vatti clipping algorithm [16] to shrink the original text labels. The three green masks in Figure 4 are the central text labels with different expanding ratios from origin text label. Subsequently, each central text region and origin text region is transferred into a 0/1 binary mask for segmentation ground truth labelling. For the expanding ratio lable, similar to the rotation angle labels in EAST [17], we cover the central text region with the expanding ratio value.

F. LOSS FUNCTION
For training our network, the loss function can be formulated as: where L c represents the loss for the complete text region maps, L s represents the loss for the central ones, and L d represents the expanding ratios regression loss. λ i (i = 1, 2, 3) is a hyper-parameter which balances the importance of the different losses.
As we all know, most text instances in natural images only occupy an extremely small region, which makes the predictions of network biased towards the non-text regions, if binary cross entropy [23] is used. To address the imbalance of positive and negative samples, we adopt dice coefficient [24] in our experiment. It directly uses the segmentation evaluation protocols as the loss to supervise the network and also ignores a large number of background pixels when calculating the Intersect-over-Union(IOU). The dice coefficient D(R, G) is formulated as follows: x,y R 2 x,y + x,y G 2 x,y where R and G represent the prediction result and ground truth respectively. Furthermore, as stated in the paper [25], detection tasks usually contain an overwhelming number of easy examples and a small number of hard examples. Automatic selection of these hard examples can make training more effective and efficient. Therefore, we adopt Online Hard Example Mining(OHEM) [25] to L c during training to get better performance.
L c pays attention to segment the text and non-text regions. We obtain the training mask M using OHEM. Thus L c can be formulated as: L s is the loss for central text regions. Like the paper PSENet [12], we ignore the pixels of the non-text regions in the segmentation result R c to avoid redundancy. Therefore, L s is formulated as: where W is a mask which ignores the pixels of the non-text region in R c . L d is the regression loss of the expanding ratio, which is the key part to obtain the final detection from the central text region. We employ pixel-wise smooth L 1 loss [26] to optimize the loss. The loss is formulated as: in which where P and Q represent the regression result and ground truth, respectively.

IV. EXPERIMENT
In this section, we first briefly introduce datasets and explain the implementation details. Then we discuss our proposed method and conduct ablation studies. Finally, five standard VOLUME 4, 2016 benchmarks are used in this paper for performance evaluation: TotalText, CTW1500, MSRA-TD500 ICDAR 2015 and DAST1500, achieving on par or better results than state-ofthe-art methods.

A. BENCHMARK DATASETS
ICDAR 2017 MLT [27] is a large scale multi-lingual text dataset, which consists of 9 languages. It contains three parts: 7200 training images, 1800 validation images, and 9000 test images. The text regions are annotated by 4 vertices of the word quadrangle. We use both training set and validation set to pretrain our model. ICDAR 2015 [28] is a dataset which focuses on natural scene text images. Text regions are multi-oriented. The benchmark consists of 1500 images, 1000 of which are used for training, and the remaining for testing. Image annotations are labelled as word-level quadrangles. In our training, we simply ignore instances which suffered from motion blur and other problems, marking them as 'DO NOT CARE'.
CTW1500 [29] is a commonly used dataset for challenging long curve text detection. It contains a total of 1500 images: 1000 for training and 500 for testing. In contrast to multi-oriented text instances labels(e.g. ICDAR 2015, IC-DAR 2017 MLT), the text regions in CTW1500 are annotated using polygons with 14 vertices.
MSRA-TD500 [30] consists of 300 training images and 200 test images collected from natural scenes. It is a multilanguage dataset, which includes English and Chinese. The scene texts have arbitrary orientations, which are labeled with inclined boxes made up by 4 points at sentence level. There are some long straight text lines.
TotalText [31] is a newly-released dataset which focuses on the text of various shapes. This dataset includes horizontal, multi-oriented and curve text instances. There are 1255 training images and 300 testing images. The scene texts in these images are labeled at word level with adaptive number of corner points.
DAST1500 [32] is a dense and arbitrary-shaped text detection dataset, which contains 1538 images and 45,963 linelevel detection annotations (including 7441 curved bounding boxes). The images are manually collected from the Internet and of size around 800 × 800. This dataset is multi-lingual, including mostly Chinese, and few English and Japanese texts. The images are divided as follows: 1038 images for training and 500 images for testing.
Following the ICDAR evaluation protocol, we evaluate our proposed method in terms of Recall, Precision and Fmeasure. Recall is the fraction of correct detection regions among all the text regions in the dataset while Precision represents the fraction of correct detection regions among all the detection regions. F-measure considers both the recall and precision to compute the score. A correct detection region is that the overlap between prediction and ground truth is larger than a given threshold. The computation of the three evaluation terms is usually different for different datasets. For ICDAR datasets, we can evaluate by ICDAR robust reading competition platform, and other datasets can be evaluated with the given evaluation methods corresponding to them.

B. IMPLEMENTATION DETAILS
Training The backbone of our network is ResNet50 [20] with deformable convolutional networks(dcn) [21], where we replace conv 3×3 with dcn 3×3 in stage3-stage5 of ResNet50. The network layers are initialized with a pretrained ResNet50 model for ImageNet [33] classification, while other new layers in our network are randomly initialized via He's method [34]. For ICDAR 2017 MLT, we use 7200 training set and 1800 validation set to train our network. We use stochastic gradient descent(SGD) to optimize our network. While we train the network with a batchsize of 8 for 600 epochs, the initial learning rate is set to 1 × 10 −3 and is divided by 10 at 200 epochs and 400 epochs. We first adopted the warmup strategy in [35], then we found that without warmup the net can still converge quickly. We use a weight decay of 5×10 −4 and a Nesterov momentum [36] of 0.99 without dampening. Our implementation also includes batch normalization [37] and OHEM [25], whose ratio of positive and negative samples is 1 : 3.
There are two training strategies for the rest of the datasets: training from scratch and fine-tuning on IC17-MLT model. When training from scratch, the training sets are the same as above. For fine-tuning on the IC17-MLT model, we train the network with 300 epochs. The initial learning rate is 1×10 −4 and is divided by 10 at 100 and 200 epochs.
Data Augmentation is used in our training. We first randomly rescale the input image with ratios of [0.5,1.0,2.0,3.0]. Then random rotations, transpositions, and flipping are performed. Finally, 640 × 640 samples are randomly cropped from the transformed images.
Post Processing For quadrangular texts like those of IC-DAR 2015, we can use minAreaRect in OpenCV to get the central text boundaries and then our proposed Polygon Expansion Algorithm to obtain the final detection result. While for curved text(e.g., CTW1500), methods like f indContours in OpenCV can be applied to obtain the boundaries of central text regions and Polygon Expansion Algorithm to finish text detection task.

C. ABLATION STUDY
To verify the effectiveness of our proposed method, we do a series of comparative experiments on the test set in the ICDAR 2015.
Influence of the full text region map. As shown in Figure  3, in order to get the final detection result using the polygon expansion algorithm at the inference step, we only need the central text region map and the expanding ratio. Therefore, it is unnecessary to predict the full text region map. However, if we only predict the central text map without a full text map prediction, the detection performance is unsatisfactory as illustrated in Table 1. The models are evaluated on ICDAR 2015 dataset. We can find from Table 1   prediction. The central text region map loses a lot of positive sample information, while text instances in natural images only occupy an extremely small region. Without the full text map, this will cause the imbalance of positive and negative samples to more seriously impact results. Influence of the multiple task. We can find from Figure. 3 that our pipeline is a multi-task network, which includes both expanding ratio regression and pixel classification tasks. In one deep multi-task network, which produces multiple predictive outputs, can offer faster speed and better performance than their single-task counterparts. Such network makes multiple inferences in one forward pass, therefore the speed is faster. And on the other hand, the model shares weights across multiple tasks, which can induce more robust regularization and boost performance as a result. We use ICDAR 2015 to do the experiments. Table 2 shows the experimental results, from which we can find that with multitask learning, the performance of each single task is more satisfactory with the test set.
Influence of the training strategy. We investigate the effect of the fixed expanding ratio and the multiple expanding ratios on the performance of our method. As shown in Figure  3, we use the polygon expansion algorithm to expand the central text region map by the expanding ratio. However, if we fix the expanding ratio(e.g. 0.5) to generate central text maps for training, there is no need to regress the expanding ratio. We can directly expand the central text map with the fixed expanding ratio. However, as explained in [13], it may fail in the following case: boxes dilated from central areas with fixed expanding ratio are sometimes smaller or larger than the ground truth boxes. The models are evaluated on ICDAR 2015 dataset. We can find from Figure 6 that the Fmeasures on the test sets drop when the fixed expanding ratio is too large or too small. When the fixed expanding ratio is too small, separating the text instances lying closely to each other is difficult. While the fixed expanding ratio is too large, the predicted central mask is sometimes split into several parts. It is obvious that when the fixed expanding ratio is 0.4, the performance on the test sets is best and the F-measure is 79.5%. Note that when setting the fixed expanding ratio to 0, we only use full text segmentation maps as the final result without the polygon expansion algorithm. But when we use multiple expanding ratios to train the network, the performance of the method with multiple expanding ratios is much higher than with a fixed expanding ratio (82.62% vs 79.54%). Table 3 shows the experimental results of the multiple expanding ratios method and the fixed expanding ratio(0.4) method. It justifies that the multiple expanding ratios training strategy is more accurate to get complete text regions. Influence of the post-processing. We verify the effectiveness of our proposed post-processing method on the detection performance and speed. As shown in Table 4, our method performs much faster than the previous leading method. PSE [12] has many output kernels to merge which may have negative effects on the speed. Notably, our post-processing method outperforms the previous method by 2%. The comparisons clearly demonstrate that our post-processing method is simple and efficient.
Influence of the backbone. In our proposed method, ResNet50 with deformable convolutional networks(DCN) is the backbone, while ResNet50 is usually used to extract deep features through other state-of-the-art methods. To better analyze the effectiveness of the backbone network, we test the proposed method with different backbone networks(ResNet50, ResNet101, ResNet152 and ResNet50-DCN) on the ICDAR 2015 test set. As shown in Table 5, under the same settings, the results show that ResNet50-DCN is better than others. Deformable convolutional networks can clearly improve performance.

D. COMPARISON WITH STATE-OF-THE-ART METHODS
To show the effectiveness of our proposed approach for datasets of different types, we do a series of tests on several benchmarks. We first evaluate our method on CTW1500 and TotalText, which contain challenging multi-oriented and curved texts. We compare its performance with state-ofthe-art ones on CTW1500 and TotalText. Next, to test its ability for oriented text detection, we compare the methods on the widely used benchmark: ICDAR2015. Then, to test the robustness of the proposed method to multi-language and long straight text, we evaluate the method on the MSRA-TD500 dataset. Finally, to show that the proposed method works well for dense adjacent text lines in the natural scene images, we evaluate the method on the DAST1500 dataset.
Curved text detection. As shown in Table 6, we compare the proposed method with state-of-the-art methods on CTW1500. We have two training strategies in the experiment as explained in the Implementation Details section: training from scratch and fine-tuning on a training model with extra data. Notably, when other methods and our method train with VOLUME 4, 2016   Ablation study on the multiple task. "F-Seg" represents the full text region map segmentation task, "C-Seg" is the center text map segmentation task and "Ratio-R" refers to the expanding ratio regression task. "Ext" indicates external data. Results are shown from the ICDAR 2015 test set. We observe an improvement in performance when training with multi-task loss.     The single-scale results on CTW1500. "R", "P" and "F" represent the recall, precision, and F-measure respectively. "Ext" refers to external data.

Method
Multi-oriented text detection. Although our method is segmentation-based, which is specifically designed for text detection of curved shapes, our approach is also adaptive to oriented text detection. The detection performance of our method and other state-of-the-art methods on IC15 is given in Table 8. Similarly to the CTW1500 training methods and settings, there are two training strategies: training from scratch and fine-tuning on a training model with extra data. When training from scratch, we can find that the performance of our method surpasses the state of the art results by more than 0.1%. Moreover, for fine-tuning on a training model with extra data, our approach achieves competitive performance with state-of-the-art methods, which means it can also process multi-oriented text well.   The single-scale results on TotalText. "R", "P" and "F" represent the recall, precision, and F-measure respectively. "Ext" refers to external data.
Multi-language and long straight text detection. Our method is adaptive to multi-language and long straight text detection. As shown in Table 9, the F-measure of the proposed method is 78.9% and 83.6% when the external data is not used and used respectively. For the F-measure, the method surpasses the previous state-of-the-art method by 2.8% when training without extra data. When training with extra data, the F-measure is comparable with previous best performance. Thus, our method can indeed be deployed in complex natural scenarios.
Dense adjacent text detection. As shown in Table 10, we compare the proposed method with state-of-the-art methods on DAST1500. For the F-measure, the method surpasses the previous state-of-the-art method by 1.9%. The performance on DAST1500 demonstrates the solid superiority of the proposed method to detect dense and arbitrary-shaped scene text.
From all the Table results, there is a clear trend that the performance of the proposed method is improved by using extra data during training, compared to the results without using extra data. There may be some reasons: on the one hand, according to the statistical learning theory, in order to get a model of small test error, we need to have more data to suppress the model complexity penalty. On the other hand, more data can contain various scene texts, which can improve the robustness of the deep learning model.

E. SPEED
Although our method is segmentation-based, we take the speed of the entire pipeline into consideration. Optimized along with the proposed text region representation and the polygon expansion algorithm, the segmentation network can not only simplify the post-processing but also enhances the performance of arbitrary-shaped text detection. The speed of the proposed method is compared with two other methods as shown in Table 11, which are all able to deal with arbitrary  shape scene text. From the table, we can see that the speed of our method is much faster than the other two methods. While complicated post-processing is needed in PSENet [12] and TextSnake [8], the post-processing in the proposed method is efficient. Figure 7 illustrates qualitative results on CTW1500 and To-talText. Figure 8 shows some examples of MSRA-TD500 and on ICDAR2015. And Figure 9 is for DAST1500. We can find that our proposed method is adaptive to texts with arbitrary shapes, including horizontal text, multi-oriented text and curved text.

V. CONCLUSION
In this paper, we propose a robust deformable framework with novel text region representation to successfully detect the text instances of arbitrary shapes in natural images. And with the training strategy, we can improve the representation robustness against text instances of various scales and aspect ratios. By using expanding ratios to expand the central text regions with our proposed polygon expansion algorithm, we can detect the text instances effectively and it is quite easy to separate spatially close text instances. The perfor-  The single-scale results on IC15. "R", "P" and "F" represent the recall, precision, and F-measure respectively. "Ext" refers to external data. mance on scene text detection benchmarks demonstrates the effectiveness of the text region representation and the post-processing algorithm. Possible future work includes extending our method to other segmentation tasks, addressing challenges of arbitrary oriented text detection and shortening the segmentation-based methods' running time.