TSER: A Two-Stage Character Segmentation Network With Two-Stream Attention and Edge Refinement

Segmenting characters in an image is a classic yet challenging task in computer vision. Correctly determining boundaries of adhesive characters with various scales and shapes is essential for character segmentation, especially for separating handwritten characters. Nevertheless, there is seldom work in the literature which can achieve satisfactory performance. In this article, by leveraging the ability of deep neural networks, we proposed a two-stage character segmentation network with two-stream attention and edge refinement (TSER) to tackle this problem. TSER firstly locates every character by object detection, then extracts their corresponding contours. In the process, a novel two-stream attention mechanism (TSAM) is proposed to make the network focus more on the discrepancy of character boundaries. Furthermore, a novel generating method is used to dynamically generate anchors on different feature levels to improve model’s sensitivity on the shapes and scales of characters. Eventually a cascaded edge refinement network is used to obtain contour of each character. To prove the efficiency and generalization ability of our model, we compared TSER with traditional algorithms and other deep learning models on two commonly used datasets in different segmentation tasks. The comparative result indicated that TSER reached state-of-the-art performance.


I. INTRODUCTION
Character segmentation is a vital step in traditional optical text recognition process. Although its importance for text recognition declined with the development of deeplearning-based methods, such as encoder-decoder based models [1], [2]. It is still the core technology in lots of important real-world applications. For instance, recognizing CAPTCHA [3], which is an important technology in modern automatic systems, demands a robust algorithm to segment characters accurately for further recognition. In addition, recognizing inscriptions and handwritten documents also needs an effective segmentation method. However, correctly segmenting characters in various scenarios is challenging and still remains unsolved. The main reason is the boundaries of The associate editor coordinating the review of this manuscript and approving it for publication was Qi Zhou. adhesive characters with different scales and shapes are difficult to determine, especially for handwritten and pictographic characters. Figure 1 shows some wrong cases that prediction models can easily make.
For a long time, rule-based algorithms are the mainstream choice. For example, the connected component analysis is usually regarded as an effective method. An important concept existing in this method is called connected component.
It refers to the region that consists of adjacent pixels with the same pixel value. Le et al. [4] found out all the connected components and labeled them. Another conventional approach of great concern is the projection algorithm. Kesiman et al. [5] segmented characters by projecting images from two directions. Commonly, if the spacing of all characters is apparent, these methods are effectual and time-saving. However, when characters are overlapping or adhesive together, under-segmentation or over-segmentation often occurs. Therefore, it is difficult for the traditional methods to enhance the ability of segmentation from the data with a variety of distributions, which makes the above problems tough to be solved. Inspired by several knowledge from different fields, deep learning has been developing rapidly. Deep learning is a representative approach of signal processing [6], [7] and it can address many tricky problems. For character segmentation, semantic segmentation networks such as FCN [8], SegNet [9], U-Net [10], etc. are preferred. However, all of them cannot obtain the subtle boundaries of characters, therefore cannot accurately segment characters. At the meantime, various instance segmentation methods, such as FCIS [11], Mask R-CNN [12] and PANet [13] can possibly tackle this problem by predicting coarse segmentation masks for characters within each fixed-size ROI patches. However, these methods are proposed to detect and delineate each distinct object of interest appearing in an image but fail to focus on character segmentation. Character segmentation task is different with the common instance segmentation task. For example, the feature and arrangement of characters may have some discrepancy with the common objects. Therefore, directly utilizing these instance segmentation networks to character segmentation cannot achieve good performance. In this paper, we propose a two-stage character segmentation network, which can accurately segment characters under various situations, namely normal spacing, subtle spacing, adhesive characters, partially overlapping characters, characters with deflection angles, and characters with different scales and shapes, from text line images with random noise.
The main contributions of this paper are summarized as follows: • We proposed a novel character segmentation network based on a two-stage detection network. This model is efficient in segmenting adhesive characters and can be generalized to other similar character segmentation tasks.
• A two-stream attention mechanism is proposed to guide the feature selection process of model. This attention mechanism contributes to distinguish the boundary of adhesive characters and assists to find out every character instance.
• A guided anchoring method is applied to produce sparse and appropriate anchors instead of the dense and redundant ones in traditional region proposal network. This module can reduce computational cost and generate anchors that meet various character shapes and scales.
• A cascaded edge refinement network is put forward as an auxiliary module to generate contours for characters from every bounding box because contours can surround characters better than rectangle boxes in most cases. The remaining of this article is structured as follows: In Sect. 2, relevant researches in this field are presented. In Sect. 3, we present formal definitions for character segmentation task, and elaborate our solution in detail. In Sect. 4, we describe model training details. In Sect. 5, we show the results of comparative experiment and ablation experiment. Finally, in Sect. 6, some conclusions are drawn.

II. RELATED WORK A. TRADITIONAL SOLUTION
Almost all traditional character segmentation methods are based on rules and conventional machine learning techniques. They are relatively simple and easy to understand, which makes them popular in the past. Projection algorithm used horizontal projection to slice the text line and vertical projection to slice the individual character. Le et al. proposed a connected-component-based method by introducing a classifier to further evaluate whether the connected component is text or not. Phan et al. [14] proposed a gradient-vectorflow based method for video character segmentation. This method finds the spaces between characters by using minimum cost path estimation. Sharma et al. [15] proposed a character segmentation method that is sensitive to dominant points for multi-oriented video. Liang et al. [16] proposed a novel wavelet Laplacian method to segment the characters with arbitrary orientation. This method explores zero crossing points to find spaces between words or characters. These approaches perform well only when being applied to regular text and with little interference. If the image has some interference on the text, their performance degrades significantly.

B. SEMANTIC SEGMENTATION
Deep-learning-based character segmentation methods can handle the images with various interference. For semantic segmentation, FCN was firstly proposed to split the input image into different semantically interpretable categories. The encoder-decoder architecture of FCN was efficient and easy to be further expand, thus more FCN based approaches were proposed, such as U-Net, Deeplab [17], RefineNet [18] and so on. All of them can segment the characters with subtle spacing in the images accompanied by slight interference. However, these networks still fail to achieve a satisfactory segmentation result because they cannot completely handle adhesive characters with different scales, especially when there are some apparent interferences in the images.

C. OBJECT DETECTION
In terms of the object detection networks (ODN), almost all of them can be classified into single-stage or two-stage networks. Some popular two-stage ODNs such as R-CNN [19], Fast R-CNN [20] and Faster R-CNN [21] can be used to segment characters with particular improvements. For instance, VOLUME 8, 2020 MS-CNN [22] introduced FPN [23] to boost the effect of detection on multi-scale images. R-FCN proposed a positionsensitive pooling method to accelerate the training process. To obtain better multi-scale feature maps, PANet applied an improved FPN with deeper sampling layers to their backbone network. In the meantime, many single-stage detection networks are also proposed continuously, such as SSD [24], YOLO [25], RetinaNet [26]. They are faster but have slightly lower accuracy than the two-stage models.

D. INSTANCE SEGMENTATION
Instance segmentation network (ISN), which creatively combines semantic segmentation network with ODN, can be used to character segmentation. It is an effective way that first uses ODN to find out different instances, and then classifies each pixel in every instance to segment instances precisely. To distinguish between different instances of the same category, ISN also requires creating an independent mask for each object. At present, the most popular ISNs are FCIS, Mask R-CNN, Tensor Mask [27], etc. ISN provides the possibility of segmenting the adhesive characters, we used the same idea in this work.

E. ATTENTION MECHANISM
Attention mechanism are widely used in computer vision [28], [29] and natural language processing [30]. Wang et al. [31] proposed a non-local network for video classification, which is based on a space-time dependency attention mechanism. Hu et al. [32] focused on the relationship between channels and features, introducing a SE module to adaptively obtain the channel-wise features. Similarly, CCnet [33] proposed a novel criss-cross attention module, which can be leveraged to capture contextual information from the long-range dependency in an effective way. Therefore, attention mechanism can be used to make model focus more on the boundary of characters.

F. ANCHOR
Anchor is a set of frames designed manually for object classification and box regression in ODN. No matter in a single-stage or two-stage detector, anchors are used to give approximate location of characters. A common way to generate anchors is sliding window method. This method firstly defines a certain number of anchors with a specific scale and aspect ratio. Then these anchors slide from top to bottom, left to right in a certain step on the image to detect characters. This technique is widely used in conventional detection networks such as Faster R-CNN, SSD, and RetinaNet. However, as object detection task becomes more complex, it becomes necessary to generate sparse anchors with high flexibility in shape and position. Therefore, guided anchoring was proposed by Wang et al. [34] to generate anchors by neural networks. The anchors generated in this way can have adaptive scales; in the meantime, redundant anchors are also reduced. In pursuit of an efficient anchor generation approach, our method also employs neural networks to produce anchors automatically.
So far, almost all character segmentation methods fail to accurately find out the boundary of multi-scale and adhesive characters. In the meantime, each method is designed for a specific task. There is no such a universal method that can be generalized to different character segmentation tasks. In this paper, we propose a novel attention mechanism to improve the feature representation ability of our model, and simultaneously utilize the guided anchoring approach to generate sparse anchors with high quality. Finally, an edge refinement network is proposed to get accurate contours of characters for further post-processing.

A. PROBLEM DEFINITION
An RGB image X h×w×c with arbitrary size is the input for TSER, which can be defined as follow: where h and w represent the height and width of the inputted image. c ∈ 1, 2, . . . , n denotes the channel of inputted image. For RGB images, n is equal to 3. The objective of TSER is two folds: first, to locate every character precisely by providing a unique bounding box to every character respectively, and then to segment every character from images by semantic segmentation. Then, the whole process can be split into two sections correspondingly. Therefore, for the first section, output are the located boxes which can be defined as the following: where B is a set of bounding boxes that marks the location of every character. (x, y) denotes the coordinate of the center point of box. (w, h) denotes the width and height of boxes. I represents the number of characters in an image. In the second section, output are final character contours which can be defined as the following formula.
where I denotes a set of polygon coordinates that represents all the character contours in the image. Note that TSER also outputs a score list S as Eq. (4).
where S represents the probability of whether the detected boxes are valid. It is an auxiliary value for determining whether the boxes are expected results.

B. OVERALL NETWORK ARCHITECTURE
Like most two-stage detection networks, TSER includes region proposal network and target detection network. As can be seen from Figure 2, the entire network consists of 205218 VOLUME 8, 2020 three components. The first component is a multi-scale features extracting network, ResNet-FPN-TSAM, which adds TSAM with ResNet-FPN to make the obtained multi-scale feature maps retain more information from different feature layers. This mechanism is necessary for ODN, which requires both semantic and location information. Moreover, since the extracted features are shared between two stages, will directly affect the final detection results. The second component is GA-RPN for generating sparse anchors and obtaining the region of interest (ROI). Generally speaking, the traditional region proposal network (RPN) is the first choice to generate ROIs, although the generated anchors are dense with redundant ones. In the meantime, the quality of generated anchors is dependent on hyperparameters settings.
Thus, using neural network can generate more appropriate and sparse anchors that fit characters better, since the location and shape of the anchors can be predicted this way. As a result, effectively eliminating redundant ones. The third component is a character segmentation network that detects the final character position and generates contours for every character.
This component takes the proposal generated by GA-RPN as input, and then adjusts the proposal to precisely obtain the final detection result; eventually, performs binary mask segmentation in each ROI to obtain contours. Moreover, with help of a cascaded edge refinement network, the inter-class difference is amplified so that the obtained contours are more suitable for characters.

C. TWO-STREAM ATTENTION MECHANISM
High-level feature contains a lot of semantic information and larger receptive field, making it suitable for detecting larger objects. In contrast, low-level feature contains much detailed information and smaller receptive field, thus, is suitable for detecting smaller objects. Therefore, features from different levels are directly added in traditional FPN based models. However, these methods ignore the influence of different channels in feature maps. Usually, different channels describe different features, which means that each channel demands different attention weights. For character segmentation, most attention should be paid to those channels containing important information about characters such as boundary, shape, aspect ratio, etc.
To address this issue, TSER uses a novel two-stream attention mechanism (TSAM) to improve feature expression ability, by combining information of feature maps from different levels. On the one hand, TSAM uses efficient feature representation ability of high-level feature maps to guide the feature selection of low-level feature maps. On the other hand, TSAM takes advantage of the detailed information described by low-level feature maps to change the attention distribution on each pixel of high-level feature maps. In this way, TSAM enables feature extraction process to retain both the detailed information and semantic information as much as possible. The whole architecture of TSAM and how to embed it in ResNet-FPN is respectively shown in Figure 3 and Figure 4. First, a channel-wise attention module is used to generate a probability vector by global pooling operation and softmax activation function. This vector describes the relative importance among different features in different channels and can be used to influence the feature distribution of low-level feature maps. As shown in Eq. (5) and Eq. (6), the high-level feature maps are squeezed to a weight vector that affects the low-level feature maps channel-wisely. where CAG represents the output guided by the channel-wise attention. Z denotes the result of the global average pooling, which squeezes the feature map to a weight vector. n denotes the number of channels of the inputted high-level feature map. m is the number of pixels in each channel of the inputted feature map. P j represents the pixel values of every channel in the high-level feature map. X L is the low-level feature map. Because the combination of different-level feature maps are processed channel by channel, therefore in this paper Concat n c=1 denotes concatenating the processed feature map together from channel 1 to channel n. As shown in Eq. (7), the weight matrix makes the model focus on important features.
where SAG denotes the output guided by the spatial-wise attention and FCN is a fully convolutional network. Similarly, X H is the high-level feature map. Since the features in the low-level feature map are sensitive to position, we utilize it to influence the high-level feature map for better feature selection. Eventually, the final attention fusion is obtained by combining the output guided by the channel-wise attention and the spatial-wise attention as shown in Eq. (8).
where AFFM represents the attention fusion feature map guided by two kinds of attention, and Conv 1×1 denotes a convolutional layer with 1 × 1 kernel size to adjust the dimension.

D. GUIDED ANCHORING REGION PROPOSAL NETWORK
The generation and selection of anchor strongly affect the performance of ODN. For character segmentation, the text in an image is usually centralized in a certain area. If the anchor is generated by the sliding window method as the traditional RPN, a large number of dense anchors will cause the positive and negative samples unbalanced and increase the computational cost. Moreover, the configuration of anchor scale and aspect ratio is set manually in RPN, which is not a flexible way to generate adaptive anchors. Especially when the shape of the character is abnormal, it is prone to neglect the small-scale characters.
In order to relieve this problem, TSER uses a novel guided anchoring method that is inspired by Wang' approach. Our guided anchoring approach creatively using neural networks to predict the center point and shape of anchors. These two steps are independent of each other. Therefore, this process can be divided into anchor location prediction and anchor shape prediction. We define (x, y, w, h) as an anchor. (x, y) represents the coordinate of the center point, and (w, h) represents the width and height of anchors. The distribution of all the anchors can be defined as follow: P(x, y, w, h|I ) = P(x, y|I ) · P(w, h|x, y, I ) The anchor location prediction uses a probability map of the same size as the input feature map to decide where the center points of the candidate anchors may locate. This module can be easily implemented via a 1 × 1 convolution kernel and a sigmoid function. The anchor shape prediction obtains the most appropriate (w, h) by maximizing the generalized intersection over union (GIOU) between anchors and ground truth box, as shown in Eq. (10): (w, h) = argmax w>0,h>0 (GIOU (anchor, gt)) (10) However, the values of w and h are always too large and may cause gradient explosion. So, the best choice is to rescale the predicted values to [−1, 1] as following Eq. (11): where σ is an empirical scale factor and s is stride of the feature map. dw and dh are the predicted values of w and h. This nonlinear transformation can project the output value from [0, 1000] to [−1, 1]. This module is implemented by a 1 × 1 convolution kernel, too. After the prediction of anchor location and shape, TSER uses a deformable convolutional layer with 3 × 3 kernel to make the feature map more adaptive to the various features. The architecture of the whole GA-RPN is shown in Figure 5 and some comparative results of RPN and GA-RPN are shown in Figure 7. As shown in Figure 7, a large number of anchor boxes are generated by traditional RPN but most boxes fail to surround characters well. These redundant boxes increase much computational cost. By contrast, the anchor boxes generated by guided anchoring RPN (GA-RPN) are sparse and accurate. Almost all boxes are generated near the corresponding characters and the scales and aspect ratio are more appropriate. Therefore, TSER with GA-RPN can locate characters faster and more accurate than other two-stage ISNs with traditional RPN.

E. CASCADED EDGE REFINEMENT
Since there is only one character in a ROI, contour segmentation can also be regarded as semantic segmentation. We find that the contours predicted by FCN cannot match the corresponding ground truth nicely because the boundary of character is usually ambiguous. To better distinguish characters from background, we propose a cascaded edge refinement network to refine the results during the segmentation process, making the predicted contours more conformable to the characters. This module is composed of several refining blocks with residual structure. Each block similarly applies attention mechanism to distinguish whether this area belongs to a character or not. In Figure 6, the input of this module is a set of feature maps and the output is the corresponding contour. Eq. (13) describes the calculation in every refining block.
VOLUME 8, 2020  where x is the input feature map and w represents the trainable weight matrix. n denotes the number of channels and n ∈ 1, 2 . . . , N . RB denotes the output of refining block. This module can be cascaded and the number of cascaded units will influence the performance. In this paper, we conclude that the performance is improved with the increasing of units at the beginning. However, if the unit is cascaded excessively, the performance tends to be stable while the number of parameters become much larger. Thus in our experiment, we set the number of cascaded units as 5 after balancing the performance and efficiency.
F. OTHER REMEDIES 1) GIOU [35] Usually, intersection over union (IOU) is the first choice to evaluate overlap between the proposal box (PB) and the corresponding ground truth (GT). However, when there is no overlap between PB and GT, the IOU and the gradient would be 0, which means that the model is hard to be further optimized. Additionally, accuracy of the coverage, i.e. PB correctly covers GT, can be totally different even with the same IOU value. For example, as shown in Figure 8, it is apparent that the accuracy of the coverage decreases from left to right. Therefore, TSER uses GIOU to evaluate the overlap between PB and the corresponding GT. The definition of GIOU is Eq. (14). where A and B are two intersecting boxes and C is the smallest enclosing box which can enclose them simultaneously. C(A ∪ B) denotes the area included in C but excluded from A ∪ B. The main idea is to consider both the influence of the joint area and disjoint area. GIOU can keep the immutability for scale and GIOU ∈ [−1, 1]. Especially, if two boxes don't overlap together, the value of GIOU will be close to −1 instead of 0, which can avoid vanishing gradient problem to some extent. Usually some common regression loss functions like MSE and smooth L1 loss are used. Nevertheless, these loss functions never take the influence of the IOU into consideration. Hence, we use GIOU to estimate the overlap between two boxes, and regard it as the bounding box regression loss. The pseudocode of loss calculating is shown in Algorithm 1.

G. SOFT-NMS [36]
Although TSER has been combined with TSAM which improves the performance to some extent, we still find that the recall of the model is not satisfactory enough. By checking the wrong cases, we observed that errors often occur when multilingual characters are adhesive together. For instance, in Figure 9, the network should output two boxes for two different characters. However, normal NMS may remove the yellow box because the IOU may surpass a certain threshold. Thus, we replace normal NMS with Soft-NMS. The main process of the Soft-NMS is defined as the following formula:  where s i denotes the score of bounding boxes. This score represents the possibility of whether this box is a character. In Eq. (15), Soft-NMS introduces a penalty to the origin scores. The penalty will be zero if there is no overlap between two boxes; and it will be larger with a higher overlap. This way, the situation that neighbor boxes are mistakenly suppressed can be avoided. The pseudocode of the Soft-NMS is shown in Algorithm 2.

IV. MODEL TRAINING
Since TSER is a two-stage segmentation network, the loss of the whole network consists of GA-RPN loss and character segmentation loss. Therefore, we use joint training strategy to separately define the objective function for these two parts and train them respectively.
Loss of GA-RPN can be treated as four independent ones which are classification loss L cls−GARPN , regression loss L reg−GARPN , anchor location loss L LOC and anchor shape Then we map the ground truth box (x gt , y gt , w gt , h gt ) to the different scale feature map as (x gt , y gt , w gt , h gt ) and sample some points from different regions as the label for location prediction. In this paper, we denote R(x, y, w, h) as a rectangle region whose center point is (x, y), width and height are (w, h). Each box can be separated into center region, ignore region or outside region. These regions can be defined as the following formula: CR = R(x gt , y gt , σ 1 w gt , σ 1 h gt ) 0 < σ 1 < 1 (17) IR = R x gt , y gt , σ 2 w gt , σ 2 h gt − CR σ 1 < σ 2 < 1 (18) where CR denotes the center area of an anchor. IR denotes the ignore region that is useless for training process. The outside region OR is the rest region excluding IR and CR. We randomly pick points from center region as the positive samples, and ones from outside region as the negative samples to train GA-RPN. Since the ratio between positive and negative samples are 2:1, we apply a simple cross-entropy loss to train this location branch. The loss of this branch is shown in Eq. (20). L(y, y ) described in Eq. (20) is defined in Eq. (19).
where y and y respectively denote the predicted center point and the corresponding ground truth. k is the number of predicted location points. After the anchor location has been determined, the anchor shape can be predicted by a pair of most appropriate width and height, that can best match the corresponding ground truth box. Additionally, we restrict the number of the generated anchors to 9 and choose the one that makes the GIOU largest. The formal expression of L Shape is defined as follow: where (w, h) and (w , h ) represent the predicted anchor box and the corresponding ground truth box. m denotes the number of predicted shape boxes. The classification module in GA-RPN is to judge whether an anchor is a foreground box or not. The definition of the L cls−GARPN is shown in Eq. (19), which is same as L LOC . L reg−GARPN is GIOU loss which has been depicted in Algorithm. 1.
where v is the predicted proposal and u is the corresponding ground truth box. After GA-RPN outputs the predicted proposals, we randomly sample part of them as the input of the training process. In the character segmentation network (described in Figure 2), the loss can be separated into L reg , L cls and L mask . Since the classification in this section is just to classify character or background, simple binary crossentropy is enough to evaluate its loss. It is shown in Eq. (19). Similarly, the bounding box regression in this section is also the same as the regression in GA-RPN, so the definition of L reg is shown in Eq. (24). Semantic segmentation is employed to get the contours of every character. During the experiment, we observe that for each bounding box detected by this network, the number of positive and negative pixels are unbalanced. Therefore, directly using cross-entropy as loss function will lead to a poor segmentation performance. In this paper, we apply the focal loss proposed by Liu et al. [30] to tackle this problem: where u represents the pixel-wise classification result and u denotes the corresponding ground truth box. Both of their values are either 0 or 1. n is the number of predicted masks and each instance corresponds to one mask. According to the above loss functions, the character segmentation network can be jointly optimized by the following function:

V. EXPERIMENTS A. EXPERIMENT SETTINGS
To prove the performance and generalization capacity of TSER, we choose two different datasets, Arti-Text and CASIA-HWDB to evaluate the performance of optimal character segmentation and handwritten character segmentation.

1) ARTI-TEXT
To verify TSER's performance on mixed characters, since such data sets are particularly rare, we create a text character segmentation dataset with mixed characters and complicated cases. This dataset is generated by open-cv toolkit in conjunction with a character generation algorithm. The number of characters in each image ranges from 5 to 15. All characters are randomly extracted from 6583 characters (including Chinese, English, punctuation, numbers, etc.) and affixed into the image with unequal spacing and random deflection. The font in each image is randomly selected from seven font types. Moreover, every image is accompanied by random degrees of Gaussian blur and different background colors. Some sample text images and data annotation images are shown in Figure 11. We balance the ratio of training set, validation set and test set to 4:1:2 as shown in Figure 10. For the convenience of description, we name this dataset as the Arti-Text, which is used for optical character segmentation task.   . Some examples of the generated text image and corresponding annotation image. Interference is purposely added to the images with Gaussian blur to make them more difficult be segmented correctly.
2) CASIA-HWDB CASIA-HWDB [39] is an offline Chinese handwriting database, which is built by the National Laboratory of Pattern Recognition (NLPR), Institute of Automation of Chinese Academy of Sciences (CASIA). The original data are not handwritten text images, but handwritten character images. Therefore, we concatenate these characters randomly to form the complete text images. Some examples of CASIA-HWDB are shown in Figure 12.

3) IMPLEMENTATION DETAILS
In this work, we use ResNet101-TSAM-FPN pre-trained on ImageNet as the backbone network and FCN as the basic VOLUME 8, 2020    semantic segmentation network. Actually, TSAM is flexible enough that it can be embedded in other advanced models. Therefore, we pay more attention to the architecture of TSAM   Figure 3. instead of the performance improvement brought by different backbone networks. To make the convergence faster, we pretrain the backbone network with TSAM on the Arti-Text and CASIA-HWDB for character recognition task. The input image is resized to 1024 × 800 without any other preprocess. We set σ 1 = 0.2, σ 2 = 0.5 in GA-RPN and λ 1 = 1, λ 2 = 0.1 in the loss function. As for the focal loss parameters, we set α = 0.25, γ = 2.
We train TSER by Adam optimizer over 1 GPU with 2 images per batch and set β 1 = 0.9, β 2 = 0.999, = 10 −8 in Adam. The model has been trained for 30 epochs and 2000 iterations per epoch in total with an initial learning rate of 0.0001 and the decay rate of 0.004 on GTX2080Ti.

B. COMPARATIVE RESULTS
In this section, we present quantitative results. Firstly, we compare TSER with traditional methods, popular ODNs and  ISNs on the Arti-Text and evaluate the performance by precision (P) and recall (R). What's more, we define Mask R-CNN as the baseline model on which the architecture of TSER is built. In Table 1, we observe that P 50 (R 50 ) is much larger than P 75 (R 75 ). It means that with the increasing of the IOU threshold, correctly and completely locating every character becomes trickier. As a result, it is inappropriate to merely choose the P or R of some certain IOU threshold as the key metric. In this paper, we select P and R (range from 0.5∼0.75) as the decisive factor to evaluate all the methods.
As shown in Table 1, it is apparent that TSER outperforms other deep learning models and rule-based methods by a certain margin. Particularly, for Mask R-CNN, TSER improves P by 3.8% and R by 0.7% respectively. In addition, we replace the original backbone network with ResNeXt101-TSAM-FPN and use OHEM to obtain a better model than the base TSER. From Table 1, the performance of the improved TSER outperforms the base TSER about 1.1% of P and 0.3% of R. Figure 13 intuitively shows the performance comparison for TSER and other approaches. Except for Mask R-CNN, TSER far exceeds other methods. To further explore the difference between TSER and Mask R-CNN, we present the loss curve of them, which is shown in Figure 14. The descending tendency of TSER is faster and more obvious than that of Mask R-CNN, indicating TSER is easier to converge to the optimal.
Secondly, to evaluate the model performance in handwritten character segmentation task, we compare TSER with other common methods on CASIA-HWDB. As shown in Table 2, TSER brings about 1.5 points higher P compared with Liu et al. Although its R is 5 % lower than Xu et al., TSER still performs efficiently on handwriting database because its P and R are balanced. This experimental result demonstrates that TSER performs well in handwritten character segmentation, which can be intuitively observed from Figure 15.
Additionally, we test the performance of TSER in different cases on the Arti-Text. As shown in Table 3, our TSER performs well if the data have slight deflection and adhesion. Nevertheless, the performance of TSER will decrease obviously when characters are adhesive seriously or have a certain deflection angle. Especially, various character types further increase the difficulty of segmentation. We observe that TSER can achieve P of 81.5% and R of 83.9% in the best condition (pure Chinese character without any interference) but P of 58.8% and R of 66.2% in the worst condition (mixed character with interference). The apparent disparity shows that mixed character with interference is the trickiest case in character segmentation.

C. ABLATION STUDY
In this section, we dismantle different components in TSER to investigate the influence of each component, including TSAM, GA-RPN and edge refinement network. From Table 4, it is apparent that TSAM and GA-RPN contributes greatly to the P and R of TSER, which leads to an improvement of 2% to 6% than the baseline model. In addition, the impact of edge refinement network seems to be smaller than other modules. The reason is that it's designed for cases that characters are so adhesive that their bounding boxes have overlapping areas, and these cases are not common in all the datasets. We show such a case in Figure 16, by obtaining the refined contours, characters can be better segmented.
In TSER, TSAM is introduced to improve the feature extraction ability and enhance the semantic boundary. From Table 4, our attention mechanism has a great influence on both character location and character segmentation. We also consider whether the operations pointed by two red arrows in Figure 3 can be replaced by other operations. As shown in Table 5, we try four possible combinations and find that the last combination is the most appropriate one. In order to intuitively understand how the attention in TSAM works, we visualize the middle result of final layer in Figure 17. It can be seen that TSAM guides the attention concentrate on boundary, shape and location of characters. Thus, TSAM enhances feature expression ability of model, which helps TSER find out accurate location of characters. Due to the introduction of GA-RPN, it's important to decide the number of generated anchors because it directly affects the amount of computation and decides the efficiency of the whole network. We use to denote the location threshold and those whose prediction scores are less than this threshold will be discarded as negative samples. As shown in Table 6, as the threshold increases, the number of generated anchors goes down rapidly. This makes the recall suffer a bit decrease but the computational cost also declines sharply. After balancing the performance and efficiency, we consider that it is worthy to promote the whole efficiency at the cost of some performance. Therefore we eventually set as 0.01. In Figure 18, we compare GA-RPN in TSER with other cases under 200, 500, 1000 proposals. Figure 18 shows that the model with GA-RPN behaves better than the one with traditional RPN, which proves the efficiency of GA-RPN. Moreover, more proposals will bring a better recall. But we don't need too many proposals because GA-RPN in TSER performs well enough and the extra computational cost increased by excessive proposals is a great burden. So, in this experiment, we set the number of generated proposals as 500.
The cascaded edge refinement network is a semantic segmentation network which generates contours that fit the corresponding characters. Coincidentally, we observe that cascading the unit can improve the quality of generated character contours. In this module, we define δ to denote the cascaded frequency of the unit which is shown in Figure 6. According to the result in Figure 19, as the δ increases, the performance of segmentation becomes better. However, we still set δ to 5, because the extra computational expense caused by increasing δ is unworthy. To prove that the remedies applied in TSER is meaningful, we test the performance of TSER in different combination, NMS-IOU, NMS-GIOU, Soft-NMS-IOU and Soft-NMS-GIOU. From Table 7, it is apparent that Soft-NMS-GIOU does helped to the increasing of P and R. D. TIME COST 1) TRAINING TSER does not require too much training data and the speed of the training process is very fast. Pre-training ResNet101-FPN-TSAM on the Arti-Text and CASIA-HWDB takes approximately 32 hours. Training TSER takes 27 hours on the Arti-Text (about 1.5∼2 hours per epoch).

2) INFERENCE
We use the ResNet101-FPN-TSAM model to extract features, apply GA-RPN to generate proposals, and finally adopt the character segmentation network to segment contours of characters. This model runs at 3.3s per image with the shape of 1024 × 800 on a single GTX2080Ti.

E. QUALITATIVE RESULTS
In this section, we show some qualitative results on the Arti-Text and CASIA-HWDB respectively. As shown in Figure 20, every case has 3 rows. The first row is the origin image. The second row is the result generated by the baseline model. The third row is the result generated by TSER. In the second row, many characters are missed, and the obtained contours do not fit the corresponding characters well. Yet, in the third row, almost all characters are found out and the contours are much better. We also show the results on handwritten dataset in Figure 21. It can be seen that handwritten characters can also be well segmented by TSER although under-segmentation still exists in some cases. Thus, we can conclude TSER is an effective text line segmentation method both in optimal character segmentation and handwritten character segmentation.

VI. CONCLUSION
In this paper, we proposed a novel character segmentation network TSER that aims at segmenting characters from text line in images. TSER has four major contributions: 1) A novel two-stage segmentation network focusing on character segmentation task was proposed. 2) TSER employed the two-stream attention mechanism to improve the feature expression ability. 3) Compared with other ISNs, TSER utilized the idea of GA-RPN to generate less but better anchors. 4) A cascaded edge refinement network was used to pixel-wisely segment characters more accurately. On our benchmark dataset (Arti-Text), TSER achieved about 3.8% higher precision and 1.0% higher recall with 90% fewer anchors than the baseline model. On CASIA-HWDB, TSER also outperformed other traditional and deep-learning-based methods. Experimental results proved that TSER is an effective method with strong generalization for character segmentation task. Possible future improvements may include reducing the number of parameters and simplifying the whole architecture to make it lighter.