A Novel Semantic Segmentation Model for Chinese Characters

Character segmentation plays an important role in optical character recognition (OCR). Due to the limitations of feature representation, traditional image analyzing based methods cannot well segment characters with connected or broken strokes, especially for the Chinese characters which usually have complex structures. To solve this issue, this paper proposes a novel segmentation model based on fully convolutional neural networks (FCN). The model first uses convolutional neural networks to extract spatial features, then shares them throughout the whole model. Two FCNs are used to extract character information to form a score map. Finally, character features are reused to adjust the accurate segmentation points in the score map. What’s more, to strengthen the ability of feature representation, a novel compound character feature which can well describe the characters’ outline is also proposed. The proposed method is validated on two datasets: GBSD and CASIA-HWDB-MT, against the methods proposed in the literature. Experimental results show that the proposed model outperforms state-of-the-art methods.


I. INTRODUCTION
Character segmentation is an important task in optical character recognition (OCR) process. The quality of the segmentation not only effects the OCR performance, but also plays vital roles in a variety of related application. For example, using machine to read CAPTCHA, which is a challenge-response test to determine whether or not the user is human, needs sophisticated character segmentation algorithm to correctly figure out each character from a row of them with sticky strokes. Therefore, character segmentation is a complicated task, especially for pictographic characters such as Chinese. Compared to the character of alphabet and numbers, Chinese characters have more complex structure, which are easily segmented into wrong blocks by the computers. Several OCR techniques have been developed during the past decade. However, these techniques simply focus on the improvement of the recognition, rather than the performance of character segmentation. Therefore, it is of great significance to study how to make character segmentation more The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo . accurate. In this paper, we take Chinese character segmentation as an independent work.
Contemporary character segmentation methods can be divided into two categories -segmentation methods based on image analysis and segmentation methods based on deep learning. The former requires the extraction of structural and statistical features such as the height and width of the characters, segmenting character images into character blocks. However, features extracted by these methods are too shallow. To extract high-level features, Deep Learning based techniques are used in these years. Deep learning has a variety of applications, like ontology [1], object detection [28] and image segmentation [20]. Compared to segmentation method based on image analysis, these methods have a strong ability of feature learning. However, these methods take segmentation as a detection task, usually used for text line segmentation [14], therefore cannot segment Chinese characters well. In 2015, Long et al. [20] firstly regarded the text detection problem as a semantic segmentation problem, and proposed using fully convolutional neural networks (FCN), which can classify each pixel in the image, taking consideration of characters' detail. But this method still fails to obtain fine segmentation results, due to the lack of spatial relations between pixels and sparse features after up-sampling.
In this article, we propose a Chinese character segmentation method (CCSeg). CCSeg is based on fully convolutional neural networks, and consists of three parts. First, it uses the convolutional neural network to extract image level features which are later shared throughout the network. Then, for the extraction of multiple character feature, two sub-networks are used. Finally, character features are reused to adjust the accurate segmentation points in the score map. Moreover, CCSeg has a novel multi-object feature extraction mechanism, which can exploit regional multiple character feature. Traditionally, character feature extracted describes rough character morphology. In this paper, regional multiple character feature can well describe the outline of characters. Experimental results show that the proposed method outperforms the traditional methods.
This paper is organized as follows: Section II introduces the related work, Section III gives a detailed description of the proposed method; In Section IV, our method is illustrated in detail. In Section V, we evaluate the experimental results, as well as a comparison with several state-of-the-art methods. Finally, the work is concluded in Section VI.

II. RELATED WORK A. SEGMENTATION METHOD BASED ON IMAGE ANALYSIS
Segmentation methods based on image analysis can be divided into four categories: namely water droplet segmentation method [2], projection method [21], connected domain analysis method [22], and clustering method [36]. These methods have experienced long-term development from 1990s to 2000s. In 1995, the drip-water segmentation algorithm was firstly proposed by Congedo G et al.. This algorithm mimics the process of dropping water from a high point to a low point and mainly aims to segment characters with adhesions. However, when it comes to segment characters like verifying human operation with CAPTCHA, it is impossible to determine an accurate drip leak because of the twisted strokes and concave parts. In 2003, Pal and Datta et al. [3] presented a projection analysis-based algorithm. To deal with the variability of writing style from different individuals, the whole text is divided into vertical stripes. Then the horizontal histogram of these stripes are used to segment text lines. Finally, these lines are segmented into words based on vertical projection profile. This method has simple calculation and is widely used in various systems. Furthermore, to improve the accuracy of segmentation, morphological processing [6] and multi-level projection [8] are applied in the segmentation tasks. Specially, instead of multi-level projection, Manmatha and Rothfeder (2005) [9] only performed projection in the horizontal direction to obtain the location of the text line. In this method, they use the contour results of the projection in the horizontal direction to obtain the location of the text line. Results need to be corrected during the process. In addition, several other image histogram-based projection segmentation methods were proposed (U.V. Marti 2002 [10], B. Gatos 2007 [11], and Rodolfo P 2009 [14]). Besides the projection method, Shi Z et al. (2009) [5] also used filters with various directions to construct a generalized adaptive local connected graph, mainly used for text lines with different directions. Unlike the above methods, the clustering method uses basic components in text images, such as pixels, connected blocks, stroke segments, etc., to make up characters.  [34] extended this structure to a three-dimensional scale which can be applied to brain tumor segmentation. To extract 'thicker' feature, U-NET combines feature maps in channel dimension. In addition to semantic segmentation, Mask R-CNN also produces a kind of instance segmentation to obtain more accurate classes of the objects.

III. PROBLEM DESCRIPTION
Traditional character segmentation methods cannot segment characters with connected or broken strokes due to the limitation of feature representation. In recent years, deep learning methods have begun to be used in OCR tasks, and have achieved better results than traditional methods. However, such research works pay more attention to character recognition, but greatly neglect the importance of segmentation. In this work, we regard character segmentation as an independent task of semantic segmentation in images, and propose an improved full convolutional neural network model.
Our solution of character segmentation in text images is formalized as follows. An input is defined as: where g ij is the value of pixel in the location (i, j) of an input image. After processed by convolutional networks, we can get two feature tensors-CMF and CNF, and a semantic segmentation map F.
. . , g 1j , . . . , g i1 , g i2 , . . . , g ij ), (4) where CMF is used to denote the character morphology feature which is represented by the width of each character. CNF is used to denote the character number feature which is represented by the number of characters in each text line. w i and n i represent CMF and CNF of the ith image respectively. g ij is the value of pixel in the location (i, j) of an output image. The output F has the same size as the input G. Finally, accurate segmentation points are marked in F with CMF and CNF.

A. COMPOUND CHARACTER FEATURE EXTRACTION
Character feature extraction in this paper is based on convolutional neural networks (CNN). Unlike traditional image-analysis based methods, CNN have the ability to extract advanced features. The basic components of CNN are convolution, pooling and activation functions. In each location (i, j), we use x ij as the input data vector in one particular layer, and y ij as the output.
where k is the kernel size, s is the stride factor. Traditional character segmentation methods often use shallow information so that they cannot find precise segmentation points, greatly reducing the accuracy of recognition. In addition, traditional FCN are not sensitive to image details and neglects the relationship between pixels. In order to obtain precise results in complex tasks such as adhesion, missing, etc., this paper proposes compound character feature extraction. The proposed extraction method extracts feature from both character morphology and number, for the final semantic segmentation in the deconvolution process. In addition, as shown by the yellow arrow in Figure 1, since CMF affects the range of the character region, CMF is merged when extracting CNF. CMF and CNF are introduced in section III. We use concatenate operation to merge features: where X 1...c and Y 1...c are the two feature channels that need to be fused. K represents the convolution kernel, and * represents the convolution operation. Z represents the output.
To extract these two different features, we construct two sub-networks, which are named as CMFEN and CNFEN respectively. On the one hand, we construct CMFEN to reduce the interference of many irrelevant pixel information in the semantic segmentation task. On the other hand, we construct CNFEN to minimize cases like over-segmentation for missing strokes and under-segmentation for sticky strokes.
Inspired by the structure of Faster R-CNN [30], we add a feature sharing mechanism to the proposed network. We send the final feature map from the backbone network to the above two sub-networks, as shown in Figure 1. Besides, CMFs are also shared between sub-networks. It is obvious that candidate character regions in the image differ from each other because of containing different character shapes. Thus the morphology of characters can indirectly influence the number of them. At the same time, given a certain character area, difference in the number of characters will also affect the morphology of each character.

B. CCSeg BASED ON COMPOUND CHARACTER FEATURE
This paper takes character segmentation as a semantic segmentation task. Inspired by FCN, the proposed architecture uses the deconvolution layer to upsample the feature map of the last convolutional layer until it has the same size as the input image. Therefore, each pixel can be classified, which can better describe the outline of characters than methods with proposals. The main operation of the proposed network can be described as: where k is the kernel size, s is the stride or subsampling factor, f ks and g k s determine the layer type. Compared to traditional convolutional neural networks, the input size of FCN is unlimited. Because of the pixel-level operation, FCN can efficiently preserve the spatial information in the original input image. However, since the upsampling process uses simple padding, the results obtained by the FCN are still coarse. In order to produce accurate and detailed segmentation, an improved FCN model based on the character information extraction method is proposed in this paper. The improved FCN use multiple features as the input. After processed by deconvolution layers, the extracted character information is reused for the final semantic segmentation task. Compared with the basic FCN method, the extracted character features will provide more information for the deconvolution process.
As shown in Figure 1, the improved FCN have three outputs, CMF, CNF and score map F. In the score map F, value of each pixel represents the probability of character class. In this paper, pixels of the whole image are divided into two categories: character class and the non-character class.
Semantic segmentation tasks normally improve the architecture of neural network models to make them perform well in complex tasks. However, it is difficult for a semantic segmentation model to deal with image details. In order to solve this problem, CRFs [35] has been broadly used in semantic segmentation to combine class scores and the extracted low-level information [37]. This paper proposes a new edgeoptimizing algorithm based on CRFs. Typically, CRFs are mainly used to smooth noisy segmentation maps. But our CCSeg has already achieved quite smooth score maps and produced homogeneous classification results. As a result, it is not suitable to use traditional short-range CRFs. The edgeoptimizing algorithm proposed in this paper does not further smooth the results, but illustrates the relationship between all the pixels in the graph to recover detailed local structure. Additionally, CNF and CMF are reused in this part. The proposed model employs the energy function as follows: where P (x i ) is the class score of pixel i which is computed by the improved FCN, θ ij x i , x j means all pairs of image pixels. Especially, in the equation (10), µ (·ď) is a penalty function. It is set to 1 when x i = x j and is set to 0 otherwise. p i and I i are denoted as the position and the RGB color of the ith pixel. σ α , σ β , σ γ are hyper parameters. Experiments show that this algorithm can significantly improve the accuracy of character segmentation.

C. MULTI-OBJECT DROPOUT MECHANISM
Due to the limited size of the self-built dataset GBSD, using traditional dropout mechanism is easy to obtain an overfitting result. So we introduce a new multi-object dropout mechanism. It allows each neuron node to be set to zero with a certain probability during the forward propagating. These neurons do not process the data passed by the previous layer nor the next layer. This way, the amount of forwardpropagating data and calculation of partial neuron nodes can be reduced. In another forward propagation, neurons of all hidden layers are selected again. During back propagation, random gradient descent method is used to optimize the unselected neurons.
The traditional dropout mechanism [45] trains only a single data feature. For the multiple character feature extraction, we use a multi-object dropout mechanism. This new mechanism can process both CMFEN and CNFEN, which are introduced in section IV-A.
Take the lth layer as an example. In order to make a simple expression, CMF is replaced by M and CNF is replaced by N . The output of neurons is denoted a {M , N } l . w l+1 i and b l+1 i can be the weight and bias of the ith neuron in the lth network respectively.
I. The process of neural network without our multi-object dropout mechanism is shown as: where z l+1 i refers to the calculation results of the ith neuron in the lth network, y l+1 is the output after activation.
II. The process of neural network with Dropout mechanism is shown as: where Bernoulli(p) i represents the following process. The ith neuron in the lth network randomly generates a 0-1 vector represented by r l i with a probability of p. Then this vector is used to process the output {M , N } l . {M , N } l i is the processing result.
The trained network is used for prediction:    GBSD: Since CASIA-HWDB-MT is created primarily to represent sticky strokes. In order to verify the effectiveness of the proposed method when there are missing or broken strokes, we also create a dataset with an increased proportion of missing or broken strokes. GBSD includes 30,000 pictures in total. They are generated by 3,755 Chinese characters in the GB2312 first-level national standard. All the experimental images have a uniform size: 512 * 512 pixels. The size of the characters in each picture is between 70px and 80px. Each image contains characters ranging from 2 to 5. For data augmentation, white noise and interference textures are manually added when constructing character pictures. Table 1 depicts key parameters of our dataset. Table 2 shows several examples in our data set. There are four types of data in the dataset. Besides the normal type, it also has three kinds of data in complex tasks. Each image is labeled by ''number_width''. The width here is the average width of characters in an image.

B. IMPLEMENTATION DETAILS
base CNN: For spatial feature extraction, four convolution operation units are constructed. One convolution operation unit is called as a block. The basic structure of each operation VOLUME 8, 2020  unit includes: convolution layer -activation layer -convolution layer -activation layer -convolution layer -activation layer -down-sampling layer. Detailed parameter settings of base CNN are shown in the Table 3.
CMFEN: This network uses spatial features extracted by base CNN. Similarly, one convolution operation unit is a block, which contains several convolution layers and pooling layers. Here we use a flatten layer to squeeze features into a 1D sequence of values. This sequence is used to make feature fusion and sent to dense layer. Finally, we get 11 outputs. Table 4 describes more details about this network.
CNFEN: Like CMFEN, it receives spatial features from base CNN. After processed by three blocks and a flatten layer, a concatenate layer is used to fuse features. Finally, we get 6 outputs of dense layers. Table 5 shows all the details of parameter settings. SSN: Before upsampling, there is also a concatenate layer to fuse all the features obtained by the above three networks. Then the data size is restored to the same size as the input image by deconvolution layers. The value of each pixel in the  score map indicates the probability of character class. The detailed parameter settings of this network is given in Table 6.

C. TRAINING
Training settings: All the networks are optimized using Adaptive Moment Estimation (Adam). The loss weight of character information extraction network is set to 0.5. The loss weight of SSN is set to 1.0. According to the loss calculation method, we use Cross-entropy loss to evaluate the loss of CMFEN, CNFEN and SSN. The initial learning rate is set to 1.0. According to the loss calculation method, we use Crossentropy loss to evaluate the loss of CMFEN, CNFEN and SSN. The initial learning rate is set to 0.0001(1e-4). The exponential decay rate of the first moment estimate is set to 0.9 and that of the second moment estimate is set to 0.999. Besides, to prevent dividing by zero in the calculation, we set epsilon to 1e-08 and set decay to 0.0.
Training: Our experiment set up 5000 iterations of training. The batch size is set to 50. The whole training totally costs about 57 hours. Figure 3 pictorially demonstrate the loss changes of different networks. All the results are recorded by Tensorboard.

D. EXPERIMENTAL RESULTS
Several examples of segmentation results obtained on GBSD are shown in Table 7. There exists a certain amount of random noise interference in the images of the left line. Besides, there are adhesions between these Chinese characters, which are hard to be segmented. For example, '' '', '' '', '' ''. Unlike English, Chinese characters have a special structure: a left-right structure. Characters with the left-right structure are easy to be over-segmented. Here are some examples, '' '', '' '', '' '', '' '', and '' '', etc.
As listed before, there are lots of common but difficult character segmentation problems. For these cases, qualitative samples of semantic segmentation conducted by traditional FCN are shown as the middle column of Table 7. Pixel points closer to white in the figure are classified into character class, while those closer to black are classified as a non-character class. After training, the neural network cannot learn the segmentation information between characters well. Therefore, we cannot receive an accurate result according to the score map. Moreover, after several hundred iterations, the loss of FCN will remain low and it is difficult to further improve this neural network. As shown in the right column of Table 7, we can achieve an even better result by just adding good features into the FCN model. Segmented regions are marked with red boxes. On this basis, the proposed method reuses character information features in the final segmentation process. Segmented regions are marked with green boxes.
The proposed method is also verified on CASIA-HWDB-MT, and the quantitative results are shown in Figure 4. Three types of data are taken for illustration: single-touching string, multiple-touching string and single-touching string with more than two characters. As is shown in Figure 4, both of the other two methods focused   on Chinese character segmentation have the problem of oversegmentation and inaccurate segmentation points. However, the proposed method can handle these three situations very well.
To handle Chinese characters with left-right structure, we add CMF into our model, so that the characters will not be over-segmented. To handle the situation that adhesions between characters, we add CNF into our model. Thus we can segment a specific number of individual character blocks VOLUME 8, 2020   in the character region. In summary, the proposed method can deal with common and difficult character segmentation problems.

E. ANALYSES AND DISCUSSIONS
In this paper, the experimental results are evaluated by three metrics: ''Over-segmentation'' (Os), ''Under-segmentation'' (Us), and ''Positive-segmentation'' (Ps). Os means that more than one ''complete'' character is included in the segmentation block. Us means that the segmentation block contains less than one ''complete'' character. Ps means that the segmentation block contains just one ''complete'' character. Here, cIoU is used to judge whether a segmented character is ''complete''.
cIoU is calculated as follows:  where S is the area of the segmented character block. C represents the area of the original character in the image. The threshold is set to 0.95. A segmentation block is considered to have a ''complete'' character if its cIoU is greater than 0.95. In the comparative experiment, 6000 images in the dataset are randomly chose for character segmentation. Results are shown in Table 8. Except for the proposed method, we still realize 10 methods based on the same dataset. Pictures in the dataset are divided into several categories: adhesion (C1), missing strokes (C2) and left-right architecture (C3). As can be observed, some early proposed methods like Vertical projection and Connected domain are extremely prone to ''Oversegmentation''. This is because both of them do not have a solution to characters with left-right architecture. Meanwhile, their ''Under-segmentation'' ratio is also higher than other methods. Compared to these two methods, Water droplet and Clustering take morphology features of characters into consideration. So they achieve better results. But they only optimize local features of characters. Thus they cannot get accurate results too. Multi-layer perceptron method was then proposed which differs from the morphological processing method in that it extracts low-level feature in the image. Semantic or instance segmentation methods like FCN, U-Net, Deep Lab and Mask R-CNN can make segmentation in pixellevel. These methods effectively reduce the ratio of ''Oversegmentation'' and ''Under-segmentation''. Similar to them, our proposed method uses information of original image for character segmentation. The most noticeable trend may originate from the bottom row of Table 8, which shows our proposed method performs well on all the three metrics. Especially for the condition of missing strokes and left-right architecture, even the semantic segmentation methods which are mentioned above cannot obtain good results. Figure 6 depicts the results of comparative experiments for the condition of left-right architecture. We compare the proposed method with other Chinese character segmentation methods on CASIA-HWDB-MT dataset. The results are shown in TABLE 9. Our method can obtain more accurate results than others.
In order to study the effect of character spacing on character segmentation, we conduct another experiment, as shown in Table 10 positive-segmentation is 92.57%. When the character spacing is set to [3px], the larger character spacing may interfere with information extraction, resulting in a slight decrease in Ps to 94.7%. But it still maintains high accuracy. The bottom of Table 10 shows the proposed method can still perform well even existing long-range character spacing. Figure 7 depicts the rate of ''Positive-segmentation'' for 11 methods with character spacing.
In order to visually show the difference, Figure 8 illustrates character segmentation using the projection approach and CCSeg. There are projections in two directions-vertical and horizontal. Since the boundary of black and white pixel points is directly used as segmentation lines, characters with leftright structure will be over-segmented, such as '' '' in (a) and '' '' in (b). In another case, '' '' and '' '' in (b) have adhesions in the vertical direction. As a result, they cannot be segmented correctly. With the regional multiple character feature, our method has obtained better results. As shown in the bottom of Figure 8, CCSeg outperforms the traditional one.
However, since the proposed method aims at optimizing the segmentation lines between characters, it cannot solve problems when there is large overlap between characters. This is also the direction we will keep working on.

VI. CONCLUSION
This paper proposes a character segmentation method called CCSeg to conduct Chinese character segmentation task. CCSeg consists of three major components. First, a convolutional network is used to extract spatial information. Next, CMFEN and CNFEN are used to extract character features to form a score map. At last, character features are reused to adjust the accurate segmentation points in the score map. A novel compound character feature is further proposed to describe the outline of characters. To effectively fine-tune our CCSeg, a multi-object dropout mechanism is also proposed. Based on the self-built dataset GBSD and CASIA-HWDB-MT(a dataset built based on CASIA-HWDB-1.0), we evaluate performance of CCSeg by focusing on Chinese character segmentation. Experimental results show that CCSeg effectively outperforms other methods in the literature. In the future, we will further extend our model to handle segmentation tasks when there are large overlaps between Chinese characters. ZHENYU GAO received the B.S. degree in software engineering from Nantong University, Nantong, in 2018. She is currently pursuing the M.E. degree with Shanghai Maritime University. Her current research interests are deep learning, as well as object and text detection. HUIHUA HE received the Ph.D. degree from Washington State University, in 2007. She is currently an Associate Professor with the College of Education, Shanghai Normal University. Her research interests include ICT uses in education, affective science, social-emotional learning, and related topics. She is a member of the CSE. VOLUME 8, 2020