A Text Detection Algorithm for Image of Student Exercises Based on CTPN and Enhanced YOLOv3

Intelligent learning system (ILS) has become a popular learning tool for students. It can collect students’ wrong questions in exercises and dig out their unskilled knowledge points so that it can recommend personalized exercises for students. Detecting text accurately from images of students’ exercises is significant and essential in an ILS. However, a big challenge of text detection is that traditional text detection algorithms can not detect complete text lines in an exercise scene, and their detection box always splits between Chinese and mathematical symbols. In this article, we propose a deep-learning-based approach for text detection, which improves You Only Look Once version 3 (YOLOv3) by changing the regression object from a single character to a fixed-width text and applies a stitching strategy to construct text lines based on the relation matrix, which improves the accuracy by 9.8%. Experimental results on both RCTW Chinese text detection dataset and real exercise scenario show that our model can improve detection effectiveness. In addition, we compare our method with two state-of-the-art approaches in applications of exercise text detection, and discuss its capability and limitations. We have also provided a platform which has implemented the proposal for detecting text lines in students’ daily homework or examination papers, which enhances user experience well.


I. INTRODUCTION
With the development of artificial intelligence, students' learning mode has changed dramatically. A variety of online education products from kindergarten to twelfth grade (K12) have emerged, such as Ape tutoring, Baidu homework help, etc. These new-type education products usually provide functions like online learning, online practicing, and searching questions by photos. Among all of these functions, searching questions by photos has been widely used. It can help students find correct answers for wrong questions by searching online databases in real-time, thus it increases the efficiency of solving problems. However, this procedure is instantaneous, which means students can not save their wrong questions. And also, these products cannot recommend appropriate exercises, which could enhance students' learning performances, for students based on each student's wrong questions as well.
We propose an intelligent learning system (ILS) which could recognize images uploaded by users and convert them The associate editor coordinating the review of this manuscript and approving it for publication was Xin Luo . to text, and then the results are saved in user's online question sets. In order to provide users appropriate exercises, we also leverage online question sets to mine users' unskilled knowledge points and incorporate their personal information to construct user profiles. We then adopt a recommendation system to provide users with appropriate exercises so that they could complement their flaws more efficiently and henceforth achieve higher scores in examinations.
For the purpose of quickly and easily uploading and synchronizing questions on exercise books or exam papers, a naive idea is to pre-store the index information (such as the catalog of learning materials) in the system so that users can use the catalog to find questions and manually add them to online question sets. However, this method requires to maintain the catalog for each learning material, which brings huge labor costs. For users, the idea is not simple enough and attractive.
To solve the above problem, we first adopt optical character recognition (OCR) technique to extract texts from question images which are uploaded by users, after that, we match the corresponding texts with the question database in real time, and add the matched results directly to each user's online question set, which is more convenient for users. The key elements of our project can be divided into the following 4 steps: 1) Text detection [1]- [5]: Text detection mainly analyzes the layout of the input image and locates the positions of texts in the image. Subsequently, it provides the text regions of the input image to the text recognition step; 2) Text recognition [6]- [8]: It converts text regions of the input image to machine readable strings; 3) Database matching: It matches the exercises in the database that are consistent with the recognized results; 4) Exercise recommendation: It utilizes question sets of users to mine unskilled knowledge points of each user and then recommends appropriate practice questions for each of them. This article focuses on text detection in images and proposes a solution for problems encountered in text detection phase. Text detection is the basis of our whole project, only when the text detection stage finds locations of the texts in the image precisely, the recognizing and matching stages could output accurate results.
The biggest challenge in applying the text detection model to exercise images is that exercise images often contain many types of characters, such as number characters and Chinese characters, which increases the difficulty of text detection. As shown in Figure 1, different types of characters have different spacing. Besides, exercise images are often characterized by a large aspect ratio and texts are generally very long. Therefore, most existing text detection methods fail in detecting complete text lines in examination scenarios, and bounding boxes are usually broken at the junction of number text and Chinese text. Thus, one text line would get multiple detection boxes, and those boxes usually overlap or omit some characters, which has a bad effect on the subsequent text recognition stage. In our work, a way to detect text regions and a splicing method to concatenate fractures are proposed to solve the aforementioned problem, in which one has a role to detect text regions and the other one is responsible for splicing cracked bounding boxes into a complete bounding box.
In this article, an enhanced YOLOv3 model is presented to detect texts in primary school exercise images. In this scheme, a strategy based on detecting text regions rather than individual characters and a splicing algorithm are proposed to enhance the performance of text detection under examination paper scenarios. Specifically, we first fix the width of YOLOv3's anchor boxes and make it smaller, then we perform score threshold filtering and non-maximal suppression filtering on the output bounding boxes. Besides, we construct a relationship matrix to splice text lines instead of detecting them line-by-line like traditional text detection methods. We compare the effectiveness of our proposed method with some latest state-of-the-art text detection models under different improvement strategies in exercise images and it improves the accuracy of subsequent database matching stage. The contributions of this work are summarized as: • We improve the YOLOv3 algorithm by determining regression object to a fixed-width text region so as to increase the accuracy of subsequent text recognition and database matching processes.
• A splicing strategy is proposed which concatenates detected text areas based on relation matrix and it brings a significant improvement in exercises scenarios.
• An open platform is built for text detection and text recognition in images. The platform is based on our model and it could support different improvement strategies. The rest of the article is organized as follows. Section III discusses related works and Section III reviews some basic principle of our proposed method. Section IV presents our novel model, which performs well in exercise scenarios. Experimental tests are depicted in Section V, and Section VI introduces our OCR open platform. Section VII gives a conclusion of this article and provides possible future works.

II. RELATED WORK
Various types of text detectors have been proposed in recent years, which can be broadly classified into two categories, i.e., traditional text detectors and deep learning-based text detectors. Traditional text detectors leverage artificial low-level features and prior knowledge to distinguish text and non-text parts in the scene image. However, these algorithms lack robustness to various fonts and degraded images. In order to mitigate such a problem, a lot of deep learning-based text detection researches have been done, and they could achieve high performance. The majority of popular deep learning-based text detection approaches are inspired by semantic segmentation [9], [10] and object detection [11]- [15]. E.g., YOLOv3 [16] and Faster RCNN [17] are commonly used object detection algorithms that are applied directly to text detection missions.
In the range of detecting texts, there are several works that deserve to be mentioned. Baek et al. [18] proposed PixelLink which realizes text detection through instance segmentation. It performs two pixel-wise predictions: text/non-text prediction and link prediction. By setting two different thresholds, the pixel positive set and the link positive set can be obtained, and then the pixel positives are connected according to the link positive to obtain the connected components(CCs) set. Each element in the CCs set represents a text instance. Wang et al. [19] proposed the Progressive Scale Expansion Network (PSENet) which expands the small-scale kernel to the final text line size based on Breadth-First-Search(BFS). These segmentation-based methods are applied for processing multi-oriented texts in real scene images. However, once the characters in the image are very close, separating them using only text/non-text semantic segmentation becomes extremely difficult.
Tian et al. [20] proposed CTPN that detects a small text boxes and judges whether they are text regions. When all small text boxes in an image are detected, those small text boxes belonging to the same text line are merged. After merging, all complete text boxes can be obtained. Although the idea of CTPN is novel and attractive, the time cost of CTPN is very expensive. Thereafter, Joseph Redmon et al. proposed YOLO series [16], [21], [22] that are of a new fully convolutional target detection as a regression problem, so as to quickly and accurately determine the position together with the type of the detected object. Baek et al. [18] proposed Character Region Awareness for Text Detection(CRAFT) which locates individual character regions and links detected characters to be a text instance. Liao et al.
proposed TextBoxes++ [23] which is built on an end-to-end fully convolutional network and can detect texts in any direction. This type of target detection-based detectors need to design anchors or default boxes of various scales and different aspect ratios in advance.
Our text detector is also designed based on object detection. We set the detected target as a fixed-width text area, and integrate detected text areas using a splicing algorithm. Compared with text detection algorithm that uses a single character as the detected target, our method has higher accuracy and can cover the text more completely.

III. BASIC PRINCIPLE FOR TEXT DETECTION A. YOLOv3
YOLOv3 [16] is an end-to-end object detection method, which directly regresses the position and category of the object in output [24], thus completes a speed-up detection. The network structure of YOLOv3 mainly includes two layers. The first layer is feature extraction that uses the Darknet-53 network to obtain a feature map, and sets the grid cells to the same size as the feature map. For each grid cell, three bounding boxes will be predicted. In addition, in the YOLOv3 network, the scale of feature maps used for object detection is divided into three ratios, namely 13 × 13, 26 × 26 and 52 × 52. The second layer is output process that produces location information (i.e.,x, y, w, h, and pr). Then, the unqualified bounding box will be removed by the filtering algorithm, and the text line construction algorithm is used to generate a text line, which is provided to the next stage for text recognition. YOLOv3 uses a construction method similar to the feature pyramid [25] in order to get three feature maps through two upsampling layers. It can detect objects with different sizes and meet the real-time detection. The specific network architecture of YOLOv3 is shown in FIGURE 2 [26].

B. CTPN
Tian et al. [20] proposed a connectionist text proposal network (CTPN) to localize text. In CTPN, a vertical anchor mechanism is developed to predict text locations in a fine scale [27], which greatly improves robustness and reliability on multi-scale and multi-language texts. It converts text detection into localizing fine scale text proposals, just detecting a part of a single character (i.e., image) to determine whether it is part of a character (such as the yellow box in FIGURE 3). After detecting all boxes, an in-network recurrent architecture is used to incorporate these small text proposals for getting a full-text box, then the text detection task is completed (such as the red box in FIGURE 3).

IV. OUR APPROACH
In this section, we introduce a text detection algorithm for student exercise images that includes two parts: one is the improved anchor box and the other one is the improved text line construction.

A. IMPROVED ANCHOR BOX
In our project, we found that most existing text detection methods fail in detecting complete text lines in exercise scenarios, and bounding boxes are usually broken at the junction of number text and Chinese text. Thus, one text line would get multiple detection boxes, and those boxes usually overlap or omit some characters, which forces us to deal with missing text [14], [28] and repeated text. To detect a complete line of text, we give an enhanced version of YOLOv3 to solve the long text detection problem, mainly with the following innovations: • We modified the aspect ratio of anchor boxes to make it suitable for detecting long texts.
• We turn the multi-class detection problem into a two-class of text/background detection problem.
• We utilize non-maximum suppression to improve the detection effect. YOLOv3 is not good for detecting small objects and those who are close to each other. It detects objects by locating each character so that errors in the detection of each character will gradually be accumulated, which will lead to performance degradation. When aspect ratios of objects are unnormal, the generalization ability of YOLOv3 is weak. Therefore, we firstly improve this algorithm in terms of changing the detection object, that is, we only detect and determine whether it is a text area instead of locating characters.
In view of the fact that the text objects in the exercise images are mostly small targets, we try to improve the anchor box which has great effect on the object detection in YOLOv3. For predicting bounding boxes, we focus on changing the original three anchor boxes with different lengths and widths to three scale detections with fixed width but different heights, which can assign more accurate anchor boxes for text detection. This idea is inspired by CTPN, which hold the view that predicting the vertical position of text is easier than predicting the horizontal position of text.
The improved algorithm only recognizes the upper and the lower edges of the text instead of a single character. Hence, our algorithm only needs to detect the text area rather than the character. We in this paper take nine anchor boxes with size  corner of the cell t x , t y and the scale relative to the anchor box t w , t h . Suppose that the cell is offset relative to the upper left corner of the image by (c x , c y ), the size of the input layer and output layer is i w , g w , and the anchor box has height p h , then the predicted bounding box's (i.e., target's) coordinates (b x , b y ) and the bounding box's width b w and height b h compute The localization results will be many boxes with partial overlap between them, thus we use Non-Maximum Suppression (NMS) [29] to select the bounding boxes with highest score, and ensure that multiple targets have localization results.We evaluate the probability that the bounding box contains texts. If the overlapping ratio of the bounding box and the actual target bounding box is greater than that of any other bounding box, the probability of this anchor box is 1. If the overlapping ratio is greater than 0.5, but not the maximum, this prediction is ignored. If the bounding box is not considered to contain an text area, it has no effect in the loss function.
Loss function is an important evaluation basis for parameter optimization and update of the deep network. For our algorithm, the loss function for text detection is calculated by combining the predicted value and the ground truth. The network predicts the coordinates of the bounding box, objectness prediction and class predictions. Usually, YOLOv3 is used for multi-object classification, i.e. the default box at each level will classify all classes of objects (plus background) for scoring. We modify it to binary classification problem of text/background to better fit the text recognition requirements, and the loss function is modified as follows: (r == truth class )?1 : 0 − predict class r 2 VOLUME 8, 2020 The total loss of the network is composed of four types of loss, the formula is as follows: loss total = loss xy + loss wh + loss confidence + loss class .
Among them, loss xy , loss confidence , and loss class are all in the range of 0 and 1, so we calculate them by using the binary cross-entropy loss function. While for loss wh , the mean squared error (MSE) is used to calculate the loss function due to the wide range of its values.

B. IMPROVED TEXT LINE CONSTRUCTION
We propose a bounding box stitching strategy based on the relation matrix, which improves the robustness of the model in different scenarios. It has better adaptability to oblique text, and can get complete text lines. The text box will not broke due to the large gap between different types of characters. This algorithm contains two main parts: construction of the relation matrix of the bounding boxes and calculation of the text line.
The relation matrix of the bounding boxes is a two dimensional Boolean matrix, which describes the right adjacent box of each bounding box. As shown in FIGURE 5(b), box 1 and box 2 are adjacent in the red text line, and box 2 is the right adjacency of box 1 , so the position (box 1 , box 2 ) in the relation matrix is assigned True, as shown in FIGURE 5(a).

1) CONSTRUCT THE RELATION MATRIX OF THE BOUNDING BOXES
For each target bounding box (abscissa is x), we will find its right adjacency box and build a relationship set for it, which is a series of bounding boxes with abscissas in [x, x + mhg] and meets the overlap-similar condition defined in the following. And mhg is an artificial horizontal threshold, whose value is usually 0.08 times the image's width. To ensure that the calculated right adjacency box and the target bounding box are in the same text line, the overlap-similar calculation is performed on the box between x − mhg and x, thresholds are given artificially.
Take box 1 and box 2 as an example, the overlap and similar is calculated as y min = max (y 10 , y 20 ) , y max = min (y 13 , y 23 ) ,

Algorithm 1 Subgraph
Require: graph = [[g 11 , g 12 , · · · , g 1n ], · · · , [g n1 , g n2 , · · · , g nn ]]; Ensure: sub_graphs = [line 1 , line 2 , · · · , line m ]; 1 where y 10  We take the box that closest to the target bounding box and meets the overlap-similar condition in the relationship set as the right adjacency box. After the relation matrix is established, we determine text lines according to it by dividing the bounding boxes belonging to the same text line into a set. Then text lines of a picture creates.
We simply describe this process in Algorithm 1. The general idea is to find the bounding box that is the first box of each text line (line 2), then we find its right adjacency box in the matrix (lines 3-9), and iterate this process until all bounding boxes are divided into a certain set (line 10).

2) TEXT LINE CALCULATION
We calculate the coordinate information of text lines according to the bounding box set of each text line, as depicted in   Next, we calculate the distance top from points T 21 and T 22 to the line L 1 and the distance down from points T 31 and T 32 to the line L 1 , where d (·) denotes the distance from a point to a line. Assuming the straight-line equation of L 1 is L 1 = k 1 x + b 1 , we solve the parallel lines L 4 and L 5 as where top and down are also represented as vertical distances from L 4 and L 5 to the line L 1 respectively (lines [12][13][14][15]. Finally, the footpoints of T 11 to L 4 and L 5 and the footpoints of T 12 to L 4 and L 5 are the four coordinates of the text box (line 16), as shown in FIGURE 6(c). In line 16, the footpoints are denoted by fp(·), such as fp(T 11 to L 5 ) denotes the footpoints' coordinates of T 11 to L 5 . Thus, we can use these four footpoints to get the position of the i th text line, which are denoted as tc i in Algorithm 2.

V. EXPERIMENTAL RESULTS
We first study the impact of different setting of improvement strategies on detection performance and then evaluate our proposed model against two state-of-the-art approaches.

A. DATASETS AND PREPROCESSING
We choose RCTW Chinese text detection dataset [20] of the ICDAR 2017 competition and a real exercise dataset of primary school exercise book. The RCTW-17 contains a large-scale dataset that consists of various kinds of images, including street views, posters, menus, indoor scenes, and screenshots. The real exercise dataset is obtained by image capturing devices (i.e., scaners), which has two sub- sets ('math','Chinese') classified by subject. We labeled it with LabelMe. There are totally 8033 textual instances, 7229 of which are used for training and remaining are for testing.
The label contains coordinates and character information, and the characters that are difficult to recognize are indicated by '#', as shown in FIGURE 7(a). Due to the difference between YOLOv3 and our proposed method, the labeled text lines of the dataset are equally segmented. The images for training YOLOv3 are labeled with single-character text boxes, as shown in Figure 7(b), and the images for training our model are labeled with fixed-width text regions, as shown in Figure 7(c).

B. IMPLEMENTATION DETAILS
We trained the network by setting the epoch to 5. Each epoch contains 7229 steps. The learning rate is set to 0.005 and decay is set to 0.3. Overlap and Similar are applied for determine adjacent bounding boxes and the threshold is 0.7. All comparison methods in text detection experiments are pre-trained on ICDAR 2017 dataset and then fine-tuned on exercise dataset.
The loss curve of the train set is shown in FIGURE 8. It can be found that the loss decreases rapidly within 2000 steps, and then the decline gradually decreases. The loss during the entire training process decreases from 31025.2422 to about 253.6912, which indicates that the training process is effective. We evaluate the trained model on the test set. FIGURE 9(a) shows the trend of total loss on each epoch, and FIGURE 9(b) shows the trend of four sub-losses in detail. After the first epoch, the loss wh and loss class have approached zero, and the loss confidence continues to converge as the training progresses and eventually remains at 350.

C. EVALUATION ON THE EXERCISE IMAGES
We focus on the impact of different strategies on detection effect. Three models are tested on one exercise dataset, namely the original YOLOv3 model, the model that only improves the detection object, and the model that combines all strategies.
[Metrics] Since the ultimate goal of the method implemented in this paper is to get the text of the exercise images uploaded by users in combination with text recognition algorithm, and then perform a database matching. Here the step of exercise matching searches the database for corresponding exercises based on the detected and recognized results. We therefore evaluate the performance with Search Accuracy, Search Time and Time(Text Box Detection). Search accuracy reflects how many correct texts were searched correctly. Since text recognition is not the focus of this article, a basic model Densenet and CTC is adopted to recognize the texts.
[Results] The results are summarized in TABLE 1. It shows: (1)YOLOv3, basic model; (2)IMP-Obj, the model that only improves the detection object; (3)IMP, the model that combines all strategies. The experimental results show that the Search Accuracy of the IMP model increased by 9.8% compared with YOLOv3 model, and improvement of the detection object algorithm(IMP-Obj) also had significant improvement on the Search Accuracy. As the input layer of YOLOv3 is increased from 416 × 416 to 608 × 608, FIGURE 9. Loss curves of model testing.  the detection time of the two new models naturally increase from 0.86s to 1.65s and 1.58s, respectively. Although the improved model is not the fastest, it is acceptable in our project. IMP-Obj performs better than the original YOLOv3 because it uses improved anchor boxes and loss function to obtain a detection box with higher VOLUME 8, 2020 text confidence. IMP had higher performance than the original YOLOv3 because it improved the method of generating text lines, which used the indicators of height and distance to comprehensively determine the adjacency matrix, and utilize adjacency matrix as the basis for merging text lines. YOLOv3 adopts a line-by-line detection method to cluster the detection boxes into a collection. It is not suitable for text with multiple fonts and a large aspect ratio. FIGURE 10(b)(d) show that the text box (i.e., bounding box) detected by YOLOv3 is relatively irregular and has large overlap which will cause a great obstacle to the subsequent text splicing algorithm. Our proposed model can yield regular-sized detection boxes with minimal overlap among detection boxes, making the spliced text lines to cover the text area completely, as shown in FIGURE 10(c)(e).

D. COMPARISON WITH STATE-OF-THE-ARTS
At last, we also compare our model with Character Region Awareness for Text Detection (CRAFT) [18] and Shape Robust Text Detection with Progressive Scale Expansion Network(PSENet) [19]. CRAFT is a text detection algorithm with good robustness to scale changes and long text detection. PSENet uses a progressive expansion algorithm to expand the small-scale kernel to the final text line. The comparison in this section focuses on the detection performance of the above two state-of-art methods and our model. Either CARFT, PSENet or our method as a detector, horizontal Chinese character text detection yields good results. However, the exercise text usually includes numbers and mathematical symbols, and the inconsistent interval between numbers and Chinese characters makes difference in detection performance.
In the case of the text contains both Chinese and other characters such as numbers, different algorithms act as detectors to detect the text, as shown in Figure 11. It can be shown that CRAFT's detection box separated and failed in detecting a complete text box at the junction of numbers characters and Chinese characters, whereas our method is successful in detecting a complete line of text. This is because CRAFT determines the final text line based on the connection relationship among the detected characters. When the interval among characters is uneven or large, the algorithm will judge that it has no connection relationship and cannot obtain a complete text box. The results of PSENet are also very good, but there are still some text lines that cannot be fully detected. PSENet predicts the different kernel sizes of the text line, and then uses a progressive expansion algorithm to expand the small-scale kernel to the final size of text line. Due to the fact that there is a relatively large margin between the small-scale kernels, it can distinguish adjacent text lines well. But at the same time, it also divide the one text line with large intervals into multiple boxes.

VI. OCR PLATFORM
Our platform uses optical character recognition to directly convert the text of images into editable text. It can recognize the content of students' daily homework and examination papers, the automatic entry of student homework and examination papers, and improve the efficiency of teachers' teaching and students' learning.
It also can choose different parameters to find a more suitable configuration and showing details of the models. FIGURE 12 is a functional framework of the OCR platform including the home page, the setting page, and the demo page.
As shown in FIGURE 13(a), the home page shows details of models in this paper including the network architecture, the performance of the project, the summary outlook of future real-time OCR and the current experimental results, which makes the structure of model clearer. FIGURE 13(b) is the setting page. We can intuitively set the structural parameters of our models on this page, i.e., whether to perform image preprocessing, whether to improve the object detection, whether to enable box stitching and set thresholds. And multiple models can be trained on different settings.
After operating on the setting page, we can run the relative model in the demo page, shown in FIGURE 13(c). We first upload a picture and then test the model to get the result. Of course, this page can be used to understand, debug and demonstrate the model.

VII. CONCLUSION
This paper focuses on text detection based on the exercise scenarios and proposes a text detection algorithm in terms of improving the advanced real-time object detection network YOLOv3 for enhancing the accuracy of text detection. This new algorithm includes two main parts. One is changing the detection objects by improving the anchor box. The other one is a bounding box splicing algorithm based on the relation matrix. It makes the detection frame of text detection more regular, the overlap area smaller, covers the text area more fully, and is more robust to oblique text. Compared with the original YOLOv3 algorithm, our algorithm improves the accuracy by 9.8%, and the running time of the entire model is 1.58s. The experimental part uses ICDAR2017's RCTW Chinese dataset and manually labeled the scene problem dataset to perform weight optimization. Through a series of scoring threshold filtering, NMS filtering, and text line construction algorithms, the precise positioning of text lines is obtained and provided to subsequent sequential text recognition algorithms.
Although the existing algorithms have successfully applied in most application scenarios, we still have some follow-up works. First, our algorithm recognizes the background as text in some scenarios to a certain extent, and it mainly targets for objects or horizontally distributed text. The algorithm needs further research on the text detection effect of some curved or circular distribution cases. In the end, although the current detection speed of our algorithm has reached the industrial level, it is still a big challenge for real-time text detection to be used in scenarios such as autonomous driving, and even real-time OCR recognition. It needs to be improved on hardware, calculation speed and model compression.
LANGCAI CAO received the Ph.D. degree in automation from Xiamen University, in 2011. He is currently an Associate Professor with Xiamen University. His research interests include research and development of information systems, process intelligence, and machine learning.
HONGWEI LI received the bachelor's degree in automation from Northeast Petroleum University, in 2019. Her research interests include optical character recognition and social networks.
RONGBIAO XIE received the bachelor's degree in automation from Xiamen University, in 2019. His research interests include optical character recognition and reinforcement learning.
JINRONG ZHU received the master's degree in automation from Xiamen University, in 2020. Her research interests include optical character recognition and community search.