A Page Object Detection Method Based on Mask R-CNN

Page object detection is crucial for document understanding. Different granularities for objects can result in different performances. In this study, block level region object detection is considered among the inherent hierarchical structure for document images. Inspired by Mask R-CNN (Region-based Convolutional Neural Networks) method, an end to end network is proposed to perform object classification, bounding box identification, and page object mask generation at the same time. Latex based synthetic document generation is designed for enlarging the training data. A large number of synthetic page images are generated for training to alleviate the insufficient dataset problem. Compared with existing page object competition methods, the proposed method achieves better results, with mAP of 0.917 on page objects such as table, figure and maths detection.


I. INTRODUCTION
Document image processing technology has become an important technology for machine understanding and artificial intelligence (AI) tasks. The sustainable development of document image processing technology helps AI algorithms and robots to obtain relevant image information, which contains human intellectual labor (documents). Generally, the pipeline of document image processing consists of three steps: pre-processing (binarization, noise/blur removal, rectification, etc.), page layout analysis (detection, identifying Regions of Interest, RoI) and logic understanding (gaining application-specific information from each RoI). Therefore, document image processing technology play a vital role in machine understanding and AI tasks.
In machine understanding, there are various application scenarios like information retrieval and mobile reading, which are based on page object extraction from document images. Traditionally, there are two major successive parts, layout analysis and logical understanding, taking part in the process. Layout analysis aims to detect and segment The associate editor coordinating the review of this manuscript and approving it for publication was Yongqiang Cheng. document page geometrically into regions, and subsequently logical understanding is to classify the segmented regions semantically into like tables, figures, formulae, text, and other page parts. However, recognition performance highly depends on the front-end layout segmentation results. A possible error in first segmenting stage tends to accumulate the misclassification in second recognition stage.
Recently, deep learning has become the most popular solution to object detection as well as semantic segmentation in natural scene images. It is made possible to detect, segment and classify objects in an end-to-end manner for image processing. Two kinds of state-of-the-art methods are known as one-stage detector and two-stage detector. Onestage detector treats detection as a regression task, such as SSD (Singe Shot MultiBox Detector) [1], YOLOv4 (You Only Look Once v4) [2], and YOLACT++ [3], etc. And two-stage methods using two steps: region proposal and classification/regression, like R-CNN (Region-based Convolutional Neural Networks) [4], Fast R-CNN [5], Faster R-CNN [6], etc. There are also attempts in applying deep learning based methods to document images. Some are designed for end to end pixel level analysis, while others aim to detect and classify regions with bounding boxes. Take FCN (Fully Connected Networks) as an example, it is able to simultaneously segment and classify document images at pixel level [7], [8]. It first extracts feature maps from convolutional neural network, and then deconvolution is performed to obtain full resolution semantic segmentation results. Another category is based on region proposal scheme, where the results are presented in regional bounding boxes. R-CNN [4] is popular among the region proposal methods. Various Faster R-CNN based methods [5] have been competed in page object recognition competition to detect the object bounding boxes [9]. In Faster R-CNN [6], the shared features are extracted from backbone network. The RPN (Region Proposed Network) is designed for producing candidate regions of interests, which made the inference faster than general R-CNN. Faster R-CNN can output both class labels and a bounding box offset for each candidate object.
With further advancements, Mask R-CNN [10] adds mask branch output on previous Faster R-CNN basis. Features are first extracted by backbone network, and the proposals are predicted and further refined to regress the bounding boxes for object detection and produce segmentation masks. The mask provides pixel-level semantic segmentation for each candidate object. Thus, Mask R-CNN achieves instance segmentation which involves both object detection and semantic segmentation. It integrates the improvements on both Faster R-CNN and FCN (Fully Connected Network). It applies RoI Align so as to preserve spatial orientation of features without losing the information when downsampling. As a two-stage object recognition method, it has become increasingly popular for various applications.
Inspired by previous works, in this paper, we utilized Mask R-CNN architecture on document image page object detection. A ResNet101 backbone with Feature Pyramid Network was trained for document images. Mask R-CNN performs FCN only for region of interest predicted, from which segmentation masks are produced. Both bounding box level recognition and pixel level classification are available. Insufficient data for training is one of the problems for page object recognition. In this paper, we generate large number of synthetic data for this usage. On six different datasets, including POD2017 (Page Object Detection) dataset, various experiments are designed to evaluate our method. Compared with previous page object detection methods, the proposed method achieves better AP results on page objects like table, figure and maths detection.
Our method extends the general framework Mask R-CNN to document image processing. It can achieve simultaneously page layout analysis and logical understanding. The output of the method comprises three parts: bounding box, mask and classification, as shown in Fig. 1, which represent RoI as layout analysis output (bounding box, mask) and logical understanding (classification), respectively. Aspect ratio of page objects is analyzed for RPN, and synthetic page images are generated for training. And experiments show that our method designed for document image leads to better performance than network designed for image in natural scene in page object detection tasks.
The rest of the paper is organized as follows. Related work on image document analysis is introduced in Section 2. The adapted network architecture is proposed in Section 3. Experimental results are presented at Section 4. The conclusions are given in Section 5.
Page layout analysis aims at detecting and segmenting text from non-text, followed by layout logic understanding so as to accomplish the task of recognizing logical classes like paragraph, figure, and table, etc. Traditionally, layout analysis and logical understanding are two major separated successive parts. For page segmentation, bottom-up and top-bottom methods are major approaches [21]- [24]. As for logic understanding, various classifier were utilized for classification such as Support Vector Machine [25], [26], and CRF [27]- [29].
To detect, segment and classify objects in an end-to-end manner, deep learning has been used as a basic method for object detection and semantic segmentation. Convolutional Neural Network (CNN) is powerful in representing hierarchal features. Deep networks are able to naturally integrate low/mid/high-level features and classifiers. A large number of methods have been proposed. CNN based networks have been applied on document classification and recognition, such as MobileNetV2 [30], dilated convolutional network [7]. Improvement for document image detection, segmentation and classification has been made with CNN [31]. It was claimed that various CNN architectures, including VGGNet, ResNet, GoogLeNet, DeconvNet, etc., were the most frequently used in image document processing [32].
Recently, ICDAR (International Conference on Document Analysis and Recognition) held Page Object Detection (POD) competition, focusing on detecting tables, mathematical equations, and figures. In POD competition of 2017 ICDAR conference, almost all the participated teams used deep learning for object detection, including popular SSD (Single Shot MultiBox Detector), Faster R-CNN based models [9]. It was also stated that there was possible improvement upon detection precision besides using Faster R-CNN. Multi-models also contributed to better performance of deep networks, with the help of extra information from OCRs, or CRF unary and binary features.  Despite the Faster R-CNN based method to detect the outside bounding boxes for document page object, there is another way to predict class labels in pixel-wise level. Fully convolutional network (FCN) was utilized for semantic segmentation [33]. Based on coarse feature map extracted from CNN, deconvolution was performed to obtain full resolution segmentation mask. Dilated Residual Network (DRN) [34] replaced subsampling layers by adding dilation, which could be applied for page document semantic segmentation.
By adding FCN on proposed region candidates, Mask R-CNN added mask branch output on Faster R-CNN. The mask provided pixel-level semantic segmentation for each candidate object [10]. In this paper, our model utilizes Mask R-CNN as basic network architecture on document image page object detection and recognition. A ResNet101 backbone with Feature Pyramid Network is trained for document images. Mask R-CNN performs FCN only for region of interest predicted, from which segmentation masks are produced. Both bounding box level recognition and pixel level classification are produced. Our mask can detect and segment the region as well as semantically label the region.
In the existing studies on page object detection and recognition, high level representation of document remains an open challenging problem. The granularity is crucial for performance. In this paper, the block level is of our consideration. Four page object classes include text, figure, table, and maths. Small objects like maths are referred as isolated formulae. The embedded maths among textlines is considered as text block region in ground truth. Tables have various types, among which some have three lines, and others have non lines. For tables and figures, captions are regarded as text block as well for better performance. To alleviate the problem of limited document ground truth data for deep network training, we generate synthetic document images to enlarge the dataset. Experiments on six datasets are implemented to evaluate the Mask R-CNN based page object detection method.

III. PROPOSED METHOD A. NETWROK ARCHITECTURE
Inspired by a general Mask R-CNN for object detection and segmentation, the object shape masks are better contours for object than bounding boxes, while semantic segmentation is better for depicting region of interest. In this paper, Mask R-CNN is adapted for page object recognition for document images. As is shown in Fig. 1, the overall framework consists of several parts: a Convolutional Neural Network (CNN) backbone with Feature Pyramid Network (FPN) [35], a Region Proposal Network (RPN) [6], RoI (Region of Interest) features extraction using RoI align, bounding box regression, label classification and mask prediction. Fig. 2 illustrates ResNet-101 [36] is utilized for the CNN in Fig. 1. As seen in Fig. 2(a), there is a bottomup path in ResNet-101, along which resolution of feature image is reduced. In contrast to ResNet-101, FPN is a topdown process, in which resolution of feature image increases.
Lateral connections between ResNet-101 and FPN combines features with the same resolution from ResNet-101 and FPN respectively, to generate new features in FPN [10]. Fig. 2(b) shows features extraction process, in which resolution reduction pathway in ResNet101 and four features are demonstrated. Whereas, FPN has a resolution increase pathway. Two features with the same resolution from ResNet101 and FPN respectively are combined to generate a new feature along the pathway in FPN.
A ResNet101 backbone with FPN was utilized to train the model for document images. FPN is a top-down feature pyramid architecture for detecting objects at multi-scale level. Four last residual blocks {C 2 , C 3 , C 4 , C 5 } are used as feature outputs. The lateral connections have enhanced semantically feature maps. FPN outputs final set of feature maps (P 2 , P 3 , P 4 , P 5 , P 6 ).
Based on the feature maps extracted by the backbone network, appropriate proposals for page objects need to be generated. Region Proposal Network (RPN) is adapted to document page objects. In this work, proposals for page objects including figure, table, maths, and text are in block level instead of fragment level. Fig. 3 shows that the aspect ratio distribution vary according to different page objects, including text, table, figure, and maths regions in this scenario. Aspect ratio is measured by the proportion of width and height. The ratio of tables and figures shape varies mostly within 10. While text and maths region have larger values, even reaching to 70. The multiple scales require multiple feature representations and multiple scale region proposal candidates. FPN outputs five stages (P 2 , P 3 , P 4 , P 5 , P 6 ), and anchors are set to (0. 5  figures and tables belong to large objects, while compared with small maths objects. Text blocks occupied majority of the distribution. FPN is suitable to extract features for page objects with various scales. A small object appears only in a small area in final feature maps. After RPN, RoI Align [10] is used to extract accurate features in proposed method. RoI is positive when it has IoU (Intersection over Union) with ground-truth box of at least 0.5. Otherwise, it is considered a negative RoI. All the anchor boxes over the image can be classified as positive or negative according to the object score. Positive, negative, and neutral ratio is 1:1:1. Sampled RoIs is 2000 for FPN backbone.
For bounding boxes regression and classification, Faster R-CNN extracts RoI features from each level of FPN feature maps, and it provides candidate boxes. Positive anchors do not necessarily cover the whole object. RPN regresses a refinement to the anchors in order to correct the object boundaries. For a given positive proposal, to obtain the best matched horizontal rectangle, the matched boxes are shifted and resized to align with the proposal and target map.
For Mask branch, a fully convolutional network is used to produce the region segmentation maps and to make predictions. Mask target is the intersection between a RoI and its associated ground-truth mask. In our scenario, a common four-page object mask maps with size 32 × 128 and a background map can be predicted.

B. EVALUATION
To accomplish the goals of object detection, object recognition, semantic segmentation, and the multi-task loss include classification loss Loss cls , bounding-box loss Loss bbox , and mask loss Loss mask . The mask branch produces 3 × m × m dimensional output for each RoI, after sigmoid function, and the loss function of mask branch applied average binary cross-entropy. It is claimed that the binary cross-entropy is better than multinomial cross-entropy loss [10]: The loss function of classification branch applied to crossentropy: The loss function of bounding-box applied Smooth L 1 Loss: C. SYNTHETIC DATA GENERATION Insufficient training data may cause the network to overfit the data. To alleviate the problem of limited document ground truth data for deep network training, we selected various datasets used in existing studies and enriched our dataset with synthetic document. There also existed several attempts in VOLUME 9, 2021 generating synthetic documents. Yang et al. produced document images by scrapping data from internet and applying Latex [7]. Yi et al. used semi-automatic method to label ground truth [31]. In this paper, a large number of synthetic data is generated to enrich the training data.
The predefined parameters for layout generation include font size, font color, page size, line space, margin space, figure size, page number, and total region number, etc. Given possible page objects set DO i , i = 1, 2, 3, . . . , N , the corresponding class labels are defined as y i , i = 1, 2, 3, . . . , N , where y i ∈ {Type j | j = 1, 2, 3, . . . , M}. Here, Type j represents: text, figure, maths, table objects in this dataset. For each Type j , the data source is denoted as Set j , j = 1, 2, 3, . . . , M, which can be crawled from internet or parsed from PDF pages. The generation applied top down method, from overall layout to page objects. The generation process is summarized as follows: • Generate header; • Set single column, double columns or multiple columns; • Starting from first column, according to y i , generate random page objects DO i and record the spatial coordinates DO i − Coors and its specific contents DO i − Content; • Generate page objects randomly till there is no space left in last column; • Generate foot and page number. It is unnecessary to have all the page objects appearing in one page. y i will decide whether there is no header, foot, page number, or other page objects. It is also allowed to configure certain page object included in the page; • Use TeX mark language to generate the code for target PDF document page, which can be exported as a document image at the same time.
The data source Set j for each Type j can either comes from the internet crawling data or the block data exported from eligible PDF parser.
• DSSE-200 provides 200 labeled document images, which were used in Yang's work [7]. This dataset originally has 6 classes, within which text, section, caption and list are aggregated into text blocks in our work. Hence, there are 3 classes including text, figure and table blocks involved in training and testing.
• POD2017 dataset has total 2417 document images selected from CiteSeer scientific papers, including 3 manually labeled classes: table, figure, and maths. This dataset has a variety of page layouts, including singlecolumn, double-column and multi-column scientific papers. This dataset is from page object detection (POD) competition in 2017 ICDAR conference. There are 9422 objects in total, with around 58% formulas, 31% figures and 11% tables.
• RDCL2019 has total 478 images from scanned magazines and technical articles. It was provided for recognition of documents with complex layouts. In this paper, 3 classes are aggregated, including Text, Table, and Figure. • Marmot selected from 35 English and Chinese books has 244 image pages which were also used in our previous work 28]. It can be accessed through http://www.icst.pku.edu.cn/cpdp/sjzy. Our groundtruthing tool based on wxpython was able to label the document images at a given granularity. In this paper, we mark the document pages at block-level. A set of 3 classes includes text, figure, table and maths.
• Doc2020 has 195 document images manually labeled from scientific paper. As previous datasets, 4 classes include text, figure, table and maths.
• SynDoc document images are generated automatically by applying Latex. 3 classes including text, figure, and table are used for training and testing. These date set is summarized in Table 1. In total, there are 5337 document images with 45901 blocks. As is expected in most documents, text block class dominates with 62% blocks. And there are around 14% figures, 13% maths, and 11% tables. The data distribution among these 4 classes is not exactly balanced. All the document images for training and testing are divided approximately with a ratio of 4:1.
In most cases, our labeling tool uses an open source tool called VIA (VGG Image Annotator) [38]. Rectangular boxes and semantic classes are marked. The ground truth is stored in json format. Our self-developed labeling tool is called Marmot [39], which is able to label the classes hierarchically. The ideal solution should be hierarchical, which includes block level, fragment level, and the relationship between different granularity regions. The hierarchical bounding boxes can be stored in a tree structure. Our own data are produced with assistance of a self-developed ground-truthing tool called CLAW. Ideally, the layout for pages should be a hierarchical structure, built upon different level of granularity. In this work, block level is our consideration. The previous PRImA dataset used PAGE format for 2019 ICDAR competition on recognition of documents with complex layout [37]. Within their dataset, nonrectangular shaped regions are annotated. For mask ground truth, RDCL2019 dataset utilized non-rectangular shapes. For other datasets, rectangular bounding boxes mark the ground truth. Generally, regions are defined as rectangular areas. The ground truth is stored in XML format. An XML based ground-truth data format is designed, which can be transformed into COCO format for unified interface.

B. IMPLEMENTATION DETAILS
Our network architecture is implemented in pytorch. All the input images are scaled into 800 pixels on short edge. Popular pretrained models on natural images are not suitable for document images. Therefore, the convolutional network is trained from scratch with random initialization.
We trained on 8 GPUs for 36000 iterations for 108 epochs, with 2 images per GPU. The learning rate is 0.005, with weight decay 0.0001, momentum 0.9. Adam optimization is used. RPN has 5 scales and 4 aspect ratios, where the 5 scales are (8,16,32,64,128) and the 4 aspect ratios are (0.5, 1, 2, 3). RoI threshold considered positive is 0.5. The ratio of positive to negative RoI is 0.25. Each image has 2000 sampled RoIs for training. The candidate boxes are predicted by non-maximum suppression. Among the highest scoring 100 boxes, mask branch is applied to predict K masks per RoI. K is the predicted class. At test time, the proposal number is 300 for C4 backbone and 1000 for FPN.
IoU (Intersection over Union) threshold is set to 0.6 and 0.8, as is the same in paper [9]. Given precision P, recall R, AP (Average Precision) metric is applied to evaluate the performance. AP is the mean P of 101 points, defined as AP = 1 101 R∈ (0.00,0.01,...,0.50,...,0.99,1.00) max R:R>R P(R). We train the model with 8 GPUs (TITAN XP 12G). Training takes about 5 hours for 36000 iterations. If only 1 GPU is used, training also takes 5 hours for 36000 iterations, but batch size is 2 (2 images per GPU), whereas, the batch size is 16 corresponding to 8 GPUs. That is to say, with 8 GPUs, the model uses 8-fold data than 1 GPU. In other words, the model is trained faster using 8 GPUs than 1 GPU. For demonstration, Fig. 4 illustrates the loss curves using 1 GPU. In this figure the losses are normalized. ''loss_mask'', ''loss_classifier'', and ''loss_box_reg'' are defined as (1) ∼ (3) respectively, and ''total_loss'' is the sum of three. ''total_loss'' and ''loss_mask'' have dramatic decline by the 5000 iterations. Although ''loss_box_reg'' and ''loss_classifier'' don't show the same falling gradient, they decrease steadily. And all loss curves see a steady decrease in the training process. Overall, our method converges very quickly before 5000 iterations. As for inference, the model runs at 0.089s per image. Actually, if the implement is optimized, the better speed would be got.

C. EVALUATIONS USING POD2017
To evaluate our method, we compare our method with 8 methods in the ICDAR2017 POD competition (POD2017) [9], and two recent methods are compared additionally: Li et al. [40] (in 1998) and YOLACT++ [3] (in 2020). As shown in Table 2  and YOLACT++ [3] are trained with the same configuration of ours.
All 11 methods are evaluated with mean of AP (mAP) and AP for different objects. Given IoU threshold of 0.6, our method gains best mAP 0.917, which is slightly higher than Li et al. Overall, mAPs of our method are better than others in Table 2 on IoU of 0.6 except that on Figure. NLPR-PAL, icstpku, Li et al. and HustVision all consider inherent characteristics of page objects in document image, therefore they perform well. YOLACT++ was not designed for document image. Vislnt, SOS, UITVN, Matiai-ee are Faster R-CNN based methods or its variations, which output bounding boxes for page objects. By contrast, our method based on Mask R-CNN considers not only bounding box based detection but also semantic detection to output pixel-wise detection (mask) for objects, so as to performs better on small objects, such as Maths.
For IoU threshold of 0.8, on Maths, our method still outperforms other methods with mAP 0.901 and Li et al. gets second place (mAP 0.863). Considering average mAP with IoU threshold of 0.6 and 0.8, our method gets 0.897 which is slightly better than average mAP 0.8935 of Li et al. So it is safe to conclude that our method gets rank one according to evaluation in Table 2.
To compare network designed for document image with network designed for image in natural scene, Fig. 5 shows mAP curves of our method and YOLACT++ with IoU of 0.6 and 0.8 in training. In Fig. 5(a), the mAP of our method gets a better start exceeding the mAP of YOLACT++ at the 5000 iterations. After 10000 iterations, two curves remain parallel. And Fig. 5(b) shows similar trend. A network designed for document images can represent the inherent the characteristics of dataset, Hence, it can result in better performance in Table 2.

D. PIXEL-WISE DETECTION BETTER THAN REGION DETECTION
To further investigate the detection accuracy of our method, datasets with more complex layout structures are used to visualize detection results. As is can be seen in Fig. 6 Fig. 6 (a) and (c) have overlapped area between text and figure blocks when using bounding box recognition results. The pixel-wise mask prediction results marked with coloring pixels are generally better with the use of FCN within each proposed candidate region. Within the tables or figures, text line might appear to be table cells or illustrative texts. These text blocks did not show the misclassification in Fig. 6 (b).
The high confident text proposals with tables or figures are not misclassified. Although small regions are still challenging, the equation number following maths can be missed for detection in Fig. 6 (d). This method is capable of handling various shapes of page objects in complex layout documents. Unlike full page FCN method, its FCN is carried out only within each region candidate instead of the whole page, since pixel level segmentation has more expensive computational cost. And it is unnecessary to take extra post processing to clean the segmented masks. The bounding boxes plus VOLUME 9, 2021 mask prediction for page object blocks are produced by two branches at the same time.

V. CONCLUSION
In this study, to detect hierarchical page objects for document images, a Mask R-CNN based network was proposed to output end to end results, including object classification, bounding box identification and page object mask generation. Block level region object recognition was of our consideration among various granularities. Latex based synthetic generation was designed to enlarge the training dataset. Compared with previous ICDAR page object detection competition methods, the proposed method achieved promising results with mAP 0.917 on dataset POD2017, which was better than the existing page object competition methods.
CANHUI XU received the Ph.D. degree from Central South University, in 2011. She has been a Visiting Scholar with Arizona State University, USA, from 2019 to 2020. She is currently working with the Qingdao University of Science and Technology. Her research interests include document image processing and deep learning.
CAO SHI received the Ph.D. degree from Central South University, in 2011. He is currently working with the Qingdao University of Science and Technology. His research interests include image and video processing and artificial intelligence.
HENGYUE BI is currently pursuing the master's degree majoring in computer science and technology with the Qingdao University of Science and Technology. His research interests include image processing and deep learning.
CHUANQI LIU received the bachelor's degree from the Qingdao University of Science and Technology, in 2021.
YONGFENG YUAN received the bachelor's, master's, and Ph.D. degrees from the Harbin Institute of Technology, from 1998 to 2010. He is currently working with the Harbin Institute of Technology, as an Associate Professor. His research interests include concentrated on image processing, computer vision, and computational biology.
HAOYAN GUO received the Ph.D. degree from the Harbin Institute of Technology, in 2016. She is currently working with the Harbin Institute of Technology. Her research interests focus on data processing, including data analysis, multidimensional data processing, and software development.
YINONG CHEN received the Ph.D. degree from the Karlsruhe Institute of Technology (KIT), University of Karlsruhe, Germany, in 1993. He is currently working with Arizona State University. His research interests include service-oriented computing, visual programming, robotics, and artificial intelligence. VOLUME 9, 2021