FCENet: An Instance Segmentation Model for Extracting Figures and Captions From Material Documents

A critical ideology of the existing Material Genome Project refers to the application of data and artificial intelligence to facilitate material innovation. The lack of data hinders the development of novel materials. The figures and captions in the material literature cover essential information regarding the entire document and have sufficient image sample data for research. Accordingly, how to extract figures and captions from the literature is critical to solve the lack of data. Though some PDF parsing tools are capable of extracting information from documents, they generally identify a document’s figures by parsing the document into a concrete structure. As impacted by the inconsistency of the form of different journals, they commonly achieve wrong recognition results. Thus, an efficient figure and caption extraction network FCENet is proposed in the present study. Inconsistent with other extraction tools, this study first attempts to adopt instance segmentation models to detect figures and their captions, and then extract them. FCENet developed in this study builds upon BlendMask and introduces a horizontal and vertical attention module. This study splits the BlendMask detection head into two branches, i.e., figure detection and caption detection, which increases final detection accuracy and speed. This study collects nearly 3000 material documents for model training and testing. As revealed from the last experiments and results, the performance of FCENet is significantly compared with that of other existing instance segmentation models. Its box and mask mAP (mean Average Precision) are 8.51% and 12.59% higher than those of BlendMask, respectively. This study hopes that considerable material image data can be acquired via FCENet and sufficiently support image data for machine learning and data mining in the material area.


I. INTRODUCTION
Scientific research literature represents the most cutting-edge research results in arrange of research areas. Its research direction overall determines each area's development and boosts research development in science and technology. When the mentioned research results flow from science and technology to industry, they are capable of boosting human society's progress. The material area refers to one of the The associate editor coordinating the review of this manuscript and approving it for publication was Ting Wang . essential areas at present, and its development impacts other areas. On the whole, a wide variety of areas require materials in practice. The development and application of several critical materials lay the cornerstones for promoting social progress.
Research and development of novel materials originate from considerable test data. Such process can be accelerated by the analysis and handling of test data. However, it has also led to some difficulties currently facing research and development of novel materials: (1) Lack of data. For some material data, it cannot obtain large amounts of experimentally VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ achieved data as impacted by the enormous expense of test costs and time costs. (2) Influence of artificial subjective factors. Even though there is considerable material data, some test links are interfered with by artificial emotional factors under manual participation, thereby reducing the accuracy of tested results to a certain extent.
In the current data-driven era, however, the computer area has been of more apparent importance. As artificial intelligence is developed and applied continuously, it has gradually expanded to more areas. Many areas exploit artificial intelligence to allow computers to process data (e.g., biomedicine [1]- [3], agriculture [4], [5], remote-sensing [6], materials [7], and physics [8]), which have achieved effective results. Accordingly, this avoids the interference of artificial factors while lowering time costs.
To solve the lack of data, people began to explore how to extract figures and captions from material literatures. After all, figures and captions of scientific research documents involve essential information in the entire document. Collecting such information and fully exploiting it can enrich experimentally achieved data and has boosted the development of novel materials. As impacted by the complexity and diversity of journal layouts, the randomness of form and shape, as well as scale, figures and captions are difficult to extract directly from the document.
In the present study, a novel and more versatile method is proposed to extract figures and captions from material documents. Inconsistent with other methods, existing instance segmentation models on datasets (e.g., COCO) exhibits high performance, whereas good performance on the document image is not ensured. This study introduces instance segmentation to document image's figures and captions extraction initially. Given the document image's particularity, the instance segmentation model FCENet ( Figure  Caption Extract Net) proposed in this study achieves better results than other models. The work here is summarized below: 1) This study collects over 3,000 material documents, converts them into an image format page by page, and subsequently selects 10,000 from nearly 20,000 images to build an instance segmentation dataset.  [13], in the area of high-energy physics, proposed PDFPlo-tExtractor, and Clark et al. [14] developed PDFfigures2 in the area of computer science. The mentioned tools exploit clustering and classification to separate images, while in the biomedical area, LI et al. [15] submitted PDFigCapX, which first separates text and image regions, and conducts the connected component analysis to analyze the image. Image content is identified, and then the layout information of PDF is adopted to obtain the final figures and captions. The mentioned methods are capable of achieving effective results in their respective areas. However, as impacted by the diversity of structured documents in various areas, results are overall ineffective. Especially for extracting figures and captions from journals with considerable differences in contents and layouts, the error will be more prominent, and most methods do not associate figures with their respective captions. Accordingly, numerous shortcomings remain in the extraction of figures and their captions from documents.

B. ATTENTION MECHANISM
In brief, attention mechanism arouses the attention from the neural network on specific information in the image. SENet [16](Squeeze Excitation Network) took the attention mechanism in the channel dimension. It enhances features and suppresses some unimportant features by multiplicative weighting. CBAM [17](Convolutional Block Attention Module) proposed channel attention and spatial attention to weight features, while introducing the attention mechanism into the channel and spatial dimensions. In addition, DANet [18](Dual Attention Network) introduced channel and spatial attention mechanisms. Unlike CBAM that weights the entire feature map, DANet weights each pixel of the feature map. Moreover, it adds too many parameters, so it sacrifices speed while improving performance. HMANet [19](Hybrid Multiple Attention Network) introduced a category attention mechanism to weight the category of each pixel. To achieve more accurate classification results, while BlendMask [20] 's box attention is to more cleverly extend the weighted value of channel level to position level. This design makes BlendMask surpass MaskRCNN [21] both in speeds and accuracy, and it has become a novel benchmark in instance segmentation. The horizontal and vertical attention modules proposed in the present study are more suitable for processing document images. In later chapters of this study, the attention modules are investigated.

C. MULTI-SCALE TARGET ALLOCATION
As impacted by the different scales of targets in images, FPN [22] (Feature Pyramid Networks) is introduced to address the multi-scale problem effectively. Small targets correspond to large-scale feature maps, and large targets represent small-scale feature maps. ROIPooling of FasterRCNN [23] corresponding all target regions to the feature maps of a specific scale. CenterMask [24], and ROIAlign of MaskRCNN employs formulas to calculate feature maps corresponding to each target region directly. Besides, BlendMask's leading network uses FCOS [25](Fully Convolutional One-Stage).
Since FCOS refers to an anchor-free detection framework, it is not identical to those mentioned anchor-based frameworks. It was limiting the target's height and width to solve the problem of multi-scale target allocation. Each feature map represents a scale range. However, this method does not apply to identify targets with a massive difference in height and width. For instance, it will assign a caption with a larger length to a feature map with a smaller scale, but caption itself is a target with a smaller scale. This method will cause problem of missed inspection. Thus, FCENet in this study adopts a distribution strategy more suitable for document images.

D. INSTANCE SEGMENTATION
Combining target detection and semantic segmentation is a challenging task. There have long been two directions, i.e., a bottom-up semantic segmentation-based method and a top-down detection-based method. Existing detectionbased methods still dominate, and the most classic one is MaskRCNN, adding a semantic segmentation branch to object detection framework FasterRCNN, thereby segmenting the respective instance. Subsequent MaskScoringRCNN [26] based on MaskRCNN, a scoring mask quality unit, is introduced to improve the segmentation accuracy. However, the mentioned instance segmentation models based on the two-stage detection framework are significantly timeconsuming. Still, with the detection framework's introduction based on the Anchor-free idea, this situation has been dramatically optimized. Like CenterMask, by combining position attention to segment each instance, PolarMask [27] abandoned the detection boxes by using the Cartesian coordinate system. Still, it returned the contour of the instance based on the polar coordinate system. DeepSnake [28] is based on the detection boxes, and it uses a circular convolution structure to regress the instance contour, then obtain each instance mask. Unlike the mentioned two directions, Yolact [29](You Only Look At CoefficienTs) proposed a novel idea, i.e., to process the detection task and the semantic segmentation task in parallel and then linearly combine high-level semantic information with underlying spatial information to generate the final mask. The subsequent Yolact++ [30] also introduced a mask score branch to improve segmentation accuracy. BlendMask introduces an attention mechanism to better integrate the detection branch's semantic information with the spatial information of the semantic segmentation branch. BlendMask surpasses MaskRCNN for accuracy and speed. Accordingly, a more robust model BlendMask is taken here, as the basic structure of FCENet.

III. FCENet (FIGURE CAPTION EXTRACT NET)
Though BlendMask has achieved effective results on the COCO dataset, BlendMask remains insufficient for document images. After all, document images are not identical to natural scene images. For document images, FCENet has made several changes based on BlendMask. The present section elucidates FCENet.  Fig. 1.
The attention module refers to a vital part of the model FCENet in this study. It weights feature maps to improve the network's overall performance. The detail of this module will be discussed in Subsection B.
The detection network of FCENet is similar to FCOS. The difference is that FCENet divided the detection network into two sub-networks, i.e., the figure detection network and the caption detection network. Figure detection network takes P a 4, P a 5 and P a 6 as the input, and caption detection network takes P a 2, P a 3 as the input, and considering the diversity of the document's figures and captions. The aspect-ratio was superseding the multi-scale target allocation strategy of FCOS. It limits the target to be detected on different scale feature maps. The feature map of the respective scale represents a limited range of aspect ratio, namely where t, b, l, and r, respectively, denote the target point's distance to upper, lower, left, and right borders. Where f 1×1 represents a 1×1 convolution, f 5×5 2 represents two 5×5 convolutions, f 7×1 represents a 7×1 convolution, f 1×7 means a 1×7 convolution, f HAM means horizontal attention module, f VAM means vertical attention module, and f CONCAT means concat.

1) HORIZONTAL ATTENTION MODULE
The horizontal attention module is illustrated in Fig. 3.
It processes horizontal feature map H at channel and position levels, respectively, obtain H c = h c j ∈ R C ×H ×W | j = 1, 2 . . . C and H s = h s j ∈ R C ×H ×W | j = 1, 2 . . . C , then add H = h j ∈ R C ×H ×W | j = 1, 2 . . . C to obtain the final horizontal attention-weighted feature map  H ∈ R C ×H ×W . The overall operation is defined below: where j represents the number of channels in the feature map, j = 1, 2 . . . C .
where α and β denote two learnable parameters adopted to determine the weight of channel and position attention, which means element-wise multiplication, ⊗ means matrix multiplication. Attention map S ∈ R 1×H ×W : where GAP and GMP represent global average pooling and global maximum pooling, respectively, fc represents full VOLUME 9, 2021

Branch 3: Calculate vertical attention map
convolutional layer, attention map A = a jl ∈ R C ×C ( feature map H reshape ∈ R C ×(H ×W ) ) is multiplied by H T reshape ∈ R (H ×W )×C and then obtained after softmax function): a jl represents the effect of the l th horizontal feature channel on the j th horizontal feature channel; attention map A = a j ∈ R C ×1×1 | j = 1, 2 . . . C :

2) VERTICAL ATTENTION MODULE
The overall structure of the vertical attention module is illustrated in Fig. 4. Its basic structure complies with that of the horizontal attention module. Among them, and the final feature map V ∈ R C ×H ×W weighted by vertical attention. The overall operation is defined below: where v s j = α × S v j (10) v c j = β × y jl y j ⊗ v j (11) where α and β respectively denote also two learnable parameters, the attention map S ∈ R 1×H ×W : Attention map Y = y jl ∈ R C ×C ( feature map V reshape ∈ R C ×(H ×W ) ) is multiplied by V T reshape ∈ R (H ×W )×C and then obtained after softmax function): y jl represents the effect of the l th vertical feature channel on the j th vertical feature channel, the attention map Y = y j ∈ R C ×C | j = 1, 2 . . . C :

IV. EXPERIMENTS AND RESULTS
Dataset and assessment indicators. In this study, 3000 material documents are collected to generate an instance segmentation dataset complying with the COCO dataset format. It consists of 10,000 images, i.e., 6000 images in the training set, 2000 images in the validation set, as well as 2000 images in the test set. There are 1000 images from other areas (e.g., biomedicine, agriculture, and remote sensing) documents. There are 3 types of instance tags, except for background categories. The assessment indicator of this experiment is identical to that of the COCO dataset, which is the average precision (AP) of the box and mask, AP (AP50) at IOU 0.5, as well as AP at 0.85 (AP85).And four more assessment indicators (Precision, Recall, F1score, and Accuracy). Training details. FCENet uses ResNet-50 and FPN as the backbone network. The number of channels in the bottom module is 128, and that of channels in the detection network reaches 64. The model here trains on rtx2060 GPU for 180K iterations, the batch size is 1, the maximum learning rate is 3e-5, and the learning rate at the end reaches 1e-6. Before 21.6K, the learning rate is elevated linearly, and then the learning rate decreases slowly like a Cos function. Lr curve is illustrated in Fig.5(left). Input images are resized to a maximum side length of 832 as well as a minimum side length of 672. All hyperparameters are set to be identical to those of BlendMask. Loss curve is illustrated in Fig. 5(right).
Verify details. The unit of inference time is 's' in all tables. The present study verifies the performance of the model FCENet by a series of ablation experiments.

A. ABLATION EXPERIMENTS
In this study, ablation experiments are performed to verify FCENet's effectiveness with attention module, aspect ratio allocation strategy, as well as detection network modules.

1) DISTRIBUTION STRATEGY
FCENet vs. BlendMask. Allocation strategy used by Blend-Mask, same as FCOS, limiting a scope box on the feature map of each scale to solve the allocation problem of multiscale targets. In the present study, FCENet limits each scale feature map's detection range of the target aspect ratio and splits the detection network into a figure detection network and a caption detection network. The methods of BlendMask and FCOS primarily aim at detecting conventional objects in natural scene images. As opposed to the mentioned, this study's method aims to detect document images and items with noticeable differences in length and width.
The results of the mentioned methods are listed in Table 1. The method in the present study performs better on document images. Because the scope box limits the long side of the target, it will assign targets with noticeable differences in length and width to the feature map that does not match the scale. In contrast, this study uses the aspect ratio to solve this problem and transfer it to the appropriate scale feature map, thereby avoiding missed detection. The histogram of aspect ratio distribution is illustrated in Fig. 6.    as the input and shares parameters with each other. FCENet divides the detection tower into a figure detection tower and a caption detection tower. The figure detection tower uses the feature map (P4-P6) of the FPN network in Fig. 1 as the input, and the caption detection tower uses the feature map (P2-P3) as the input.
The results of the mentioned methods are listed in Table 2. FCENet's figure and caption detection tower perform better. Because FCENet divides the detection tower into a figure detection tower and a caption detection tower, they do not share parameters. Compared with the shared detection tower used by BlendMask, though the number of parameters has increased, the method in this study is more flexible for document images and targets with large-scale differences.

3) THE NUMBER OF CHANNELS IN THE DETECTION TOWER
Time-cost increases as impacted by the attention module's addition. The present study chooses to reduce the time-cost by compressing the number of channels in the detection tower. In order to find a balance point between time and performance, this study uses different channel numbers to measure the model's performance. It tried 64, 128, and 256 channel numbers on FCENet and compared them with Blend-Mask. The results are listed in Table 3. The performance of FCENet decreases with the increase in channels, and time also increases a lot. Even FCENet with 64 channels is much better than BlendMask in performance. Considering performance and time cost comprehensively, unless otherwise specified, the number of detection tower channels used in the experimental model is 64.

4) ATTENTION MODULE
FCENet's horizontal and vertical attention modules enhance horizontal and vertical features. In addition, horizontal attention, vertical attention, horizontal and vertical attention are compared to verify the model's effectiveness. The results are listed in Table 4. Both horizontal attention and vertical attention dramatically improved the performance of this study's model. Since the neatness of the document image's content, the horizontal and vertical attention module of FCENet enhances the model's performance by weighting the horizontal and vertical features. FCENet exhibits high performance in the document image.

5) TOP ATTENTION
In the top module of FCENet, to better distinguish between figures and captions, this study expands the box attention of BlendMask into two, i.e., fig-box attention and caption-box attention. Compared with single top attention, the results are listed in Table 5. The two-box attention methods are better.    vertical attention have weighted that. The present study tried element-wise addition and concat. The results are listed in Table 6. In contrast, the concat method outperforms the element-wise addition method. It is capable of more effectively integrating the horizontal feature map, the vertical feature map, as well as the base feature map. It retains features after attention weighting, while ensuring the model's performance. As opposed to the mentioned, the elementwise addition method may cause overlapping effects and lose several features, which adversely affects model performance.

B. MAIN RESULTS
On the validation set in this study, we compare FCENet with BlendMask, MaskRCNN, and Yolact. As listed in Table 7, the all-around performance of FCENet in this study surpasses BlendMask, MaskRCNN, and Yolact. On RTX2060 GPU, the running speed of FCENet reaches 0.43s/image, the running speed of BlendMask is 0.54s/image, the running speed of MaskRCNN is 1s+/image, and the running speed of Yolact is 0.35s/image.
On the test set, this study compares the results of FCENet with BlendMask, MaskRCNN, and Yolact. The visualized mask is illustrated in Fig. 9. FCENet in this study produces a higher-quality mask than BlendMask, MaskRCNN, and Yolact, and some models omit target (e.g., caption). Compared with other models, FCENet in this study is more suitable for processing document images.
After the final mask and box are generated, it is capable of extracting figures and captions of the documents. The overall process is presented in Fig. 7. The extracted results are illustrated in Fig. 8, where Fig. 8(a) is the document's original image. The extracted results from documents show in Fig. 8(b).
Moreover, to verify the versatility of FCENet, this study collects documents in other areas to test FCENet. The results are illustrated in Fig. 11. FCENet of the present study is not merely prominent in material documents, and it also performed well in other areas.

C. DISCUSSIONS
Compared with BlendMask, FCENet in this study adds horizontal and vertical attention modules to enhance the horizontal and vertical features, which can more effectively process document images. It adopts aspect ratio to limit the target's distribution on multi-scale feature maps and avoids missed detection caused by scale mismatch. Moreover, FCENet splits the detection tower into a figure detection tower and a caption detection tower, which is more flexible for differentiated targets (e.g., figures and captions). The mentioned methods have enhanced the performance of FCENet.
Another advantage of FCENet is that it is capable of generating higher-quality masks and boxes of document images because the model in this study is more aimed at document images. Accordingly, with high-quality masks and boxes, it is allowed to better extract figures and captions of material documents. In comparison with other methods, FCENet in this study has ensured both quality, speed and versatility. Even on low-resolution images, its mAP has only dropped by 1.34%.
Lastly, this study visualizes the comparison results of FCENet(top) and BlendMask(bottom) on the test set (Fig. 10). Two sets of images are listed in rows. Moreover, figure and caption extraction results are visualized (Fig. 12), where the left side is the original image, and the right side is the extracted result. The present study's model can also segment the figure's sub-figures, but other extraction methods currently cannot do this.

V. CONCLUSION
Given the lack of image data in some areas of materials, this study chooses to collect image data from material documents. Instance segmentation is adopted here to extract figures and captions of material documents for the first time. It develops an instance segmentation model FCENet, based on Blend-Mask and suitable for processing document images. Second, this study collects and produces an instance segmentation dataset with COCO format and selected box and mask mAP as assessment indicators consistent with the COCO dataset. Besides, this study verifies the performance of the model by relevant ablation experiments. Compared with BlendMask, (1) FCENet adds horizontal and vertical attention modules, it enhances the features in the horizontal and vertical directions. Its box and mask mAP increase by 7.69% and 9.95%, respectively. (2) FCENet adopts an aspect ratio distribution strategy. It avoids the missed detection of targets with significant differences in length and width (e.g., captions). Its box and mask mAP increase by 1.63% and 3.84%, respectively.
(3) FCENet splits the shared detection tower into figure caption detection towers and splits the top box attention into figbox attention and caption-box attention. It is more flexible for recognizing targets with noticeable differences (e.g., figures and captions). Its box and mask mAP increase by more than 8%. In addition, the present study is also compared with other instance segmentation models. FCENet outperforms BlendMask and MaskRCNN on accuracy and speed, and it exceeds Yolact on accuracy too. The box and mask mAP of FCENet are respectively 8.51% and 12.59 % higher than BlendMask, and speed is 0.11s faster. For MaskRCNN and Yolact increased by over 10%. As revealed from experimental results, the mask mAP of FCENet reaches 76.39%, which performs better on document images. This instance segmentation framework is considered a novel way to collect material image data.
YINGLI LIU received the Ph.D. degree in materials science from the Kunming University of Science and Technology, in 2017. Since 2005, she has been a Teacher with the Computer Science Department of Kunming University of Science and Technology. Her research areas include machine learning, natural language processing, and materials genome projects.
CHANGKAI SI is currently pursuing the master's degree with the Kunming University of Science and Technology. His research areas include material image processing, semantic segmentation, instance segmentation, and materials genome projects.
KAI JIN is currently pursuing the master's degree with the Kunming University of Science and Technology. His research areas include material image processing, semantic segmentation, and materials genome projects.
TAO SHEN (Member, IEEE) received Ph.D. degree from the Illinois Institute of Technology, in 2013. He is currently a Professor with the College of Information Engineering and Automation, Kunming University of Science and Technology. His research interests include blockchain technology, smart contract, and Energy of Things (IoT).
MENG HU is currently pursuing the master's degree with the Kunming University of Science and Technology. His research areas include blockchain, peer-to-peer transactions, and the power Internet of Things.