Vietnamese Document Analysis: Dataset, Method and Benchmark Suite

Document image understanding is increasingly useful since the number of digital documents is increasing day-by-day and the need for automation is increasing. Object detection plays a significant role in detecting vital objects and layouts in document images and contributes to providing a clearer understanding of the documents. Nonetheless, previous research mainly focuses on English document images, and studies on Vietnamese document images are limited. In this study, we extensively benchmark state-of-the-art object detectors and analyze the performance of each method on Vietnamese document images. Moreover, we also investigate the effectiveness of four different loss functions on the experimental object detection methods. Extensive experiments on the UIT-DODV dataset are conducted to provide insightful discussions.


I. INTRODUCTION
Understanding the content of documents is the key task in The Fourth Industrial Revolution (4IR) [1]. Document Image Understanding (DIU) is an automatic process that extracts useful information from the image of a documentation page. DIU combines image analysis techniques and pattern recognition to process and extract information from image documents; it refers to the logical and semantic analysis of image documents to extract information that humans can understand and encode into a machine-readable form. The DIU problem can be divided into several subproblems (e.g., Table Detection [2], Document Image Classification [3], etc.) with each problem rising from the result of the prior problem. Two common and significant stages of these subproblems are segmentation-defining feature regions (also known as page physical structure analysis) and labeling-assigning labels to defined regions (also known as page logical structure analysis) [4]. Once solved, these two stages are extremely meaningful and are baselines for other complex problems, The associate editor coordinating the review of this manuscript and approving it for publication was Khoa Luu . such as document forgery detection [5], document image retrieval [6], and visual document question answering [7]. However, the Document Image Understanding field has many major challenging issues, receiving attention from document recognition, analysis and information & database communities.
In this study, we focus on object detection in the document image data problem using the Vietnamese UIT-DODV dataset [8]. The dataset includes 2,394 scanned images of Vietnamese documents, with four object classes: Table, Fig By observation and analysis, we recognize some challenges of the problem of each object type in the document not only from external factors but also from internal factors of the documents.
• External factors resulting from the quality of images such as blurred images, blurred, obscured objects, low resolution, and distorted objects. Moreover, the difference between the quality of scanned images and PDF images is very large.
• In addition to external factors, the problem will face challenges from within, such as page layout variation, uneven object distribution, elongation of space between objects (spacing), and diversity in the morphology of objects, such as border and nonborder categories. Moreover, unlike English documents, the typical extract of Vietnamese document images faces significant difficulties due to their own expressions in the text language. The most obvious is that the feature classes are expressed in terms of terms meaning Caption. Separately, the Formula object class (Formula), in addition to the usual mathematical formulas containing equations and math symbols learning, is also represented as text (not belonging to the math area), which is also a major challenge for the problem.
In this study, we focus on Vietnamese image documents. There are some different characteristics between Vietnamese documents and English documents. First, although English and Vietnamese both use Roman characters, the Vietnamese language further uses diacritics, and UTF-8 characters display them. This observation means that Vietnamese documents use many more characters than English documents, and this difference should make the models trained on English documents work poorly in Vietnamese. The reason can be technically explained by the fact that the CNN-based backbones trained on English documents cannot produce the feature maps that describe the diacritics' information in images, leading to poor performance in caption detection. Moreover, caption objects are often placed near tables or figures; this may affect the pattern recognition characteristic of deep CNN models, and the detection performance on tables and figures may also be poor. We also do experimentation to confirm this hypothesis. Second, the Vietnamese document layout is also different. While English documents often use a no-border style with large tables (Figure 3a), Vietnamese documents have bordered smaller tables (Figure 3b). In addition, English documents place the table either on top or at the bottom of the document. Meanwhile, small tables can be arbitrarily found in a Vietnamese document. Also, the positions of captions are worth discussing; they sometimes are placed next to figures or tables (Figure 4a) instead of above or below them ( Figure 4b). Therefore, there is a legitimate need to explore and develop a specific deep learning-based object detection model for page object detection in Vietnamese document images.
Our prior work on object detection in Vietnamese document images was published in CAIP 2021 [8]. In this journal version, we further extend the conference version. We would like to highlight the novelty and contribution of our paper. We expand the experiments on nine object detectors that were published in three recent years. We review all available loss functions in the MMDetection toolbox on these object detectors and propose the combined loss function for the improvement. SABL-Cascade achieves the highest results by experiments; therefore, we extend investigations by replacing default RoIAlign with PrRoI in the RoI Pooling module on SABL-Faster. The state-of-the-art performance demonstrates the efficiency of this change. Our contribution is summarized as follows.
• To the best of our knowledge, we are among the first to conduct research on Vietnamese document image understanding.
• We conduct the extensive benchmark on the UIT-DODV dataset, which is the first Vietnamese dataset. Specifically, recent advanced models such as AutoAssign [9], ATSS [10], Double Head [11], GRoIE [12], SABL-Cascade [13], Faster RCNN [14], Generalized Attention [15], Libra R-CNN [16], Weight Standard [17], and CARAFE [18] methods are investigated in this article. In Figure 1, we briefly compare object detection methods on the UIT-DODV dataset regarding AP score and the number of parameters after experiments. Finally, we conduct experiments using four different loss functions for the classification task. The classification losses used to run experiments are crossentropy loss, focal loss [19], fused loss [8] and GHM loss [20].

II. RELATED LITERATURE A. EXISTING DATASETS
Detecting objects in image documents is one of the problems that has received the research community's interest in document layout analysis and document image understanding.
There are many related studies as well as benchmarks for this problem that have been published worldwide. Details of the datasets mentioned above are described in Table 1  The POD contest dataset [22] includes 2,000 images of language document pages that were selected from 1,500 scientific articles by CiteSeer. The dataset represents the diverse formats for both page layout and object types, including single-column pages, two-column pages, multi-column pages and different types of formulas, tables, graphics, and figures.
The TableBank dataset [23] includes more than 278,000 images with more than 47,000 table objects. A total of 200,000 images edited in Latex are scientific articles collected from the ArXiv.org site.
PubLayNet [24] is the largest document image dataset ever, including 358,353 photos from research documents and scientific articles in medical fields with five object classes. The main object includes important elements related to document layout: title, text, figure, table and list. PubLayNet is used in the ICDAR 2021 competition in document layout recognition and detection board tasks.
In the ICDAR 2019 competition, cTDaR 2019 [25] is the dataset used with two New editions, including modern printed materials and archives. This is the first dataset that contains historical documents with handwritten and printed tables. The number of images in the cTDaR dataset is dependent on the tracks of the competition; however, the maximum is 799 and 840 images for historical and modern datasets, respectively.
DocBank [26] is an extended version of the TableBank dataset, which contains linguistic units. Other meanings are also included for document layout analysis. In this dataset, the following semantic structures are annotated in DocBank: Abstract, Author, Caption, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table and Title. UIT-DODV [8] is the first Vietnamese document dataset with 2,394 document images with 4 object classes, including Table, Figure, Caption, and Formula.

B. RELEVANT WORKS
In 2018, Kerwat et al. [28] used SSD [29] for object detection tasks in image documents on the ICDAR 2013 dataset. YOLOv3 [30] is a well-known algorithm for real-time object detection, which Huang et al. [31] used for a table detection task in 2019. Later, Ren et al. [32], for document layout detection combined context information to improve region detection performance. The experimental results have shown that the proposed method has 23.9% better mAP and 14 times faster processing speed than the text-based technique. Sun et al. [33] proposed the combination of the Faster R-CNN method and corner locating for table detection. The proposed method includes two stages: 1) the table detection results and the angular coordinate of the original object will be predicted by Faster R-CNN; and 2) the coordinate matching algorithm will group the angular coordinates that belong to the same table object. The result achieves F1 94.9% accuracy on the ICDAR2017 dataset, 2.8% higher than the traditional Faster R-CNN. Siddiqui et al. [34] introduced   method based on the encoder-decoder structure to convert PDF documents to HTML structures at the same time. TED measurements are also suggested to evaluate effectiveness. Zheng et al. [36] introduced an end-to-end framework that not only detected but also recognized table structures in document images. Module Global Table Extractor (GTE), which is proposed, can be placed at the top of object detection methods. The study also introduces the FinTabNet dataset in the financial field. Agarwal et al. [37] presented the composite deformable cascade network (CDeC-Net) method to solve detecting tables in document images. This study is based on a novel cascade Mask R-CNN [38] and a dual backbone architecture [39].
Many studies have applied and improved common object detection methods such as SSD, YOLOv3, Faster R-CNN, and Mask R-CNN in the Document Image field. The evaluation and analysis of new methods in recent years on Vietnamese document image -UIT-DODV promises to provide a lot of helpful information as a fundamental for future extensive studies.

III. OBJECT DETECTION MODELS
Page object detection can be regarded as an object detection task in document images based on popular object detection algorithms. Therefore, we review the state-of-the-art object detection methods for page object detection in this section, which are leveraged for page object detection in our research.
To the best of our knowledge, object detection algorithms can be divided into one-stage object detection and two-stage object detection.

A. ONE-STAGE METHOD
In this subsection, we review AutoAssign [9] and ATSS [10] object detection methods. AutoAssign is a dense object detector considered a one-stage anchor-free detector. ATSS is not a complete object detection method, just a module which can be integrated into any one-stage anchor-free (such as FCOS) or one-stage anchor-based method (such as RetinaNet). However, for simplicity, we list them as ''one-stage'' methods.

1) AutoAssign
AutoAssign [9] is a single-stage object detection method. It requires very little prior knowledge (thresholds for selecting positive and negative samples) and is highly efficient through a weight distinction mechanism.
As shown in Figure 5, the grey framework illustrates the network architecture. The first followed an anchor box-free method such as FCOS (fully convolutional one-stage) [40] to remove predefined anchors and directly predict objects at each feature position. The network architecture has three outputs: the classification score, the implicit objectness score, and the feature coordinates. During training (blue framework below), all the predictions of the above architecture are converted into a common confidence index first. Above all, a weighting mechanism has been proposed, which consists of a module of center weighting and confidence weighting. The center weights module is designed to respond to the pre-centrality property inherent in the data and adapt to each class's specific patterns. It starts from the standard central attribute and then learns the distribution of each class in the data. The confidence weights module is used to assign the most appropriate positions of each sample based on its occurrence and size accordingly. Both modules combine to generate positive and negative weights for each position i in the ground-truth bounding box. Finally, the positive and negative loss functions will be calculated, and the positive-negative sample labelling will be optimized along with the network architecture.
From a positive-negative labelling point of view, for an object, AutoAssign can automatically find its appropriate scale on FPN (feature pyramid network) levels and spatial locations based on the network's output. As a result, labeling is appropriately resolved in a uniform, recognizable and distinguishable manner.

2) ADAPTIVE TRAINING SAMPLE SELECTION (ATSS)
ATSS [10] is a method that automatically selects positive and negative samples based on the statistical characteristics of proposal regions.
For each ground-truth envelope g on the image, Zhang et al. first looks for positive samples. At each feature level, they choose k anchor boxes whose center coordinates are closest to the g-box center coordinates based on the L2 distance. Assuming there are L feature classes, the label box g has k ×L positive recommendation regions. Then, calculate the IoU between these proposal boxes and the ground truth set. With these statistics, the IoU (intersection over union) threshold for the ground truth box g is adjusted using the formula t g = m g + v g . Finally, they select proposal boxes with IoU ≥ t g as the last positive proposal boxes.
Note that ATSS also restricts positive samples neighboring ground truth g. In addition, if an anchor box is assigned to more than one ground-truth box, the box with the highest IoU is selected, and the rest are considered negative samples.

B. TWO-STAGE METHODS
In this subsection, we review the state-of-the-art two-stage object detection methods. Note that all methods in this section are almost improved versions of Faster R-CNN [32], which is mainly focused on Balance Sampling [16], Deep Convolution Network for Feature Extraction [17], [15], Feature Pyramid Network (FPN) [16], [12], [18], and Regression and Classification tasks [11], [13].

1) FASTER R-CNN
Faster R-CNN is the improved version of Fast R-CNN. Ren et al. [32] proposed a region proposal network (RPN), replacing selective search to generate better proposal regions; this architecture will then be trained with Fast R-CNN. These improvements have reduced the number of proposal regions and increased the operating speed during model testing to near real-time with the best performance, approximately 5 fps on a single GPU. Faster R-CNN is the premise method for many later object detection methods. Within Faster R-CNN, an input image will pass through a CNN architecture and output a feature map. This feature then goes through the RPN to generate suggested regions with or without objects. These regions will pass through the RoI pooling layer to be resized to the same size and then classified and location refined by Fast R-CNN.

2) WEIGHT STANDARD
Batch normalization is a data normalization technique that gives outstanding results. However, Qiao et al. [17] argue that with the microbatch training schedule, this method has limitations. The reason is that when training a microbatch on multiple GPUs, each GPU only receives 1-2 images, thus causing batch normalization to reduce performance significantly. Indeed, one GPU receiving too few images is a common problem due to insufficient resources in the computer vision field. Therefore, the Weight Standard was proposed to overcome this issue.
The main idea of the Weight Standard is to normalize the weights on the kernel. Typically, a convolution is defined as follows: where W ∈ R O×I is the weight in the kernel and X, Y are the output and the input of the convolution. O is the number of output channels, and I is the number of input channels in the kernel of each output channel. The Weight Standard Weight Standard will normalize the weight W according to the following formula: where: Therefore, the output of the convolution is now calculated as follows:

3) GENERALIZED ATTENTION
Generalized Attention [15] is a synthesized study of different spatial attention factors in a general attention formula, including the attention mechanism in transformer architecture. Given a query element and a set of element keys, an attention function will aggregate the content keys based on attention weights that measure the correspondence of the query-key pair. For the model to participate in content keys from different representation subspaces and locations, the outputs of many attention functions are linearly summed with learnable weights. Let q index a query element with content z q , and let k index a key element with content x k . Then, the multihead attention y q feature will be computed by the formula: where m indicates the attention head, q identifies the key regions that support the query, A m (q, k, z q , x k ) denotes the attention weights in the m th attention head, and W m and W m are the weights that can be learned. Usually, attention weights are normalized inside q , such as In recent related studies on transformer attention, the attention weight of each query-key pair is calculated as the sum of four parts ε j 4 j=1 based on different attention factors, such as: normalized by k∈ q A m (q, k, z q , x k ) = 1 when key regions supporting q span element keys (e.g., the whole input sequence). By default, 8 attention heads will be used in the research.
Zhu et al. [15] incorporate various attention mechanisms into deep networks to explore their effects. For object detection tasks, ResNet50 is chosen as the backbone for feature extraction, and only the self-attention mechanism is involved. In detail, the self-attention mechanism is incorporated into the residual block, called the ''attended residual block,'' and only applied in the last two stages (conv4 and conv5 stages). Faster R-CNN with a feature pyramid network is chosen as the baseline detector for experiments.

4) LIBRA R-CNN
Libra R-CNN [16] is an innovative object detection method that addresses imbalances in the training process. Pang et al. suggested that this imbalance lies at three levels: the sample level, feature level, and object level. To solve this, balanced sampling based on IoU (IoU-balanced sampling), balanced feature pyramid (balanced feature pyramid) and Balanced L1 loss function (L1-balanced loss), which respond to the above three imbalance problems, have been proposed.

a: BALANCE IoU SAMPLING
Based on the conclusion that imbalance will cause difficult samples to be masked by thousands of easy samples, the IoU sampling balancing method is proposed to find more difficult samples at no additional cost.
Suppose we need to select N difficult samples in M proposed regions. The probability of being selected for each sample in random sampling is: To increase the selectivity of hard negatives, the sampling interval is divided into K equal parts based on IoU. N is the number of hard-to-negative samples in each of these equal parts. Then, samples were uniformly selected from these sections. Therefore, the possibility of taking these difficult samples is redefined as follows: where M k is the number of proposals in the K th part. The original paper is divided into three parts (K = 3).

b: EQUALIZING PYRAMID FEATURES
Unlike previous studies on FPN, PAANet combines multilevel features using the two-way connection; the idea is to reinforce multilevel features by using balanced semantic features. Full integration includes four steps: resizing, integrating, refining and strengthening.
c: OBTAIN BALANCED SEMANTIC FEATURES the feature at l -level resolution is denoted as C l . The number of multilevel features is denoted as L. The highest and lowest levels are denoted by l min and l max , respectively. In the image above, C 2 is the top-level resolution. To integrate multilevel features and keep semantic hierarchies at the same time, the multilevel features are resized C 2 , C 3 , C 4 , C 5 to an intermediate size. Once the features are rescaled, the balanced semantic feature is obtained by simple averaging: The obtained features are then scaled using the same but inverse process to enhance the original features.

d: REINFORCE SEMANTIC FEATURES
equilibrium semantic features will be consolidated later. The consolidation step will help enhance specific features to improve results. With this method, low-level to high-level features will be aggregated at the same time. Output P 2 , P 3 , P 4 , P 5 is used for FPN-like object detection.

e: EQUILIBRIUM LOSS FUNCTION L1
In this paper, a balanced loss function L1 is proposed. Balanced L1 is used to increase the contribution of current observations (inliers); this cost is defined as follows: (11), as shown at the bottom of the page, where b ensures L1 balanced x = 1 is a continuous function, Z is a constant and the relationship between the coefficients α, b, γ is determined as follows: As recommended in the original article, α and γ are set as 0.5 and 1.5, respectively.

5) CONTENT-AWARE REASSEMBLY OF FEATURES (CARAFE)
Feature upsampling is a common practice in thick detection problems such as object detection or object segmentation. It is an integral part of high-to-low or low-to-high feature hybrid architectures such as FPN, U-Net, and Stacked Hourglas. Content-Aware ReAssembly of Features (CARAFE) [18] is a universal, simple and highly efficient operator for this purpose.
CARAFE acts as a clustering operator with content-aware kernels consisting of two steps. The first step is to predict a clustered kernel for each target location based on its content, and the second step is to cluster the features with the predicted kernel. Given a feature X of size C ×H ×W and size increase ratio σ . CARAFE will produce a new feature X of size C × σ H × σ W . Any target location l = (i , j ) of X will have a target location l = (i, j) at input characteristic X, where i = j /σ and j = j /σ . Here, N (X l , k)) is denoted as a subregion of size k ×k of the input feature X located between position l, so-called neighborhood X l .

6) DOUBLE HEAD
The two-layer structure (one fully connected layer and one convolutional layer) is used extensively in R-CNN-based object detection methods for two jobs: recommendation box classification and coordinate regression. However, in their study, Wu et al. [11] suggest that there is a certain lack of understanding of how these two classes can work for both. The results show that the fully connected layer is more suitable for the proposed box classification, and the convolution layer is more suitable for coordinate regression. Here, the output of the fully connected layer is more spatially sensitive than that of the convolutional layer. Therefore, the double-head method was proposed, i.e., using a fully connected layer for classification and a convolution layer for box regression.

7) GENERIC RoI EXTRACTOR (GRoIE)
In two-step object detection methods such as Faster R-CNN, the region of interest layer plays an important role. Specifically, it is used to extract a consistent subset of features from an FPN network layer placed at the top of the architecture. Realizing that previous RoI classes only selected the best layer from the FPN as a limitation, Rossi et al. [12] proposed the Generic RoI Extractor (GRoIE), which introduces nonlocal building blocks and an attention mechanism to increase performance.
Specifically, GRoIE is mentioned to include the following 4 modules:

a: RoI POOLER MODULE
Here is a module that uses RoI Align on heterogeneous proposed regions to obtain fixed size representations.

b: PREPROCESSING MODULE
The goal of this module is to apply preliminary pooling to pooled regions. This module is used to preprocess the features and is usually applied by a convolution layer associated with each aspect ratio.

c: AGGREGATION MODULE
This module defines how the single RoIs coming from each branch can be aggregated. The most commonly used operators are concatenation and summation.

d: POSTPROCESSING MODULE
This is an additional postprocessing step that applies to features that have been merged before returning. It allows the network to learn global features considering all dimensions.

8) SIDE-AWARE BOUNDARY LOCALIZATION (SABL)
Existing object detection methods depend on bounding box regression for object locating. Although there have been attempts to improve processes in recent years, the accuracy of envelope regression has not been satisfied, leading to this being a limitation of object detection. Wang et al. [13] found that previous approaches focused only on predicting center coordinates and dimensions (x, y, w, h), which is not an efficient way to perform regression bounding boxes, especially when large displacements and variances exist between the anchor boxes and the ground truth. Therefore, the Side-Aware Boundary Localization (SABL) method is proposed, where each side of the envelope would be located in turn with a dedicated network branch.
SABL will first extract the horizontal and vertical features (F x and F y ) by combining the RoI F features along the X and Y axes, respectively. F x , F y will be divided into side-aware features F left , F right , F top , F down . Then, on each side of the bounding box, the SABL first divides the target spaces into groups and searches for the boundary container by taking advantage of side-aware features. It refines the VOLUME 10, 2022 boundary coordinates x left , x right , y top , y right by further predicting their offset. Such a two-step clustering framework reduces regression variance and eases prediction difficulties. Furthermore, the reliability of the estimated groups can also help to correct the classification score and further improve the performance. Not only is it applicable to two-stage methods, it is also applicable to one-stage detection methods.
In Figure 5, we depict the input processing as document images on two methods, AutoAssign and SABL, representing the one-stage and two-stage detectors in our experiments. AutoAssign improves the label assignment task in anchor-free detectors by proposing two modules, center weighting and confidence weighting, to calculate positive and negative weights for adjusting the category-specific prior distribution and the instance-specific sampling strategy in both spatial and scale dimensions. Meanwhile, SABL focuses on improving the localization task, where each side of the bounding box is respectively localized with a dedicated network branch. Figure

C. LOSS FUNCTIONS
Loss functions are an essential factor affecting the detection performance in object detection tasks. The loss functions of object detection are categorized into classification loss and localization loss. In our research, we focus on exploring the effect of the classification loss function on object detectors. This improves the precision of classifying semantic classes, which is a challenging problem in analyzing document images.

1) CROSS ENTROPY LOSS (CE)
Let p be the label probability, q be the prediction probability, and C be the number of classes. Cross-entropy loss is defined as follows: The CE loss is used for the problem of classifying positive or negative suggested boxes, which means the number of classes is 2 (C = 2). For CE, the class distribution is assumed to be balanced; however, we would like to consider the unbalanced scenario where we need a different loss function that handles the minority classes to be classified more accurately. This case is problematic to the object detector since the positive recommended regions are few, whereas the negative proposal regions dominate.

2) FOCAL LOSS (FL)
Originally introduced by Lin et al. [19] in an attempt to improve the single-stage method, this loss function is applied to the method and named RetinaNet. Focal loss is defined as follows: As shown, focal loss adds the factor (1 − q i ) γ to the CE function. This multiplier is very effective in adjusting the effect of labels on the loss function and gradient descent simultaneously. For classes with majority samples, the probability of guessing these samples is usually correct and larger, (1 − q i ) γ will tend to be smaller and have almost no impact on the loss function. For classes with minority samples, the probability of predicting these samples is small, making (1 − q i ) γ closer to 1, and the impact will be larger.

3) FUSED LOSS
To take advantage of both CE and FL loss functions, we combine both loss functions with a trade-off parameter:  This combined loss function will default to consider the contributions of the classes as fair, in addition to the contributions of the minority classes.

4) GHM LOSS
In an attempt to solve the class imbalance of proposal boxes. In the study [20], the loss function gradient harmonized mechanism-classification (GHM-C) is proposed: where p is the probability of the suggested boxes; p * is ground truth, β i is the harmonic gradient density parameter of the ith sample, and GD(g i ) is the gradient density of sample I, determined by: β i = N GD(g i ) .

IV. BENCHMARK SUITE
In this section, we present the benchmark for Vietnamese document analysis. We mainly discuss the benchmark dataset for Vietnamese document images and the experimental process based on the aforementioned methods in Section III.

A. DATASET
The UIT-DODV dataset [8] is used for our extended experiments. There are four classes in this dataset: formula, figure, table, and caption, and the dataset is split into three sets: training (1,440 images), validation (234 images), and testing (720 images). The UIT-DODV dataset is collected from various domains, i.e., PDF (1,696 images), scanned by smartphone (451 images) and scanned by the physical scanner (247 images). Due to the variety of images, UIT-DODV poses many challenges for object detectors to work well in multiple domains. In addition, there are many research articles from many sources. Therefore, there are significant differences in layout. For example, it can organize the page into one column or double columns, depending on the template of the conference or journal. As a result, the location of objects (e.g., table, figure) is not fixed on different pages. Moreover, the primary language on UIT-DODV is Vietnamese, which has high differences in character with some accent symbols (', ',?,.,∼) and derivative characters (ô,ȯ, ê, â,ǎ,u). This contributes to the significant challenge of detecting semantic classes (e.g., formula, caption).
Moreover, we also visualize the distribution of object sizes in the training, validation and testing sets of the UIT-DODV dataset. The distributions seem quite similar between the three sets. Almost width values are between 0 to approximately 1250px, and the height value tends to be stable. The formula objects seem to be similar to caption objects. However, the range of width values is shorter. Width values of table objects seem to cluster at approximately 1200px and 500px whereas the height values are various, which are almost from approximately under 250px to 1000px. The figure objects have the most beautiful distribution between other types of objects, it seems to be quite linear along with width-axis and height-axis. Table objects tend to have higher height values (almost up to 2,000 px, and the maximum is approximately 3,000 px). This analysis shows that UIT-DODV reflects the natural distribution of Vietnamese documents in reality quite well. The number of objects is adequate to evaluate the performance of object detectors, and the experimental results on this dataset are worth discussing.

B. EXPERIMENTAL SETTINGS
In this study, we run all experiments by using the MMDetection toolbox [41], which implements recent new methods. The configuration we use: Intel(R) Xeon(R) CPU @ 2.30 GHz 2 cores; 25 GB Ram; and 01 Nvidia Tesla P100-PCIE 16 GB GPU. For a fair comparison, we use ResNet-50 as the backbone for feature extraction in all ten object detectors. All models mentioned in Section III are trained in 24 epochs, and the best of each is recorded. Training within 24 epochs is proven to help the models converge when trained on the MS-COCO dataset in the MMDetection toolbox [41]. Each detector is tested using the four different classification loss functions mentioned in Section III-C. However, several detectors are not adapted well with some loss functions. In particular, the loss value is NaN in the very first epochs. Therefore, not every method is tested with all four loss functions.

C. EVALUATION METRICS
We calculate the average precision metric using the COCO API. 1 We calculated the AP scores of all classes and took their average as the mean AP score (mAP). This process is formally defined as follows: where AP c is the average precision of the c-th class; C is the set of all classes in the dataset; and T is the set of IoU thresholds T = 0.50 : 0.05 : 0.95. In addition, we also calculate the mAP scores at IoU = 0.5 and IoU = 0.75, which are called AP@50 and AP@75, respectively.

D. EXPERIMENTAL RESULTS
All the aforementioned methods are available in the MMDetection toolbox. Furthermore, we perform experiments with different loss functions of classification and regression tasks in the second stage of detectors. Note that there are some modifications as follows.
• AutoAssign: This detector does not provide classification loss in its configuration; thus, we only perform experiments on regression loss. • In the classification loss, we perform experiments on cross-entropy loss, focal loss, and fused loss by default.
Here, we replace any incompatible loss in any detector with GHM loss. Except for the AutoAssign detector, all detectors experiment with different L1 losses in regression tasks. We experiment with different IoU losses in the AutoAssign detector because the L1 loss shows a very low performance on this detector.
• Regarding the fused loss function, we emphasize the focal loss function. As shown in Section III-C3, the focal loss has a better performance on the class imbalance; therefore, the classification performance is theoretically better. In particular, we apply α = 0.6 to perform experiments in all methods where fused loss is applied. However, we also try different α values of 0.4, 0.5, and 0.6 in the double head method, which shows that α = 0.6 is appropriate.
As a result, the highest result of each method ranges from 64.7% to 77.2% in terms of the mean average precision. The three methods that give the highest results are SABL (Cascade), Faster RCNN, and Double Head. Meanwhile, ATSS yields the lowest result 39.9%. We visualize the results of these 4 methods in Figure 12.
As shown in Figure 12, these four methods show good predictions for the Table class, and the predictions of the three other classes exhibit different mistakes depending on the detectors. SABL-Cascade detects enough and correct objects; the bounding boxes also perfectly surround objects belonging to four categories. The reason is that SABL Cascade is based on Cascade R-CNN, which contains multiple stages, and the next stage will improve the performance of the previous stage; therefore, the predictions can achieve the perfect performance. Moreover, SABL is the method that enhances the regression task by taking advantage of side-aware features; consequently, it is much more precise than other methods.  Table 2; the AP of Double-Head is higher than Faster-RCNN in the Figure class and vice versa.
The ATSS method only detects figures and tables, omits the two Caption objects, and gives the lowest results among the methods. The predicted bounding box of the figure objects also does not well surround the object. The reason is that ATSS is a single-stage detector; thus, its performance is VOLUME 10, 2022  obviously lower than the rest of the detectors. However, the prediction time in ATSS will be much faster, which is commonly seen in single-stage detectors.
The SABL (Cascade) method yields the highest AP results up to 77.2%. Among the four classes, the AP scores of Table and Figure are above 85%, Formula is 50.1%, and Caption is 76.2%. To improve the results on the dataset, it is necessary to focus on these two classes. The Formula class has a highly diverse expression, a big challenge for the UIT-DODV dataset. The SABL (Cascade) method achieves the highest mAP of 77.2% when using Cross-Entropy and 76.2% when using GHM Loss.
We continue to discuss the loss function in our experiment. In general, detectors maintain good performances in predicting tables and figures. However, as mentioned in IV-A, detectors have to struggle with the challenges of the UIT-DODV dataset, which leads to poor performance on caption and formula. Specifically, formula and caption are semantic classes that are a part of the text in the document. Therefore, the problem that must be overcome is the noise proposed region (i.e., a part of the text can be proposed to captions or formulas by the region proposal network). These problems relate to the classified background and foreground. We leverage focal loss to handle these problems; however, it is not as robust as expected to solve them. Fused loss is another solution that leverages both cross-entropy and focal loss to enhance classification precision. According to the experimental results, the fused loss gives better results than the cross-entropy loss in almost all methods (except the GRoIE and Double Head methods). This shows that applying fused loss is more effective than applying traditional cross-entropy loss. In addition, with different α values examined on the Double Head method, α = 0.6 gives the highest result. Fused loss contributes to improving the AP on the caption   has the best results. Libra R-CNN expresses its effectiveness with GHM loss of 64.7% at mAP. On the other hand, for regression problems. ATSS has better results than DIoU in three classes, caption, figure, and formula, compared with the default regression loss function (DIoU) in ATSS.
Since SABL Cascade achieves the best results, we further conduct experiments on SABL (Faster RCNN) with 2 different pooling methods: RoI Align and PrRoI. The evaluation results show that the application of PrRoI gives 1% higher results than RoIAlign. By using PrRoI Pooling, SABL (Faster RCNN) is slightly behind SABL (Cascade) with 74.5% AP. The results are reported in Table 3 and Figure 13. In this observation, using PrRoI as an RoI pooling module is more effective than RoI Align in the document images. Different from RoI Align, PrRoI uses full integration-based average pooling instead of sampling a constant number of points. Therefore, the localization task can be improved, helping the predicted boxes overlap more exactly with ground-truth boxes, leading to a better AP score.
Moreover, different loss functions have various impacts on detecting and classifying the model's results. We visualize the result in Figure 10 to show the impact of the classification loss functions. We observe that the fused loss function produces better localization performance than cross-entropy in some cases. Figure 11 illustrates the performance of different regression loss functions.
The trade-off between performance and complexity is explored in Figure 1. Note that we only use the highest AP among loss functions in each object detection model to illustrate their trade-off. It is not difficult to recognize that the larger the model is, the higher the detection performance becomes. SABL-Cascade outperforms the others because it is a multistage model whose proposal boxes are refined within three stages. Moreover, the regression task is improved by the Side-aware boundary localization module. However, this is also a problem that increases the complexity. Among the one-stage or anchor-free methods, AutoAssign seems to operate quite well. However, the average precision cannot compare to the results achieved by two-stage methods; it is an acceptable selection if real-time speed is required. Two-stage methods tend to cluster each other because they are all modified versions of the Faster R-CNN; the difference lies in FPN, external learned branches or other modules. Double-Head with two branches for regression and classification seems to outperform the others among the two-stage models.
We also explore the performance of the transfer learning technique, which is training the detector on an existing benchmark dataset for page object detection and then calculating the AP score on the testing set of the UIT-DODV dataset. The DocBank dataset is selected for this experiment because it includes all four classes of the UIT-DODV dataset (caption, figure, table, formula), and it also includes document images from research papers; therefore, the AP scores for predicted detections in the UIT-DODV dataset on these four classes can be calculated. However, we only take 10,000 samples and use the ground-truth bounding boxes of objects belonging to four classes, as in the UIT-DODV dataset, for training. These 10,000 samples are split into training and testing sets, with each including 5,000 samples; this is known as the DocBank10K. Faster R-CNN using cross-entropy loss is selected to conduct the transfer learning experiment. First, we train the Faster R-CNN model on the DocBank10K dataset, and we use the trained weights on this dataset to evaluate the detection performance on the testing set of the UIT-DODV dataset. Second, we use the trained weights on DocBank10K as pretrained weights to continue to train on the training set of the UIT-DODV dataset, and then we evaluate again on the testing set of UIT-DODV. The results are reported in Table 4.
We note that training on DocBank10K and evaluating on UIT-DODV do not achieve the expected results. The AP scores are much lower than those of the model directly trained on the UIT-DODV training set. Moreover, fine-tuning the Faster R-CNN model on the UIT-DODV training set from the pretrained weights obtained by training on the DocBank10K dataset also does not perform well. The AP score is lower than that of the Faster R-CNN model reported in Table 2 (−12.5% AP). These observations prove that training from the existing POD document dataset in English and fine-tuning on the UIT-DODV are ineffective. The main reason may be the    Table 3. sees Latin characters; it cannot observe samples containing UTF-8 characters used in the Vietnamese language, creating confusion for the pretrained model. In addition, there are various formula writing styles in English papers, such as the position of indexing numbers and indexing style. We observe that the indexing numbers can be put at the formula's right and left. The indexing style is also slightly different: they can be numbers only or include numbers and one character. Meanwhile, Vietnamese documents usually use only numbers to denote formulas and put them on the right. These differences cause poor performance in formula objects' class when recorded at only 0.3%. Regarding document style, English research papers commonly contain non-bordered tables, while Vietnamese researchers habitually use bordered tables. This aspect leads to the low AP table score (35.2%). We provide some qualitative results in Figure 14 and Figure 15, which compare the performance between three versions of Faster R-CNN: directly trained on UIT-DODV (reported in Table 2), trained on the DocBank10K dataset, and fine-tuned on UIT-DODV using pretrained weights from a model trained on DocBank10K. Based on visualizations, directly using pretrained weights on the DocBank10K dataset cannot detect captions' objects ( Figure 15b). Even if we fine-tuned the UIT-DODV dataset, the detector still predicted some false-positive captioning objects (Figure 14c). Besides, the detection performance on equations' objects seems of  utmost bad. While Faster R-CNN trained on the UIT-DODV training set predicts almost correctly (Figure 14a), its counterpart trained on the DocBank10K dataset not only misses objects but also regresses bounding boxes inaccurately (Figure 14b). Even after fine-tuning the UIT-DODV training set, the detector still misses one formula object (Figure 14c). However, Faster R-CNN trained on DocBank10K shows quite good performance on table objects (Figure 14b). Nev-ertheless, the predicted bounding boxes occupy many redundant parts, leading to a low AP table score. The same problem also appears on Faster R-CNN fine-tuned on the UIT-DODV training set (Figure 14c).

E. ERROR ANALYSIS
To address the current problems that state-of-the-art detectors face in the page object detection problem, we visualize the VOLUME 10, 2022 detection results from the best performing model (SABL-Cascade) and find some cases in which the detector commonly fails. Since the SABL-Cascade performs well in detecting table objects 95.9% AP), we focus only on the performance in detecting other types of objects. After visualization of all images in the testing set, we observe some problems.

1) CAPTION
The detector mainly ignores captions if they appear on the left or right of tables or figures. In Figure 7a, Captions are much longer sentences and lie to the right of figure objects; in this case, the detector cannot detect the bounding boxes of the captions. The reason may be explained as these samples are outliers because there are not too many samples whose object captions lie beside object figures. On the other hand, in scanned images, the captions may be wavy. In these cases, caption objects are easily ignored (Figure 7b).

2) FIGURE
Predicted bounding boxes of figure objects are commonly overlapped in the cases in which subimages appear in a figure or many figures are placed beside each other (Figure 8a). At the same time, abnormal figure objects are easily ignored. In Figure 8b, the figure object contains a long string, which makes it extremely difficult for the model to recognize whether this is a figure, table, or caption.

3) FORMULA
Formula objects may include the formula numbers. In our opinion, this is the pattern that helps the model recognize whether the object is a formula or not. Some formulas that do not contain the numbers may become hard objects, which are commonly not detected by the detector (Figure 9a). However, formula objects are the same as figure objects; if these objects are located too close to each other, the model also predicts some redundant boxes (Figure 9b).

V. CONCLUSION AND FUTURE WORK
In this paper, we conduct comprehensive assessments of the UIT-DODV dataset with ten state-of-the-art object detection methods for Vietnamese document analysis. We train from two to three different loss functions for each of these methods in the proposal boxes classification task. In addition to cross-entropy loss, focal loss, and GHMC loss, we conduct experiments with fused loss (a combination of cross-entropy and focal loss). We assess not only the impact of different loss functions but also the impact of two different RoI pooling. In particular, we replace the default RoI Align with PrRoI to further improve the performance.
In the future, we will diversify the UIT-DODV dataset by collecting more images from lectures, textbooks, and receipts. In addition, we aim to address more problems in the document understanding problem, such as recognizing captions below figure or table objects and visual question answering based on text contents in document images.

ABBREVIATIONS
The following abbreviations are used in this manuscript: See Table 5. TAM V. NGUYEN (Senior Member, IEEE) received the Ph.D. degree from the National University of Singapore, in 2013. He was a Research Scientist and a Principal Investigator at the ARTIC Research Centre, Singapore Polytechnic. He was also an Adjunct Lecturer at the National University of Singapore. He is currently an Associate Professor with the Department of Computer Science, University of Dayton. His research interests include computer vision, applied deep learning, multimedia content analysis, and mixed reality.