Classification of Financial Tickets Using Weakly Supervised Fine-Grained Networks

Facing the rapid growth in the issuance of financial tickets, traditional manual invoice reimbursement methods are imposing an increasing burden on financial accountants and consuming excessive manpower. There are too many categories of financial ticket that need to be classified with high accuracy. Therefore, we propose a Financial Ticket Classification (FTC) network based on weakly supervised fine-grained classification discriminative filter learning networks, which greatly improves the work efficiency of financial accountants. The FTC network adopts an end-to-end network structure and uses a deep convolution network to extract highly descriptive features. By using a fully convolutional network (FCN), this method reduces the depth and width of the whole network and avoids the over duplication of features and the overconsumption of system memory. To obtain more accurate classification results, we use the large-margin softmax (L-softmax) loss function, which can make the features learned in the class more compact, make it easier to separate subclasses, and effectively prevent overfitting. Experimental results show that the proposed FTC network achieves both high accuracy (up to 99.36%) and high processing speed, which perfectly meets the requirements of accurate and real-time classification for financial accounting applications.


I. INTRODUCTION
In recent years, with the rapid development of computer hardware, computer vision and other technologies, deep learning is being adopted by an ever-widening group of fields [1]- [5]. Finance-and-tax is an important field that implements deep learning applications [6], [7]. Traditionally, accounting is usually performed manually as follows. First, the different types of financial tickets, such as value-added tax (VAT) invoices (common invoices, electronic invoices, and special invoices), bank tickets, toll tickets (highway passenger tickets, vehicle occupation fees, highway tolls,) are manually sorted. Second, the basic information of these financial tickets The associate editor coordinating the review of this manuscript and approving it for publication was Zhenhua Guo . is manually input into the financial software to produce accounting vouchers for the corresponding category. Then, each financial ticket is sequentially attached to the accounting voucher for the corresponding category. Finally, the accountant must repeatedly check whether the ordering of the tickets is correct and whether there are any missing tickets. However, this approach is obviously slowed by the lack of automation. Due to the large number and variety of financial tickets, the process results in massive classification workloads, time consumption, and labor effort on the part of the accounting staff, leading to high labor costs and low work efficiency. The accuracy of input information is also greatly affected. Therefore, in order to make accounting more accurate, more efficient and highly automated, optical character recognition (OCR) technology has been gradually applied to the field VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of financial ticket recognition [7]- [9]. The ticket information identification system can not only reduce work tasks and pressure and improve work efficiency but also resolve contradictions caused by rising labor costs and labor shortages. Additionally, it can promote the digitalization, maintenance and intelligent accounting and storage of accounting information, making it more convenient for accountants to review. The highly accurate and efficient classification of tickets is a very important step in the ticket recognition system. The correct classification of tickets can result in more accurate OCR and structured information extraction. Through this method, the work efficiency of financial accountants can be greatly improved.
To meet the practicality of use requirements for financial accounting systems, our solution should implement the following four improvements: 1. Reduce the cost of manual labeling in the model training process; 2. Improve the classification accuracy of the system as much as possible; 3. Support more financial ticket types, that is, the number of categories; and 4. Increase the classification processing efficiency.
To implement the improvements mentioned above, we proposed the Financial Ticket Classification (FTC) network based on weakly supervised fine-grained classification discriminative filter learning networks. This paper proposes 1. Using DenseNet as the basic network, which has a strong feature description capability and can effectively improve the classification accuracy; 2. A fully convolutional network (FCN) to replace the fully connected (FC) structure, which can effectively reduce the depth and width of the entire network, the number of training parameters and system resource consumption; and 3. The large-margin softmax (Lsoftmax) loss function, which can effectively improve the classification accuracy by using small intraclass variance and large interclass variance. Compared with existing financial ticket classification methods, the proposed method can support more financial ticket types and achieve higher classification accuracy and processing efficiency.
The structure of this manuscript is as follows: Section I introduces the latest research. Section II describes related work. Section III presents the overall framework of the proposed system, including detailed information about the dataset, data preprocessing, and FTC network structure, and the layer configuration of the model used in the experiment. Section IV describes the verification of the FTC network proposed in this work and analyzes the experimental results. Section V summarizes overall research.

II. RELATE WORKS
Currently, there are two main automatic financial ticket classification methods. One method is based on combining artificially designed features (such as SIFT and HOG) with machine learning classifiers (such as SVM and KNN) for classification [10]- [12]. The other method is based on a deep neural network, which extracts discriminative features for classification [13], [14]. The artificially designed features depend on the layout of the tickets, such as frame lines, headers, text areas, and other discerning feature information [15]. The descriptiveness of the features extracted using this method is very limited. Indeed, this method is intended mainly for a certain type of ticket and has poor adaptability to tickets with new and different structures. The features extracted with methods based on deep neural networks are relatively highly descriptive and do not need to be manually designated, so such methods have been widely used. Due to the low similarity between major categories of the financial ticket, there is a large interclass variance. Tickets within the same major category (subclass) have high similarity and small variance. Therefore, we use the fine-grained classification technology of convolutional neural networks to classify financial tickets. At present, fine-grained classification is mainly divided into strongly supervised fine-grained classification and weakly supervised fine-grained classification. Strongly supervised fine-grained classification methods such as Part RCNN [16] provide image category labels, labeled boxes and local area locations for the training samples, but the manual labeling cost is very large. Weakly supervised fine-grained classification methods, such as B-CNN [17] and DFL-CNN [18], only need to label the classified categories.
Ahmad S. Tarawneh et al. [19] used AlexNet to extract convolutional features, and then separately used Random Forests, K-nearest neighbors (KNN), and Naive Bayes to classify the tickets. In their method, the invoices are divided into three categories: handwritten, machine-printed and receipts. Among the classification methods, the KNN classification performed best, with the classification accuracy reaching 98.4%. However, this method only supports three categories, which is unable to support subsequent OCR and structured information extraction for tickets.
JIE YANG et al. [20] proposed an intelligent reimbursement system. The invoice classification module in this system divides invoices into three categories: VAT (value-added tax) invoices, common printed invoices and train tickets. The invoice image is obtained with a scanner. The smallest size tickets are directly judged as train tickets. Since VAT invoices and common printed invoices have similar sizes, they are distinguished by extracting keywords from the invoice title. However, this method has highly restrictive requirements on the source of the data, as it must be generated by the same type of scanner, and there are few classification categories.
YINGYI SUN et al. [21] classified invoices into three categories: VAT (value-added tax) invoice, train tickets, and taxi invoices. This method proposes an optimized network based on the VGG-16 network. The optimized network model includes 8 convolutional layers and 3 fully connected layers. To prevent the gradient from disappearing, this method adds batch normalization layers (BNs) and regularization constraints. The data source of the invoice is a photograph obtained by a smartphone. However, this method only classifies 3 categories. It is difficult to meet the requirements of a practical financial accounting application.

III. THE NETWORK MODEL
The basic process of classification is shown in Figure 1. Different types of financial tickets are collected to make a training set, which is then preprocessed. The model is trained on the GPU server and the parameters are adjusted accordingly. The trained Best Weight is deployed to the server to classify the ticket samples online. Its input is a sample of the tickets to be classified, and its output is the corresponding label.

A. THE DATASET
Through long-term collation, we divide financial tickets into 12 major-classes with a total of 482 subclasses. Our actual financial software has accumulated more than 10 million real tickets, which continue to increase by approximately 10,000 every day. The names of the major classes of tickets and the number of corresponding subclasses are shown in Table 1. There are 85 categories of tolls, which mainly include vehicle occupation fees, highway tolls, parking fees, and passenger tickets in most places. There are 51 categories of quota tickets collected from most of the regions. Admission tickets consist of 25 categories of common scenic tickets. There are 21 categories of general machine printed invoices issued by tax bureaus in various regions. There are 21 categories of taxi tickets obtained from various regions. Insurance policies include 7 categories such as HUA Insurance, CPIC Insurance, and Life Insurance. There are two categories of receipts: machine-printed receipts and handwritten receipts. The Other class includes 15 categories such as payroll, tax payment certificates, and patent fees. As shown in Figure 2, the major categories of financial tickets have few similarities and large interclass variances. The interclass tickets for each major category have high similarity and small variance. At the same time, there are many uncertain factors that increase the difficulty of classification, such as a large number of categories of financial tickets, similar structures, incompleteness, occlusions, folds, different sizes, low or uneven brightness, similar colors, geometric distortions, complex shooting backgrounds, and blurring or deformation. Therefore, this work needs to preprocess the dataset to improve the accuracy of classification.

B. TICKET PREPROCESSING
The purpose of ticket preprocessing is to clean and enhance the ticket samples in the dataset. The task consists of the following steps:

1) SEGMENTATION
Some of the single samples in the dataset may contain multiple financial ticket images, or the background of the ticket images may be very complicated. To ensure that the input sample contains only a single ticket image and reduces background interference, it is necessary to segment the largest area containing a single ticket image. Considering the different styles and sizes of financial tickets, we used the SSD (Single Shot MultiBox Detector) method for automated financial tickets segmentation. Because SSD can predict the detection results from the feature maps of CNN at different levels, so it VOLUME 8, 2020 can better detect and segment out tickets of different sizes. Figure 3 shows an example of the segmentation of financial tickets.

2) REMOVE NON CLASSIFIED AND DUPLICATE TICKET IMAGES
A large number of repeated data will affect the training results and may lead to overfitting. We use the similarity between images to remove duplicate ticket images to ensure the uniqueness of each ticket image in the training dataset.

3) ROTATION CORRECTION
The experimental results show that the classification accuracy of the non-rotated training set data is poor. Therefore, without changing the original features of the ticket image, we rotate the samples in all training sets before training to ticket angles 90 • , 180 • , 270 • and 0 • .

4) INCREASE THE NUMBER OF RARE TICKET SAMPLES IN THE CLASSIFICATION CATEGORY
As the number of types of financial tickets continues to increase, the number of tickets in some categories will be relatively small, which will seriously affect the accuracy of classification. The automated classification method needs to be constructed and expanded to ensure that there are enough samples to improve the accuracy of the model, ensuring that the characteristics from even rare categories of tickets can be learned. The number of ticket images for each category in the training set cannot be less than k (In our experiment, k = 300). When this condition is not satisfied, enhancement operations to increase the number of samples for these classes of ticket images, such as random change, in contrast, Gaussian blur, random rotation (−45 • to 45 • ), affine transformation, dirty point simulation, and deformation, are carried out. For training the network, each ticket image in the dataset counts as a different input. Increasing the diversity of the dataset will help to achieve more accurate classification results for the different tickets and enhance the generalization ability of the model.

C. NETWORK ARCHITECTURE
This section mainly describes the network structure and analyzes the improvement in the weakly supervised finegrained classification network based on the Discriminative Filter Learning (DFL-CNN) [18] model. In the financial ticket classification scenario, based on the general DFL-CNN model, we have the following two targeted improvements: 1. higher classification accuracy; 2. faster processing speed.
Because a fully connected structure will not only widen the depth and width of the entire network but also cause too many repeated features, increase background noise interference, waste computer computing resources, and increase system memory consumption, we use a fully convolutional network (FCN) to replace the fully connected (FC) structure and extract local salient features instead of the features of the entire image. The DFL model uses a softmax loss function in the G-stream, P-stream, and Side branch modules. The softmax loss function is good at optimizing the distance between major classes, but the accuracy is not high when classifying within subclasses. By analyzing the principles of several different loss functions (Softmax, L-Softmax, A-Softmax loss, AM-Softmax loss) [22], [27], [28], considering the characteristics of the financial ticket data set, this work uses the large-margin softmax (L-softmax) loss function [22] to replace the softmax loss function of the original network for more accurate classification. The L-softmax loss function can make the features learned within the class more compact, make it easier to separate the subclasses, and effectively prevent overfitting. Algorithm 1 describes the calculation process of the L-softmax loss function.
The basic network module is used to extract features. AlexNet [23], VGGNet [24], ResNet [25] and other networks can be used for feature extraction. However, the number of layers in the AlexNet network is shallow, and so higherdimensional features cannot be extracted. The large parameters of VGGNet will result in long training time and require a device with a large storage capacity. ResNet is deeper than the other two networks, and a large number of parameters are reduced without using a fully connected structure. Compared with the above feature extraction networks, the DenseNet [26] network has fewer parameters and is easier to train. It has a greater depth of network layers, can extract highdimensional feature information, and can easily prevent overfitting. Therefore, we use DenseNet with its ability to extract highly descriptive features as the basic network for feature extraction. To reduce training time and, at the same time, guarantee the accuracy of the extracted features, we use some layers of DenseNet121 as the basic network, including an initialization layer, 3 DenseBlock layers, and 3 Transition layers. The initialization layer consists of a convolution (Conv) of size 7 * 7 with stride 2, a BN, a rectified linear units (ReLU), and a max pooling layer (maxPool) with a kernel of size 3 * 3 with stride 2. The pooling layer is used to reduce the dimensions of the data. DenseBlock 1, DenseBlock 2 and DenseBlock 3 are composed of 6, 12 and 24 dense units, respectively, each of which in turn contains: Each transition layer consists of a BN, a ReLU, and an average pooling layer (avgPool) with a kernel of size 2 * 2 with stride 2. The 1 * 1 convolution reduces the number of output channels and improves the compactness of the model. The G-Stream module focuses on global information. Its input is a 512 * 14 * 14 feature map output by Transition Layer 3. Then, a Conv of size 3 * 3 with stride 1 is used to obtain a feature map of size 256 * 12 * 12. Then, the features are transformed into nclass * 1 * 1 column vectors through an FCN layer. The FCN layer contains a 1 * 1 Conv, a BN, a ReLU and a 1 * 1 AdaptiveAvgPool2D. Finally, the loss loss g is obtained by passing the nclass * 1 * 1 feature vector through the loss function L-Softmax.
The P-Stream module focuses on key local information. Its input is the feature map output by Transition Layer 3. Then, the feature map of size 512 * 14 * 14 is convolved by a Conv of size 1 * 1 with stride 1 to obtain a feature map of size (k * nclass) * 14 * 14, which is followed by global max pooling with a kernel size of 14 * 14. Then, the output is passed to an FCN layer, which contains a 1 * 1 Conv and a 1 * 1 AdaptiveAvgPool2D. Finally, the loss function L-Softmax is used to obtain the loss loss p .
The S-Stream module is a supervision module that highlights the distinguishing parts in the feature map. The input is the global maximum pooling feature map from the P-stream module. It uses average pooling on the k * nclass * 1 * 1 vector with k as the growth rate to obtain an nclass * 1 * 1 vector. This operation is called Cross-Channel pooling. Finally, the loss function L-Softmax is used to obtain the loss loss s .

IV. EXPERIMENTAL RESULTS
To evaluate the effectiveness of the FTC network in the classification of financial tickets, we designed two sets of VOLUME 8, 2020   Table 3. The input image for the experiments is an RGB image.

A. EXPERIMENTAL SETUP
The algorithm in this work uses the deep learning Pytorch framework to build the network in the Python programming language. The model is deployed on a Linux server with CentOS Linux release (Core) 7 operating system. The GPU is a single NVIDIA Tesla P40 with 24 GB video memory, the CPU frequency is 2.20 GHz, and the computer has 32 GB memory. Details of the layer configuration of the model in the experiment are shown in Table 2.

B. THE DATASET
The equipment for obtaining the dataset sample in these experiments includes imaging equipment such as scanners, mobile phones, and digital cameras and can also collect financial ticket images in the SaaS financial software platform. The experiment uses 10 major classes of financial tickets, including a total of 100 subclasses of commonly used tickets, each of which consists of a sample of 300 tickets, and the total dataset is 30000. The training data and validation data are allocated according to 5:1, that is, there are 250 training data and 50 validation data in each subclass. The total of the training data is 25000, and the validation data is 5000. The dataset in the experiment is shown in Table 3, which covers 86 kinds of bills of 48 banks and VAT, toll, taxi, quota, train ticket. 13 of the 100 classes contain multiple ticket styles, therefore, the fine grained classifier can fully show its advantages. To improve the classification accuracy, the aspect ratio of each input image is maintained during the training and testing stages so that the features of the image are not distorted. We resize the short sides of the input image to 448 pixels, and the long sized are resized to maintain the original aspect ratio.

C. THE METRICS
We use the accuracy and recall as the basic index to measure the algorithm and model, and focus on the time consumption of a single sample from the perspective of the financial accounting application requirements. The metrics are shown in Table 4.

D. THE RESULTS
We use stochastic gradient descent (SGD) with a momentum algorithm to optimize the model. The initial value of the learning rate (lr) is set to 0.001, the size of the momentum is set to 0.9, the size of the weight decay is set to 0.000005, the training batch size is set to 8, and the maximum number of iterations is set to 7,546.

1) EXPERIMENT 1 CLASSIFICATION WITH DIFFERENT NETWORK STRUCTURES
To verify the accuracy and real-time classification of the deep learning algorithm on the financial ticket training set, we use four different mainstream CNN networks to classify financial tickets: ResNet50, VGG16, DenseNet121, and DFL-DenseNet121. All networks implement the softmax loss function. The loss coefficients λ, β, γ of DFL-DenseNet121 are 1, 1, and 0.1, respectively. Figure 5 (a) shows a plot of the number of samples that were successfully classified as the threshold was increased using different structured network models. Figure 5 (b) shows a plot of the number of misclassified samples for the different structured network models with increasing threshold based on Figure 5 (a). Due to the need for a low faulttolerance rate for the financial accounting application, it is necessary to reduce the amount of labor while ensuring high accuracy. Figure 5 shows that the number of correctly classified samples for the four networks is similar when the confidence threshold is less than 99%. However, with a 100% confidence level, the DFL-DenseNet121 network yielded a small number of unclassified samples, compared to the abovementioned networks, which accounted for only 3.7% of the total test samples.
The analysis of the experimental results in Table 5 suggests that the feature extraction capabilities of the four CNN networks are very good, and they can successfully classify financial tickets. With a confidence threshold requirement of 99%, the prediction accuracy and recall of the ResNet50, VGG16 and DenseNet121 has reached more than 93%. The prediction accuracy and recall of DFL-Densenet121 are 98.16% and 98.68%, respectively. When VOLUME 8, 2020 the confidence threshold is 100%, the classification accuracy of DFL densenet121 network is 96.10%, and the recall is 96.35%, which is much higher than the other three networks. However, the classification accuracy of ResNet50, VGG16 and DenseNet121 networks is only 9.88%, 52.78% and 1.88% respectively, and the recall is only 9.81%, 52.38% and 1.85% respectively. Based on the above situations, this study chooses to use the fine-grained network structure DFL-DenseNet121 as the network model for financial ticket classification.

2) EXPERIMENT 2 COMPARISON OF DIFFERENT PARAMETERS
One of the goals of this work is to improve the weakly supervised fine-grained classification network DFL-CNN. To verify the improvement effect, the following comparative experiments were performed.  It can be seen from Table 6 that the fully connected structure uses the whole characteristics of the image, which contains more background noise and reduces the accuracy. In this work, the detailed information of the salient area (locally significant features) is combined with fully convolutional layers to remove the interference of background noise interference and achieve an accuracy of 99.36%. Table 7 shows that our accuracy is improved by 0.96% and 0.31%, respectively, compared to reference [19] and reference [21]. The number of classification categories in this article is very large and is increased while ensuring similar accuracy. In short, this work is far superior to the methods in references [19]- [21] in terms of the categories of financial tickets supported and the input sources of these supported financial tickets.

V. CONCLUSION
The classification of a vast number of types of financial tickets results in a heavy classification workload and low work efficiency for accounting staff. Therefore, we proposed an FTC network to automatically classify financial tickets, captured by scanners, smart phones, digital cameras and an actual financial accounting software platform. We used an end-to-end network structure and a deep convolution network, DenseNet, capabilities as the basic network to extract highly descriptive features, as well as a fully convolutional network (FCN) to replace the fully connected (FC) structure of DFL-CNN, which can effectively reduce the depth and width of the entire network, the number of training parameters and system resource consumption. Additionally, we used the L-Softmax loss function, which is good at optimizing the distance between classes to improve the classification accuracy. To further improve the accuracy of classification, the input ticket images were preprocessed during the training and testing stages, which included ticket background segmentation, cleaning, rotation, dataset enhancement, and some further operations. Additionally, the short sides of the input images were resized to 448 pixels, and the aspect ratio was maintained to avoid distortions. To improve the generalization ability of the model, 448 * 448 image blocks were randomly cropped from the resized image. The experimental results indicate that the proposed FTC network can achieve both high accuracy (up to 99.36%) and high processing speed. Compared with existing financial ticket classification methods, this method can support a greater number of financial tickets types and achieve higher classification accuracy and processing efficiency.