Multi-Instance Learning Algorithm Based on LSTM for Chinese Painting Image Classification

Aiming at the problem of weakly supervised learning in traditional Chinese painting image classification, a novel multi-instance learning algorithm based on Long and Short-Term Memory neural network with attention mechanism (ALSTM-MIL) is proposed. Firstly, by using the Pyramid Overlapping Grid Division (POGP), a multi-instance modeling scheme is designed to convert Chinese painting images into multi-instance bag, thereby transforming the problem of Chinese painting image classification into a MIL problem. Secondly, an efficient sequence generator is designed. It selects discriminative instances from the positive bags, construct a discriminative instance set (DIS), and convert multi-instance bags into equal-length ordered sequences. Thirdly, an LSTM network model with an attention mechanism is designed to perform semantic analysis on multi-instance bags to obtain their memory coding features, and then combined with the Softmax classifier to achieve semantic classification of traditional Chinese painting images. Experimental results on the Chinese painting (CP) image set show that the LSTM network built on the visual feature set is feasible, and the performance of the proposed MIL algorithm is also superior to other classification algorithms.


I. INTRODUCTION
With the development of the Internet and high-fidelity imaging technology, many art galleries and museums have provided online digital art work viewing services [1], [2]. Among them, Chinese painting, as an important expression of Chinese culture, has attracted more attention and favor from online visitors. As an important part of Chinese traditional culture, the research on the classification of traditional Chinese painting images contributes to better inherit and carry forward traditional culture. Therefore, in order to help the art museum to manage the Chinese painting images efficiently, and to facilitate the visitors to browse, so that the visitors can better understand the connotation of Chinese painting, research on the automatic classification of Chinese painting The associate editor coordinating the review of this manuscript and approving it for publication was Jeon Gwanggil .
images has become an important topic in the field of computer vision.
Chinese painting is a painting by a painter dipped in water, ink, and color painting on silk or paper with a brush, and has a unique artistic style. Different from Western painting, Chinese painting focuses on artistic conception. The artist conceives and imagines things by observing things, and outlines with lines, reflecting the subjective grasp of objects and artistic refinement. Western painting focuses on real objects, and usually regards artistic works as true copies of painting goals. Chinese painting can be divided into figures, landscapes, flowers and birds, etc. In terms of painting technique matter, Chinese painting can be divided into Gongbi and Xieyi. The traditional method of Chinese painting classification is to extract and fuse global and local features to the image, and then send it to the classifier. Combining the advantages of contourlet transform and gray-level co-occurrence matrix, Wang et al. [3] proposed a multi-scale and multi-color gamut texture feature extraction method to classify traditional Chinese painting images. In order to find a better image feature representation method in the new signal domain, Sheng [4] proposed a method to obtain the depth information of Chinese painting art in the wavelet domain by using different expressions of artistic styles displayed by image structures of different resolutions and frequency bands. In addition to the traditional support vector machine classifier, with the development of machine learning in recent years, convolutional neural networks (CNN) have been widely used in pattern recognition [5]. As a very good image classification method, CNN has very good classification performance. However, the traditional Chinese painting image has inherent characteristics, that is, the shape, texture, light and darkness and mood of the painted object supplemented by the main line and auxiliary point and surface. Therefore, in the classification, it mainly uses its local painting brush strokes, that is, the combination of lines, points, surfaces, and thickness as features, so under the condition of a limited number of training images, CNN cannot be well applied in the classification of Chinese painting images.
Aiming at the problem of Chinese painting image classification, this article proposes a MIL algorithm based on LSTM. First, multi-instance modeling is used to extract the local brush strokes features of the traditional Chinese painting image, and the image is constructed in the form of bag to characterize and describe the local points and line features of the traditional Chinese painting image; Second, a sequence generator is designed which extract the most discriminating brush strokes features set from the multi-instance bags to reduce redundancy and provide equal-length ordered semantic sequences for the LSTM network. Third, we design a LSTM model with an attention mechanism, and use it to memorize the dependency relationship between the local brush strokes of the image, and finally send it to the Softmax layer for classification. In short, the flowchart of the algorithm proposed in this article is shown in Figure 1, and its main contributions are as follows: (1) A pseudo-sequence generator is designed to convert the disordered and unequal number of visual feature sets (bags) into equal-length sequence signals.
(2) A novel MIL algorithm based on LSTM with attention mechanism (ALSTM-MIL) for Chinese painting classification is proposed. To the best of our knowledge, this is the first work to introduce LSTM network to the MIL problem, compared with other bag embedding methods, the LSTM is more able to capture the causal relationship among instances (brush strokes features) to achieve semantic parsing of a bag (image).
(3) In order to demonstrate the promising performance of the proposed algorithm on the classification of Chinese Painting (CP) image sets, we conducted sensitivity test and experiments on classification accuracy.
The remaining paper is organized as follows: In Section II, we introduce related works of Chinese painting image classification and MIL algorithm. In Section III, details of the multi-instance modeling are described. Section IV provides the details of sequence generator. In Section V, we proposed LSTM with attention mechanism model. The experimental results are presented in Section VI, Section VII discusses the future works, and Section VIII concludes the paper.

II. RELATD WORK A. CHINESE PAINTING IMAGE CLASSIFICATION ALGORITHM
Retrieval-oriented image classification has always been a hot research topic in multimedia, and has achieved a lot of effective results. However, there is less research work specifically for Chinese painting image classification. At present, the researches on Chinese painting mainly include the authentication of images, classification based on author's style, painting technique, and content. In the authenticity analysis of artwork, Buchana et al. [6] was able to discern low-quality digital representations between the original and the fake by training a comprehensive discriminant function (OTSDF) filter on the coarsely segmented image of difference and location of the original painting. Polatkan et al. [7] used hidden Markov trees to model the wavelet coefficients of painting images, and used the parameters on the model as input features to perform supervised machine learning to distinguish the copy from the original work. In the classification of paintings based on the author's style, Sun et al. [8] used the Monte Carlo convex hull feature selection model to integrate basic feature descriptors, and then used support vector machines to classify the works of different artists. Li and Wang [9] Designed a general framework for the classification of Chinese paintings, and used a hybrid two-dimensional multi-resolution Markov model (MHMM) to represent the stroke attributes of different artists to achieve classification. Sheng and Jiang [10] proposed to extract local features based on histograms to express the style of Chinese painting images, and designed a fusion scheme of window and entropy balance to optimize the classification results. Chinese painting can be divided into two categories according to the painting technique, namely Xieyi painting and Gongbi painting. In this research work, Gao et al. [11] Integrated SIFT feature detectors and edge detection to obtain key areas of Chinese painting, described the visual characteristics of key areas and internal area differences to obtain image features, and used different dimensional features to cascade classification strategies to classify Gongbi and freehand painting. Jiang et al. [12] combined Discrete Cosine Transform (DCT) and Convolutional Neural Network (CNN) to propose a classification model. Chinese painting images can be classified from the perspective of content, named ancient tree paintings, figure paintings, flower and bird paintings, Jiangnan water villages, etc. Bao et al. [13] classified themes by extracting the semantic information of Chinese painting image scripts and adopting a multi-task joint sparse method.

B. MIL RELATED ALGORITHMS
In image classification, each image has its own region of interest, which may be one or several regions of the image, and the remaining regions are uninteresting regions. This assumption is similar to multi-instance learning [14]. Therefore, multi-instance learning is widely used in image classification tasks [15]. Assuming that any image is a multi-instance bag and the image is divided into multiple sub-blocks, the region of interest of the sub-block is taken as an instance.
After MIL was proposed, many classic MIL algorithms have been presented [16], including axis parallel hyperrectangles, Citation-kNN, Diverse Density (DD), DD with Expectation Maximization (EM-DD) and MI/mi-SVM algorithms [17]. Due to the training instances of the MIL are bags, i.e. some unordered sets composed of different low-level visual features of images, convert each bag into a single representation vector, and then use standard single instance learning (SIL) methods (i.e. SVM) to solve the MIL problem, is a very effective MIl algorithm. For example, DD-SVM [18], Multi-Instance Learning via Embedded Instance Selection (MILES) [19], LSA-MIL [20], MI-J-SC [21], EC-SVM [22], MILDM [23] and miFV/miVLAD [24], etc. Unfortunately, few existing feature representation methods are effective to describe the sematic of images (bags), so it is difficult to adapt some well-known SIL methods to solve the MIL problems. In recent years, due to the excellent performance of the CNN in image classification problems, some MIL algorithms combined with CNN have also been proposed. For example, Xu et al. [25] used deep learning method to obtain the CNN features of images, and then trained MIL classifiers based on these CNN features for medical image analysis; Wu et al. [26] focused on the image classification and image annotation, a pre-trained deep learning model was used to predict the label of instance in the MIL framework, and then the labels of all the instances in the bag were synthesized to predict the final label of the bag. He et al. [27] in order to address the problem of medical image classification, based on prototype learning and bag feature transformation function, a multiinstance convolutional neural network algorithm is designed. Tang et al. [28] in order to find an effective and efficient representation for image classification, traditional MIL methods is extended to explicitly learn more than one multi-instance deep discriminative patterns (MiDDP) in positive class by stochastic gradient decent method, and proposed a novel MIL algorithm named MiDDP. Ajjaji et al. [29] in order to solve the problem of remote sensing scene image classification, each scene image is divided into multiple sub-images (four corners and center image) to generate instances, and SequeezeNet is used to extract highly descriptive features from each instance. The feature is sent to a deep neural network to learn the appropriate weights for each instance feature, and the feature is fused using a weighted average method to obtain the final example representation. Kausik et al. [30] proposed a deep convolution multi-instance algorithm. First, the instances in the bag are sent to VGGNet to obtain advanced features. Then, the multi-instance pool layer (MIP) is introduced after the eighth layer of the network is fully connected. It consists of two branches, the multi-instance maximum pooling layer and the example-level layer. The two branches generate packet-type MIPs and example-level features, respectively, and map them to the decision layer to generate decisions. In image classification, each image has its own region of interest, which may be one or several regions of the image, and the remaining regions are uninteresting regions. This assumption is similar to multi-instance learning. Therefore, multi-instance learning is widely used in image classification tasks. Assuming that any image is a multi-instance bag and the image is divided into multiple subblocks, the region of interest of the sub-block is taken as an instance.

III. MULTI-INSTANCE MODELING
In the multi-instance learning algorithm, each bag contains specific regional features, that is, positive instances. It is not necessary to know the specific location. As long as the positive instances are included, the bag type can be judged. The performance of Chinese painters when painting is unique, and the characteristics of strokes are difficult to capture. Besides, most of the Chinese paintings focus on the ''ideal'', and the subjectivity is strong. The objects depicted change with the author's emotions which are more abstract. Therefore, the multi-instance algorithm is very suitable for weakly labeled and more ambiguous Chinese painting images. In the modeling of multiple instances, in order to capture the global and local characteristics of Chinese painting images, that is, the description of the overall style and the details of local brush strokes, this article designed a block method of ''Pyramid Overlapping Grid Division (POGP)'' to achieve multi-bag multi-instance modeling. As shown in Figure 2, it is a multi-bag multi-instance modeling diagram. In the subsequent experiments, the block size of the POGP method is set to pixels, and the block moves from left to right and from top to bottom by 30 pixels. The image reduction ratio is 0.8, when the width or height of image is reduced to 244 pixels or less, the blocking stops. Then extract the brush strokes features for each sub-block, expressed as instances, and multiple instances as bags.
For each instance, texture features, color features and SIFT features are extracted separately. In order to pay attention to the spatial location information of the local Chinese painting image, use bior4.4 wavelet to carry out 4-layer wavelet decomposition for each instance, and then extract the energy distribution mean and variance of each sub-band on each decomposition layer as texture features [31], each instance can extract 26-dimensional texture feature vectors. Use color moments to extract color features for each instance, set the pixel on the i-th HSV color channel to {p(t)|t = 1, 2, . . . , N }, and extract first-order (mean), second-order (variance), and third-order (slope) color-moment features for each channel. Therefore, each instance can extract 9-dimensional color feature vectors.
In view of the long origin of traditional Chinese painting images, when scanning into digital images, it is inevitable that there is a defect in the state of traditional Chinese painting, and the poor preservation environment of future generations leads to the problem of low image registration rate. Therefore, this article also chose the scale-invariant feature [32] transformation to describe instance. The scale-invariant feature transformation was first proposed by Lowe [33], also known as the image local feature descriptor. It has invariance to image rotation, scale scaling, and brightness changes. It finds extreme points in the scale space and performs Filter to find feature points, and extract 128-dimensional feature vectors with constant position, scale and rotation around the feature points.

IV. SEQUENCE GENERATOR
Because the training samples of the MIL are bags, which contain varying numbers of instances, a key module in the proposed algorithm is the ''sequence generator'', whose function is to transform each bag into equal-length pseudosequence signals.
According to the definition of the MIL-based image classification, B i corresponds to an image, and X ij corresponds to the visual feature of a local region of an image. It is not difficult to see that in the instance feature space, if some images all contain the same semantics, they must contain some unique instances, which can reflect the essential commonality of this semantics, and have strong discriminating ability compared to the other images.
Therefore, to transform each bag into a pseudo-sequence signal, inspired by the DD function [23], this letter defines a new criterion function to pick out some unique instances from training bags in the training set, and call them as ''discriminative instance set(DIS)'', to guide the construction of pseudo-sequence signals.
Specifically, each category of image is regarded as positive bag in turn, and the other categories of images are regarded as negative bags, as a result, the multi-instance training bags L can rewritten as L = {(B i , y i ) : i = 1, 2, . . . , N }, and here y i ∈ {−1. + 1} is the label.
In the multi-instance training bags L, let X denotes an instance from any positive bag, and we define its ''uniqueness'' calculation function as follows: where y * i = (1 − y i )/2 (i.e. for any positive bag its y * = 0, for any negative bag its y * = 1, X * i denotes the instance closest VOLUME 8, 2020 to X in the i-th bag B i , and X * i − X 2 denotes the Euclidean distance between X * i and X . The geometric meaning of the Equation (1) is as follows: For one instance X in any positive bag, if at least one instance in all positive bags is closer to it and all the instances in all negative bags are farther away from it, then the uniqueness of X is larger, and the more it should be selected as a discriminative instance.
Finally, following the above ''uniqueness'' calculation function, we can calculate the uniqueness of each instance in all training bags, and then collect the top-T instances with the largest uniqueness as the final DIS, which is recorded as = {V 1 , V 2 , . . . , V T }. As a result, under the guidance of , any bag B i can be transformed into a pseudo-sequence composed of T signals with attention, which is formulated as: where X t ∈ R d and w t represent the t-th signal and its attention in the pseudo-sequence respectively. Specifically, we calculate the distance between each instance in B i and V t . Assuming that the distance between instance F i,J and V t is the smallest, then: where X t denotes the instance of B i closest to V t (t = 1, 2, . . . , T ) according to Euclidean distance, and w t is the inverse of this minimum distance (i.e. the smaller the distance, the greater the attention). Finally, the detailed steps for constructing pseudo-sequence signals from multi-instance bags are summarized as follows:  the number of network layers, RNN is used in information synthesis. The loss in the process is large, often focusing on the content of the last stage of learning information, unable to achieve long-term memory. LSTM is an improved structure of RNN. It overcomes the problems of gradient disappearance and gradient explosion owned by RNN, changes the hidden layer of RNN, adds control gate and memory unit c <t> ; Memory unit is the key to the network, it stores and transmits useful sequence information. The control gate is composed of an input gate i <t> , a forget gate f <t> and an output gate o <t> , which respectively control whether the current state of the input, forget and output. When a new input comes, the input gate works to store new information. At the same time, the forget gate determines how much information is forgotten before that moment, and the output gate further controls how much cell state c information is transmitted to the final state h <t> .The internal structure of a single-layer LSTM network is shown in Figure 3. The LSTM activations are calculated as follows:c  Figure 4. Attention mechanism is introduced in this architecture. It assigns different weights to the upper LSTM input through the learning network, and can dynamically capture the key parts of the input. For inputs that are helpful for classification, the network assigns higher weights to retain useful information; for inputs that are not good for judgment, assign low or no assignments to discard information. Each input vector weight is calculated by the attention model function. Let H ∈ R d×n be an input sequence {h <1> ,h <2> ,. . . ,h <n> } in which each element is a input vector, d is the size of hidden layers whereas n is the length of sequence. The attention mechanism can be computed as follows: The attention vector of the i-th layer is expressed by the following formula: The network architecture consists of three layers of LSTM and attention mechanism layer and a softmax classification layer. The input x <t> of the first layer LSTM is a semantic sequence with time sequence. The input of the upper LSTM is the output of the lower LSTM, that is, the final status h <t> of this layer. Multi-layer LSTM can fully capture the long-term dependencies of input sequences. In order to provide a confidence score at the last time step n, a softmax layer is added to the highest LSTM layer.
In the first layer structure, the input is x i ,i = 1, 2, ...n. After processing by the LSTM unit and the attention mechanism, the output state h 1 and the attention vector α 1 are obtained, and the product of them is used as the input of the next layer. By analogy, during softmax classification, the input becomes the product of the attention vector and the output state at all time steps of the third LSTM layer. The choice of activation function affects the nonlinear expression ability of the network model. In this article, tanh is used as the activation function. The reason for choosing it is that the function is symmetric about the origin, which makes the input data training effect better and the convergence speed faster. In addition, the fault tolerance and performance of the tanh function are better. Use the tanh activation function for the output of the third LSTM layer, which is publicly expressed as: After through the softmax layer, the probability score that the sequence X belongs to the category C i , i = 1, 2, ... |L| as follows:

A. IMAGE SET AND EXPERTMENTAL SETUP
In order to verify the performance of the proposed ALSTM-MIL algorithm in Chinese painting image classification, we applied it on the CP image set. The CP image set contains 2,000 images in JPEG format with size 256 × 384 or 384 × 256. There are all-together five different categories, each containing 400 images, the names are ancient trees, people, flowers&birds, Jiangnan water-bound town, and ink paintings. Figure 5 shows a sample of the CP image data set.
In the experiment, we also chose the Corel image library [37] for comparison, which is divided into 20 different categories, which are Africa, Beach, Building, Buses, Dinosaurs, Elephants, Flowers, Horses, Mountains, Food, Dogs, Lizards, Fashion models, Sunset scenes, Cars, Waterfalls, Antique furniture, Battle ships, Skiing and Desserts, each category contains 100 color images, a total of 2000 images. We refer to 1,000 images of the top 10 categories names COREL 1000 [38], [39], and the entire 2000 images are COREL 2000 image set.
During subsequent experiments, 60% of images are randomly selected from one category to form training set, and all the remaining images to form test set. In the ALSTM-MIL algorithms, the number of LSTM network neurons is set to 1024, initialized using uniform random numbers between the -0:05 and 0:05. As one of the input parameters of the algorithm, the learning rate is set to 0.001. For each training, the batch size is set to 64, Adam [40] is adapted to automatically adjust the learning rate, and the training process was run for 500 epochs. All experiments were performed under win10 + python3.6 environment with Intel i7 2.8GHZ CPU and 16GB RAM memory.

B. SENSITIVITY TEST
While we use Algorithm 1 to construct DIS and pseudosequence, one important parameter T must be predefined. In order to confirm the influence of T value to the proposed algorithm, we chose T from 200 to 450 with step size 50. Over ten rounds repeated training and testing, the average classification accuracies in CP image set, COREL 1000, COREL 2000 are shown in Figure 6.
As can be seen from Figure 6, the T values have effect to the performance of the proposed algorithm. The reason is that when T is too small, the pseudo-sequence constructed by them is not complete, which affects the expression ability of memory coding features. On the contrary, when T is too large, it only increases the redundancy of the pseudosequence, which does not help to improve classification accuracy. Therefore, in subsequent comparative experiments, the T value in algorithm was set to 200.

C. ACCURACY AND CONFUSION MATRIX ON PROPOSED ALGORITHM
In order to verify the effectiveness of the algorithm proposed, we performed comparative experiments with various other MIL algorithms, such as MI/mi-SVM [17], DD-SVM [18], MILES [19], and some classic deep learning algorithm, such as CNN [25], MIDDP [28], LSTM [32]. The average AUC values used in the comparison experiments on three data sets are shown in Table 1.
From the experimental results in Table 1, it can be seen that the classification performance of the proposed algorithm on three datasets is better than the existing classic algorithms.
The main reasons are as follows. First, the sequence generator accurately generates semantic sequences from the instance features and has a strong image semantic representation capability. Then, when the semantic sequence is passed to a multi-layer LSTM network model, the model has a strong long sequence processing capability, which can take into account the weight relationship between the various semantics, making the algorithm adaptive and robust. The confusion matrix of CP image sets and COREL 1000 is shown in Figure 7. We can see that our model performs well on most of image classification.

D. COMPARISON MODEL OF ALGORITHM WITH AND WITHOUT ATTENTION MECHANISM
In order to verify the importance of the attention mechanism in the proposed algorithm, we conducted an ablation study. Compare the classification accuracy of the Baseline, model with the attention mechanism and without the attention mechanism respectively on CP image set. Table 2 shows the results of the three models.
It can be seen from the experimental results in Table 2 that the proposed ALSTM-MIL algorithm is 3% higher in classification accuracy than without Attention mechanism and 9% higher than Baseline model. Because the attention mechanism assigns different weights to different instances in the network, increasing the proportion of useful instances, thereby improving the accuracy of Chinese painting image  classification. At the same time, multi-layer LSTM can learn more memory information than single-layer LSTM.

E. AVERAGE RUNNING TIME COMPARISON
We respectively compared the average running time of the POGD method in this article with blobworld [41] and bag generator of DD-SVM on the CP dataset, as shown in Table 3.
From the results in the table 3, it can be seen that the POGD method is 29s shorter than the blobworld. The reason is that each instance of blobworld needs to extract 230-dimensional features, which is more than the number of dimensions extracted by POGP. In addition, the running time of this method is much less than that of the bag generator in DD-SVM. This because DD-SVM finds the concept point of positive and negative instances based on the diversity density function, which is time-consuming.
In addition, we have compared the performance of our method and other methods, and it is still interesting to know the average running time of each MIL algorithm. We conduct 20 experiments on these methods and take the average as the running time. Figure 8 illustrates the average running time of different MIL algorithms ALSM-MIL, mi-SVM and DD-SVM executing the COREL 1000, COREL 2000 and CP image sets, respectively.
According to the experimental results, we found that ALSTM-MIL has no obvious advantage in time efficiency. This is because the LSTM model with attention mechanism requires iterative training, and the training batches extend the running time. In addition, the running time of ALSTM-MIL on the COREL 2000 and CP image sets differs very little. The reason is that the data sets are similar in size.

VII. DISCUSSION
There are still some interesting optimizable areas about Chinese painting image classification model. For example, when designing a sequence generator, we can choose more distance metrics, such as Levenshtein distance [42] to verify classification performance.
The application areas of the algorithm proposed in this article are not only in image classification and retrieval, but also in image surface defect detection and target tracking. In the training phase of image defect detection in [43], the algorithm in this article can be used to implement multi-instance modeling on surface images to construct a multi-instance bag. Then, using the uniqueness calculation function mentioned to calculate and select the top-T instances with the largest uniqueness as the discriminative instance set in order to implement the surface defect detection. In addition, the proposed algorithm can also be used for target tracking [44], [45]. Since the target deforms with movement, the multi-instance modeling method should consider the instances near the target as positive bags. Moreover, although the ALSTM-MIL algorithm we discuss is aimed at the classification of Chinese painting images, it is undeniable that when there is other image retrieval or classification problem, the classification accuracy can be improved by only modifying the extracted feature vectors for specific images.
In addition, in the future work, other factors inherent in the classification of traditional Chinese painting images can be considered to further improve performance.

VIII. CONCLUSION
In order to realize the classification of Chinese painting images under the framework of MIL, the work of this article is mainly reflected in three aspects: (1) A new multi-instance modeling scheme is designed, that is, the Pyramid Overlapping Grid Division method is used to convert Chinese painting images into multiple instance bag; (2) A sequence generator is designed to construct a discriminative instance set, and the multi-instance bag is transformed into an ordered and equal-length semantic sequence; (3) A novel MIL algorithm based on LSTM with attention mechanism (ALSTM-MIL) for Chinese painting classification is proposed into perform semantic analysis the discriminative instance set. Combining long and short-term memory neural network with MIL algorithm is proposed for the first time. Capturing and long-term memory to identify the correlation information between instances, and finally combining with the Softmax layer to realize the classification of Chinese painting images, which is innovative in the research of MIL algorithm. The experimental results show that the proposed algorithm is a very effective MIL algorithm, and its classification performance in the COREL image sets and the Chinese painting image set is better than other algorithms.