GFF-CARVING: Graph Feature Fusion for the Recognition of Highly Varying and Complex Balinese Carving Motifs

The recognition of Balinese carving motifs is challenging due to the highly varying and interrelated motifs of Balinese carvings and in addition to the scantiness of Balinese carving data. This study proposed a method named GFF-CARVING for the recognition of Balinese carving motifs. GFF-CARVING is a deep learning architecture based on the Graph Convolutional Network (GCN) and Convolutional Neural Network (CNN) to extract image and graph features. GFF-CARVING applies feature fusion to improve the discriminative ability of the model to overcome these challenges and therefore improve its recognition performance. The proposed method consists of three main modules, the image representation learning module, the graph representation learning module, and the prediction module. The image representation learning module is based on ResNet and extracts the image features using global max pooling. The graph representation learning module is based on GCN and extracts the graph features. The graph features are handcrafted features that are built based on the occurrence relationship between the constituent sub-motifs of Balinese carvings. The feature fusion generates new features that take into account the occurrence relationship between the sub-motifs. These new features are used in the prediction module to accurately recognize the Balinese carving motifs. Based on the experimental results, GFF-CARVING achieved the highest recognition accuracy of 98.93% compared to other state-of-the-art models. These results indicated that feature fusion based on the handcrafted graph features and image features improved the discriminative ability of GFF-CARVING in recognizing Balinese carving motifs.


I. INTRODUCTION
Balinese carvings are a work of art that is considered to be a cultural heritage in Bali. Balinese carvings found in sacred temples have unique motifs that adorn each element of the temple. Motifs of Balinese carvings are carved on a compressed sand media. Most temples in Bali that were built in the past have unique motifs. Preservation efforts have been carried out by digitally collecting and archiving various motifs of Balinese carvings, in which the first step is the automatic and accurate recognition of these motifs.
The associate editor coordinating the review of this manuscript and approving it for publication was Li Zhang . However, the recognition of Balinese carving motifs is challenging due to two reasons. Firstly, a single motif may vary in appearance when it is present in different carvings. Furthermore, Balinese carvings are comprised of sub-motifs that are mostly interrelated to one another. Secondly, there are no public Balinese carving datasets that is currently available. The complex characteristics of the motifs and also the scantiness of Balinese carving data makes the recognition process very challenging.
Several methods have been proposed to overcome the challenges of limited or scant data and high data variation [1], [2], [3]. In the context of Balinese carving, Darma et al. [4] proposed a data augmentation technique based on generative adversarial networks (GANs) and geometric transformation to generate synthetic data to improve recognition performance. A transfer learning approach was proposed by Darma et al. [5] to improve the performance of several pre-trained convolutional neural networks (CNNs) for the recognition of Balinese carvings. Mahawan and Harjoko [6] proposed a feature extraction method based on histogram of oriented gradient (HOG) and principal component analysis (PCA), in which the features of the training data are stored into a table using learning vector quantization (LVQ) and used for the recognition of Balinese carvings. However, these methods have yet to achieve a significant recognition performance and also have yet to fully overcome the problems faced in the recognition of Balinese carvings, namely the highly varying and interrelated sub-motifs.
Furthermore, several studies have proposed the use of graph features from images to improve recognition performance. Zhang et al. [7] proposed a modularity-based graph learning module to build the graph representation of features extracted using CNN and with the use of a graph convolutional network (GCN) module, independent CNN features and mutual GCN features are integrated to represent the retinal images and boost the recognition performance. Zhang et al. [8] proposed a structure-feature fusion adaptive GCN (SFAGCN) for skeleton-based action recognition, in which SFAGCN was shown to surpass the accuracy of state-of-theart methods by more than 0.6% on average. Mou et al. [9] proposed a nonlocal GCN for the classification of hyperspectral images which exhibited competitive results compared to other spectral classifiers.
Based on the studies above, this study proposed a method named GFF-CARVING which is based on GCN and CNN to address the challenges faced in the recognition of Balinese carving motifs, namely the highly varying and interrelated motifs of Balinese carvings and the scantiness of Balinese carvings data. The proposed GFF-CARVING applies feature fusion to improve the discrimination ability of the model and in turn improves the recognition performance of the model. The crucial contribution of this study for the recognition of Balinese carving motifs are summarized below: • We propose a hybrid model that combines CNN and GCN into a unified architecture to extract image features and graph features for the recognition of Balinese carving motifs.
• We built handcrafted graph features based on the occurrence relationship between the constituent sub-motifs of Balinese carvings that represent the Balinese carving images.
• We propose feature fusion of the image and graph features to improve the discriminative ability of GFF-CARVING in recognizing Balinese carving motifs.
To the best of our knowledge, there are only a few studies that conduct the recognition of Balinese carving motifs. The experimental results show that GFF-CARVING can overcome the challenges faced in the recognition of Balinese carving motifs, namely the highly varying and interrelated motifs of Balinese carvings and the scantiness of Balinese carving data.
The rest of the paper is organized as follows: Section II discusses the related works. Section III presents the proposed GFF-CARVING method for Balinese carving motif recognition. Section IV discusses the experimental results. Finally, Section V presents the conclusion and future works.

II. RELATED WORKS
Image recognition is the process of identifying images and classifying them into classes. The classification process is carried out based on features of the objects present in the images. Image feature representation is the value used to distinguish the classes of each object. Ling et al. [10] proposed a self-residual attention-based CNN for deep face recognition. This study used Resnet-50 and Resnet-101 as the backbone networks and implemented a self-residual spatial attention block and a self-residual channel attention block to decrease the redundancy between channels and to focus on the more significant parts of the face images. Wang et al. [11] proposed a method to classify pulmonary images based on Inception-v3 and transfer learning. Kui et al. [12] proposed a depthwise separable residual neural network (ResNet) for the classification of hyperspectral images that distinguishes the spectral and spatial information of the images and reduces the network size to prevent overfitting. Pal et al. [13] proposed a deep metric learning-based framework that is configured into CNNs to generate class-distinctive image feature descriptors for the classification of cervical images. Sutramiani et al. [14] proposed a data augmentation technique named MAT-AGCA to improve the performance of CNNs for the recognition of Balinese characters. MAT-AGCA addresses the challenge of limited availability of Balinese character datasets.
Transfer learning is a strategy to improve the performance of pre-trained models for the recognition of objects. Zhou et al. [15] conducted transfer learning based on the Inception-v3 and VGG19 models to differentiate benign and malignant breast tumors. This research examined various depths of transfer learning and evaluated the effects on the classification performance. Huo et al. [16] used deep transfer learning and semisynthetic training data for the classification of underwater objects in sonar images. This research applied fine-tuning and transfer learning to the VGG19 model. Sutramiani et al. [17] conducted transfer learning based on the MobileNet model for the recognition of Balinese characters. This research fine-tuned the number of trainable parameters of the pre-trained model and achieved an accuracy of 86.23%. Fan et al. [18] carried out the recognition of rock lithology using transfer learning based on the SqueezeNet and MobileNet models. The research achieved the highest recognition accuracy of 94.55% compared to other state-ofthe-art methods.
Ensemble learning is a technique that combines several models or classifiers in an attempt to improve classification performance. Several studies have conducted ensemble learning in various fields. Ali et al. [19] used ensemble learning for lung nodule classification. They extracted deep features using several CNN models and these features were used to train two classifiers, namely SVM and AdaBoostM2. It was shown that their method with the used of SVM outperformed other state-of-the-art methods and achieved an accuracy of 90.46%. Kusetogullari et al. [20] proposed DIGITNET-rec, an ensemble of three CNN model to recognize digit strings based on majority voting. Nanni et al. [21] proposed an ensemble of CNNs for bioimage classification, in which the scores of the models were combined using sum rules. This study compared the performance of several models based on learning rates, batch sizes, and topologies. Banerjee et al. [22] proposed an ensemble of selected features of several CNNs based on a two-stage feature selection algorithm, namely fuzzy entropy (FE) and total contribution score (TCS) for erythrocytes detection. Liu et al. [23] proposed a deep ensemble model for facial expression recognition. A hybrid feature representation method was used to acquire high-level discriminative features and a lightweight backbone fusion based on VGG16 and ResNet was constructed to achieve low-calculation training. The model achieved an accuracy of over 94% on four benchmark datasets. Patel et al. [24] proposed feature fusion based on several modalities, features, classifier decision scores for human action recognition. Liu et al. [25] proposed a deep feature fusion ResNet for insect pest recognition. Based on these studies, feature fusion can improve recognition performance by combining several features.
Several studies have proposed several other techniques to improve recognition performance of various objects. Lee et al. [26] proposed a data augmentation method using conditional GAN (cGAN) to address the scarcity of labeled iris image data. It was shown that the method improved the iris recognition accuracy. Man et al. [27] proposed a method to classify breast histopathological images named DenseNet121-AnoGAN which utilized anomaly detection with GAN to screen mislabeled patches and DenseNet to extract multi-layered features of the discriminative patches. Liu et al. [28] proposed a method for Covid-19 diagnosis from CT images based on a two-dimensional sparse matrix profile and DenseNet. This study used the sparse matrix profile method to generate anomaly enhanced CT images which was used to train the DenseNet model. Furthermore, this study also used data augmentation techniques to achieve the best classification performance. Wang et al. [29] proposed a method to classify single chromosome images into 24 types based on extended ResNet. This study used Hausdorff distance to calculate the vector of the input image and the 24 label feature vectors. Furthermore, Lu et al. [30] proposed an efficient algorithm based on the ResNet model to predict protein-protein interactions. Other research proposed Dimension-Based Generic Convolution Block to improve the recognition accuracy and reduced the optimized the model [31].
In this study, we adopted several of the approaches above to address the challenges faced in the recognition of Balinese carving motifs, namely the highly varying and interrelated motifs of Balinese carvings and the scantiness of Balinese carving data to improve the recognition of Balinese carving motifs. The transfer learning approach can be used to improve the recognition performance but does not overcome the challenges encountered in the recognition of Balinese carving motifs. Furthermore, ensemble learning by combining several models and features can also improve recognition performance. However, in the context of Balinese carving motifs, a more suitable approach that takes into consideration the complex characteristics of Balinese carving motifs is needed. Therefore, we built handcrafted features that exploit the occurrence relationship between the sub-motifs as the graph features. Based on these handcrafted features, we applied feature fusion to enrich the image features. Therefore, improve recognition performance. This study proposes a method to recognize Balinese carving motifs named GFF-CARVING which applies feature fusion on the handcrafted graph features extracted using GCN and the image features extracted using ResNet. This feature fusion approach improves the discriminative ability of the model to recognize Balinese carving motifs. Fig. 1 shows the different variants of Balinese carving motifs and the interrelated unique sub-motifs. The Balinese carving is consisted of seven sub-motifs i.e., Barong, Gajah, Karang Goak, Karang Daun, Patra Punggel, Patra Cina, and Keketusan Kakul kakulan. A single sub-motif may vary in appearance when it is present in different carvings because Balinese carvings are carved by different craftsmen and not printed, as shown in Fig. 1a. The scantiness of Balinese carving data is also a challenge in the recognition task. To our knowledge, there is currently no publicly available Balinese carving dataset. In addition, Balinese carvings have unique sub-motifs that are interrelated to one another, as shown in Fig. 1b. Each Balinese carving is constituted of several interrelated sub-motifs. For example, Karang Barong is composed of three sub-motifs i.e., Barong, Patra Cina, and Keketusan Kakul kakulan. Barong and Keketusan Kakul kakulan sub-motifs also appear in Karang Barong 2. Hence these sub-motifs are interrelated. The combination of these constituent sub-motifs is described in more detail in Table 1. Based on the unique characteristics of Balinese carving sub motifs, we exploit these characteristics to construct graph features. This graph feature is combined with image features to produce new features to improve the model's discriminative ability in classifying Balinese carving motifs on highly varying and limited dataset. Fig. 2 shows the proposed Balinese carving recognition method that consists of four steps. The first step is to construct a directed graph based on a weighted adjacency matrix to represent the occurrence relationship between the submotifs. The second step is dataset vectorization. In this step,  we applied word embedding to the BaliCarv dataset based on the identified sub-motif labels. We utilized FastText to generate a vectorized form of the BaliCarv dataset. The third step is the proposed GFF-CARVING method, which consists of an image representation learning module, a graph representation learning module, and a prediction module. In this step, we applied ResNet to extract image features and GCN to capture graph features then predict the labels of the motifs based on feature fusion. We combined the image and graph features to improve the discriminative ability of the model in recognizing Balinese carving motifs. The last step is performance evaluation. We evaluated the performance of the proposed GFF-CARVING with other state-of-the-art CNN models on the BaliCarv dataset.

A. WEIGHTED ADJACENCY MATRIX FOR GRAPH REPRESENTATION
The occurrence relationships between the sub-motif labels are represented by a weighted adjacency matrix and shown as a graph. A graph is a structure that encodes object connections. Objects in a graph are represented by nodes, while edges that connect nodes reflect the relationship between nodes. Weights can be assigned to edges to indicate the strength of the link between nodes. A weighted directed graph is used to represent the graph in this scenario. Each Balinese carving contains sub-motifs that constitute the carving as a whole. The weighted graph is based on the occurrence of two different sub-motifs in one Balinese carving.
We constructed handcrafted graph features based on the occurrence relationships between the sub-motif labels. Each sub-motif label is represented as a node in the graph and each directed edge between the nodes has a weight that represents the probability of occurrence of other sub-motif labels when a particular sub-motif label is present. Furthermore, each sub-motif label has a probability of occurrence which is calculated by the number of sample images that contain this particular sub-motif divided by the total number of sample images. A conditional probability can be used to represent TABLE 1. Balinese carving motifs combination with its constituent sub-motifs, the number of motif images, and the number of sub-motif images.

TABLE 2.
Co-occurrence matrix of sub-motif pairs (A ∈ R C ×C ) and Number of the sub-motif occurrences in the Balinese carving motif combination (N).  the occurrence relationship between sub-motif labels, namely P L j | L i , to indicate the probability that the L j label appears when the L i label appears.
The graph and the corresponding weighted adjacency matrix can be constructed based on the Balinese carving motifs combination. First, we identified the different motifs present in the Balinese carving images, which are shown in Table 1. Each Balinese carving motif is composed of submotifs. The combination of this sub-motifs is spread over 2,364 images in the BaliCarv dataset, consisting of 9,326 submotifs. Afterward, we counted the number of occurrences of sub-motif pairs in the Balinese carving motifs combination to obtain the co-occurrence matrix of sub-motif pairs A ∈ R C×C , where C is the number of labels is shown in Table 2. Then, we counted the number of occurrences of each sub-motif in the Balinese carving combination (N ). Finally, the weighted adjacency matrix of the occurrence relationship between sub-motif pairs based on the conditional VOLUME 10, 2022 probabilities of sub-motif pairs is constructed by dividing each row of the co-occurrence matrix A by the number of occurrences of the corresponding labels (N ). The conditional probability P L j | L i for the pair of sub-motif labels L j and L i is calculated using the following formula: where A ij is the number of occurrences of the sub-motif pairs L j and L i , and N j is the number of occurrences of L j . We calculated the conditional probabilities for each pair of sub-motif labels using Eq. 1. The conditional probabilities for each pair of sub-motif labels are shown in Table 3 and the final weighted adjacency matrix is shown in Figure 3c. The conditional probability that the L i label appears when the label itself appears is represented with a weight of 1. Fig. 3a depicts the nodes of sub-motif labels, the corresponding directed edge that connects the nodes, and the weight that is given to each edge based on the conditional probability between the sub-motifs. It can be seen that when the gajah label appears, the probability that the Patra Punggel label appears is represented with a weight of 1. However, as the edges are directed, this does not apply the other way around. When the Patra Punggel label appears, the probability that the gajah label appears is represented with a weight of 0.5. Another example is the relationship between the Patra Cina label and the Keketusan label. When the Patra Cina label is present, the probability of occurrence of the Keketusan label is represented with a weight of 1, but when the Keketusan label is present, the probability of occurrence of the Patra Cina label is represented by a weight of 0.3. The GFF-CARVING model. The model consists of three main modules, namely the image representation learning module to extract the image features, the graph representation learning module to capture the graph features, and the prediction module that utilizes feature fusion to predict the motif labels. Fig. 3b shows the illustration of weighted directed graph of the occurrence relationship between sub-motif labels. Fig. 3c is the final weighted adjacency matrix that represents the weighted directed graph. This graph will be used in the graph representation learning module.

B. SUB-MOTIF LABEL VECTORIZATION
We applied sub-motifs label vectorization to enrich the motif carving features so as to improve the discriminative ability of the model. Fig. 4 shows the sub-motif label vectorization through word embedding process and the illustration of the vector values of the carving sub-motif labels. The vectorized form of the BaliCarv dataset was generated by applying word embedding based on the sub-motif labels. Word embedding maps each label of the BaliCarv dataset into a dense vector. The dense vector is a numerical representation of the semantic meaning of each sub-motif label. Each sub-motif label in the Balinese carving dataset is encoded into a very dense and high-dimensional vector. The abstract meaning and relationship of each label are coded numerically. Fig. 4a shows the sub-motif label vectorization through word embedding process. The BaliCarv dataset consists of seven sub-motif classes. The vectorization process is implemented for all the seven motif labels of the BaliCarv dataset that consists of 2,364 images which contains 9,326 submotifs labels. Table 4 shows the sub-motif labels of the BaliCarv dataset along with the number of occurrences of each sub-motif label within the images. The FastText model was applied to map each sub-motif label into a vector to generate the vectorized form of the BaliCarv dataset. The training process using FastText was carried out for 10,000 iterations using a vector dimension of 300. This vectorized form of the BaliCarv dataset and the weighted adjacency matrix are used as input data to the graph representation learning process. As an illustration, Fig. 4b shows the vector values that represent each carving sub-motif. Fig. 5 shows in detail the GFF-CARVING process, which consists of three main modules. The first module is the image representation learning module, which is responsible for extracting image features. An image feature is part of the pattern of an image object to recognize or differentiate from other objects. The second module is the graph representation learning module, which is responsible for extracting graph features based on the interrelationships between the sub-motif labels. The third module is the prediction module, which is responsible for feature fusion of the extracted image and graph features, and also predicting the Balinese carving motifs. The purpose of extracting image and graph features is to produce new features of Balinese carving by combining image features and graph features, thereby enriching the features, and increasing the model's discriminatory ability in classifying Balinese carving motifs.

1) IMAGE REPRESENTATION LEARNING
The first module aims to extract features from the images. In our experiment, we used ResNet as the base model. The resolution of the input image is 500 × 500 pixels. To obtain the features from the images, we used global max pooling with the following formula: where x is the features extracted from the image, I is the input image, θ cnn is the model parameter, and D is the dimension of the feature map where D = 2048. VOLUME 10, 2022 To optimize the neural network, we used SGD as the optimizer with a momentum = 0.9, L2 regularization with a weight decay = 1 × 10 −4 , and an initial learning rate = 1 × 10 −2 which decays every 40 epochs by a factor of 10. The image representation learning module generates a 2048-dimensional feature map.

2) GRAPH REPRESENTATION LEARNING
The second module aims to extract graph features. This module implemented a GCN network consisting of two layers. Each GCN layer received a node representation from the previous layer and outputs a new node representation. In the first layer, the weighted adjacency matrix is used as input for the label-level word embedding, where d is the vector dimension of the word embedding and d = 300. The output from the first layer goes through the LeakyReLU activation function; hence the model can learn the relationship between complex labels by stacking multiple GCN layers. The results of the GCN convolution in the first layer produces an output with a dimension d = 1024. The output of the first layer and the weighted adjacency matrix is used as the input for the second layer to produce a feature vector with a dimension of 2048. The final output of the GCN module is a 2048-dimensional feature map that is based on the graph of interrelated submotif labels.
The weighted adjacency matrix and the vectorized form of the BaliCarv dataset are used as the input for the graph representation learning module. The weight adjacency matrix represents the occurrence relationship between the sub-motifs that constitute the Balinese carvings. The vectorized form of the BaliCarv dataset is built through the word embedding process.

3) PREDICTION BASED ON FEATURE FUSION
The prediction module combines the extracted image features and graph features. Feature fusion is implemented on both features by utilizing the matrix multiplication function. We used the torch.matmul function to combine image features and graph features instead of torch.mm and torch.bmm. Torch.mm and torch.bmm function does not broadcast matrix product, so it cannot treat arrays with different shapes during arithmetic operations. On the other hand, torch.matmul functions treat arrays of different shapes during arithmetic operations. The smaller array is broadcast across the larger array to have a compatible shape. In addition, torch.matmul can perform tensor multiplication with high-dimensional input. The feature fusion is calculated with the following formula: In Eq. 3, x 1 x 2 . . . x n is the first vector and y 1 y 2 . . . y n is the second vector. The torch.matmul function calculates both vectors to perform feature fusion to generate a new vector value. After applying feature fusion, the results are fed through a fully connected layer to predict the labels of the Balinese carving motifs. In addition, the feature fusion generates features that takes into account the occurrence relationship between sub-motifs and can improve the discriminative ability of the model; therefore, improving the performance of the model in recognizing Balinese carving motifs.

A. DATASET
We used the BaliCarv dataset, composed of 2,364 images containing 9,326 sub-motif labels and seven sub-motif classes. The BaliCarv dataset was built through a data generation process using neural style transfer and geometric transformation described in [4]. We applied K-Fold crossvalidation to the model training process. First, we split the dataset into 5-folds, thus dividing the dataset into five types of data trained on each model. Then, each training process was conducted on 1,892 train and 472 test data.

B. PERFORMANCE EVALUATION
We evaluated the performance using precision, recall, F1, and accuracy on a RTX3060 GPU. We evaluated the performance of GFF-CARVING by comparing the performance with other benchmark CNN models. These models were trained using the BaliCarv dataset that consists of 2,364 Balinese carving images. The performance is evaluated with the following formula:

Recall/True Positive Rate(TPR)
where TP is the number of correctly predicted positive labels, TN is the number of correctly predicted negative labels, FP is the number of incorrectly predicted positive labels, and FN is the number of incorrectly predicted negative labels. Accuracy is the ratio of true predictions to the overall data. Precision is the ratio between TP to the total number of positive predictions. Recall/TPR is the ratio between TP to the total number of positive data. The F1 score is the harmonic mean of the precision and recall. In addition, we applied K-Fold cross validation, receiver operating characteristic (ROC), and area under the curve (AUC) scores to evaluate each model.

C. BALINESE CARVING RECOGNITION
We conducted the recognition of Balinese carving motifs by applying our proposed method, namely GFF-CARVING, on the BaliCarv dataset. The Balinese carving recognition process was carried out by first extracting the image features and the graph features. Subsequently, feature fusion was applied to combine the image and graph features to generate features that takes into account the occurrence relationship between the sub-motifs. Finally, these features are fed through a fully connected layer for label prediction.
We used ResNet as the backbone model for the image representation learning module. We trained the model for 150 epochs. In the graph representation learning module, we used the final weighted adjacency matrix as the graph data and the vectorized form of the BaliCarv dataset to train the GCN model. The graph representation learning module is built based on GCN with two layers. The first layer accepts input data with a vector dimension = 300. Convolution in the first layer produces a 1024-dimensional feature vector based on the graph data and vectorized form of the BaliCarv dataset.  We used LeakyReLU to learn and model the interrelated nodes in the graph convolution process in the first layer. In the second layer, the graph convolution process generates a 2048-dimensional feature vector. The features generated in the graph representation learning module contain information on the occurrence relationship between the sub-motif labels.
We experimented with three variants of ResNet as the GFF-CARVING backbone. Table 5 shows a detailed view of the proposed model summary. Each model has a different number of parameters, especially in Resnet-101, which has almost twice the parameter size. Table 6 shows the experimental results of the GFF-CARVING method using different variations of ResNet. We conducted experiments using three variants of ResNet as the backbone network. We performed K-Fold cross-validation to evaluate the model's performance. We split the dataset on each training into 5-folds. We used ResNet-50 as the backbone network in the first scenario to learn the image features. The experimental results in the first scenario show that the model can recognize Balinese carvings with an accuracy of 97.88%. Furthermore, we also evaluated the performance of GFF-CARVING with the use of ResNet-50 as the backbone network using precision, recall, and F1, in which it achieved the best results with 93.85%, 98.81%, and 96.27%, respectively.
In the second scenario, we applied ResNet-101 as the backbone network to extract image features. ResNet-101 has more parameters than ResNet-50. This second scenario aims to determine the number of network parameters in the feature learning process. The experimental results show that the used of ResNet-101 produced better recognition performance compared to the used of ResNet-50 as the backbone network. The GFF-CARVING method with the use of ResNet-101 achieved the best precision, recall, and F1 of 94.50%, 97.26%, and 95.86%, respectively. The 5-fold crossvalidation yielded an average accuracy of 98.06%.
In the third scenario, we carried out the recognition of Balinese carving motifs using ResNeXt-50 as the backbone network. ResNeXt-50 is a variant of ResNet that applies a repeated building block that aggregates a set of transformations with the same topology [32]. The results of the third scenario were better than the previous two scenarios. The 5fold cross-validation yielded an average accuracy of 98.93%. The GFF-CARVING method with the use of ResNetXt-50 achieved a precision, recall, and F1 of 92.04%, 98.18%, and 95.01%, respectively. Fig. 6 shows the confusion matrix of the three ResNet variants used as the GFF-CARVING backbone. The three variants of ResNet showed almost the same performance, with the performance of recognizing Balinese carving motifs in each class with the lowest error of 5%. Karang Barong 1 is the motif with the lowest recognition results per class, with 95% in the GFF-CARVING-ResNet-50 and GFF-CARVING-Resnet-101 models. In GFF-CARVING-ResNeXt-50, Karang Barong 1's recognition performance increased to 98%. Generally, the experimental results showed that a high recognition performance reached 98.93% using different variants of ResNet. Based on the experimental results, the model with the highest performance yielded by GFF-CARVING-ResNeXt-50 model. However, the GFF-CARVING-ResNet-101 outperforms GFF-CARVING-ResNeXt-50 based on the precision and F1, but this model has more parameters and higher FLOPs than the other ResNet variants. Fig. 7 shows the ROC curves and AUC scores comparison on three ResNet variants. The ROC curve on GFF-CARVING-ResNet-50 shows Karang Barong 1 has the lowest AUC score of 0.913. On the other hand, GFF-CARVING-ResNet-101 and GFF-CARVING ResNeXt-50 achieved higher AUC scores in Karang Barong 1 were 0.972 and 0.994, respectively. Based on AUC scores, GFF-CARVING-ResneXt-50 produced the best performance with a higher AUC score than the ResNet-50 and Resnet-101 variants.
Therefore, based on evaluation metrics and parameter size, ResNeXt-50 as the backbone produced the highest performance. The higher FLOPs on GFF-ResNet-101 are because the ResNet-101 variant used in the image representation learning module has a larger parameter. Therefore, the ResNeXt-50 model was chosen as the primary model in GFF-CARVING. In the next section, we extend our experiment to compare the performance of the GFF-CARVING model with benchmark CNN models. Table 7 shows the comparison of recognition performance of benchmark CNN models and the GFF-CARVING model. We evaluated the performance of GFF-CARVING by comparing it to benchmark CNN models in previous research that applied transfer learning and data augmentation strategies to overcome limited or scant data and high variation. Based on the previous method, we applied a fine-tuning strategy and data augmentation method to each benchmark model to improve recognition performance on limited and high data variation data. In addition, we also compared the performance of GFF-CARVING with our previous method [4], that applied neural style transfer and geometric transformation as data augmentation method. Furthermore, we conducted further experiments by conducting an ablation study on the effect of the handcrafted graph features on the final prediction. Each benchmark CNN model was also trained using the BaliCarv dataset.

D. RECOGNITION PERFORMANCE COMPARISON
The MobileNet model achieved the lowest performance with an accuracy of 81.70%. MobileNet is a model with small parameters. MobileNet achieved precision, recall, and F1 of 87%, 78%, and 79%, respectively. The VGG16 and VGG19 models that applied transfer learning strategies with larger parameter size and higher model complexity exhibited better performance than the MobileNet model, in which the models achieved an accuracy of 85% and 84.52%, respectively. Furthermore, the ResNet-101 model achieved a higher recognition performance of 88.37%, while the MobileNetV2 achieved the performance with an accuracy of 85.46% and a precision, recall, and F1 of 91%, 84%, and 85%, respectively. Furthermore, we performed more experiments using several different architectures i.e., DenseNet169, InceptionResNetV2, Xception, and EfficientNetB4 that yielded accuracy of 88.76%, 87.16%, 89.31%, and 88.21%, respectively.
The highest accuracy achieved by the benchmark models was only 89.31%. Our previous study proposed a data augmentation technique based on neural style transfer and geometric transformation to address the scantiness of Balinese carving data [4], which increased the recognition accuracy to 91.60%. These results indicated that the previous studies that applied transfer learning strategy and data augmentation method could not fully overcome the challenges faced in recognition of Balinese carving motifs. The GFF-CARVING model proposed in this study was designed to overcome challenges faced in the recognition of Balinese carving motifs. The proposed graph-based fusion feature increased the recognition accuracy of Balinese carving motifs, which reached 98.93%. Furthermore, compared to our previous data augmentation technique, GFF-CARVING enhanced the recognition accuracy, which reached 7.33%.
We conducted further experiments by conducting an ablation study on the effect of the handcrafted graph features on the final prediction. We compared our GFF-CARVING models with the baseline ResNet-50, ResNet-101, and ResNeXt-50 that only used imaging features to determine the performance improvement of the GFF-CARVING model. Table 7 shows that the GFF-CARVING outperforms the baseline ResNet variants. Baseline ResNet variants can only achieve accuracy that reaches 90.46%. These results indicated that the handcrafted features built by utilizing the occurrence relationship between the sub-motifs label enhanced the recognition accuracy of Balinese carving motifs. Furthermore, feature fusion of the image and graph features increased the model's discriminative ability to recognize Balinese carving motifs. However, our proposed method has a larger FLOPs size than the MobileNet variants in terms of model complexity due to the used of ResNet variants in the image representation module. Hence, it requires higher computing resources.
Based on these exhaustive experiments, GFF-Carving outperformed other state-of-the art models in experiments. Handcrafted features that are built based on the characteristics of Balinese carving motifs can enrich features, thereby increasing model recognition performance. In other research domains, a similar strategy by exploiting data characteristics to build new graph features can be a strategy to improve recognition performance.

V. CONCLUSION
In this study, we proposed GFF-CARVING, a Graph Feature Fusion method for the recognition of Balinese carving motifs that addresses the challenges faced in the recognition of Balinese carving motifs, namely the highly varying and interrelated sub-motifs of Balinese carvings and the scantiness of Balinese carving data. The proposed GFF-CARVING consists of three modules, namely the image representation learning module, graph representation learning module, and prediction module. GFF-CARVING combines CNN and GCN into a unified architecture to extract image and graph features for Balinese carving recognition. We built handcrafted graph features based on the occurrence relationship between the constituent sub-motifs of Balinese carvings and extracted image features using ResNet from images of Balinese carvings. Then, feature fusion was implemented on the image and graph features to improve the discriminative ability of GFF-CARVING in recognizing Balinese carving motifs. Based on the experimental results, the proposed GFF-CARVING outperforms benchmark CNN models and achieved an accuracy of 98.93%. The experimental results indicate that the handcrafted graph features can significantly enhance the recognition of Balinese carving motifs. In addition, the proposed feature fusion of the image and graph features can generate enriched features that can improve the discriminative ability of GFF-CARVING; therefore, overcoming the challenges faced in the recognition of Balinese carving motifs. However, the proposed model requires a higher computational cost than the MobileNet variant. Therefore, it cannot be applied to mobile devices.
In future work, there is still a need for improvement in terms of model complexity in the hybrid deep learning approach. Therefore, the hybrid model can be applied to mobile devices with limited computing resources. In addition, we will apply GFF-CARVING for image retrieval of Balinese carvings to digitally archive Balinese carvings in various temples to preserve cultural heritage. This study is a significant breakthrough in the conservation of Balinese carvings. In addition, further research can be done by trying to apply GFF-Carving to other domains, by exploiting the characteristics of the data.