Semantic and Context Information Fusion Network for View-Based 3D Model Classification and Retrieval

In recent years, with the rapid development of 3D technology, view-based methods have shown excellent performance in both 3D model classification and retrieval tasks. In view-based methods, how to aggregate multi-view features is a key issue. There are two commonly used solutions in the existing methods: 1) Use pooling strategy to merge multi-view features, but it ignores the context information contained in the continuous view sequence. 2) Leverage grouping strategy or long short term memory networks (LSTM) to select representative views of the 3D model, however, it easily neglects the semantic information of individual views. In this paper, we propose a novel Semantic and Context information Fusion Network (SCFN) to compensate for these drawbacks. First, we render views from multiple perspectives of the 3D model and extract the raw feature of the individual view by 2D convolutional neural networks (CNN). Then we design the channel attention mechanism (CAM) to exploit the view-wise semantic information. By modeling the correlation among view feature channels, we can assign higher weights to useful feature attributes, while suppressing the useless. Next, we propose a context information fusion module (CFM) to fuse multiple view features to obtain a compact 3D representation. Extensive experiments are conducted on three popular datasets, i.e., ModelNet10, ModelNet40, and ShapeNetCore55, which can demonstrate the superiority of the proposed method comparing to the state-of-the-arts on both 3D classification and retrieval tasks.


I. INTRODUCTION
In recent years, with the wide application of 3D technology in virtual reality, 3D printing, medical diagnosis, and other fields [1]- [4], the number of 3D models is proliferating, which makes the 3D model classification and retrieval tasks receive a surge of attention. The most critical step in these tasks is to learn a discriminative 3D model descriptor. The current 3D model descriptor extraction methods can be divided into two mainstreams: model-based methods and view-based methods. Model-based methods [5]- [11] describe the 3D model by the raw representation, i.e., point cloud, voxel, and mesh, which can preserve more completed structure information. However, complex computation restricts its application in real scenarios. View-based methods [12]- [19] usually first place virtual cameras around The associate editor coordinating the review of this manuscript and approving it for publication was Yongqiang Zhao . the 3D model to obtain multiple views, then extract features of each view through 2D CNN, and finally, fuse those view features into a compact 3D model descriptor. Since the remarkable progress of deep learning has been achieved in the 2D image recognition field [20], [21], view-based methods have been proved more successful compared to model-based methods.

A. MOTIVATION
Although many works focus on the 3D model classification and retrieval tasks, there still exist some issues to be solved.
• How to exploit the view-wise semantic information contained in individual views. Since the latest image feature extraction technique [20], [21] can be directly employed to encode multiple views of the 3D model, the current view-based 3D model analysis methods mainly focus on how to aggregate the multiple view features into a compact 3D model descriptor. For example, VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. The Overview of the proposed method (SCFN).First, we place virtual cameras around the 3D model to capture multiple views from different directions. Then 2D CNN is applied to extract the raw view feature, and a channel attention mechanism (CAM) is designed to update raw view feature by channel-wise weight. Next, these updated view features are concatenated and processed by context information fusion module (CFM) to explore spatial information. At last, the fused features are extracted for classification or retrieval tasks.
some view-based methods [17] employ the pooling strategy to fuse multi-view features, while others [22], [23] attempt to select the representative views to depict the 3D model. These methods only focus on multi-view feature fusion. Nevertheless, they neglect the view-wise semantic information contained in individual views, which plays an essential role in feature representation. As introduced in [24], the different channels in 2D CNN feature focus on different regions of an image. Regions with rich semantic information contribute more to view features. For instance, a rendered view of the 3D car model typically consists of two parts: the background and the car. Some channels in this view feature pay more attention to the background, while some channels focus on the car. If we can capture more effective information from the view, the features of each view will be more distinctive, which is crucial for generating a compact 3D model descriptor. Therefore, it's necessary to exploit more effective view-wise semantic information contained in individual views.
• How to mine multi-view context information contained in view sequence. Although deep learning has been thoroughly studied in the field of 2D image recognition, its application in 3D models is still in infancy. Some networks and strategies [8], [17] adopted by view-based methods cannot fully consider the characteristics of 3D models. For instance, Su et al. [17] adopt the max-pooling layer to fuse view features. However, such an operation may discard the critical information in the informative views since it only retains the maximum value of each view feature. Besides, the fused feature does not consider the multi-view context information contained in view sequence, which has a significant influence on human observation and recognition of objects. Some previous works also use RNN or LSTM [25], [26] to aggregate multi-view features and mine context information among views. But this sequential mechanism requires much more computing resources, which substantially impair the ultimate performance. Therefore, mining the context information contained in view sequence is crucial for 3D model classification and retrieval tasks. To address the aforementioned problems, we propose a novel semantic and context information fusion network (SCFN) for 3D model classification and retrieval. As shown in Fig. 1, SCFN mainly consists of two modules: view-wise semantic information learning module and multi-view context information learning module. In the first module, we employ CNN to encode individual views of the 3D model to obtain the multiple raw view features. Then we propose a channel attention mechanism (CAM) to exploit the useful semantic information of the raw view features. Specifically, by explicitly modeling the correlation among feature channels, we can enhance the useful attributes according to the channel-wise weight and suppress the attributes that are not useful for the current task. In the second part, we propose a context information fusion module (CFM) to acquire a compact 3D representation. Since the rendered views of the 3D model are sequential, there exists context information between the adjacent 2D view features. To effectively synthesize spatial and context information, a 3D CNN followed by global average pooling is proposed in CFM. By employing 3D CNN, the motion and appearance among multiple consecutive views are well modeled simultaneously. Finally, we use this unified 3D descriptor to complete the classification and retrieval tasks. Extensive experiments on three benchmarks, ModelNet10, ModelNet40, and ShapeNetCore55, can verify the superiority of the proposed method comparing against state-of-the-art approaches on both 3D model classification and retrieval tasks.

B. CONTRIBUTION
The main contributions of this paper are summarized in the following three aspects: • Unlike previous view-based 3D model analysis methods, which only focus on the multi-view feature aggregation but neglect the feature representation capability of individual views, we propose a channel attention mechanism (CAM) to enhance the useful semantic information contained in individual views.
• We employ a context information fusion module (CFM) to aggregate multi-view feature by a 3D CNN and a global average pooling layer, which can fully explore the context information and spatial information among view sequences.
• We conduct extensive experiments on ModelNet10, ModelNet40 and ShapeNetCore55 [27], respectively. The experimental results can validate the superiority and effectiveness of the proposed method compared to the state-of-the-art methods. This paper is organized as follows. In Section II, we introduce some related works. The specific details of the method are explained in Section III. The relevant experimental settings are presented in Section IV. Experimental results and discussion are provided in Section V. Finally, Section VI draws a conclusion of the paper.

II. RELATED WORK
In this section, we briefly review the impressive methods proposed in the 3D model classification and retrieval tasks. According to the diverse data formats used in these 3D models depicting methods, we can divide them into two categories: model-based methods and view-based methods.
• Model-based methods: The model-based methods take the raw representation of the 3D model as input, such as mesh, volume and point cloud. Since handcrafted features are widely used in the many fields of computer vision [28]- [31], which can well reflect the characteristics of the data. In previous work, researchers tended to design handcrafted features, like point feature histograms [32], and local surface feature descriptions [33] to recognize these raw representations. Mian et al. [34] propose a fully automatic 3D model-based freeform object recognition and segmentation algorithm. It is a multiview correspondence algorithm that automatically registers unordered views of an object with O(n) complexity. Fang et al. [35] propose a temperature distribution descriptor. TD descriptor is capable of exploring the intrinsic geometric features on the shape. Andy et al. [36] propose 3DMatch. The method learns a local volumetric patch descriptor to match partial 3D data. However, the handcrafted features have to be redesigned when the data source changes, which limited its application scenarios. In recent years, more and more researchers have investigated deep learning methods to process 3D models and achieved great success. Wu et al. [27] propose 3D ShapeNet, which represents the 3D model as the probability distribution of the 3D voxel grid binary variable. Gernot et al. [37] present OctNet, where a set of unbalanced octrees are employed to divide the sparse input data space, and the leaf nodes are used to store the set features, which greatly reduces the cost of computing and memory. Brock et al. [38] design a voxel-based variable autoencoder to explore the latent space of 3D shapes, and a deep convolutional neural network architecture for object classification. They address the unique challenges of voxel-based representations. Despite that VRN Ensemble preforms well, it requires more computation cost. Wu et al. [11] propose a disordered graph convolutional neural network (DGCNN). This architecture can dynamically update the graph by EdgeConv, which can capture local geometric information while ensuring the invariance of permutation. Qi et al. [9] propose PointNet, this structure can extract the key points to represent the 3D model. This capability makes the PointNet robust to noise and data loss. However, this method cannot capture the local structure of the 3D model due to the correlation between local points is not learned. To solve this problem, Qi et al. propose PointNet++ [10]. This method employs the hierarchical neural network to recursively apply PointNet on the point cloud, which has achieved satisfactory results. Zhi et al. [39] propose LightNet. It uses multi-task learning to solve real-time 3D object recognition problems. Extensive experiments have proven LightNet's superior object recognition accuracy and computational efficiency in real-time tasks. This method provides a fairly basic structure and the recognition performance can be further improved by adding more effective modules. Through the above introduction, it can be seen that the model-based method implicitly requires model information. According to the model information, these model-based methods can be further divided into mesh-based methods, volume-based methods, point-cloud-based methods and so on.
• View-based methods: The view-based methods leverage a set of views to depict the 3D model, and researchers construct a discriminative 3D descriptor from these views. In previous work, Chen et al. [13] propose a visual 3D model retrieval system. The system employs the Zernike moment and Fourier descriptor encoding views for the 3D model retrieval task. Liu et al. [40] propose a multi-view latent variable model (MLVM). This method designs an undirected graph structure to discover the potential spatial context information of a given 3D model. Su et al. [17] propose a multi-view convolutional neural network (MVCNN). This method employs 2D CNN to extract the feature of each view and fuses the multi-view features by max-pooling strategy, which has achieved great success in 3D model classification and retrieval tasks. Bai et al. [12] propose a 3D shape search engine (GIFT). This engine employs 2D CNN to extract view features and matches them to calculate the similarity between 3D models. Seout et al. [41] propose a stereographic projection neural network (SPNet). This method learns the feature representation of a 3D model by transforming the input 3D model into a 2D planar image using stereo projection. Sfikas et al. [15] propose an ensemble of PANORAMA-based 2D CNN (PANORAMA-NN). This method takes the panoramic view of the 3D model as the input to the convolutional neural network and employs the SYMPAN method to normalize the pose. Jiang et al. [42] propose a multi-loop-view convolutional neural network (MLVCNN). It introduces a hierarchical view-loop-shape architecture to deal with multiple groups of views. It makes better use of the local features of a view in a loop, while taking into account the global feature representation. However, the view rendering setting and network structure is more complicated than most view-based methods. Kanezaki et al. [43] propose RotationNet. It considers view-point labels that are ignored by other methods as latent variables and jointly estimates object category and view point from each single-view image. It achieves excellent results in the 3D classification task. However, it also has the limitation that each image should be observed from one of the pre-defined viewpoints. These view-based methods have achieved promising progress in the 3D model classification and retrieval tasks, but there are still some problems to be solved. For example, MVCNN [17] propose the max-pooling layer to fuse multi-view features, which does not consider view-wise semantic information and ignores the context information contained in the consecutive view sequence. In recent years, some researchers propose multimodal fusion methods [8], [8], [44], [45]. For example, Vishakh et al. [44] propose FusionNet. This method combines volumetric representation and pixel representation to learn new features. Qi et al. [8] design a new method to improve the existing volumetric CNN and multi-view CNN. These methods make full use of the characteristics of different data representations, but are also limited by the gap between different data representations.

III. METHODS
In this section, we firstly give the problem definition and overview. In the remaining part, we introduce the SCFN network architecture and explain the main modules of SCFN in detail.

A. PROBLEM FORMULATION AND OVERVIEW 1) PROBLEM FORMULATION
In the 3D model classification and retrieval tasks, the view-based methods usually use a view set of the 3D model to represent the 3D model. Given a 3D model, we render N views from different perspectives. In the training process, we train a feature extraction network to extract the feature of each view. Then we select an appropriate strategy to aggregate the features of all views to generate a compact 3D descriptor. In the 3D classification task, we use the 3D descriptor as the input of the softmax function to predict the category label of the 3D model. In the 3D retrieval task, we adopt Euclidean distance between descriptors to measure the similarity of two models. Based on the similarity, we select the candidate set that meets the query criteria as the retrieval result.

2) OVERVIEW
As shown in Fig.1, in our method, we designed the following two modules to complete these two tasks: • View-wise Semantic Information Learning. In order to exploit the useful semantic information of the raw view features, this module consists of two parts: raw view feature extraction and channel attention mechanism. Firstly, we leverage virtual cameras to extract a set of views of the 3D model and use 2D CNN to extract the raw features of each view. Secondly, we enhance the raw view features by channel-wise attention modeling.
• Multi-view Context Information Learning. The purpose of this module is to fuse multiple enhanced view features and generate a compact 3D description. Inspired by the relevant methods in the field of video processing [46]- [48], we use 3D CNN as the essential component of CFM to mine context information.

B. VIEW-WISE SEMANTIC INFORMATION LEARNING 1) RAW VIEW FEATURE EXTRACTION
Given a 3D model, we extract multiple 2D views from different viewpoints, as in Fig 1. In the view capturing process, we fix an upright orientation axis as the rotating axis. Then we place virtual cameras point to the centroid of the 3D model with intervals of θ [17]. We use AlexNet [20] as the backbone CNN for 2D view feature extraction. The AlexNet is composed of five convolutional layers (conv1-conv5) and three fully connected layers (fc6-fc8). We acquire the output of conv5 as the raw view feature. Therefore, the raw multi-view feature set representation can be written as: where v i represents the raw view feature of i-th view, the width, height, and number of channels of the feature are denoted by W, H and C respectively.

2) CHANNEL ATTENTION MECHANISM
Previous methods [8], [17], [23] usually directly employ the raw view feature v i for the subsequent multi-view feature fusion procedure. However, we notice that each channel of the v i contains certain channel-wise statistics. In fact, the view feature channels are the response to the different convolutional filters. By explicitly modeling the correlation among channels, the network can increase the sensitivity responded to the useful statistics. Therefore, we seek to design a novel channel attention mechanism (CAM) that can calculate the channel-wise importance according to the statistics contained in each channel. Based on the channel-wise importance, we can re-weight the channels to get a more informative view feature. Then, we apply the average pooling operation on each channel of v i to obtain a statistics z i ∈ R C×1 . During this procedure, the c-th element of z i is formulated as follow: where v c i represents the c-th channel of v i . H c i and W c i refer to the height and width of v c i , respectively. Since the statistics z i don't have a non-linear learning ability to learn more complex correlations among channels, we adopt a gate mechanism with a sigmoid activation function to endow z i a good non-linear learning ability.
where the sigmoid function and ReLU function are denoted by φ and ψ, respectively. The transformation T 1 , T 2 are performed by two fully connected layers, which act as a bottleneck with a dimension reduction ratio to reduce the training cost. Then we use α i ∈ R C×1 to update the raw view feature v i and get the updated view feature v i . The c-th channel of v i is obtained by: where α c i represents the c-th element of α i . In the end, the enhanced multi-view feature set is defined as:

C. MULTI-VIEW CONTEXT INFORMATION LEARNING
After obtaining the enhanced multi-view feature set F, how to fuse the multi-view features into a unified 3D feature vector is vital to the final performance. Previous works [8], [17] usually employ the pooling strategy to fuse the multi-view features, which cannot discover the multi-view context information. We introduce the context information fusion module (CFM) to tackle this challenge. The module is inspired by the success in the video processing [46], [47], [49]. In this field, plenty of approaches divide the whole video into some frames and explore the context information among frames by 3D CNN. CFM consists of two 3D convolution modules and a global average pooling operation. The 3D convolution module contains three parts: a 3D convolution layer with 1024 filters, a batch normalization layer, and a ReLU layer.
In the first 3D convolution module, the kernel size is set to 1 × 1×1 to increase network nonlinearity. The kernel size of the second 3D convolution module is 3 × 3 × 3 to explore the context information among multi-view features. After the 3D convolution, the output is a high-dimensional feature F high R 1024×N ×H ×W .
We use the high-dimensional feature F high as the input of global average pooling layer to obtain the final fused feature: The context information fusion module (CFM) proposed in this paper aims to explore the spatial and context information contained in view sequence and enhance the sensitivity of the network to input changes. Since there are numerous similar objects in the 3D model database, such as vase and bottle, these objects often lead to the wrong classification results due to their high similarity. Given some variations to the input, our 3D CNN module can notice the change and feed it back to the output. On the contrary, the single max-pooling operation will produce the same output, unless the input variable is obvious enough to alter the maximum value. The ability to capture subtle changes in the 3D CNN space helps to enhance the discrimination ability of the network. Besides, the final global average pooling facilitates SCFN resist against local region noise.
We use F fusion as the final 3D descriptor, which is combined with the softmax function for the 3D model classification. In the 3D model retrieval, we define the Euclidean distance between two 3D model descriptors as the similarity of the two models. According to the similarity, we select the candidate set that meets the query criteria as the retrieval result.
• ModelNet40 1 : ModelNet40 is composed of 12,311 3D CAD models of 40 classes. In ModelNet40, 9,843 3D models are used as the training set, and the remaining 3D models are used as the test set following the splits by Su et al. [17].

B. EVALUATION CRITERIA
On ModelNet10 and ModelNet40, we adopt the eight most representative and widely used criteria to validate the performance of our method, which are defined as follows [23], [50]: • Nearest Neighbor (NN): It represents the percentage of the closest 3D models that matching the query.
• First Tier (FT): It represents the recall for the first top Q matching models, where Q is the number of query categories.
• Second Tier (ST): It represents the recall for the first top 2Q matching models, where Q is the number of query categories.
• The Mean Average Precision (mAP): It is a ranking measurement that can solve the problem of the single point value limitation of accuracies and recalls rate.  • F_measure: It is a comprehensive measurement considering both accuracy and recall for the top 20 retrieval results.
• Average Normalized Modified Retrieval Rank (ANMRR): It considers the ranking information of the relevant 3D models to measure the performance of the ranking list.
• Discounted Cumulative Gain (DCG): It is a statistical measurement that assigns higher weights to related 3D models while assigns lower weights to unrelated 3D models.
• Precision-Recall curve (PR curve): It is a key plot that visualizes the correlation between the precision and the recall. On ShapenetCore55 [27], we adopt the evaluation code and indicators provided in shrec17 3 . The indicators include Precision (P@N), Recall (R@N), F-score(F1@N), Mean Average Precision (mAP), and Normalized Discounted Cumulative Gain (NDCG). N is the length of the retrieval list. Besides, the micro averaged versions of these indicators give a weighted mean according to the size of each category, while the macro averaged versions give the unweighted mean regardless of the size of each category.

C. IMPLEMENTATION DETAILS
We employ a standard backpropagation strategy to train the entire network in an end-to-end manner. During the training stage, we utilize some strategies like dropout, weight decay, etc., to prevent over-fitting. In our experiments, the learning rate is fixed at 0.0001. Our method is implemented based on the PyTorch framework. 4 All experiments are conducted on a server with two GeForce GTX1080 GPUs equipped with 12G memory, one Intel (R) Xeon (R) CPU, and 32G RAM. Table.1 and Table.2, we compare SCFN with the state-of-the-art methods on ModelNet10, ModelNet40, and ShapeNetCore55. The experimental results on ShapeNet Core55 are all from the shrec17 competition. We chose accuracy and mAP as the main indicators to analyze the 3D model classification and retrieval performance.

As shown in
On the ModelNet10 and ModelNet40 dataset, our method outperforms most of the methods, except VRN Ensemble [38], MLVCNN [17] and RotationNet [43]. Compared to these methods, on ModelNet10, SCFN gains 1.4% to 21.2% on accuracy and 0.4% to 34.1% on mAP. On ModelNet40, SCFN improves accuracy and mAP by 0.9% to 36.5% and 2.7% to 155.6%, respectively. In summary, we can draw the following conclusions: • SCFN is superior to other representative view-based methods. Previous methods [8], [17] usually employ the pooling strategy to fuse multi-view features to obtain a unified 3D model representation. Although this operation is robust to view input order, the pooling strategy is too simple to capture the multi-view spatial context information. In contrast, our method can well handle this challenge with the help of 3D CNN.
• The performance of SCFN is better than most model-based methods, where the raw representation such as point cloud, voxel, and volume, are directly utilized to capture the structure information of the 3D model. However, these methods [6], [9], [10] cannot capture the latent visual semantic information of the 3D model. Comparatively, SCFN can capture that information by channel attention mechanism (CAM).
• Compared with SCFN, VRN Ensemble [38], MLVCNN [42] and RotationNet [43] achieve relatively better performance. VRN Ensemble designs a method for training voxel-based autoencoder to solve the unique challenges of voxel-based representation. But it is not scalable for real application since the voxel data is difficult to capture. Comparatively, 2D images are more available and it further boost the demand of view-based 3D shape retrieval methods. MLVCNN designs a hierarchical view-loop-shape structure, taking into account the local features of the view in loop and global feature representation. However, the multiple loop views (3 loops × 8 views) require more computation cost and memory storage. Comparatively, SCFN can achieve satisfactory performance with less views, and the experiments in Section.V-D can further validate the effectiveness of the proposed method. RotationNet considers view-point labels that are ignored by other methods as latent variables and jointly estimates object category and view point from each single-view image. However, it strongly depends on specific camera array settings, where the input view sequence should be in a fixed order, which restrict its application in real scenarios. Comparatively, the proposed SCFN is of the cameraconstraint-free setting, where both the view number and view order can be arbitrary, and the details are presented in Section.V-C and Section.V-D. As can be seen from Table.2, on ShapeNetCore55, compared with other state-of-the-art algorithms, the proposed SCFN obtains better accuracies in six indicators, such as R@N, mAP, and NDCG (both micro and macro), but relatively poor performance in other four indicators. Intuitively, the high precision and recall denotes a well performance, but in fact these two evaluation criteria performance in contrary. In other words, a well precision usually means a passable but not well recall. Therefore, the F1 score and mAP have been introduced to comprehensively measure them. The F1 score can be computed by the harmonic mean of precision and recall, it, however, may cause a single value limitation problem and thus will be easily affected by the sample distribution. Comparatively, this problem can be solved by the mAP since it is decided by the area circled by the precision and recall. mAP, hence, is usually regarded as the most important criterion for measuring the retrieval performance. This explains the lower P@N, F1@N but higher R@N and mAP achieved by the proposed method. Besides, the NDCG takes into account the position of the retrieval result in the retrieval list and thus has been regarded as an important evaluation criterion as well. Compared with other state-of-the-art algorithms, SCFN can achieve a gain of 3.7%-317.2%, 7.2%-169.4%, 1.9%-218.4%, and 20.9%-137.0% in terms of micro mAP, macro mAP, micro NDCG, and macro NDCG, respectively. This verified the proposed SCFN. Fig.2 presents some samples of the retrieval results. The overall search results are satisfactory. Based on the query model, our method can obtain correct results in most categories, such as "car", "chair" and "guitar". Only a few categories contain errors, such as "cup" and "nightstand". We can find the categories with poor retrieval performance have great similarity in appearance. It is also very difficult to distinguish manually.

B. ABLATION STUDY
We conduct an ablation study to show the effect of different parts in our model.  Table.3 and Table.4, respectively, which demonstrates that each part is crucial to the final performance. The four architectures are detailed as follows: • MVCNN: It renders the 3D model from different perspectives to get multiple views, extract the view features by 2D CNN, and aggregate these view features to a unified 3D descriptor by the max-pooling operation.
• MVCNN + CAM: It renders views from multiple perspectives of the 3D model and extract the raw view feature by 2D CNN, then the channel attention mechanism (CAM) is employed to exploit the channel-wise statistics and update the raw view features by reweighting the channels. Finally, the learned view features are taken to the max-pooling process to get a 3D descriptor.
• MVCNN + CFM: On the basis of MVCNN, it replaces the max-pooling strategy by a context information VOLUME 8, 2020   fusion module (CFM) to acquire a compact 3D representation.
• SCFN (MVCNN + CAM + CFM): It first places the virtual cameras around the 3D model to capture multiple views from different directions. Then 2D CNN is applied to extract the raw view feature, and a channel attention mechanism (CAM) is designed to update raw view feature by channel-wise weight. Next, these updated view features are concatenated and processed by context information fusion module (CFM) to explore spatial information. It can be observed that the method with channel attention mechanism (MVCNN + CAM) outperforms the basic method (MVCNN) since it can learn effective deep visual    features by re-weighting the feature channels, which benefit our model greatly. It can be seen that the use of context information fusion module (MVCNN + CFM) is also better than MVCNN. Given some variations to the input, the 3D CNN module in CFM can capture the variance. Instead, unless the input variations change the maximum value, the single max-pooling operation will produce the same output. The ability to capture subtle change helps to enhance the discrimination ability of our model. As we expected, the combination of CAM and CFM can further significantly improve the performance consistently.

C. SENSITIVITY ANALYSIS ON VIEW ORDER
To test whether our model is affected by the input order of views, we conduct 50 experiments on ModelNet10 and ModelNet40. Each experiment randomly disturbs the input order of views and employed a series of indicators to measure classification and retrieval performance. The result is reported in Table.5. Counting the results of 50 experiments, we observe that, compared to sequentially ordered input (ordered view), the out-of-order setting (disordered view) does not have much effect on the experimental results. Our method introduces 3D CNN in the fusion part, which can enhance the sensitivity of our network to input changes and capture the spatial and context information of the input data. Due to the characteristics of 3D CNN, our method is affected by the order of view input to some degree, but combined with the global average pooling in the fusion part, this impact is limited.

D. SENSITIVITY ANALYSIS ON THE NUMBER OF VIEWS
Because the number of views rendered from the 3D model may affect the performance of 3D model classification and retrieval, we conduct several comparative experiments to explore the impact of view numbers on classification and retrieval performance. As shown in Section III-B, we set the interval angle θ of virtual cameras to 30 • , 36 • , 45 • , 60 • , 90 • , and 180 • , to generate 12, 10, 8, 6, 4, and 2 views for each 3D model, respectively. The experimental results are shown in Table.6, Table.7, and Fig.4. From the experimental results, we can get the following conclusion: with the view numbers increasing, the amount of information carried by the multi-view increases and the performance of classification and retrieval improves. When the number of views approaches 12, the PR curves are gradually approach each other, and the extent of indicators improvemetn becomes smaller. When the number of views reaches 12, SCFN has the best performance.

VI. CONCLUSION
In this paper, we propose a novel network (SCFN) to learn a discriminative 3D descriptors for 3D model classification and retrieval tasks. We design a channel attention mechanism (CAM) to enhance the useful semantic information in view features and suppress the useless information to acquire a more effective visual feature. For multi-view feature fusion, we propose the context information fusion module (CFM) to replace the traditional pooling strategy. Compared with other methods, the CFM can exploit multi-view context information better. We compare SCFN with some state-of-the-art methods on three challenging datasets: ModelNet10, ModelNet40, and ShapeNetCore55. The experimental results verify the superiority and effectiveness of SCFN in the 3D model classification and retrieval tasks. In the related work, we find that the existing view-based methods have done a lot of work on the correlation between views and the fusion of multi-view features, and achieved good results. However they rarely consider the semantic information contained in the view itself. SCFN has done some work in this area, but there is still some room for improvement. In the follow-up work, we will continue to pay attention to this aspect. Besides, SCFN we proposed is designed to model the multi-view information of the 3D model, but it does not combine the other modality information, such as point cloud, voxels, grids information. In the future, we will also try to combine the view information of the 3D model and other modal information to generate a more discriminative 3D descriptor.
WEN-HUI LI received the M.S. and Ph.D. degrees from the School of Electrical and Information Engineering, Tianjin University. He was an Intern Student with the SeSaMe Center, National University of Singapore. His research interests are in the field of computer vision, machine learning, and 3D model retrieval.
DAN SONG received the Ph.D. degree in computer science and technology from the Zhejiang University of China. Her research interests include computer graphics, computer vision, 3D human body reconstruction, and virtual fitting.