Revisiting the Effectiveness of 3D Object Recognition Benchmarks

Recently, 3D computer vision has greatly emerged and become essential topic in both research and industry applications. Yet large scale 3D benchmark like ImageNet is not available for many 3D computer vision tasks such as 3D object recognition, 3D body motion recognition, and 3D scene understanding. Existing 3D benchmarks are not enough in the number of classes and quality of data samples, and reported performances on the datasets are nearly saturated. Furthermore, 3D data obtained with existing 3D sensors are noisy and incomplete causing unreliable evaluation results. In this work, we revisit the effectiveness of existing 3D computer vision benchmarks. We propose to refine and re-organize existing benchmarks to provide cheap and easy access but challenging, effective and reliable evaluation schemes. Our task includes data refinement, class category adjusting, and improved evaluation protocols. Biased benchmark subsets and new challenges are suggested. Our experimental evaluations on ModelNet40, a 3D object recognition benchmark, show that our revised benchmark datasets (MN40-CR and MN20-CB) provide improved indicators for performance comparison and reveals new aspects of existing methods. State-of-the-art 3D object classification and data augmentation methods are evaluated on MN40-CR and MN20-CB. Based on our extensive evaluation, we conclude that existing benchmarks that are carefully re-organized are good alternatives of large scale benchmark which is very expensive to build and difficult to guarantee data quality under immature 3D data acquisition environment. We make our new benchmarks and evaluations public.


I. INTRODUCTION
As deep learning approaches have shown performance saturation in many 2D computer vision tasks, 3D data based approaches have got a lot of attention.3-dimensional information plays a key role in many vision tasks: from 3D object recognition and reconstruction to autonomous vehicle driving, robot navigation, and immersive realistic media interaction.In particular, object recognition from 2D image and 3D data is a core task for many other related image understanding applications.Large size public benchmarks such as ImageNet [1] and PASCAL VOC [2] have been introduced leveraging rapid development of deep learning approaches in the field.Although, in general, deep neural networks pursue to be generalized avoiding over-fitting The associate editor coordinating the review of this manuscript and approving it for publication was Zhipeng Cai .
on particular dataset, their algorithm development and performance evaluation are largely influenced by popular benchmarks.Therefore, the existence of high quality and large size benchmarks is essential factor for the prosperity of a research field.Benchmark is not only an evaluator of existing methods, also a guide toward future development direction.
Our main interest is effectiveness of benchmark for 3D object classification.Classification is the most basic task for perceiving data.Despite various 3D object classification approaches being introduced, recently reported quantitative gains on existing 3D benchmarks are not significant.It is difficult to determine whether the various new methods in certain tasks and environments have been properly verified.Better benchmark with extended versatility to reveal diverse aspects of the performance of 3D recognition models.Different models and corresponding applications may have different requirements (real vs CAD, complete vs partial 3D shape, small vs large scale scene, etc).
We also consider 3D data augmentation.Naturally, researchers want to find generally effective data augmentation methods, rather than only being effective for specific datasets.A problem is that we have no choice but to rely on quantitative measurements to evaluate data augmentation methods.This is because it is very difficult to judge the value of augmented data samples as training data with the human eye, unlike determining how well samples generated by generative models are generated.Therefore, the reliability of the quantitative measurements provided by benchmarks is bound to be more important to the data augmentation.When the dataset has less quality or quantity data, the less likely quantitative measurement (e.g.accuracy) be improved by correct or generally effective data augmentations compared to the incorrect or specific data augmentation.For example, for quantitative evaluation of data augmentation, there should be sufficient amount of test data left that is not accurately recognized by the recognition models yet, but can be improved with better data augmentation.Otherwise, augmentation that usually forms boundaries that can be worse in order to recognize a few outliers more is more likely to score well.Thus, datasets already reaching their limits are likely to mislead future data augmentation studies.
Therefore, in this study, we carefully examine existing popular but saturated in the performance evaluation 3D datasets and revisit their effectiveness as benchmarks.Many large scale benchmarks have highly dependant upon open internet for data collection.However, it is significantly difficult to create a new large scale 3D dataset of improved data quality and size.In this work, we propose to carefully refine and expand existing benchmark to enhance quality of data samples and add more reasonable challenges, instead of newly building large scale 3D benchmark with high cost, time and effort.In this work, we clean the most popular 3D object dataset ModelNet40 [3] and conduct create two new datasets MN40-CR and MN20-CB according to novel effective challenges on the cleaned dataset.We also demonstrate that our proposal enables the verification of recognition models, including classification and data augmentation, using more reliable and challenging standards.For example, we compare accuracy gains between diverse classification models on ModelNet40, MN40-CR, and MN20-CB.

II. RELATED WORK A. 3D OBJECT RECOGNITION BENCHMARKS
Various 3D object recognition benchmarks exist in the literature, as summarized in Table .1. Existing 3D datasets consist of varying scopes of instances: single/multiple objects, indoor/roadway/urban-scale scenes, and scanned/synthetic data.Complex and diverse information contained in an instance make recognition task be more challenging.Indoorscale 3D datasets are relatively easy to obtain using existing sensors such as time-of-flight depth camera.Most of indoor objects can be scanned at close distance with little interference with day light.There also exist indoor synthetic datasets [4], [5].On the other hand, roadway datasets such as KITTI [6], SemanticKITTI [7] are obtained with sensors mounted on moving vehicles.Detecting, segmenting, and tracking 3D objects on the datasets have been evaluated for autonomous driving applications.Outdoor 3D data scanned on a moving vehicle contains only visible side of buildings and objects from viewpoint.Urban-scale datasets are scanned by aviating device such as drones, which contain large scale dense 3D topographic information of a region.For example, SensatUrban [8] consists of 2,847 million points which is very difficult to synthesize.
For the description of an object, completeness of 3D shape is important.In addition, whether other closer objects or attached background are included in the data is critical in object recognition performance.To avoid such problem, for example, some scanned real object datasets provide segmented instances or boundary label [9] although still they have partial 3D surface date of objects.In this regards, synthetic dataset having single complete 3D object for each instance shows better reliability in the performance evaluation.
Annotation is also important factor in determining the characteristics of datasets.Annotations include class label, part or semantic segmentation label, Object location indicated by bounding box and sequence labels for tracking ('C', 'S', 'B', and 'T' in Table .1 respectively).Class label defines which category each instance belongs to.Some datasets [16], [24] provide class label for each segmented or bounding boxed object.ShapeNet [32] provides multiple hierarchical class labels for each instance.By contrast, ModelNet40 [3] has a single class label for each instance.Part or segmentation label defines local regions [7] or semantic parts [33], [34] of an object.Sequence label makes it possible to operate spatio-temporal recognition tasks such as motion prediction and object tracking [24], [25], [26].
3D point cloud is obtained using devices such as stereo camera, laser scanner, and time-of-flight depth sensors or creating graphical 3D models manually.Both acquisition methods have pros and cons as benchmark.In 3D data scanning surroundings, data characteristics vary significantly depending on the scanning device, method, acquisition environment, and post-processing.Limitation in the normalization of data quality imposes a difficulty in cross comparison of evaluation performance.Due to the unexpected or unknown aspects of real data, analytical performance evaluation could be limited with real data benchmark.It may contain systematic or non-systematic noises, background clutters, spatial distortion, and occlusion.Sparse data can be interpolated, and missing part can be inferred but true data cannot be restored.However, relatively more complex and large scale objects can be collected at low cost.Scanned data contains detailed reality information of target scene, which we miss or ignore intentionally with synthetic data.It gives practical verification of a method in real world applications.On the other hand, synthetic data contains only controlled aspects of target 3D objects.There is no unexpected noises and aspects in the synthetic data.In most cases, 3D objects of complete shapes are created, and incomplete partial shapes can be synthesized as evaluations TABLE 3. 3D object Classification methods and overall accuracy(OA, %) on ModelNet40: Input data types in the first column 'I.' are 3D 'Voxel', 2D 'Image', or un-ordered array of 'XYZ' 3D point cloud.[60], [61], [62], [63] use normals or mesh information.[64] is pre-trained on ImageNet.
demand.Therefore, recognition methods come to focus more on the 3D shape of object itself.Synthetic data generated from different producers for different goals have different characteristics and the levels of description details.Some data show unrealistic graphical appearances.They may have too simple and abstracted shapes.
ShapeNet [32] has large amount of annotated synthetic objects.However, it does not provide formal classification protocol such as predetermined train/test split.Even though we can determine a classification protocol using ShapeNet, there is no previous evaluation to compare with.ScanObjectNN [9] is an object classification benchmark of scanned 3D data.However the amount of instances are not enough to get proper performance evaluation results and suffer from occlusions.ModelNet40 [3] instance is complete single object without lens distortion, background, and noise.The number of classes and labeled instances are appropriate for benchmarks.It provides official training and test splits for object classification.Thanks to the contributions, ModelNet40 is the most widely used benchmark ever in the field.One negative point of the dataset is that it is getting old dated and reported performance is getting saturated.
Table .2 shows several instance samples.Synthetic samples tend to contain unusual objects that are difficult to understand in the general categories of reality.For example, hand-shaped  and shoe-shaped chairs, beds with unusual mechanical devices, or a mannequin which labeled as 'Person'.Many cases would be created to be used in a virtual space such as a game.They may belong to the class in the virtual world, but it may be far from reality.If there is a guarantee of overall consistency, datasets consisting of objects created and labeled with different standards from reality are also valuable.However, the problem is that objects labeled with different standards are mixed together when the dataset is collected.In this case, the truth boundary becomes very ambiguous.In addition, there exist instances in which the amount of information is too small.Models learned with such training data are likely to make errors in generalization, and if they are in test data, it will be difficult to rigorously evaluate.
Despite the importance of 3D object classification task, there are only a few reliable benchmarks with appropriate amount of instances such as ModelNet40 [3] and ScanObjectNN [9].ModelNet40 and ScanObjectNN provide appropriate class distribution and official training/test data split for the use of classification benchmark.

B. 3D OBJECT CLASSIFICATION MODELS
Lots of 3D recognition models have been proposed and evaluated on the 3D object classification benchmarks.The  models can be categorized along the input data format as shown in Table .3. Some methods [32], [35] accept 3D voxels as input and employ 3D convolutions.Beyond them, a study [36] had achieved improved accuracy using 3D convolution with multi view 3D input.However, point cloud of object surface is not real 3D volumetric data and increasing spatial resolution causes unnecessary computational cost exponentially.Improved approaches [37], [38] are proposed to mitigate the inefficiency in working with point cloud data.Some other approaches [39] adopt 3D to 2D data projection and 2D CNN, in which 3-dimensional geometry information is lost.The approach is reasonable when the spatial information of a dimension is not much important [40].With properly interpreted multi-view inputs, 3D information can be obtained [41].Taking multi-view inputs requires increased computational cost and complex network structure.Thus some work [42], [43] improve computational efficiency by removing the limitation of projection at a fixed view position.
The most recent methods accept 3D point cloud as an array of 3D coordinates values.PointNet [44] is one of earlier work for 3D understanding with the 3D point cloud representation that propose symmetric aggregation.Many following methods such as PointNet++ [45] and ShellNet [46] have improved the description ability of local and global geometry.Convolution layer traditionally requires predetermined input structure.However point cloud is a set of unstructured 3D point list.To apply convolution calculation to the point cloud data, other approach [47], [48] calculate convolution weights of a 3D point from its neighbor point coordinates.MKConv [49] try to unify voxelization, symmetric aggregation, and point convolution to one framework.While, several work [50], [51] employ GCN(Graph Convolutional Networks) for the description of 3D structure from point cloud.GCNs treat data as a graph of node-to-node relations rather than as a shape in the space.Point cloud array is a set of points without any order and it does not provide locality information between the points.Transformer [52] concentrates on context without locality or order.By this reason, recent many researches [53], [54], [55], [56], [57], [58], [59] employ the transformer algorithm to 3D recognition.

C. 3D DATA AUGMENTATION METHODS
As shown in Table .4, 3D data augmentation methods have reported improved overall accuracy over conventional augmentation schemes (e.g.rotation, translation, point jittering, etc.) on ModelNet40.PointAugment [97] employs an endto-end neural network which generates challenging training samples from source data.However it is difficult to define a loss function that guarantee good training samples.Mixing samples and labels [98], [99] is a popular approach in 2D image augmentation.PointMixup [100] conducts similar task for 3D point cloud augmentation.It produces new point cloud samples of minimum sum of distances (EMD or CD) from existing samples in 3D geometry space.Since then, recent work [101], [102], [103] also studied mixing samples in the 3D geometric space to generate new training sample.One of them, PointCutMix [101] proposed two methods named PointCutMix-R and PointCutMix-K.The PointCutMix-R replaces random points from a input point cloud to points from another point cloud.PointCutMix-K replaces the K nearest neighbor points of a random point in the first input point cloud with another points.The another points are selected from second input point cloud but also calculated by K nearest neighbors from that point in the first input point cloud.As a result, it makes new sample to consisted with spatial local parts of two point cloud.Some approaches apply local (instead of whole object) geometrical transformations in 3D space.PointWOLF [104] performs multiple local geometrical transformations with

III. MODELNET40
ModelNet40 [3] is representative benchmark for the evaluation of 3D object recognition.It is designed for multi-class single-label classification.ModelNet40 consists of relatively large number of classes (40 object types) with complete 3D object shapes (13,200 CAD point cloud instances) including mesh information.Even though ModelNet40 has paved the way for the literature as the first object classification benchmark, recent state-of-the-art models have 111570 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.shown saturated accuracy around 94% (fig.1).Therefore, quantitative comparison is getting difficult on the dataset.Considering varying environments and experimental factors, accuracy gain of decimal places on ModelNet40 hardly mean any significance in the improvement.This makes us doubtful whether ModelNet40 is as ever valid benchmark.Current and near-future researchers may try to achieve remaining around 6% of accuracy reaching 100% with future novel method.However, we have observed that most instances of incorrect prediction on ModelNet40 are actually difficult samples to classify even to human.They are either unusual case or mislabelled samples.Therefore, it is difficult to count on any further improvement in the accuracy of ModelNet40.
Although ModelNet [3] has conducted sample cleaning, there still exist instances corresponding to bad samples that they have claimed to remove (e.g, floor, thumbnail image, person standing next to the object, unrealistic, or miscategorized 3D models etc).Moreover, ModelNet40 contains non-exclusive classes such as 'Flower_pot'-'Plant'-'Vase' in which some instances of one class show strong characteristic of another (ex.a flower pot sample with a plant).It also contains outlier samples that affect evaluation precision and reliability.
We have taken a closer look at the bad samples of (but not limited to) ModelNet: 1) 'Mislabeled instances' that does not show clear appearance of labeled class, 2) 'Nonexclusive class categories' that causes large overlaps among some classes 3) 'Simplified object shapes' that hardly show class specific clue to characterize them as given class, and 4) 'Inconsistent range of an instance' that make us to be confused.

A. MISLABELED INSTANCES
ModelNet40 contains mislabeled instances.For example, in Fig. 2 (a), an instance of class 'bed' in training set is more like a cargo truck.Even though this could be an extreme case of the class (cargo truck shaped bed), trained network may include some features of cargo truck inside the class 'bed'.If such instance is included in the test dataset, corresponding classification accuracy may misdirect network performance.Sometimes this problem occurs because a category name has double meaning.For example, there are benches for human sitting and benches as workstations.

B. NON-EXCLUSIVE CLASS CATEGORIES
As exampled in Fig. 3 there are non-exclusive categories.Some samples of 'Desk' and 'Table ' are quite similar to each other.'Flower_pot' and 'Plant' also contain very similar samples.Ambiguity in class definition causes difficulties and misdirection of network training with ambiguous truth class domain and decision boundary.As a result, we can not put reliability on the achieved network performance.Therefore, we need more rigorous consideration of class categories and the collection of instances should be strict to the carefully-defined category criteria.

C. SIMPLIFIED OBJECT SHAPE
Large number of ModelNet40 instances have simplified object shapes as shown in Fig. 4. One of the most common cases is rectangular cubes with no any other details.Depending only on the simple 3D shape, without color texture information, it is not appropriate to categorize such instances as given class label of each.In such case, a test sample of very simple cube has to be correct with any inference result of glass_box, xbox, wardrobe, or Radio class.Actually, synthetic CAD dataset suffers from simplified object shape problem compared to scanned real object dataset.This raises another consideration of varying level of details over different 3D object datasets.

D. INCONSISTENT RANGE OF OBJECT REGION
ModelNet40 is basic and expandable dataset because most of its samples are single complete 3D object.However ModelNet40 has exceptional cases.There are instances including other nearby objects or background.In this case, network learns the shape of nearby objects or background under main object label.In fact, sometimes the criteria of an object range is not clear.'bookshelf' may contain books.If there is no 'book' class (as ModelNet40), this does not cause non-exclusive category problem.If not, book shape could be important clue of the classification of both 'bookshelf' and 'book' from others.In many cases, it is very difficult to determine to which extend of 3D shape belongs to main object characteristic.Fig. 5 shows some example cases of inconsistent range of object region.'Tv_stand' sample in Fig. 5c contains not only a TV stand, but also a TV, wall, and other furniture.It is difficult to guess if TV stand is main object of the instance or not in terms of relative size and structure.
On the other hand, there are several cases of incomplete objects such as samples in Fig. 5g, 5h, 5i, and 5j.Incomplete obejcts caused by occlusion or limited viewpoint may be useful in real application.However, the cases shown in Fig. 5 are not natural situation in real application.

IV. PROPOSED BENCHMARK DATASET
Large sized dataset such as ImageNet allows consistent performance evaluation for real world applications.However, building such a huge dataset is expensive and timeconsuming.In order to obtain improved evaluation ability of 3D object recognition without building an expensive large dataset, we revise existing dataset (ModelNet40 in this work).To this end, we carefully redefine expected role of benchmark in the field and establish following attributes.
Attribute 1: Definition of class categories have to be clear and exclusive.Scale and scope of the classes have to be chosen from similar level.Any ambiguity in these attributes makes evaluation result unreliable and weakens the significance of experimental comparison over methods.
Attribute 2: Each class has to be balanced in the number of instances and their diversities.The degree of shape details and the difficulty of corresponding challenges have to be properly adjusted.In particular, benchmark should configure multiple challenges that are suited to the verification of diverse aspects of applied methods.Building multiple subsets of a benchmark enables more specific verifications.For example, MultiDigitMNIST [106] adds a new specific challenge on traditional MNIST [107] containing multiple digits in an image.
Attribute 3: Benchmark and related challenges have to be general and analyzable.Challenging point of a benchmark has to be reasonable to be addressed for any type of approach.Difficulty level of tasks have to be well distributed so that the performance of applying methods can be developed successively.

2) BIASED SET
We expect that a machine learning model constructs continuous and focused distribution in feature space by learning the aspects of given sparse training samples effectively.And the model is able to predict the label of unseen instances correctly in test step.In this situation, there is an assumption  that ideally training and test data show similar distribution in the feature space.However, in real applications, this hardly happens and we try our network not to be overfitted on the training set for network generalization.In the construction of 3D dataset, it is more difficult to collect or create training set that represents truth population of a class due to the limitation of data acquisition or synthesis environment.Easy to capture and reconstruct or synthesize 3D samples will be preferred and included in dataset different from 2D samples that simply can be taken by camera regardless of complexity.It would be beneficial to have a benchmark evaluating the robustness of a network against such biases.
To this end, we define a biased dataset 'MN20-CB' consists

V. QUANTITATIVE EVALUATION & ANALYSIS
We conduct diverse quantitative evaluation of 3D object classification and augmentation methods on MN40, MN40-C, MN40-CR, and MN40-CB.

A. OBJECT CLASSIFICATION
As shown in Table .6, overall test accuracy of classification models on MN40-C generally increase compared with original MN40.This is mainly because additional 111574 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The effect of data cleaning is also shown in Fig. 10 and 11, which are class accuracy of classification models Point-Net and PointNet++ on the MN40(grey dots) and MN40-C(blue dots).In MN40, there exist abnormally low accuracy classes such as 'f.pot/p.pot'('flower_pot' in MN40, 'planted_pot' in our category adjustments).This is the result of non-exclusive class categories Table .6 also summarizes evaluation results on reduced dataset MN40-CR.All models show decreased accuracy on MN40-CR compared to original MN40 and MN40-C.Note that the amount of decrement varies over classification models.For example, PointMLP [94] show more than 18% overall test accuracy decrement (MN40-C vs MN40-CR), while some others such as RepSurf [95] and DGCNN [83] show around 8% of decrement (Fig. 7a).Overall test accuracies on MN40-CR show larger variance than MN40-C.Maximum difference of test accuracy on MN40-CR is over 12%, while around 5.5% in MN40 and around 3.1% in MN40-C.Therefore, the accuracy gain from the baseline(PointNet) is also bigger on MN40-CR in many models (Fig. 7b).This indicates that MN40-CR provides more challenging task evaluation and reveals respective characteristics of each model with higher precision than saturated MN40 or MN40-C.
Evaluation results on MN20-CB in Table .6 show bigger drops in overall test accuracies of the classification models around from 31% to 48%.And the accuracy gains of some models on MN20-CB are remarkable.RepSurf [95] and DGCNN [83] achieve more than 10% of accuracy gain compare to the baseline(PointNet).Fig. 7 compares evaluation results of various classification models on proposed benchmarks in order to see the effect of added challenges.Comparing with MN40-C, quantitative evaluation results reveal the generalization performance of each classification model against the challenges of MN20-CB.In Fig. 7a, MN40-CB shows better precision (performance differences) in the quantitative evaluation than MN40-C.Accuracy gains also show bigger differences along the classification models as shown in Fig. 7b.
Fig. 8 shows class accuracy of PointNet [44] and PointNet++ [45] on MN20-CB.In the training set of MN20-CB, super-class 'Seats' includes 'Chairs' and 'Toilets' of MN40-C and the test set of 'Seats' additionally includes 'Bench' and 'Sofa' of MN40-C as explained in Table .5. Bench test samples generally show horizontally long shape that cannot be observed in the training set consists of 'Chairs' and 'Toilets' samples.To predict a bench test sample correctly as 'Seats' in the evaluation, classification models have to learn more about the shape features that enable people to sit on the object.In this regard, PointNet++ shows higher accuracy than PointNet in 'Seats' class as shown in Fig 8 .MN20-CB may require classification models learn semantics of training samples.In Fig. 7b, we observe that classification models capturing rich information from input point cloud such as RepSurf [95], RSCNN [88] or having residual connections such as DGCNN [83], PointMLP [94] achieve better accuracy than others.
We further test with two split test sets of MN20-CB:  Table .7 shows classification results of various augmentation methods on MN40, MN40-C, MN40-CR, and MN20-CB.Accuracy gains of all data augmentation methods on MN40-C decrease compared to MN40, because MN40-C is already clean and good enough to achieve generalization in training without data augmentation compared to MN40 for most of classification method.For example, PointCutMix [101] shows decreased gain from 2.7↑ to 0.9↑ while the accuracy increases from 93.4% to 95.9%.These results denote that our data cleaning may unexpectedly remove existing challenges from extreme samples of outliers.On the other hand, all accuracy gains on MN40-CR and MN20-CB increase compared to MN40-C.For example, PointCutMix obtains 3.5↑ on MN40-CR and 4.5↑ on MN20-CB while 0.9↑ on MN40-C.It denotes our additional challenges are activated and the augmentation methods have achieved additional improvements.On MN40-CR, gains of methods show different aspect from MN40.For PointNet++, PointMixup(w/MM) [100] and PointCutMix obtain larger gains on MN40-CR than MN40 (1.0↑, 2.7↑ → 1.6↑, 3.5↑).However, gains of PatchAugment [105] and PointWOLF [104] decrease (1.7↑, 2.5↑ → 0.3↓, 0.5↑).
111576 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Sample/label mixing approaches of data augmentation (such as PointMixup and PointCutMix) seem to be more helpful on a dataset of small amount of training set rather than local geometry shape transformation approaches (such as PatchAugment and PointWOLF).On MN20-CB, changes of gain of data augmentation models are much more significant than other datasets.PointMixup(w/MM), Point-CutMix, PatchAugment achieve larger gains (4.4↑, 4.5↑, and 4.1↑) on MN20-CB increased from 1.0↑, 2.7↑, and 1.7↑ on MN40 respectively.However PointWOLF obtains only 1.0↑ gain decreased from 2.5↑ on MN40.MN20-CB enables extended of improved approaches in the future.
Fig. 9 shows visual comparison between various data augmentation methods.Accuracy on our challenging benchmarks (MN40-CR and MN20-CB) decreases significantly than MN40 or MN40-C as shown in Fig. 9a because of added challenges.On the other hand, accuracy gains are generally increase more on MN40-CR and MN20-CB, making them easier to distinguish as shown in Fig. 9b and 9c.

VI. CONCLUSION
Benchmark dataset should consist of samples that have sufficient quality and proper challenges to provide validation for studies of the time.In this study, we carefully examine data samples and perform data cleaning for more reliable evaluation of 3D object classification tasks.We introduce new benchmarks (MN40-CR and MN20-CB) for more challenging, effective, and reliable evaluation by re-organizing original data, class category adjustment, and adding improved evaluation protocols of ModelNet40.Our new benchmarks are able to evaluate future 3D object classification models with improved precision.
It is not possible to verify every necessary performances of 3D recognition method with the proposed benchmarks.
They are examples to present these expansion approaches are meaningful.In addition, more detailed standards and challenges could be required in the further future.For example, training data reduction challenge in the future could be conducted by careful sampling instead of random sampling of us.However, we believe that our approach to consider the effectiveness of dataset would still encourage future validation to be more correct even after it is outdated.The 3D recognition field has huge potential that can be realized by solving several limitations.We hope that 3D benchmarks will continue to be verified and developed reaching beyond our contributions.

APPENDIX A DEFINITION OF CLASSES
The definition of classes occurs implicitly in the process of collection and cleaning.It is defined by one's intuition because it is what a person does.The ModelNet40 is collected through query at online 3D data websites by keyword from SUN database categories.These collected data were cleaned by the author and AMT personnel.The intuition of the people who created and cleaned defined the class.We defined a super-class that has a clear difference from other classes, considering classes that can be semantically integrated into the same category.We tried to define the class definition more correctly as our claim.Machine learning is about machines learning human intuition, so setting rules that can be expressed in sentences was minimized and processed intuitively.But we interpreted it as a sentence to explain the definition.In Table .8, you can see a linguistic interpretation of our class definition.It would be better to understand if you look at them with the actually included instances.

APPENDIX B PATCHAUGMENT
PatchAugment has properties that are difficult to compare on the same line as other data augmentations.For example, it's augmented sample cannot be expressed as a single point cloud data, and it operates inside the model.We have found that the PatchAugment exhibits a very low accuracy gain for MN40-CR in PointNet++.Therefore we conducted an additional experiment to provide a reference for this.As can be seen in Fig. 12, PatchAugment showed lower accuracy than conventional augmentation when it had less than around 30 per class training instances.In addition, we realize the official code of PatchAugment performs augmentation as it is even in the test phase.When we disable it, the accuracy rather decreases.

FIGURE 1 .
FIGURE 1. Overall accuracy of classification models on ModelNet40 benchmark: Horizontal axis denotes release date and vertical axis denotes overall accuracy(%).

FIGURE 2 .
FIGURE 2. Mislabeled Instances in ModelNet40: (a) An instance of class 'Bed' that looks like a truck (b) An instance of class 'Car' that looks like a city (c) An instance of class 'Glass_box' that is simple rectangular shaped object (d) An instance of class 'Radio' that looks like a small drawer.

FIGURE 4 .
FIGURE 4. Simplified shape Samples in ModelNet40: Instances of first row are just flat planes rather than three-dimensional objects with volume.Second row show similar rectangular samples from different classes.

FIGURE 5 .
FIGURE 5. Inconsistent range examples: (a) contains a small bookshelf, a monitor, and a desk with a bed.(b) contains a person with a bed.(c), (d), and (f) commonly contain wall, a bookshelf, tv, and tv stand but their labels are different.Only a small part of (e) is a tent.And (g) (j) are some examples of incomplete objects.(g) looks like a part of bottle frame, (h) is just the grip of a door, (i) is a guitar without it's front face, and (j) is a set of key caps of keyboard without main body.

FIGURE 6 .
FIGURE 6. Overall accuracy of PointNet, which is the baseline classification model, tested on varying amount of reduced training samples: X-axis indicates the number of instances in training set of each class.'All' means original MN40-C.Y-axis indicates overall test accuracy on each dataset.

TABLE 5 .TABLE 6 .
Super-classes of MN20-CB: MN20-CB contains 12 original classes and 8 super-classes which is consisted with several sub-classes.For example, the super-class 'Doorway' is consisted with 'Door' in training set, while consisted with 'Curtain' and 'Door' in test set.More detailed information is shown in Table.8 of Appendix A. Overall accuracy (OA) in % and gain of classification models on ModelNet40(MN40), MN40-C, MN40-CR, MN20-CB: 'Gain' is accuracy gain over baseline(b.line).Baseline is PointNet.'In tr' and 'Out tr' are the results from split test sets of MN20-CB.'In tr' is tested with sub-classes that are included in training.'Out tr' tested with sub-classes that are not included in training.

FIGURE 7 .
FIGURE 7. The effect of challenges on our benchmarks to each classification models: In (a) , we can see that MN40-CR and MN40-CB provide challenging benchmarks to every classification models.In (b), MN40-CR and MN40-CB reveal clear differences in the improvement of methods.

FIGURE 9 .
FIGURE 9. (a) Effect of adding challenges on various data augmentation methods: The results of cleaned set(MN40-C) was difficult to interpret because their overall accuracy are similar each other and each has too big value.Otherwise, the error rates of proposed challenges (MN40-CR, MN20-CB) are generally larger and the cause is more clear, making it easy to analyze.* marked experiments are augmentation methods on PointNet while rest of experiments are on PointNet++.(b)Accuracy gain of vary data augmentation methods compared with conventional data augmentation on PointNet: (c)Accuracy gain of vary data augmentation methods compared with conventional augmentation on PointNet++.
A. DATA CLEANING Regarding attribute 1, we conduct data cleaning on ModelNet40.As part of the data cleaning, we readjust the class definition of ModelNet40.Based on the observations of instances, 'Flower_pot', 'Plant', and 'Vase' contain lots of similar samples showing non-exclusive definition of the classes.We define three new classes: 'Plant', 'Planted_pot', and 'Empty_pot'.And all instances from the three old classes are reassigned to either of the three new classes.Then we carefully go through the appropriateness of given labels of all 13,200 instances.Unanimously chosen mis-labeled sample is either removed or reassigned to other more relevant class.As a result, cleaned dataset 'MN40-C' consists of 9,342 training and 2,287 test instances are constructed.(original ModelNet40 has 9,843 for training and 2,468 for testing.)B. NEW CHALLENGES MN40-C is a base benchmark for the verification of general multi-class single-label classification with complete 3D object samples.Considering attribute 2 and 3, we perform two additional dataset reorganizations (constructing reduced set MN40-CR and biased set MN20-CB) suggesting new challenges and following leader boards.MN40-CR and MN20-CB open up not only enhanced precision of performance comparison, also diversified aspects of model performance.1) REDUCED SET Large scale training data allows generalization performance in real applications.However it is not always available with all computer vision tasks.Frequently, it is very important to verify generalization performance with small sized benchmark.Showing good performance despite the small number of training data is one of very preferable feature of 3D recognition models.On the other hand, training data has to consist of proper number of samples to represent the diverse characteristics of each class.Based on MN40-C (cleaned ModelNet40), we remove training samples of each class that decrease the variance of class samples least.Test accuracy of PointNet on gradually reduced training data are shown in Fig. 6.Test accuracy drops significantly from 20 to 15 training samples indicating that there is significant training data quality drop after removing five more samples from 20.In this specific setup, we decide to build 'MN40-CR' dataset with 20 training instances for each class (total 800 training instances).MN40-CR, however, can be implemented with different number of reduced samples under different environment.
of 8 super classes and 12 original classes of MN40-C.Two to five original classes of similar characteristics are grouped to building a super class of larger intra-class variation with non-overlapping class definition.Training sets of the super classes include the instances from only part of their sub classes, but their test sets include the instances of all their sub-classes.Remaining 12 original classes show relatively independent characteristics or already larger intra-class variation than others.All the configurations are summarized in Table.

5 .
As a result, MN20-CB consists of 6,197 training instances and 5,432 test instances.

TABLE 8 .
Interpretation of the our classes definitions expressed in sentences: The true definitions of classes are made by human intuition implicitly.This table is for just helping understanding the implicit definitions more easily.Our MN40-C and MN40-CR have the classes in the group 'A' and 'B'.Otherwise, MN20-CB has 'B' and super-classes 'C'.outliers are removed the training set that may have hindered classification models from learning true features of corresponding class.Because there exist diverse viewpoints in deciding the correctness of class label of an instance, outliers that we have removed could have contributed to the improvement of classification performance as an extreme (but correct) sample in training.Therefore some other methods show decreased accuracy on MN40-C.We follow original cleaning policy of ModelNet40 and conduct rigorous but careful cleaning of each sample.

FIGURE 12 .
FIGURE 12.The overall accuracy comparison between some augmentation models on various reduced training set: PointNet++ is used as classifier model.the 20 on the X axis is same with MN40-CR.
1)'In tr' only includes sub-classes that are also included in training set.2)'Out tr' only includes sub-classes that are not included in training set.classification accuracy of all methods on 'In tr' are almost similar to each other showing no performance difference between the methods, as shown in Table.
6. On the other hand, classification accuracy on 'Out tr' show meaningful differences in their generalization performances.For example, PointNet and PAConv obtain significantly lower accuracy on 'Out tr' (34,9% and 36.4%), while other classification models which have shown similar 'In tr' accuracy such as PointNet++, DGCNN, or CurveNet obtain bigger than 40% of accuracy and DGCNN obtains highest accuracy of 48.9% on 'Out tr'.B. DATA AUGMENTATION Data augmentation is often employed when a dataset does not have enough amount of training samples.The proposed MN40-CR and MN20-CB are useful for the evaluation of data augmentation method.

TABLE 1 .
3D datasets: * indicates synthetic data.In ('Annot.') column showing provided annotations(labels), C means instance (or segment) class label, S means segmentation label, B means bounding box, N means surface normal, and T means sequence for tracking.