3DSliceLeNet: Recognizing 3D Objects using a Slice-Representation

Convolutional Neural Networks (CNNs) have become the default paradigm for addressing classification problems, especially, but not only, in image recognition. This is mainly due to the high success rate they provide. Although there are currently approaches that apply deep learning to the 3D shape recognition problem, they are either too slow for online use or too error-prone. To fill this gap, we propose 3DSliceLeNet, a deep learning architecture for point cloud classification. Our proposal converts the input point clouds into a two-dimensional representation by performing a slicing process and projecting the points to the principal planes, thus generating images that are used by the convolutional architecture. 3DSliceLeNet successfully achieves both high accuracy and low computational cost. A dense set of experiments has been conducted to validate our system under the ModelNet challenge, a large-scale 3D Computer Aided Design (CAD) model dataset. Our proposal achieves a success rate of 94.37% and an Area Under Curve (AUC) of 0.978 on the ModelNet-10 classification task.


I. INTRODUCTION
O BJECT recognition is one of the key problems to be solved for the development of a complete scene understanding system and is the main focus of this work. Although this problem has traditionally been addressed using RGB cameras, in recent years many new approaches have encouraged the use of 3D data. The advent of commodity 3D sensors such as the Microsoft Kinect, and the creation of large, real and synthetic 3D data repositories [1]- [4] have opened new trends for this research problem. In particular, many papers addressed the problem of 3D shape classification using deep learning techniques and there have been a large number of papers using Convolutional Neural Networks in the field of 3D object recognition [5]- [7].
While the best results so far are obtained with methods based on 2D deep learning, their extension to 3D still presents many problems. For example, the methods that obtained the best performance in the ModelNet challenge are mostly based on 2D views. These 2D views are usually obtained as a projection of the 3D data. For example, [8] features a CNN architecture that combines information from multiple views of a 3D shape into a single, compact shape descriptor. This method obtained a 90% classification accuracy on the Model-Net40 dataset. However, focusing on 2D visual features could lead to ambiguities in a real scenario. The external part of an object could not capture the internal structure of that object. In addition, most of the best performing techniques in this challenge also rely on the use of multiple 2D views. The main reason why volumetric or 3D rendering approaches do not currently produce as good results as 2D multiple view methods is related to the 3D data discretization process and how the volumetric domain needs to handle large amounts of sparse data. In addition, the cost of handling 3D data is higher than that of processing 2D data, so it is limited by the amount of detail that can be captured. Similar conclusions are presented in [9]. Also, as stated in [10], inserting volumetric representations in a deep Convolutional Neural Network (CNN) pipeline requires large amounts of memory and is a very time consuming task.
In this work, which is an extension of the doctoral thesis of Dr. Francisco Gomez-Donoso [11] and our previous work LonchaNet [12], we present an approach that uses multiple 2D views acquired from 3D models applied to 3D object recognition. The proposed 2D representations are based on cross sections of 3D models. The proposed method outperforms our previous method, LonchaNet, which uses different VOLUME 4, 2016 convolutional backbones and more datasets for validation, and most of the existing approaches that participated in the ModelNet challenge. Currently, we have obtained 94.37% classification accuracy. Our proposal focuses on learning a discriminative representation that is able to distinguish between most of the categories in the ModelNet dataset. We also contribute a new architecture that utilizes the existing GoogLeNet network [13]. We use three independent GoogLeNet networks for learning features specific to each cross section or slice of the 3D model. Finally, we evaluated the trained model using ModelNet on a subset of the IKEA dataset. We obtained a classification accuracy of 70.45% demonstrating that the trained model was not overfitted to the ModelNet dataset.
The rest of the paper is organized as follows: Section II reviews existing works that use deep learning-based techniques for 3D object recognition. Next, Section III presents the proposed deep learning architecture for 3D object recognition based on 2D renderings from 3D models cross-sections. Section IV shows experiments and discuss results obtained using our novel approach. Finally, in Section V, we give our conclusions and directions for future work.

II. RELATED WORKS
Today, deep learning in general and CNNs in particular have surpassed traditional computer vision methods in many tasks, including 3D object recognition [14]- [16]. This exponential and continuous growth has been made possible by three factors: (1) the creation of large-scale 3D object databases, (2) accessible deep learning frameworks for designing, developing, and training CNNs, and (3) the democratization of Graphics Processing Units (GPUs) to accelerate those networks for both inference and training. Although all three were equally vital to the development of deep learning approaches, it is important to note that challenges or benchmarks-associated with data sets-allow researchers to objectively evaluate their proposals against other work. As such, high-quality datasets are gathered around a plethora of competing cutting-edge work striving for the best recognition results. In this regard, the ModelNet database is arguably the most relevant challenge, and so are the methods that make use of it. ModelNet 1 is a large-scale database of 3D Computer Aided Design (CAD) objects with two subsets or challenges: ModelNet10 and ModelNet40 with 10 and 40 classes respectively. More details on the composition of the dataset and specific information on the challenges are provided in section IV.
In this work, we will focus on ModelNet, so in the following lines we report all the state-of-the-art methods for that challenge, and recent methods not included there but whose novelty or results are remarkable. We introduce these methods by grouping them in three big categories according to their 3D data representations. 1 http://modelnet.cs.princeton.edu/ Voxelization. Input data is a discretization of the original point cloud, grouping points into different clusters according to a neighborhood criteria, that serve as a approximation of the original shape. Commonly, every voxel is represented as a binary value, 0 or 1, that indicates the presence of points in the space the voxel represents.
The seminal work by Wu et al. [5] introduced the Mod-elNet dataset and a Convolutional Deep Belief Network (CDBN) to represent and learn 3D shapes as probability distributions of binary variables on volumetric voxel grids. They achieved a 83.50% accuracy, a quite inconspicuous percentage for today's standards, but they paved the way for future research.
Another approach to this problem is introduced by Xu and Todorovic in their work "Beam search for Learning a Deep Convolutional neural Network of 3D Shapes" [17] in which a beam-search for an optimal CNN hyperparameters and architecture is proposed. This system models different network configurations as states. These states are connected in a directed graph fashion. The system traverses the graph using an heuristic function that produces the next best statean improved version of the architecture and the hyperparameters set. This system achieves an accuracy of 88% for the ModelNet-10 classification task and 81.3 for the ModelNet-40 classification task.
On another note, PointGrid [18] creates grid cells with a constant number of points with a point quantization technique, saving points' coordinates to improve the representation of the local geometry of the object.
Another relevant work is the 3D Generative Adversarial Network (GAN) proposed by Wu et al. [19] which combines a 3D CNN with a GAN to capture 3D shape descriptors, initially intended to generate or sample 3D objects, which can be effectively reused for classification. By making use of these unsupervised learned descriptors they demonstrated that their model could achieve 91.00% accuracy in the challenge.
Maturana and Scherer proved that bringing together a pure 3D CNN along with a volumetric occupancy grid representation was helpful to recognize 3D shapes in an efficient manner; their proposal, VoxNet [6] achieved a 92.00% success rate on the benchmark.
Other works, such as Octree-based Convolutional Neural Network (O-CNN) [20] and Octree Generating Network (OGN) [21] combine the use of octree representations with the performance of 3D convolutions to lower memory consumption and improve the performance.
The next significant leap was taken by Sedaghat et al. [22] who introduced object orientation prediction, in addition to the class label itself, to increase classification accuracy; their ORION network is a 3D CNN that produces class labels and orientations as outputs and uses both to contribute to training. By adding orientation estimation as an auxiliary task during training, they were able to learn orientation invariance and raise the accuracy to 93.80%.
Voxception-ResNet (VRN) ensemble mode introduced by Brock et al. [23], achieve a great 97.14% accuracy. That architecture is based on ResNet [24] but uses inception blocks which are produced by concatenating bottleneck and standard ResNet blocks. A voxelized volumetric input is fed to an ensemble of those VRNs whose predictions are summed to generate the output.
Finally, in order to approximate the results to real life scenarios, Par3dnet [25] used 3D CNNs to perform object recognition over tridimensional partial views of the objects, and made a deep analysis of the easiest and hardest views to classify an object.
2D projections. In this category of methods, input data is represented as multiple 2D projections of the tridimensional data. Traditionally, they have been the most common approaches, and usually have a 2D CNN to carry out the processing.
DeepPano [7] achieved 85.45% by converting 3D shapes to panoramic views, using a cylinder projection around their principle axes, and learning them with a CNN specifically designed for that purpose.
Sinha et al. [26] propose a system in which a geometry image is created by mapping the mesh surface to a spherical parametrization map, then is projected to an octahedron and then cut and ensembled to create a square. This approach achieves an accuracy of 88.4% and 83.9% in the ModelNet-10 and ModelNet-40 classification tasks.
Bai et al. proposed GIFT, a real-time shape matching method which combines projective images of 3D shapes and a CNN to extract features that are later matched and ranked to provide a candidate list; using this approach, they improved slightly with respect to VoxNet reaching a 92.35% recognition rate.
The method by Johns et al. [27] exploited multi-view image sequences to boost accuracy to 92.80%; they used a CNN to independently classify image pairs from sequences, and then classify them again weighting the contribution of each pair.
The Multi-View Convolutional Neural Network (MVCNN) approach, introduced by Su et al. [8], used a CNN to learn to classify objects using a collection of rendered views for each one; however, they did not report any result for the Model-Net10 challenge. In a subsequent work [28], authors improve their results by making modifications in the architecture and using shaded images as input.
In Multi-Loop-View Convolutional Neural Network (MLVCNN) [29], they generated a view-loop-shape 3D shape representation structure, which represents 3D shapes in a hierarchical way. They analysed the view features using a Long Short-Term Memory (LSTM) with Loop Normalization by exploring the relationship among views in each loop.
Finally, RotationNet [30] addressed the problem of the pose and object category estimation jointly, using a CNN over a partial set of multi-view images. This network predicts viewpoint-specific category likelihoods corresponding to all predefined discrete viewpoints for each image input, and then selects the object pose that maximizes the object category likelihood.
Point Cloud. 3D data is represented as a raw unordered point cloud. These methods usually extract features by analyzing the neighborhood of every point within a radius.
The most representative proposal in this case is Point-Net++ [31]. It generates a feature vector for the whole cloud by applying order-invariant transformations to every point, generating local hierarchical features that are sampled and grouped, and uses it to segment and classify the scene.
Some proposals are based on the previous architecture. This is the case of VoteNet [32], a novel technique based on Hough voting, that uses PointNet++ layers as the backbone. This approach selects a set of interesting points, with their corresponding features, as seed points to generate clusters of object instances based on their votes. Finally, these clusters are transformed into 3D bounding boxes with their corresponding categories.
Another work, SplatNet [33], extends the concept of 2D SPLAT images into 3D. It uses hash tables as a efficient implementation of neighborhood filtering, which provides an easy mapping of 2D points into 3D space, Then, bilateral convolutions are used to extract a set of features.
In the case of SO-Net [34], the authors propose a method to guarantee invariance to point permutations. It builds a Self-Organizing Map (SOM) through modelling the spatial distribution of the point cloud and using the neighborhood of every point to extract hierarchical features. As a final step, this method generates a global feature vector for the whole cloud.
Alternative and fusion approaches. In this case, these approaches use another type of data representation, alternative data transformations, or mix the results of different alternatives.
FusionNet [35] fuses volumetric representations (binary voxel grids) and pixel representations (projected images); they use both representations to feed two volumetric CNNs and a MVCNN, as a result they achieved a 93.11% accuracy.
Point-Voxel Convolutional Neural Network (PVCNN) [36], combines the sparse representation of the data with voxelized convolutions that increase the performance of the data access and improve the locality of the method. In this work, a new efficient primitive is introduced, Point-Voxel Convolution (PVConv), that converts points into voxel grids, aggregates neighboring points with voxel-based convolutions and transforms them back to points. In order to obtain features with a higher level of detail, they include point-based feature transformations.
In NurbsNet [37], authors propose a method based on local similarities between surfaces, modelled as nurbs. They fit a nurbs surface around the neighborhood of every point, calculate the similarity score with the pretrained surfaces, select the best similarity score for every part of the object and generate a feature vector to perform the classification.
Hypergraph Neural Networks (HGNN) [38] presents a novel data representation in form of hypergraph, using a hy- VOLUME 4, 2016 peredge convolution operation to handle the data correlation during representation learning. Authors generate 12 different views of each 3D object in intervals of 30 degrees and create the hypergraph as a probability graph based on the distance of nodes.
Finally, Voxelized Fractal Descriptor (VFD) [39] proposes a novel global descriptor based on the fractal dimension. This paper proposes the computation of the fractal dimension for every voxel of the object and generates a feature descriptor with the concatenation of the results. The voxel-based computation of the fractal dimension, as stated by the authors, is agnostic to the density of points, number of points in the input cloud, sensor of choice, and noise up to a level.
In light of this literature review, our proposed method presents a novel approach for 3D model recognition which bundles a multi-view object slicing approach, based on Setio et al. [40] method for Computed Tomography (CT) images, with a modified version of the GoogLeNet [13] CNN architecture to achieve state-of-the-art performance while keeping computational cost at bay. As mentioned above, we propose a method for 3D object recognition using deep learning and 2D CNNs. First, for each sample in the dataset, we take three sections of an object, one for each 3D axis, and project the 3D points onto a plane, so that we get three images of each sample. Each of these three images that make up a single sample is fed into a Convolutional Neural Network. Our novel deep architecture features three GoogLeNets, one for each image, joined in a layer before the classification layer. The classifier receives the information from the previous three independent (a) Model slice with scattered pixels (b) Model slice with dilated pixels FIGURE 2: A comparison of a slice before and after the dilation process. The dilated image provides a more accurate representation the object.

III. APPROACH
networks and performs the classification. This gives us great expressiveness and a high success rate.

A. SLICING A MODEL
3DSliceLeNet takes point clouds as input, but the neural architecture itself uses three images corresponding to three slices, so we first have to extract the slices from the 3D point cloud.
To do this, we load the point clouds and calculate the centre point of each axis. Then we make a slice in the XY, XZ and YZ planes with a thickness of 5% of the model size. This thickness is determined empirically, allowing the system to capture enough data to produce a faithful representation. Points that fall within these sections are isolated and projected onto their planes to generate a 500-pixel image. These images are binary maps in which the background is black and the projected points are white. This process is shown in Figure 1.
Due to the inconsistent point density produced by the sampling process described in section IV-A, there are some slices where the points are very scattered, so the projection does not faithfully represent the object. To deal with this problem, a post-processing is performed for each projection in which we apply 10 pixel dilations using a square as a structuring element. This post-processing step fills the gap between the sparse points and produces a more adequate representation of the object, as shown in Figure 2 This process is performed for each sample present in the dataset, so that for each point cloud there are three corresponding images, one per slice.
This slice representation of the object allows us to train and test a 3D recognition system in a 2D way. It provides the high success rate and speed of training and testing that usually characterises deep image recognition neural networks. Furthermore, it preserves and exploits the 3D information that is implicitly embodied in the slicing method.

B. 3DSLICELENET ARCHITECTURE
The main architecture of 3DSliceLeNet is composed of three isolated GoogLeNets that are joined together at the end in a Concatenation Layer prior to a Fully Connected Layer which is the final classifier, as shown in Figure 3.
As mentioned above, GoogLeNet is the most advanced deep network for image recognition tasks, providing the highest accuracy in several challenges, which is why we chose it over the other network architectures.
In this architecture, all convolutions, including those of the starting modules, use Rectified Linear Unit (ReLU) activation. The receptive field size in this network is 224 × 224 taking the RGB channels with mean subtraction, although in the 3DSliceLeNet ensemble we use binary maps without mean normalisation. The GoogLeNet network consists of 22 layers if we consider only the layers with parameter layers (or 27 layers if we also consider clustering layers). The total number of layers (independent building blocks) used for the construction of the network is approximately 100. However, this number depends on the machine learning system used. The use of the average pooling step prior to the classifier is based on [41], although this implementation differs in the use of an extra linear layer. This allows the network to be adapted and tuned for other datasets.
3DSliceLeNet has three independent GoogLeNet (we will refer to them as "branches") that learn the features that define an object for each slice, so we force each branch to specialise the filters on particular features of each slice. By isolated, we mean that the hyperparameters of each are different, and are affected independently by the backpropagation stage. Finally, the responses from each branch are concatenated into a single output that is fed to a fully connected layer that acts as a classifier.

IV. EXPERIMENTS
In order to validate the precision of 3DSliceLeNet, we tested our approach in the ModelNet challenge. The Princeton Mod-elNet project intends to provide researchers with a comprehensive clean collection of 3D CAD models for objects and sets a framework to test and compare the different approaches to the 3D object recognition task. First, we describe how to convert this dataset from meshes to a point cloud format, then we expose the methodology and our testing bench, and finally we show how 3DSliceLeNet performs the ModelNet-10 and ModelNet-40 classification tasks and draw some conclusions. Also we use a model obtained by training with the ModelNet-10 dataset to perform inference over the IKEA dataset and show results.

A. DATASETS FROM MESHES TO POINT CLOUDS
The ModelNet and the IKEA databases [1] provide CAD models either in Object File Format (OFF) or Object Files (OBJ) format as polygonal meshes. However, our architecture takes as an input point clouds to generate the slicebased representation -the point cloud representation format is closer to the usual input data provided by consumer depth sensors. To bridge this gap, an adapter or converter stage is performed to shift the OFF and OBJ mesh representation to Point Cloud Data (PCD) clouds. This process involves placing each mesh inside a 3D truncated icosahedron (tessellated sphere), then a virtual camera is placed on each vertex pointing to the sphere's center. Raytracing is used to VOLUME 4, 2016 Class Number Category  Training Set  Test set   1  Desk  200  86  2  Table  392  100  3  Nighstand  200  86  4  Bed  515  100  5  Toilet  344  100  6  Dresser  200  86  7  Bathtub  106  50  8  Sofa  680  100  9 Monitor 465 100 10 Chair 889 100 Later on, those model clouds are sliced and provided as input to the network for training (randomized order), and testing using the corresponding splits provided by ModelNet. Figure  4 illustrates the aforementioned conversion process. We trained and tested 3DSliceLeNet with the ModelNet-10 and ModelNet-40 datasets, as described earlier in the subsection IV-A. It is remarkable that the training and test splits are defined by the dataset itself and also it is worth saying that the number of samples per class are not balanced, as seen in Table 1 (both subsets, ModelNet-10 and ModelNet-40 are unbalanced). This fact harms the accuracy of the system, biasing the learning and the classification to the classes with more number of samples, as stated by [42].

C. RESULTS FOR MODELNET-10
Regarding the parameters that affects to the learning process, we trained the architecture with a base learning rate of 0.00001, multiplying the current learning rate by 0.75 every 10000 iterations. To compute the weights update we use the ADAM [43] solver with β 1 = 0.9 and β 2 = 0.999. In the ModelNet-10 experiment the training process executed for 20000 iterations, but the best weights set was produced on the iteration #18300 which threw a test accuracy of 94.3709%, the second best score in the leaderboard of the ModelNet10 challenge.
It is also worth noticing the low run-time of our architecture. One training iteration with a batch size of 30 samples takes an average of 2.25 seconds. Moreover, classification of new samples only takes 0.0896 seconds.
We cannot compare 3DSliceLeNet with the VRN Ensemble method, the one that provides a success rate of 97.14% in the ModelNet-10 challenge because they do not provide any time measurements in the paper. In spite of this, we contacted Andrew Brock (author of the mentioned method). He said "I don't recall the test time/batch, and all my logs are buried in an external hard drive somewhere and in inference mode would probably take several seconds per batch". In the light of these statements, VRN Ensemble is, in fact, impractical in a real-time application. Figure 6 shows the classification rate per class. It can be seen that the classes with the lower success rates are the very same classes that has a lower number of samples, the desk and nightstand classes. In light of this fact, we can expect that if we could have a balanced dataset, the error rate of these classes would decrease significantly.
In Figure 5, we can observe the confusion matrix of the classification accuracy for the test split of ModelNet-10. The mentioned confusion matrix, alongside the precisionrecall curves presented in Figure 7, confirms the stability and reliability of the system, which fails only on samples that look very similar from a visual perspective. Our system achieves an AUC of 0.978 on this test.
That is the case for the classes desk and table which the system confuses, yet not the other way around. This is possibly caused by the fact that a desk is a type of table, meaning the desks have minor differences (visual features) that go unnoticed in some samples, making those samples hard to distinguish from the table ones, as shown in Figures  8a and 8b. In addition, Table 1 shows that the class desk have a reduced number of samples compared with other classes such as sofa or chair.
In order to remark the desk and table ambiguity, a user study was conducted involving humans, so they had to classify random desk and tables samples. This experiment is detailed later in Section IV-D In addition, the proposed system fails distinguishing among the classes nightstand and dresser. This kind of problem is very common in Convolutional Neural Network architectures because their learned features are mainly based on visual features of an object and, as seen in Figures 8c and 8d, these two classes are visually similar. Figure 8 remarks the ambiguity of the visual features between the classes desk and table, nightstand and dresser and shows the difficulty of the problem.

D. EXPLORING DESK AND TABLE AMBIGUITY WITH HUMANS
Reviewing the desk and table samples of the ModelNet dataset, we noticed that there is not a substantial contrast in the visual features that describe each class, so it is understandable that 3DSliceLeNet (and any other visual featuresbased system) would tend to fail when classifying samples of these classes.  In order to evaluate the scope of the aforementioned desk and table ambiguity, we carried out a new experiment involving humans. This experiment was conducted by displaying 20 random samples of tables and desks (10 of each class) to a set of 9 humans of different professional and academic profiles. Each test subject had to classify the samples in either the desk or table class. These results are shown in the Table  2. With an overall accuracy of 83.75%, the subjects were incapable of successfully guessing all samples. Specifically, the humans achieved an accuracy of 88.89% for the desk class and, as expected, 3DSliceLeNet performed similarly with an accuracy of 79.56%.
These results remark that there are no definitive visual features that allow 3DSliceLeNet to precisely discriminate samples from the desk and table classes and therefore, it is prone to fail, as humans are as well.
When questioned the test subjects about the reasons that led them to classify an object either in the desk or table class, With this experiment we intend to demonstrate the main reason behind the lack of accuracy of 3DSliceLeNet for the desk and table classes: their visual features are not clear enough to discern these two classes, even for a human being.

E. RESULTS FOR MODELNET-40
We also trained and tested our system with the extended version of the ModelNet dataset (40 classes). The neural architecture remained the same with a minor modification: the number of neurons in the output layer was modified from 10 to 40 in order to match the number of classes of the     ModelNet-40 dataset. No further modifications were applied to the 3DSliceLeNet architecture. The training hyperparameters are the same that we used for the ModelNet-10 experiment: a base learning rate of 0.00001, multiplying the current learning rate by 0.75 every 10000 iterations. To compute the weights update we use the ADAM [43] solver with β 1 = 0.9 and β 2 = 0.999.
In the ModelNet-40 experiment, the training process executed for over 250000 iterations, obtaining the best weights set on the iteration #144500 which threw a test success rate of 79.8529%. The timings for this experiment are roughly the same that the ones we used for the ModelNet-10 experiment: a training iteration of 30 samples took over 2.25 seconds whilst classification of a new sample only takes 0.0893 seconds.
Our method is rather competitive achieving a high success rate, yet it is highly dependent on the orientation of the samples. Whilst in the ModelNet-10 version the models share IKEA  Class Number  Category  # of samples  1  Desk  14  2  Table  39  4  Bed  3  6  Dresser  18  8  Sofa  8  10 Chair 5 the same pose, in the ModelNet-40 they do not. This is caused by the slicing method described in Section III-A that captures implicitly the pose which is clearly counterproductive in this case.
Reviewing the confusion matrix obtained in this experiment, Figure 9, and the accuracy per class, Figure 10, we can confirm the overall accuracy of the 3DSliceLeNet architecture. We also observed that several samples across all classes are often wrongly classified as bottles. The misclassified samples of the classes bowl, cup, flowerpot or stool are fairly comprehensible as these classes heavily look like a bottle, namely, a narrow neck and a wider bottom. As shown in Figure 10, these classes are those with the lower success rate. There are other classes that also share some features but in a different scale that make them look very similar once converted to the slice-based representation.
Moreover, we can spot some punctual and scattered confusion errors such as the desk and table ambiguity introduced before, bookshelf and wardrobe or flowerpot and bottle. These pairs of classes heavily look alike one to each other, leading to a high probability of misclassify them as explained in Section IV-C.

F. RESULTS FOR THE IKEA DATASET
The IKEA dataset consist of furniture CAD models gathered from the Google 3D Warehouse alongside of aligned RGB images took from Flickr. This dataset contains 7 classes: bed, bookcase, chair, desk, sofa, table and wardrobe. This dataset is originally intended to validate fine 3D pose estimation, but we used the 3D models of the dataset to validate our 3D model recognition system: 3DSliceLeNet.
As we did with the ModelNet dataset, we firstly had to convert the meshes to point clouds following the method explained in Section IV-A and then the obtained point clouds to the sliced representation proposed in Section III-A. We manually aligned the point clouds to match the ModelNet-10 poses. Also, several meshes contained more than one object so we split them in order to obtain single object meshes. The samples from the class bookcase were removed as this class is not present in the ModelNet-10 dataset. Some other meshes were removed due to incompatibilities between the given format and our mesh to point cloud method. Finally, the refined IKEA dataset contains 87 samples distributed in 6 classes. The exact number of samples per class could be seen in Table 3.
Finally, we took 3DSliceLeNet and the best ModelNet-  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29  10 model, and tested it with the converted IKEA samples achieving an accuracy of 70.4545%. This success rate it is promisingly high bearing in mind that we trained our system with the ModelNet dataset and then we tested it with a totally different one, which means that the trained model is not overfitted to the ModelNet dataset. Figure 11 shows the results of the classification process.
In this case, the overall accuracy is slightly lower than the one achieved by testing over the ModelNet-10 test split shown in Section IV-C. Although 3DSliceLeNet performed nicely for the table, dresser, sofa or chair classes, it often fails to correctly classify the desk and bed samples. One more VOLUME 4, 2016 Note that the IKEA dataset does not contain samples for classes 3, 5, 7 or 9 time, we found 3DSliceLeNet classifies the desk samples as tables, as it did in the former experiments. In addition, the desk samples of this dataset are rather odd to the system as they do not only include the desk, but the models also include some bookcases or other furnitures attached to them, as shown in Figure 12. Regarding the bed class, it only contains 3 samples and, although it fails to classify all of them, we can not extract conclusions from this result due to the reduced size of the set.
It is worth noticing that we could have improved the classification accuracy by training over this dataset, but this is not the purpose of this experiment. Our target is to state and determine the reliability, stability and generalization capabilities of 3DSliceLeNet, training with a certain dataset and then testing the generated model with a totally different one.

G. RESULTS FOR THE SHAPENET V2 CORE DATASET
ShapeNetCore is a subset of the full ShapeNet dataset, which is an ongoing effort to establish a richly-annotated, largescale dataset of 3D shapes, with single clean 3D models and manually verified category and alignment annotations. It covers 55 common object categories with about 51,300  Table 4 shows the number of samples per category.
The samples of this dataset come in a mesh format, so they had to be converted to point clouds following the method described in Section IV-A in order to be used with 3DSlice-LeNet.
Next, 3SliceLeNet was trained on this dataset. As usual, this experiment was conducted on the machine described in Section IV-B. The test split ratio was 20% of the dataset. Once the training process is done, the test threw an accuracy of 88.45%. The overall accuracy, yet high, is not as high as the achieved in the ModelNet-10 test. Both datasets are manually aligned but, in this case, the number of samples per class is highly unbalanced. In fact, some classes contain less than 100 samples whilst others contain more than 8000. This causes the learning process to skew to the categories with more samples as it is more likely to make a correct prediction if it predicts the class with more samples. This effect harms the overall accuracy. As seen in Figure 13, the classes with lower accuracy are those with fewer number of samples according to Table 4. Figure 14 shows the confusion matrix for this experiment. The diagonal confirms that the algorithm performs as expected, with minor failing cases. For instance, almost every cellphone sample is classified as telephone. This is understandable as in the telephone class it can be found samples of cellphone or isolated telephone handsets, that are very similar to cellphone. Besides, some examples of microphone are classified as lamp, which is also understandable as both classes share common visual features, namely a tall post with some bigger artifact on top.
Finally, the class tower has low accuracy as the system tends to classify its samples as lamp. Once again, tower and lamp share common visual features, namely a thin tall structure.

H. TESTING OTHER BRANCH ARCHITECTURES
All the experiments presented so far were executed on a topology that features three GoogLeNet branches, but some other architectures were considered too. In this subsection, those experiments are compiled and detailed. Those experiments exposed that the chosen GoogLeNet architecture outperforms several other ones and justifies the inclusion of GoogLeNet in the 3DSliceLeNet topology.
All experiments were conducted with the parameters detailed in Section IV-B and trained and tested on the ModelNet-10 dataset.
First, ResNet50 [24] was tested. This architecture introduces the "residual" term, which consists in the aggregation of the input image to the output image of a convolution block. As a result, the output of a convolution block could be seen as the input image where the features activated by the filters are highlighted. In contrast, the output of a convolution layer in a default convolutional neural network is only the result of the neuron activation. If a neuron is not triggered on a certain region of the input image, the output remains with lower activation values. When the network computes the weight updates in the backpropagation stage, the values on non-activated regions lead to a very low updates, eventually even provoking no update at all, which causes the learning to get stuck. This issue is known as the vanishing gradient problem. The inclusion of the "residual" term helps fighting the vanishing gradient problem and allows the creation of even deeper architectures.
A 3DSliceLeNet with ResNet50 branches was trained and tested on the ModelNet-10 dataset. This topology achieved an accuracy of 92.2822% and a runtime of 0.1053 seconds on inference mode. The confusion matrix of this experiment is shown in Table 5.
In addition, the Xception [44] architecture was tested. This architecture introduces the depthwise separable convolution which consists in applying convolutions across channels and then a 1 × 1 convolution. This feature makes the network learn spatial dependencies and relations across channels.
A 3DSliceLeNet with Xception branches was trained and tested on the ModelNet-10 dataset. This topology achieved an accuracy of 91.9514% and a runtime of 0.0942 seconds. The confusion matrix of this experiment is shown in Table 6.
Lastly, the VGG16 [45] architecture was also considered to be included in the 3DSliceLeNet topology. This architecture features a stack of convolutional layers (which has a different depth in different architectures) and is followed by three fully connected layers: the first two have 4096 neurons each, the third performs classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks. To integrate VGG16 in 3DSliceLeNet, the fully connected layers were removed so the outputs of the last convolution blocks of the three branches are fed to the 3DSliceLeNet classification block.
A 3DSliceLeNet with VGG16 branches was trained and tested on the ModelNet-10 dataset. This topology achieved an accuracy of 90.6284% and a runtime of 0.2248 seconds. The confusion matrix of this experiment is shown in Table 7.
Despite the fact that the accuracy gain is marginal, the 3DSliceLeNet with VGG16 branches is totally discarded because of its elevated number of parameters (over 400.000). This causes the network to run much slower and takes more memory to work than the other tested architectures. What is more, Xception's main novelty relies on the intra-channel convolutions. Since the point cloud projections are binary maps, it cannot take advantage of this feature, so this architecture could be discarded regardless of its accuracy. Finally, 3DSliceLeNet with ResNet50 branches performed similarly to the GoogLeNet branches incarnation, yet its accuracy is slightly lower. As exposed, all the considered architectures achieved similar accuracies but the best performer both at inference time and accuracy was GoogLeNet, so it was chosen to be part of the 3DSliceLeNet final topology.

I. IMPACT OF THE NUMBER OF SLICES ON THE ACCURACY AND RUNTIME
As stated before, using only three slices to feed our architecture involves discarding a lot of potentially useful information but our chosen topology for 3DSliceLeNet with GoogLeNet branches cannot be tested any further due to memory limitations of our current hardware set up. In order to find out the gain when increasing the number of slices, the branches of the 3DSliceLeNet topology were replaced by the simpler LeNet5 [46], a much shallower and naive architecture but also less memory greedy. This way, up to six slices fit in our system with no memory problems. These experiments intend to expose the improvement on accuracy that involves using more than three slices.
As usual, the experiments were conducted with the parameters detailed in Section IV-B and trained and tested on the ModelNet-10 dataset.
First, a 3DSliceLeNet with three LeNet5 branches was tested. This topology achieved a test accuracy of 82.24% and a runtime of 0.0122 seconds at inference time.
Then, a 3DSliceLeNet with six LeNet5 branches was tested. This topology achieved an accuracy of 82.02% and a runtime of 0.0178 seconds at inference time.
As expected, the overall accuracy dropped compared to the 3DSliceLeNet chosen topology as the LeNet5 architecture is shallower than the GoogLeNet one, but the aim of these experiments was not to improve the accuracy but to find out how much gain implies the use of more slices to classify 3D objects.
In the light of these experiments, it cannot be concluded either that the inclusion of more slices could lead to an improvement or not. In fact, the accuracy of 3DSliceLeNet with three or six slices is roughly the same while the inference time is increased by a 45%. Nonetheless, this conclusion is 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54  not definitive. It is worth to recall that the LeNet5 architecture expressiveness is limited compared to deeper architectures as GoogLeNet. Still, there would likely be room to improve the classification accuracy when feeding more slices if the architectures in the branches were more powerful.

J. DISCUSSION
Using our current architecture, we reached a top-5 place in the leaderboard of the challenge, with an accuracy of 94.37% in the ModelNet-10 classification task and reached an accuracy of 79.85% in the ModelNet-40 classification task.
Nevertheless, we also tested another approach for this problem. This approach consisted in concatenating several slices for each 3D axis into a single image, thus generating 3 × N slices for each sample. Each slice is understood as a tile, so this approach relies in images of large resolution. If more slices are considered, images of more resolution are  Table Nstand  Bed  Toilet Dresser Bathtub Sofa  Monitor Chair   77  8  7  0  2  0  0  2  2  1   1  97  2  0  0  1  0  0  0  0   0  3  92  0  0  3  0  0  0  1   0  2  1  94  3  0  0  0  0     generated. This approach is much more memory consuming that the other approach we tested with 3 convolutional branches. We converted the meshes to this 2D representation and fed it to a state-of-the-art GoogLeNet. We noticed that we obtained better classification results as we increased the N value, i.e., the number of slices per 3D axis. This approach was tested with 3 × 3, 3 × 5 and 3 × 7 slices observing an improvement on each test over the former. We could not test any further due to memory limitations.
This fact leads us to think that if we could feed 3DSlice-LeNet with more slices, we can expect an improvement in the accuracy rate, but we can not test this case due to memory limitations of our hardware (GPU memory consumption above 12 GBytes).

V. CONCLUSION
In this paper we introduced a novel architecture for 3D object recognition, 3DSliceLeNet. Our system takes three slices of the input point cloud (one per 3D axis), projects the points to a plane, generating three images, and uses those images as an input for the proposed deep network. The architecture consist of three independent GoogLeNet branches which activations are concatenated and fed to a fully connected layer. Each of this branches learns particular features of a slice. This method allows us to take advantage of the fast 2D computation whilst preserving the 3D information. 3DSliceLeNet achieved a success rate of 94.37% in the ModelNet-10 classification task and reached an accuracy of 79.85% in the ModelNet-40 classification task, yet providing extremely fast computation times: once the model is trained, classifying a 3D object only takes 0.0896 seconds.  Table Nstand  Bed  Toilet Dresser Bathtub Sofa  Monitor Chair   84  2  2  0  0  1  0  7  1  3   19  77  2  0  0  0  0  0  0  2   0  1  64  0  0  14  0  2  1

VI. FUTURE WORK
Following on this work, we plan to address the generalization problem due to inconsistent poses across the models. This problem was revealed during the ModelNet-40 experiments. This issue can be addressed by applying data augmentation methods, so further research in this line should be made.
In addition, we plan to extend this system to a 3D object recognizer for point clouds captured in the real world with low cost depth sensors like the Microsoft Kinect device. This introduces new challenges to be addressed as these sensors do not obtain a whole point cloud representation of the scene but a partial view of it. The real world also presents some peculiarities that affects the classification process like dealing with complete scenes, filled with different objects, and with occlusion problems.
In addition, we plan to test a new version of 3DSliceLeNet that uses several slices per 3D axis. That is the reason why we are currently exploring methods to circumvent GPU memory limitations.
FRANCISCO GOMEZ-DONOSO received a BS degree in Computer Science from the University of Alicante (Spain) in 2014. Then, he achieved a Masters degree in Robotics and Automatic in the next year. He is currently enrolled on a PhD in Computer Science programme in the same University. His main interests are human-computer interaction, deep learning and machine learning, and tridimensional data processing. Regarding his experience as a scientist, he has published more than 35 papers in high-impact journals and conferences.