Material Type Recognition of Indoor Scenes via Surface Reflectance Estimation

There are fundamental difficulties in obtaining material type of an arbitrary object using traditional sensors. Existing material type recognition methods mostly focus on color based visual features and object-prior. Surface reflectance is another critical clue in the characterization of certain material type and can be observed by traditional sensors such as color camera and time-of-flight depth sensor. A material type is characterized well by relevant surface reflectance together with traditional visual appearance providing better description for material type recognition. In this work, we propose a material type recognition method based on both color and reflectance features using deep neural network. Proposed method is evaluated on both public and our own data sets showing promising material type recognition results.


I. INTRODUCTION
Objects are composed of multiple distinguishing materials. Material type is an essential characteristic in determining the class of object. Furthermore, materials composing an object reveal detailed sub-classes such as wooden vs iron table, fabric vs leather couch. Recognized material types enable realistic rendering of reconstructed three-dimensional model. Therefore, material type recognition has been importance research optic in computer vision and graphics. Earlier studies on material type recognition largely depend on visual features such as color patterns and texture shapes of each material type. Sharan et al. [1] investigate various combination of contemporary visual descriptors to suggest optimal color feature set for material recognition evaluated on FMD (Flickr Material Database). Bell et al. [2] prove that context information encoded from general convolutional neural network helps material recognition in indoor scenes. Materials IN Context(MINC) database proposed in this work has been widely used for the comparison of material recognition methods in the field. FVCNN [3] proposes a learning based texture descriptor using CNN filter banks and Fisher Vector pooling. Degol et al. [4] prove that using 3D geometry based features improves material recognition performance of a large-scale scene.
There have been many approaches for material type recognize using ToF camera [5]- [11]. Su et al. [7] have captured subsurface scattering that specifies the characteristics of material surface. However, it covers only small number of material-classes requiring additional light source. Tanaka et al. [8] propose an exemplar-based material classification method that uses the distortion of the observations of ToF camera. They claim that depth distortion caused by varying modulation frequencies and camera-object distance provide rich information on material class types. However, it relies on particular device-dependent characteristic observed during data acquisition process. Both prior work [7], [8] provide classification result only with controlled experimental environment. Kim et al. [9] propose a surface roughness estimation method with ToF camera. Off the shelve depth camera such as Microsoft Kinect is used to estimate surface roughness together with color features.
There are several attempts to use physically-motivated features like 3d geometry and surface reflectance for material recognition [12]- [14]. Due to the limited data acquisition en-vironments and experimental costs, the performance of earlier methods is limited compared to color based approaches. Recent material recognition methods have addressed such issues by properly designing deep neural networks for material type recognition. Zhang et al. [15] propose Deep TEN, a material recognition framework which uses generalized encoders like VLAD [16] and Fisher Vector [17] in a form of encoding layer. DEP(Deep Encoding Pooling) network [18] shows performance improvement over Deep TEN [15] by adding local spatial feature encoder. Zhai et al. [19] suggest MAP(Multiple Attribute Perceived)-Net which learns a bag of texture attributes representing texture-specific characteristics. Their improved work, called DSR(Deep Structure-Revealed)-Net [20], extracts an inherent spatial dependency of a texture which helps the textures to be recognized and deformed properly. However, both of the proposed methods work with only closed-up surface texture images. Various attempts to estimate or reconstruct surface reflectance have been conducted [12], [21]. Recent studies [22]- [26] have used multi-view images of material object to implicitly obtain physically-motivated features instead of using highlyrestricted experimental setups [13], [27]- [30].
There has been large body of studies on capturing reflectance from a set of color images under varying experimental conditions [30]- [37]. Sengupta et al. [38] have suggested neural inverse rendering method which enables scene-attribute estimation from single indoor image. Despite several practical advantages of the method, it handles real image in a self-supervised manner showing dependency on training data. Li et al. [39] suggest indoor scene-wise SVBRDF(Spatially-Varying Bidirectional Reflectance Distribution Function) estimation framework from single synthetic image. But the method has limitation in that it cannot be trained on real data since it requires synthetic ground truth data for training. Murmann et al. [40] propose Multiillumination Images in the Wild(MIW) dataset with per-image illumination condition and per-pixel material type label. Our color image based reflectance estimation network is trained on the data set.
Wang et al. [23] suggest light-field camera based approach for joint reconstruction of 3D shape and surface reflectance of an object. Xue et al. [25] suggest DAIN (Differential Angular Imaging Network), which uses small angular variance between closely captured image pairs as a reflectane-related material feature. However, these methods lack practicalities for its needs for customized devices or for conditioned environment. On the other hand, thanks to recent advance of deep learning, several attempts have been made for fewshot surface reflectance estimation [31]- [34], [37], [41], [42]. But these methods show limitation in that it can be applied in closed-up texture-like images or object-centered photos. Some of the recent works [38], [39] have achieved scenewise reflectance estimation, but they show limitations in the application with real world data due to the dependency on inverse rendering based methods.
In this work, we propose surface reflectance estimation methods using either ToF(time-of-flight) depth camera (IR reflectance) or multiple color images (visible light reflectance) taken under varying practical conditions. Proposed methods are able to collect pixel-wise surface reflectance that enables dense material type classification of entire scene. Based on extensive experimental evaluation, we suggest an optimal network structure for a multi-modal material type recognition.

II. MATERIAL RECOGNITION WITH COLOR FEATURES
Materials in Context Database(MINC) is one of large material data set consists of 23 material classes with context information. Total 3 million sample patches are extracted from 435,749 images of the data set. Each material sample patch includes following attributes: material label, id of original photo, 2d coordinate information of center pixel used to extract the patch. There exists a subset of MINC with patch-wise material classification label called MINC-2500. The subset has balanced class distribution (2500 samples per class) while its original data set has biased distribution of material classes. MINC has been widely used for the colorbased material classification evaluation due to its diversity and large number of samples. Zhao et al. [10] have pointed out the problem of MINC samples: non-values occurred by pixels out of image border and re-generated new material patches based on the extraction rules of MINC. They propose real-time 3D material segmentation framework trained with their newly sampled 917,839 patches. Jurado et al. [43] propose a semantic segmentation method of natural materials on a point cloud using multi-spectral features. Xue et al. [18] show improved performance on MINC-2500 (about 82.00% of classification) using deep encoding and pooling network called DEP-Net.
In order to perform cross-data set verification, we have generated two novel variants of MINC-2500 called MINC-NEW and OUR-NEW data set. In case of MINC-NEW, samples are extracted by the same extraction strategy with MINC-2500. Each patch is extracted from corresponding original image while none of the patches are extracted from identical image. Following the work, the size of patch is 32.9% of smaller one and resized to 362x362 resolution. The size of patch is decided to be 32.9% of smaller original image and then resized to 362x362. 500 patches per  3 different material data set used in our experiment and its corresponding reference images. Each sample shares same patch extraction strategy, containing single material label at the center of the image each material class (total 11,500 patches) are extracted. For OUR-NEW data set, original images are collected from two different image sources. First group consists of web-based images collected from internet. In order to keep the context of MINC data set, keywords 'interior', 'office', 'cafe', 'school' are used for searching. A keyword 'Asia' is also used to check if MINC has any bias on western environment context instead of material-specific context features. Second group includes images of material objects located in natural environment of furniture showroom, captured using mobilephone camera. Each group consists of 300 to 400 original images and patches are obtained avoiding overlapping. Five commonly observed material classes including (fabric, glass, leather, metal, wood) are collected with 1,334 patches using identical patch extraction process with the first group. Figure  1 shows examples of 3 different material data set used in our experiment. Proposed two data sets have similar contextual characteristic compared to the original MINC-2500.
In this evaluation, we choose DenseNet-121 [44] trained on MINC-2500 aiming to verify whether typical material properties sufficiently appear in MINC. Proposed patch-wise material classification model is trained on the training set of MINC-2500 (48,875 instances). Table 2 shows material recognition performance of DenseNet-121 on our three different(but with similar context) material data sets. The model gives 81.13% of recognition accuracy on MINC-2500, which is compatible to Xue et al. [18](82.00%). However, the model shows limited performance on our proposed data set: 69.27% on MINC-NEW and 49.78% on OUR-NEW. Considering the contextual similarities shown in Figure 1, the performance decrease indicates that trained model favors MINC-2500 images. Color-based features of an object can easily be contaminated by various real conditions such as lighting and viewing direction. Therefore, material recognition depending on only color feature has clear limitation in real-world applications.

A. SURFACE IR REFLECTANCE FEATURE
Surface reflectance of visible light has been obtained by exhaustive scanning from entire lighting or viewing directions. Anisotropic surface assumption based methods show limitations in real-world applications [28]. Kim et al. [9] propose surface roughness features for material recognition using single Kinect depth camera. Infrared reflectance obtained from time-of-flight depth camera is used for material feature extraction. However it still has limited application due to large angular condition (360 • ). By employing several practical assumptions, Lee et al. [11] propose a simple and practical acquisition setup. Isotropic surface assumption is reasonable since imaging setup with Kinect is limited to a single imaging device with light source. Based on Helmholtz reciprocity, they vary camera and light source positions and as a result vary incidence angle and reflectance angle. Our aim is not a precise reconstruction of BRDF of visible light. Infrared reflectance shows enough separation ability of diverse material types as verified in prior work [9] [11]. Furthermore, emitting visible light on to target objects or person for the acquision of reflectance is not practical. On the other hand, emitting invisible infrared light is free from such problem. Therefore, our proposed method employs IR Reflectance for material type classification.

B. ACQUISITION SETUPS
In order to enhance the practicality of surface IR reflectance feature acquisition process compared to the previous work [11], we use same assumptions explained above while collecting reflected IR of real-world samples. Note that some of  our device setups are based on [11]. Figure 2 shows 3 different acquisition setups. Figure 2-(a) is object-rotation setup proposed by [11]. Reflected IR from the center of rotating target is acquired by camera fixed in front of the object. In order to obtain complete IR distribution, target surface has to be rotated for varying incidence angle. Figure 2-(b) shows one-shot acquisition with uniform surface material type assumption.
Consequently, it collects reflected IR from multiple surface points assuming that they share same reflectance characteristic. Especially, reflected IR intensities of a surface with sufficient surface normal variation(curved surface) can be acquired for one shot. Figure 2-(c) shows camera-rotation setup which requires point cloud registration. A newly acquired data from ToF camera are accumulated in dynamic voxel space. A camera-rotation acquisition setup is able to get pixel-wise reflectance while others get single reflectance per target object.

C. COLOR-IR MATERIAL DATA SET
We employ Color-IR Material Data Set [11]. Figure 3 shows examples of the data set of 7 common material types. Total 116 numbers of flat-surfaced samples are collected. Both color and proposed reflectance features are acquired using the object-rotation method described in Figure 2-(a).

A. INDOOR SCENE-WISE SURFACE REFLECTANCE ESTIMATION
Proposed method shows reliable result both with/without color features in real-world environment. However, it has several problems. First, collected reflectance feature of oneshot acquisition method (Figure 2-(b)) has narrow incidence angle variation. Camera-rotation (Figure 2-(c)) has wider incidence angle but still it suffers from registration noises. To alleviate the problem, seven individual voxel spaces from seven different viewing directions are collected. Figure 4 shows comparison between this discrete observations and camera-rotation. In camera-rotation, a well-registered, normal-based segmented point clouds are first obtained to get object boundaries. Since the point-wise IR distribution is noisy, we set neighbor point clusters to get robust point-wise reflectance features. As the result of neighbor point clustering, points inside same clusters get identical reflectance of  single material type. For further enhancement, each cluster finds and allocates dominant material type prediction result out of all pixels inside each segment.

B. MULTI-MODAL MATERIAL RECOGNITION NETWORK
Two-stream nets is widely used for the purpose of fusing multi-modal features. Xue et al. [25] have suggested a twostreamed convolutional neural network called DAIN which entangles spatial/angular gradients of material clues. Lee et al. [11] use matrix concatenation at the end of the network to fuse reflectance stream and color stream for material recognition. Similar to the previous work, we apply feature fusion at the final layer considering different level of the features. We adopt a SkipRNN [45] network. Lee et al. [11] define partial gate(p t ) which skips updates whenever noisy data is fed as input, modifying the structure to be noiserobust.
Equations show activation conditions of partial gate used in partial skipRNN. The proposed network is trained on our proposed reflectance data set. We use gradients among adjacent sequence points(x t − x t−1 , x t+1 − x t ) to find noisy, missing value based on IR distribution continuity. If the obtained gradient is greater than defined threshold(0.5), the intensity value at the point is considered as noise and set to zero. Also, if a point located in small incidence angle has larger intensity value than the threshold, it will also be considered as noise.
Huang et al. [44] have suggested several modified versions of DenseNet, which share same dense connectivity strategy. Lee et al. [11] introduce fine-tuned model using MINC-2500 [2], enabling the network to be suitable for material recognition task. Along with their proposed two-stream network structure and concatenation based feature fusion, the model is used as a feature extractor of color feature stream. Since the feature of each stream is subject to be biased by the concatenation, it cannot fully be described by both reflectance and color characteristics in balance. Secondly, concatenation cannot guarantee sufficiently encoded correlation between the two features. To get better correlated feature fusion of multi-modal features, recent work [46], [47] uses outer prod-VOLUME 4, 2016 uct fusion for two-stream network instead. By multiplying color and reflectance feature matrix, correlated multi-modal features are obtained.

V. MATERIAL RECOGNITION WITH MULTIPLE COLOR IMAGES A. MULTI-ILLUMINATION IMAGES
Multi-illumination images in the Wild (MIW) [40] data set consists of more than 1000 scenes with 25 illumination variations and per-pixel material labels. Using a customized camera setup with attached auto-controlled light source, MIW provides 25 images per scene with varying lighting conditions and shared (fixed) scene geometry. Perpixel material labels consist of 41 material classes including wide range of materials such as fabric, glass, metal, paper, plastic, marble, linoleum, wicker. For better representation of real-world environment, photos are captured in 95 different rooms located in 12 different residential and office areas. Since all of the provided photos are taken as HDR images with varying and corresponding light probes, the data set is suitable for the applications like lighting prediction or relighting as they have tried. In our work, we focus on perpixel material labels provided by the data set. Since MIW provides illumination varying scenes with static scenes, it reveals per-pixel reflectance characteristics of target objects.

B. TWO-STREAM NETS WITH ATTENTION MODULE
Our network consists of two neural networks encoding both color and brightness variation features as shown in Figure 5. In the color feature network (Figure 5 (b)), each patch is fed to Densenet121 fine-tuned with MIW data set. Since each patch has n different illumination conditions, the network encodes color features from all n patches. In the brightness variation network (Figure 5 (c)), we build a new patch set P by obtaining difference patch of two consecutive original patches in the patch set P as follows.
The patch set consists of n patches with different brightness at the same viewpoint. So the new patch setP extracts brightness of original patch that is good for the extraction of brightness variation features. Extracted brightness variation reveals unique characteristic of surface reflectance of each material type.P is fed to Resnet34 fine-tuned with MIW data set extracting brightness variation features. We concatenate the color and brightness variation features along the sorted sequence. Each concatenated feature vector f i has 192 channels.
The concatenated feature vectors f 1 ∼ f n are fed to attention module (Figure 5 (d)) to emphasize features that contribute to class separation in the following Long Short-Term Memory (LSTM) network. Note that the LSTM extracts sequentially varying features from the illuminationvarying inputs such as surface reflectance. Our attention module learns channel-wise significance assigning same attention to corresponding feature set (f 1 (j) ∼ f n (j)) of jth channel. And then, each feature vector f i obtained from corresponding input patches p i ,p i is fed to each unit of the LSTM. Consequently, LSTM classifies material types from color and brightness variation features observed from the illumination-varying input patches.

C. PATCH EXTRACTION & SORTING
Our network gets multiple images taken at a same location under illumination-varying condition. We hypothesize that the reflectance of material obtained from the observation helps to characterize material type thanks to unique mesosurface and BRDF (Bidirectional Reflectance Distribution Function) characteristics. We randomly extract m patches per scene for patch-wise classification. Multiple material classes exist inside a patch. We set the label of the center pixel of a patch as class label. Figure 5 (a) shows that a patch set P consists of n illumination conditioned patches as follows.
After the patches p 1 ∼ p n are obtained, they are sorted along the average pixel-brightness. As a result, we extract m(number of patches) × n(illumination conditions) patches in a scene.

A. MATERIAL RECOGNITION FROM COLOR AND DEPTH
Data set used for experimental evaluation includes reflected IR, RGB, vertex normal and 3D points. Incidence angle is calculated from vertex normal [9], [11]. Frame acquisition and calibration functions are implemented based on Kinect v2 SDKs. However, our proposed method can be applied to any other type of imaging device. Based on the acquisition setup in figure 4-(a), we construct multi-object scenes. With test objects, we minimize errors of poor registration and segmentation. Figure 6-(a) shows sample material segmentation results. Proposed feature successfully smooths out the noises of poor registration. For evaluation comparison, combinations of different network structures are evaluated. One-dimensional Gradual CNN consists of 4 convolution filters which is gradually increasing within layers (1x3, 1x5, 1x7, 1x11). For fair comparison with [11] which uses 2-layered RNN based structure, total number of convolution layers of our network is fixed to 4. DenseNet-121 is first trained with MINC-2500 [11]. Table 3 shows comparison results among the combinations of different network structures. In the test with reflectance feature alone, Gradual CNN shows best performance(72.66%) compared to previous work(64.67%). Results indicate that the trained LSTM cells are affected by the noisy inputs coming from previous state. Proposed CNN structure isolates noisy inputs better within each kernel. Applying dilation to the network decreases performance from 72.66% to 68.67%, even though it is greater than previous work. Performance  gain of outer product is 7.34% in two-stream network with partial skipRNN compared to concatenation based method. Outer product fusion encodes features of different modality better. Gradual CNN with outer product shows best performance of 86.00%.

B. 3D MATERIAL-AWARE SEGMENTATION
In this test, we perform 3D point cloud segmentation by material types. A neural network trained with 4 common material classes including (fabric, leather, paper, wood) are used. This test is performed in noisy, unconstrained real world conditions compared to the previous experiments. Seven point clouds from seven different viewing directions are acquired without registration. For each segment inside each angle-wise point clouds(from θ 1 to θ 7 ), total seven reflectance features are obtained and merged following the refinement process described in figure 4. As illustrated in figure 6, our framework shows meaningful classification performance despite the challenging environmental condition.

C. MATERIAL RECOGNITION FROM MULTIPLE COLORS
Implementation details and results of our experiments using multiple color images are as follows. Our model is implemented with Pytorch framework. ADAM optimizer is used for training and batch size is 32. Learning rate is initialized to 0.001 with learning rate scheduler ReduceLROn-Plateau(patience = 10, factor = 0.95). Backbone network is initialized with pre-trained models and training is finished after 300 epochs. We use MIW dataset for the evaluation. MIW dataset is randomly divided into 5 splits with 985 training scenes and 30 test scenes. We randomly extract 40(= m) patches per scene. There are 41 classes in the MIW dataset. We group the classes of similar material types into 8 super-classes(Fabric, Glass, Leather, Metal, Paper, Plastic, Stone, and Wood). We perform 5 different experiments (table 4) to verify the effect of multi-illumination conditions, two-stream network, adding new patch setP , and attention module, respectively. Compared to single-illumination case, the accuracy of multi-illumination is around 9% higher. This shows that the difference in the surface appearance of materials along the illumination conditions improves material type classification. The result of two-stream nets using only patch set P is 1.87% higher than single-stream net. The result of using P for the color feature network andP for the brightness feature network is 75.64%, which is 2.73% higher than using only patch set P . This implies that brightness features from the new patch setP helps the networks to classify material types better encoding surface reflectance. Finally, attention map shows 0.86% higher accuracy than two-stream networks.
Contextual relation in our material recognition task indicates how often one material type is observed with other material types within a patch. We hypothesize that material types of neighbor regions reveal critical information of the material type of target region. There is trade-off in the extraction of material type features between surface reflectance and color information along the variation of patch size. Each pixel of a patch has its own reflectance characteristic, within a patch, however, integrated reflectance of multiple pixels is obtained. Therefore, what we have obtained is a reflectance of all the pixels within the patch that is called a smoothed reflectance. If the patch size is large, the reflectance of a larger area is smoothed, so it is relatively difficult to define common reflectance. On the other hand, with bigger patch, color and contextual information are extracted better. On the contrary, if the patch size is small, relatively common reflectance label works better, but the extraction of other visual features may be limited. Based on the relationship, we tested increasing patch size. Patch is extracted only when the portion of pixel material label is higher than 65% within the patch. The 'Normal' column of Table 5 shows the test accuracy along the variation of patch size. Compared to the patch sizes 11x11 and 21x21, larger patch sizes get better performance. However the performance with patch size of 31x31 and bigger show saturated performance indicating that no better color and contextual information could be extracted with bigger patches lager than 31x31.

1) Training Data Augmentation
We conduct training data augmentation. The key point is to keep contextual relation in a patch, when more than one class type exist in the patch. Since the class label of major region in a patch is material label of the patch, our augmentation keeps the region and replace remaining minor region by other patch of same minor class labels. To preserve contextual information, we create a new minor part of the same material class by randomly finding it from another training image. Figure 7 shows an example of applying the proposed augmentation. Figure 7-(a) is a patch whose class label is paper, and the minor part on the left top is brightcolored wood. Figure 7-(b) shows a patch that the minor part, wood, is replaced by a dark-colored wood from other training image. Table 5 shows that the best accuracy is obtained with 31x31 patch augmentation. When the patch size is small, there is little effect of augmentation because the contextual information contained within the patch is limited. When the patch size is increased, patch is extracted only when the pixel corresponding to the class label is 65% or more in the patch, so the surface diversity of the extracted patch is reduced. This condition, however, reduces the diversity of the data set.

VII. CONCLUSION
In this work, we propose a material type recognition method of indoor scenes via surface reflectance estimation. Novel two-stream network extracting both reflectance and color features obtain pixel wise material type from objects of indoor real-world environment. Diverse experimental evaluations on public data set prove that conventional color features with reflectance feature outperforms prior approaches.