UrbanLF: A Comprehensive Light Field Dataset for Semantic Segmentation of Urban Scenes

As one of the fundamental technologies for scene understanding, semantic segmentation has been widely explored in the last few years. Light field cameras encode the geometric information by simultaneously recording the spatial information and angular information of light rays, which provides us with a new way to solve this issue. In this paper, we propose a high-quality and challenging urban scene dataset, containing 1074 samples composed of real-world and synthetic light field images as well as pixel-wise annotations for 14 semantic classes. To the best of our knowledge, it is the largest and the most diverse light field dataset for semantic segmentation. We further design two new semantic segmentation baselines tailored for light field and compare them with state-of-the-art RGB, video and RGB-D-based methods using the proposed dataset. The outperforming results of our baselines demonstrate the advantages of the geometric information in light field for this task. We also provide evaluations of super-resolution and depth estimation methods, showing that the proposed dataset presents new challenges and supports detailed comparisons among different methods. We expect this work inspires new research direction and stimulates scientific progress in related fields. The complete dataset is available at https://github.com/HAWKEYE-Group/UrbanLF.

an important role in visual scene understanding. Accurate and reliable semantic segmentation has benefited many popular and challenging applications like autonomous driving [1], medical image analysis [2] and geographic information system [3].
Existing semantic segmentation methods can be divided into four categories based on the type of input data. Early works [4]- [7] focus on exploiting the visual cues from a single RGB image with hand-crafted features or some feature learning techniques. With the rise of the requirement for realtime applications, semantic segmentation has also been applied to video sequences [8]- [11]. The key is to make effective use of temporal context in the video to balance the trade-off between quality and speed. Some other approaches utilize geometric and structural information in 3D data to further improve accuracy, generally falling into two categories. One is RGB-D semantic segmentation [12]- [15], that leverages depth data to recalibrate the RGB feature. The other concentrates on extracting more representative features directly from 3D point sets, dubbed as point-cloud semantic segmentation [16]- [19].
However, there are some defects in these algorithms. For single image and video-based methods, the limited information in RGB images does not allow to fully analyze geometric constraints, making it difficult to show promising results in challenging scenes with low color discrimination and complex occlusion. For RGB-D-based methods, depth maps captured by sensors are partially noisy and hard to accurately align with the RGB pixels, which may result in undesirable results. The limited distance measurement range can make this phenomenon more obvious in outdoor scenes. For point-cloud-based methods, the dataset size is generally small since acquiring and annotating points is much more complicated than images, restricting the development of deep learning methods. In addition, the quality of the data can not be well controlled.
In this paper, we propose a new comprehensive light field (LF) dataset named UrbanLF for semantic segmentation. A 4D LF [20] not only contains intensity but also direction of light rays. The additional directional information implicitly defines the geometry of the scene. Fig. 1 parametrizes the LF as L(x, y, s, t) with two parallel planes, where (x, y) and (s, t) are the spatial coordinates and angular coordinates respectively. Theoretically speaking, LF benefits semantic segmentation in several ways. First, LF can be seen as sub-aperture images where the viewpoints are arranged on a regular grid in angular plane. Some occluded pixels in target view can be obtained from other views. Second, LF contains depth information [21] which has been proven useful for this problem. Third, the pixel parallax of LF is available for reflection layer removal [22] and rain streak removal [23] so as to reduce the performance degradation of image understanding.
Standard benchmarks [24], [25] have proven their importance for the development in the respective fields. They can offer guidance on research by giving detailed evaluations and objective comparisons of different methods. Moreover, driven by the big success of deep learning, most top-performing methods are nowadays built upon deep neural networks. A major factor is the availability of large-scale datasets which allow networks to develop full potential. In brief, it is necessary to create a large benchmark dataset to support LF semantic segmentation.
By far, existing LF datasets almost do not contain annotations for semantic segmentation, hindering the development of related field. Commercial plenoptic cameras that can capture LF in a single shot are available and continuous progress on rendering engine guarantees the quality of synthetic data. With the help of these imaging technologies, we collect 1074 LF images with complex urban scenes as well as pixel-wise annotations. The dataset contains not only real-world samples but also synthetic samples. Besides, we specifically design two baselines for LF semantic segmentation and provide a performance evaluation of semantic segmentation methods on the proposed dataset. The results show our baselines achieve better performance than non-LF-based methods, demonstrating the effectiveness of LF for this problem. We also try using the dataset for other fields such as super-resolution and depth estimation. We hope that our dataset can make a contribution to LF community.
In summary, the contributions of our work are as follows: • We create a large-scale LF dataset called UrbanLF, including 824 real-world samples and 250 synthetic samples with annotations into 14 classes. To the best of our knowledge, it is the first time to propose such a large LF dataset for semantic segmentation. • We propose two new LF semantic segmentation baselines and provide a systematic experiment on UrbanLF to compare with state-of-the-art RGB, video and RGB-Dbased methods. The solid performance of our baselines confirms the superiority of LF to this problem. • We design a comparative experiment to explore the effectiveness of synthetic data. The experimental results demonstrate that synthetic data can complement real-world data to boost model performance.
• We provide other experiments to explore the potential of UrbanLF for other tasks and give a comprehensive analysis to show the advantages and drawbacks of different methods. The results show that UrbanLF can also be used for super-resolution and depth estimation.

A. Light Field Datasets
In the past few years, LF has gradually developed into one of the mainstream research fields of the computer vision community. Owing to the potential capabilities from additional directional information of the light, a large variety of fields have tried using LF as input rather than a single image, introducing a series of datasets that can be classified into real-world LF captured by a camera array, a gantry or a plenoptic camera and synthetic LF by Blender [26] or other software. In some applications like depth estimation [24], [27]- [29], super-resolution [30]- [32], saliency detection [33]- [35] and view synthesis [36], [37], LF datasets have been widely used with remarkable results. While in other applications like quality assessment [38], [39], video processing [40] and intrinsic decomposition [41], this attempt still is in preliminary stage with initial success. The summary of these LF datasets is shown in Table I.
Very recently, [42] proposes the first LF dataset for semantic segmentation. It provides 400 real-world macropixel images and corresponding central view images with annotations for 3 foreground objects. The small dataset size and semantic class number constrain its application. As a comparison, our UrbanLF has more samples and richer annotations, which allows the deep learning model well generalizing and presents more challenges to this field.

B. Semantic Segmentation Datasets
Various types of datasets have been proposed for semantic segmentation and gradually improved in terms of size, annotation richness, scene variability and complexity. These datasets play an important role in the overall progress in this field.
CamVid [51] is the first collection of videos with semantic labels. It includes five sequences captured from the perspective of a driving automobile. Cityscapes [25] contains a larger set of stereo videos recorded in street scenes. The inaccurate depth maps obtained through stereo image pairs are rarely used for RGB-D semantic segmentation. It also has 20000 weakly annotated images as extra training data. Mapillary Vistas [52] collects single images captured at various conditions by different devices. It covers a wide variety of street scenes and is only used for RGB semantic segmentation. SYNTHIA [53] is a synthetic collection of photo-realistic images rendered from a virtual city created by Unity. It has 3 subsets with different types of data that can be selected depending on the needs. NYUDv2 [54] and SUNRGBD V1 [55] are widely used for RGB-D semantic segmentation. The former consists of RGB-D images taken from the Kinect and the latter combines images from [54], [56], [57] with new samples captured from 4 different sensors. As public standard point-cloud datasets, Semantic3D [58] consists of dense and complete points from  [59] includes richly annotated points from cities with available RGB color. The summary of these datasets is shown in Table II. In contrast, our UrbanLF has some distinguishable features. It includes a large number of densely and regularly sampled LF images which contribute to new LF semantic segmentation methods. It has two kinds of data with a small domain gap, in which the synthetic samples can be regarded as a good supplement to real-world samples and provide accurate depth information for RGB-D semantic segmentation.

C. Semantic Segmentation Algorithms
1) RGB Semantic Segmentation: Semantic segmentation has reached a new stage with the introduction of FCN [60] that leverages convolutional layers instead of fully connected layers to get the final predictions. Standard FCN segmentation model utilizes the encoder-decoder structure so as to split the task into two stages. Firstly, the encoder uses ConvNets like ResNet [61] to encode semantic information into feature maps, then the decoder recovers the prediction details gradually through the context information. In order to further improve accuracy, some approaches focus on solving the problem of limited receptive field. Yu et al. [6] exploit atrous convolution to enlarge the receptive field while keeping the resolution of the feature maps to preserve the spatial information. HRNet [62] generates reliable high-resolution features through repeatedly fusing the representations from multi-resolution convolution streams. Other approaches achieve this by capturing multi-scale context information. PSPNet [5] proposes the pyramid pooling module (PPM) that adopts average pooling layers with different scales. DeepLabV2 [63] proposes the atrous spatial pyramid pooling (ASPP) that adopts atrous convolutions with different rates. Ca-crfs Net [64] uses spatial pyramid pooling to ensemble multi-scale features and proposes cascaded conditional random fields to learn boundary information. Besides, some recent works combine Transformer [65] with segmentation to achieve state-of-the-art performance. SETR [66] is a new segmentation model that replaces the traditional stacked convolution layers with a pure transformer. In OCR [67], a transformer encoder-decoder framework is used to rephrase the object-contextual representation scheme.
2) Video Semantic Segmentation: Video semantic segmentation aims to generate real-time predictions for each frame. The most straightforward approach is to apply an RGB semantic segmentation model to each frame. However, this strategy brings an excessive computational burden. Existing approaches concentrate on exploring the temporal relation between video frames to avoid unnecessary computation. One way is to reuse the features from the key frame to current frame. The challenge is how to propagate information robustly. Carreira et al. [8] directly reuse stable features extracted from deep layers to share information across frames. [9], [10], [68] apply an optical flow network to guide the propagation process. The other way is to use the same model for each frame and aggregate them through temporal context for better features. TDNet [11] applies several sub-networks to extract sub-feature groups and gets full features via grouped knowledge distillation loss and attention propagation module. TMANet [69] treats past frames as memory and builds long-range temporal context information to enhance the representation power of features from current frame.
3) RGB-D Semantic Segmentation: RGB-D semantic segmentation takes depth data into consideration to achieve better performance. The majority of approaches treat the depth as an additional input of the network. A two-stream network is used to process RGB images for color and texture information as well as depth images for geometry information, then fuses them for final prediction. ACNet [13] proposes a third branch to process and propagate the fusion features from RGB and depth branches. SA-Gate [14] performs feature aggregation and transfers the fusion features back into RGB branch and depth branch to recalibrate information at each stage. ESANet [70] adopts shallow encoder branches and a 3×1 along with a 1×3 convolution for faster inference with competitive results. Some approaches directly incorporate depth data into explicit operations. DCNN [12] proposes depth-aware convolution and depth-aware average pooling to seamlessly incorporate geometry into CNN. SGNet [71] proposes a S-Conv operator that can adaptively adjust the convolution weights and distributions based on the spatial information. Other approaches [15], [72] treat depth data as a supervised signal and use a multi-task learning framework to jointly train segmentation and depth estimation to improve single-task performance.

4) Point-Cloud Semantic Segmentation:
Point-cloud semantic segmentation adopts points in 3D space instead of pixels in 2D images and assigns each point with a label. Existing approaches are sorted into three categories according to the data format. 2D-based approaches [16], [73] first convert data into multi-view 2D images, apply 2D CNN architectures to generate downsampled 2D labels and transfer them back to 3D form. Voxel-based approaches [17], [74] first voxelize raw points, apply 3D CNN frameworks for subsequent processing and restore the result to the original 3D point labels. Unlike the aforementioned work, point-based approaches directly process point-cloud without data pre-transformation operation. PointNet [18] applies multi-layer perceptrons to extract point features that aggregate both global and local knowledge and finally outputs per point scores. PointNet++ [75] further explores the local relationship among points to augment features for improving performance.

5) Light Field Semantic Segmentation:
Previous works mainly focus on LF segmentation which aims at grouping pixels of different objects without considering semantic information. Wanner et al. [76] propose globally consistent multi-label assignment for the first time. Hog et al. [77] decrease the running time of Markov random field graph-cut by using a ray-based graph structure. Inspired by superpixel segmentation of 2D images, Zhu et al. [78] propose light field superpixel (LFSP) and develop a refocus-invariant LFSP segmentation method. Khan et al. [79] segment horizontal and vertical epipolar plane images (EPIs) and combine the angular segmentations in them through view-consistent clustering. Lv et al. [80] build a hypergraph representation with LFSPs and present a method via graph-cut optimization. HAMAD et al. [81] propose an automatic, adaptive, and view-consistent method based on normalized LF cues and K-means clustering. [42] is currently the only work that uses LF to explore semantic segmentation. It investigates the advantage of LF angular-spatial information combined with a designed convolutional neural network. The network has an angular model to learn the angle features from macropixel images and applies ASPP to extract multi-scale context features.

III. THE URBANLF DATASET
For providing sufficient data for LF semantic segmentation, we create a new large-scale LF dataset called UrbanLF which includes 824 real-world and 250 synthetic samples. As shown in Fig. 2, each sample is composed of 81 sub-aperture images with an angular resolution of 9 × 9 and high-quality pixelwise annotation of central view. The synthetic sample further contains annotation, depth map and disparity map of all views.
We choose urban scenes as the subject of UrbanLF. With the development of urbanization, urban scene understanding has become a research hotspot and has been widely used in advanced applications like crowd detection and traffic analysis. Consequently, it is meaningful to further understand complex urban scenes through the rich information in LF to improve the practical system performance and reliability, offering a good alternative choice for depth data.

A. Data Capture Process
The real-world data are captured by Lytro Illum which is widely used because of its simplicity on carrying and operation. To ensure the quality of the data, we collect LF in the time period with sufficient light so that all objects in the scene can be clearly captured. The density of foreground objects is large to prevent background classes from occupying most of pixels in a single image. We avoid unfavorable weather conditions such as heavy rain or snow because of equipment limitations. We also avoid overly complicated scenes to reduce the adverse impact of unclear structure on annotation due to the limited image resolution. Lytro Illum stores the original data in LF Raw format that is processed with MATLAB Light Field Toolbox [82] in this work. Note that the depth maps obtained from the toolbox are discarded because they are prediction results rather than ground truth data.
The synthetic data are created by Blender using the Cycles and Eevee renderer. We design a virtual urban environment and add various elements in it to acquire images. In order to increase the diversity and complexity of data, each element has multiple models with different textures, colors and shapes and we place many instances in a scene to avoid leaving large For the sake of keeping the consistency between the real-world and synthetic data, the resolution of these two parts should be as near as possible. The limited number of sensors in Lytro Illum makes the spatial resolution of the real-world images only 623 × 432, so we finally select 640 × 480 as the spatial resolution of the synthetic images. In addition, the synthetic data contain densely sampled LF with disparity in a range from −0.47 to 1.55 between adjacent views that is basically as similar as the real-world data.

B. Class Selection and Image Annotation
Taking into account practical applications, the frequency of objects and the compatibility with existing urban scene datasets, we define 14 classes for evaluation, i.e., bike, building, fence, others, person, pole, road, sidewalk, traffic sign, vegetation, vehicle, bridge, rider and sky. Please refer to the supplementary material for detailed definition. We provide fine annotations that accurately reflect details in the scene, including the contour of the object, the scale of the object and the occlusion relation between different objects.
For real-world data, the annotations of central sub-aperture images are realized by human labour via LabelMe [83].
To guarantee the quality, the annotation time is more than one hour on average for an image. Furthermore, three participants are responsible for checking all annotations so as to avoid inconsistencies caused by the different understanding of the label scheme and definition of classes among annotators.
For synthetic data, Blender generates completely accurate label maps, depth maps and disparity maps of the scene, greatly reducing the demand for manual effort. The extra semantic annotations and depth information for all 81 views broaden the scope of application of synthetic data.

C. Dataset Splitting
UrbanLF is split into training, validation and test set approximately at a ratio of 7:1:2. Following this scheme, the real-world data consist of 580 training, 80 validation and 164 test samples, while the corresponding number in synthetic data are 172, 28 and 50 respectively. The training and validation set are publicly available and the test set is withheld for benchmarking. We divide the data by stratified sampling instead of random sampling. Specifically, each set is composed of samples with the same distribution ratio in the following properties: 1) the light condition, 2) the number of instances, 3) the shooting angle. This balanced way helps to comprehensively train and test the model.

D. Statistical Analysis
We conduct statistical analysis from three aspects to give a comprehensive introduction to UrbanLF.

1) Distribution of Classes:
We compare our UrbanLF with widely-used datasets that focus on urban scenes with semantic pixel-wise annotations, i.e. Cityscpes [25], Mapillary Vistas [52], CamVid [51] and SYNTHIA-Rand-Cityscpes [53]. The original labels of each dataset are remapped to the aforementioned 14 classes for a unified comparison. As shown in Fig. 3, the statistical results of these datasets are relatively consistent, in which background entities like building and road occupy more pixels than foreground objects like bike, traffic sign and rider. This imbalanced class distribution is in line with urban scenes. The difference comes from the characteristics of scenes in the datasets. UrbanLF mainly covers traffic and street scenes, resulting in more vehicle, more building and fewer vegetation. Cityscpaes with inner-city traffic of roads and intersections contains the most road. Mapillary Vistas with a wide vertical field of view contains the most sky. SYNTHIA-Rand-Cityscpes with lots of street blocks contains the most sidewalk. Fig. 4 shows the proportion of images that have specific class in UrbanLF. It can be observed that the majority of classes appear in at least half of the images, and only the value for bridge, rider and sky is less than 20%.
2) Scene Complexity: We report the scene complexity from the number of semantic classes per image. As shown in Fig. 5, it is obviously that UrbanLF has a high diversity of scene complexity, where the number of semantic class per image is in a wide range of [1,14] rather than 3 classes at most in the only published LF dataset for semantic segmentation [42]. Moreover, nearly half of images in UrbanLF contain at least 8 classes, meaning that hard samples occupy a considerable share of the dataset.
3) Shooting Angle: The shooting angle is one of the key factors that have a great influence on the outcome of the images. A low-angle shot mainly takes the sky as background and creates a sense of depth. An eye-level shot is a standard shooting angle and accords with custom of human visual. A high-angle shot captures the object from above and makes it look flat. The shooting angle transformation in datasets may present new challenges. With this in mind, our UrbanLF covers various shooting angles to achieve a comprehensive visual effect, consisting of 89, 767 and 218 images at low, eyelevel and high shooting angle respectively. Fig. 6 shows some representative samples.

IV. BENCHMARKS
In this section, we first introduce the algorithm benchmarking for semantic segmentation, including representative baselines, experimental setup and result analysis. With the rich resources provided by UrbanLF, our benchmark extends to super-resolution and depth estimation. The benchmark website will be available online after publication. We also apply the proposed dataset to other tasks like LF segmentation and LFSP, please refer to Appendix C of supplemental material.
A. Semantic Segmentation 1) Representative Baselines: We evaluate 12 state-of-theart methods on UrbanLF, including 4 RGB-based methods: PSPNet [5], DeepLabv3+ [7], OCR [67], SETR [66], 4 video-based methods: Accel [10], TDNet [11], DAVSS [68], TMANet [69], 4 RGB-D-based methods: ACNet [13], MTI-Net [72], SA-Gate [14], ESANet [70]. They cover most of the representative methods and offer open source code. Note that we do not evaluate point-cloud-based methods due to the limitation of data content. We also design two new LF-based methods to prove the benefits of LF for semantic segmentation. They rely on PSPNet and OCR respectively and increase a spatial branch along with a feature fusion module on the original basis. Different from [42], we explore the possibility of using partial sub-aperture images as input to reduce memory consumption. As illustrated in Fig. 7, our baselines apply the encoderdecoder structure. There are two independent branches in the encoder. One is RGB branch that extracts color features from central view image. The other is spatial branch that extracts spatial features from image stacks in four directions of horizontal, vertical, 1 4 π, and 3 4 π. PSPNet-LF adopts ResNet as the backbone and OCR-LF adopts HRNet as the backbone. We apply a channel attention operation to further refine the two features and use element-wise add as input of decoder to convert the fusion feature into the final segmentation result.
2) Experimental Setup: There are three experiments on UrbanLF in total. We make the first experiment on the real-world part of the proposed dataset. In the second experiment the model is trained along with some synthetic samples. The aim of this work is to show that synthetic data helps to improve segmentation results on real-world data. To achieve this, we crop the synthetic image to the same resolution as the real-world image, then build batches with images from two domains. Since the real-world data do not provide depth information, we conduct the third experiment on the synthetic part own to extend the evaluation scope to RGB-D-based methods. The details of experiments are shown in Table III.
We follow the original experimental settings of each method. For RGB-based methods, we only use the sub-aperture image in central view and corresponding annotation. For RGB-D-based methods, we additionally use the depth map. For video-abased methods, we choose the order in [84] that horizontally scans the sub-aperture images from left to right starting from the view on the left superior corner to create the pseudo video. For our two baselines, a SGD optimizer with initial learning rate 0.01, momentum 0.9 and weight decay 0.0005 is used to train the network. We employ a poly learning rate policy where the learning rate is multiplied by (1 − iter max_iter ) 0.9 . As for data augmentation, we apply random scaling, flipping and cropping to both central view image and image stacks. The comparison is done with pixel accuracy (Acc), mean pixel accuracy (mAcc) and mean intersectionover-union (mIoU) in full resolution of central view. We adopt single-scale testing and multi-scale testing at the same time. The testing strategies for the latter are horizontal flipping and multi-scale scaling with a factor (0.75, 1.0, 1.25, 1.5).  We also report average inference time in the third experiment to evaluate speed. For a fair and consistent comparison, all experiments are conducted with single-scale testing strategy and a batch size of one on a NVIDIA RTX 2080Ti. For videobased methods, we compute inference time per frame in the pseudo video as statistical results.
3) Result Analysis: The quantitative results of the former two experiments are presented in Table IV. Our modification to OCR and PSPNet by additionally using image stacks as input is particularly effective. OCR-LF achieves the highest scores on almost every metric. PSPNet-LF shows remarkable performance with multi-scale testing. The following are OCR and SETR. All methods achieve improvement on Acc, mAcc and mIoU while exploiting the extra synthetic images for training. The specific increments are highlighted in bold. Table V presents the results of the third experiment. In terms of accuracy, OCR-LF obtains the highest scores on mAcc and mIoU. PSPNet-LF also has improvements compared with PSPNet. When depth data are available, ACNet achieves superior performance on Acc and SA-Gate obtains the second highest scores on every metric. As for speed, DAVSS achieves the shortest inference time by reusing and warping keyframe features. ESANet with an efficient ResNet-34-based encoder obtains the second fastest speed. PSPNet-LF and OCR-LF have the longest inference time due to utilizing many subaperture images, about twice as long as PSPNet and OCR respectively. On the whole, LF-based methods can obtain comparable results through leveraging the implicit geometry information in the sub-aperture images. However, they achieve start-of-the-art performance at the cost of low inference speed. It is worthy of exploring how to further reduce memory usage while retaining high accuracy. Due to effectively using the extra depth information, the performance of RGB-Dbased methods is generally superior to other methods, which is consistent among different datasets. Furthermore, adding synthetic data and multi-scale testing help to boost the performance.
The qualitative results of experiment I and experiment III are shown in Fig. 8 and Fig. 9. From Fig. 8, it is observed that our proposed baselines improve the case of inaccurate  Fig. 9 provides the results on the synthetic samples. Benefiting from the depth cue, the RGB-D-based methods distinguish wheels and clothes from road with similar colors at high accuracy, achieving better performance than other methods. Our baselines get similar visual results through utilizing the implicit depth information in LF. It's worth noting that all methods fail to recover the spokes of the bike wheel and the exact boundary of arms and fingers, leaving a margin for future research.
B. Super-Resolution 1) Representative Baselines: We select 3 representative LF spatial super-resolution (LFSSR) methods, including LF-ATO [85], LF-InterNet [86] and LF-DFnet [87]. LF-ATO applies an all-to-one architecture and appends structural consistency regularization to preserve parallax relationship. LF-InterNet combines the separately extracted spatial and angular features through repetitive interactions. LF-DFnet incorporates and encodes the angular information through  deformable convolution. They are all deep learning methods and have proven effectiveness on many LF datasets.
2) Experimental Setup: Following the general setting, we train the models with both real-world samples and synthetic samples from UrbanLF, validate and test them on two parts independently. Considering that sharing the same test set with other tasks will expose the ground truth, we extra collect 80 real-world and 30 synthetic samples as new test data. The bicubic interpolation with a factor of 2 and 4 is applied to generate low resolution images of different scales. All these methods have open source code and we follow the original settings. The comparison is done with peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) averaged over all sub-aperture images.
Qualitative results on UrbanLF-Syn for ×2 LFSSR. The red rectangle is zoomed for better viewing. Qualitative results on UrbanLF-Real for ×4 LFSSR. The red rectangle is zoomed for better viewing.
leading part in the development of this field. SPO finds the lines indicating depth information from EPIs. EPINet uses the FCN framework to exploit the characteristics of epipolar geometry. LFattNet introduces a view selection module to infer the contribution of each view by generating an attention map.
2) Experimental Setup: The experiment is only performed on the synthetic part of UrbanLF. We exclude all samples that contain the sky because the true depth of this class can not be accurately measured. We also create a new test set to avoid depth data leakage owing to data sharing among tasks. After redistributing the data, there are 170 samples for training, 30 samples for validation and 30 samples for test with corresponding disparity map. We adopt the settings in the original publication and the disparity label range is set to 64 for SPO. As for evaluation, we only estimate the disparity of central view and use the mean square disparity error (MSE) and the bad pixel ratio (BadPix) with three thresholds (0.01, 0.03 and 0.07 pixels). For these metrics, small value signifies good performance. Since EPINet applies the convolutional layer without zero-padding, we crop 15 bordering pixels for a fair comparison.
3) Result Analysis: Table VII shows the quantitative results. LFattNet achieves the best MSE performance and SPO achieves the best BadPix performance. However, there is still much room for improvement. The BadPix scores are generally too high, indicating that the prediction of most pixels is not accurate enough. Fig. 12 shows the qualitative results. We can observe high errors caused by the shadows in all methods. SPO has difficulty in recovering weak texture areas thus it fails to make predictions for road, bridge and other areas with similar colors, resulting in high MSE scores. EPINet and LFattNet struggle with reconstructing fine structures such as the outline of tree and thin gaps between leaves. Their performance also deteriorates at different levels on car surface with specular highlights. Judging from the results, we conclude that our meticulously designed urban scenes include various combinations of open challenges to further stimulate advanced research in depth estimation.

V. DISCUSSION
LF semantic segmentation is a challenging and meaningful topic. However, due to the lack of large-scale datasets, it has not been well explored up to now. The key to constructing such a dataset is to ensure both quantity and quality of the data and UrbanLF fills this blank.
Considering the characteristics of the LF geometry, our baselines encode sub-aperture image stacks to learn the angular and spatial information for semantic segmentation. Since sub-aperture images share information, acting as a supplementary item for one another, our baselines solve the problem of inaccurate prediction in occluded regions of central view. The implicit depth information is also useful for distinguishing different objects with similar colors in RGB space. The results on UrbanLF are better than those of RGB and video-based methods and are comparable to those of RGB-D-based methods, proving that LF does benefit this topic.
Although applying LF to semantic segmentation is the main contribution of UrbanLF, it is applicable to other fields of research as well. The complex urban scenes present challenges for super-resolution and depth estimation. In future work we plan to introduce multiple data content like intrinsic layers to make UrbanLF suitable for more tasks.

VI. CONCLUSION
In this paper, we introduce a brand new LF dataset called UrbanLF, including 824 real-world and 250 synthetic urban scene samples with ground truth pixel-wise annotations. Through evaluating several state-of-the-art methods on three tasks, we highlight that the proposed dataset supports detailed comparisons among different methods. Furthermore, we specially design two baselines for LF semantic segmentation and get outstanding performance. We also find that synthetic samples can supplement real-world samples to solve the problem of limited available data caused by cumbersome and error-prone manual annotation. As the largest and the most diverse LF dataset for semantic segmentation, we hope that UrbanLF attracts more researchers into related fields.