A General Framework for Depth Compression and Multi-Sensor Fusion in Asymmetric View-Plus-Depth 3D Representation

We present a general framework which can handle different processing stages of the three-dimensional (3D) scene representation referred to as “view-plus-depth” (V+Z). The main component of the framework is the relation between the depth map and the super-pixel segmentation of the color image. We propose a hierarchical super-pixel segmentation which keeps the same boundaries between hierarchical segmentation layers. Such segmentation allows for a corresponding depth segmentation, decimation and reconstruction with varying quality and is instrumental in tasks such as depth compression and 3D data fusion. For the latter we utilize a cross-modality reconstruction filter which is adaptive to the size of the refining super-pixel segments. We propose a novel depth encoding scheme, which includes specific arithmetic encoder and handles misalignment outliers. We demonstrate that our scheme is especially applicable for low bit-rate depth encoding and for fusing color and depth data, where the latter is noisy and with lower spatial resolution.


I. INTRODUCTION
Representation and processing of real-world three-dimensional (3D) visual scenes has been of increasing interest recently in the light of new forms of immersive visualization achieved by the advancement of 3D display technology. The geometrical information about scenery can be sensed into an intensity image-like representation referred to as ''depth map''. Each pixel of a depth map represents the distance to a particular point in 3D space as seen from a particular view perspective. Depth maps are combined with confocal captures of 2D color images to form a 3D representation, referred to as ''View-plus-depth'' (V+Z) [1], [2], where both images have the same size and are pixel-topixel aligned to augment each color pixel with its position in space. V+Z can be used for various applications, such as virtual view synthesis by Depth-Image Based Rendering (DIBR) [3], computational photography effects of refocus-The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei .
ing, vertigo or synthetic aperture [4], and mixed reality [5] The format has been standardized in 3D video compression standards (3DVC) [1]. Figure 1 illustrates the color and depth modalities in blended transparent combination (i.e. the actual color is shown on the upper left corner and depth is shown pseudo-color coded in the lower right corner). As seen in the figure, the depth modality is a piece-wise smooth function, where edges are formed by objects situated in different distances. The blended transparency reveals that there is a certain alignment congruency between edges of both modalities (i.e. scene objects are at a certain depth).
Depth maps of real scenes are captured and estimated by, generally, two groups of techniques, referred to as passive or active sensing. The ''structure-from-stereo'' estimates depth by matching similar (corresponding) pixels between two or more images captured from different perspectives. Dedicated (i.e. active) range sensors employ Time-of-Flight (ToF) principles to directly capture depth [6], [7]. In all cases, depth estimation or measurement usually come degraded by various artifacts. For example, in passive sensing, degradation  is caused by ambiguity in texture-less areas or repetitive patterns. Furthermore, depth resolution is degraded by the non-linear conversion (quantification) of matched disparities [8]. In ToF approaches, depth data is limited by the low sensor resolution, e.g. 120 × 160 [9]. It is constrained by the requirement of the photo-elements to work in high-sensitivity conditions, which is ensured by increasing the sensing element area. ToF sensing elements typically have plate size of 150 µm, compared to the size of modern color sensors which is about 2 µm [7]. Otherwise, ToF sensors provide better depth resolution quality, however they are usually non-confocally located with respect to the companion color sensors. A 3D data fusion is required to mix the modalities into a confocal representation. Such processing stage includes projection alignment, non-uniform data resampling, denoising, and depth enhancement filtering [10], [11]. Figure 2 illustrates the fusion process for a non-confocal asymmetric V+Z setup.
In this work, we focus on the problem of optimally representing the V+Z data. Our inspiration is based on the fact that the depth is a piecewise smooth function aligned to scene object edges, which open possibilities for its sparse representation. We consider two cases. First, we consider an already aligned V+Z representation where depth and color maps are with the same resolution and we target the smallest decimated depth map representation which would ensure a faithful full-resolution depth reconstruction. Such approach is instrumental for depth compression and streaming in the form of auxiliary data. Second, we consider a case, where the depth comes as low-resolution, noise-degraded map and the task is to restore it to its full resolution. Such case is instrumental in non-confocal ToF/color data fusion systems.

A. DEPTH AND VIEW-PLUS-DEPTH COMPRESSION
Depth compression schemes can be roughly separated into two categories regarding whether the depth maps are compressed independently from or jointly with the aligned color images [12]- [23]. Methods for direct depth map compression include decomposition techniques for effective prediction of the underlying piecewise-smooth function [12]- [15] or techniques for representing and compressing depth contours [16]- [18]. The inter-relation between the V and Z modalities has been explored in several works utilizing different cross-segmentation approaches [18]- [20]. Other works have considered block partitioning and ''wedgelet'' edge modeling of non-rectangular intra-block segments [2], sometime combined with inter-component prediction [1]. Some of the tools in image/video compression standards such as ''JPEG/JPEG2000'' [21], [22] or ''H.264/HEVC/AVC'' [23] are also effectively applicable for depth compression.

B. 3D FUSION OF ASYMMETRIC VIEW-PLUS-DEPTH DATA
3D data fusion problem has been considered in different research settings aiming at aligning the edges of the two modalities while enforcing piecewise smoothness of the depth. A layered Markov Random Field (MRF) model in [24] with the purpose to correlate a continuous smooth surface to the given samples of depth data. The MRF formalization have been further advanced in [25], [26] and [27]. In [28], the problem has been cast as in a dissipated heat anisotropic diffusion network, where the heat sources are the available data samples. Simultaneous surface fit and denoising have been considered in a number of works, employing either joint-geodesic distance [29], or moving least squares [30], or multi-point regression [31]. Cross-modality filters such as bilateral [32] and non-local [33], [34] have been implemented as to utilize the high-resolution color map as a guiding modality is the depth reconstruction process. Solutions based on bilateral filtering have been proposed in [35]- [38], and solutions based on non-local filtering have been proposed in [39] and [40]. Other forms of edge-preserving guided filtering have been proposed as well [41]. A method based on total generalized variation (TGV) for optimization of anisotropic diffusion tensor structure has been proposed in [42]. The article provides also a benchmark data set for 3D fusion resampling quality evaluation for real-case data of VOLUME 8, 2020 asymmetric V+Z capturing setup, where depth maps are obtained by noisy ToF sensor.

C. RELATION WITH PREVIOUS WORK
Previously, we have proposed techniques for depth resampling and 3D fusion for the case of an asymmetric non-confocal V+Z camera setup, where the depth is sensed in low-sensing conditions [43], [44], as well as techniques for near-lossless depth encoding [45], [46]. In the present work, we present a general framework, which addresses both cases.
We further extend the technical stages of super-pixel (SP) segmentation, resampling, regularization, encoding, and 3D fusion. More specifically, we modify the segmentation clustering stage proposed in [45], [44] to ensure border congruency at hierarchical refinement levels and seed the SP clusters for non-uniform data samples to serve the case of projected data. Furthermore, we address the problem of possible misalignment between V and Z modalities caused by sensing artifacts. Such misalignment produces edge outliers that concentrate high amount of errors in the global cost metrics and thus mislead the error optimization in the coding process. To this end, we propose an efficient encoding scheme of such outliers in so-called ''yield-flow'' protocol. A modification of the adaptive regularized reconstruction is proposed as well.
The article is organized as follows: Section II provides some preliminaries and notation conventions along with description of basic super-pixel clustering, Section III describes the proposed general framework, Section IV describes application realizations for depth encoding and 3D fusion of asymmetric V+Z sensor data utilizing a proposed multi-layer congruent super-pixel clustering mechanism, Section V provides experimental results, and the manuscript is finalized in Section VI for some conclusive remarks.

II. PRELIMINARIES A. DEPTH AND VIEW-PLUS-DEPTH COMPRESSION
Consider a color image is some three-component color space, for example CIELAB [47]. Each pixel with index j is a threecomponent vector V j = [l, a, b] j , j = (1, .., J ). When needed, the pixel is given with its coordinates related to the camera projective system x = (x, y), {x, y} ∈ R 2 [48]. The associated depth value is denoted by Z j .
When sensed by active sensors, the depth map relates with the range data D, which represents distances from pixels to scene points [42]. When estimated from stereo, the depth values relate with disparity values d showing the shifts between corresponding pixels [8]. In many encoding applications, depth is quantized as ''inverse depth'' [7] z n = 1 n  When sensed by some active range sensor, depth maps are non-confocal to the color maps and can come with lower spatial resolution and floating-point higher range, e.g. Z l , l = (1, .., L), L J , Z ∈ 0, 2 16 . In such case, the output of the V+ Z representation is calculated by projective alignment and depth resampling, referred to as ''3D fusion''.

B. SUPER-PIXEL CLUSTERING
Super-pixel (SP) based segmentation plays an essential role in the proposed framework. Super-pixels are segments that have near-isotropic and compact representation with low-computational overhead. A typical super-pixel behaves as a raster pixel on a low-resolution near-regular grid. Perceptually, SP areas are homogeneous in terms of color and texture. Two main approaches for generating super-pixels can be cited, namely: Simple Linear Iterative Clustering (SLIC) [49], [50] and Super-pixels Extracted via Energy-Driven Sampling (SEEDS) [51]. Hereafter we adopt the SLIC approach.
An elegant feature of the super-pixel segmentation is that it takes the desired number of SPs as an input parameter and that for this number it is reproducible in terms of same SP areas (clusters) and indexing that follow the edge shape between color textures. For that reason, SP segmentation is instrumental for finding objects shapes in a scene, see the pear example in Figure 3 (b). The SP clustering is initialized by defining K seed locations of color points Q k , k = (1, .., K ). Those points are chosen to be equidistantly sampled in image coordinates x k = {x,y} k for roughly calculated sampling shifts [50] 97518 VOLUME 8, 2020 where H and W are the pixel dimensions of the sensor (c.f. blue dots in Figure 3 (a)). Pixels V j are clustered to SP segments C k , where each segment C k span N k pixels, as follows.
For each image pixel V j , a neighborhood S j (i.e. seeding support region) is associated. The neighborhood seeding support region spans a rectangular area of dimensions 2s {R,C} around V j . The closest similarity of V j to seeding points Q k within S j is found by applying e.g. a bilateral cost, which assigns V j to segment C κ where λ ρ , λ C are weighting constants. The clustering is iterated by updating the seeding points Q k with the arithmetic mean for the pixels assigned to the associated cluster C k A polishing step that enforces connectivity of points of each segment is applied at the end [50].

III. PROPOSED GENERAL FRAMEWORK FOR V+Z RESAMPLING AND FUSION A. DEPTH RESAMPLING SCHEME
We propose a general depth resampling scheme (DRS) to be used as a building block in various applications. The aim is to find an optimal representation of the depth map, for either compression or depth reconstruction. The block diagram of the proposed scheme is given in Figure 4. It takes as input the color image V, a set of initial seeding points Q, and a depth map Z , which might be or might be not with the same resolution as the color image. The color image is segmented by a SP clustering operator , A masking operator M fills each segment C k with constant depth valuesZ k , thus generating a depth map with the same resolution as the color image V The valuesZ k are selected or calculated depending on the application. A cross-modality adaptive reconstruction filter B reconstructs an estimate of the depth map Furthermore, a depth down-sampling operator D turns either Z orẐ into low-resolution depth map Z The scheme is general and can be integrated in other techniques requiring depth resampling and refinement.
We develop two such techniques, one related with nearlossless depth encoding and one related with asymmetric V +Z data fusion. However, we first propose a modification of the SP segmentation which would better serve the targeted applications.

B. MULTI-LAYER CONGRUENT SUPER-PIXEL CLUSTERING
In order to facilitate the operations in the DRS, we propose a novel multi-layer SP clustering to serve as the operator (5). It is based on the SLIC method [50] and aims at finding a segmentation that has contour congruency among different refinement levels in a sense that a refinement level with smaller number of segments has segment boundaries of SPs that are in union to those of a refinement level with higher number of segments (c.f. Figure 5 (a, b)).
In the proposed solution, the clustering for some desired number K of SPs is done by several refinement stages ρ, starting from an initial very fine mosaic ρ = 0. Assume the initial number of segments K 0 and the corresponding seeds Q 0 k are selected in a way that only a few points M define each cluster C 0 k (e.g. M 0 k = 4). For each iterative step ρ > 0, the number of SPs is chosen to be smaller (e.g. decreased by two in each iteration) The clustering process for ρ > 0 combines segments of SP cluster C ρ obtained in previous iteration ρ = ρ − 1.
Denote the actual seeding points at iteration ρ by G ρ k . In the general case, these are at non-uniform locations x k , k = 1, 2, . . . , K ρ . In order to find the new seeders for iteration ρ, one first sets a coarser uniform grid with steps s ρ {H,W} = (HW) K ρ (see the blue points in Figure 3 (c)). SeedersG ρ k being closest to this grid are considered as attractors, as they are meant to attract other super-pixel centroids to form the new segments C ρ k (see the red points in Figure 3 (c)). This is done by calculating the bilateral distances (3)  and its intensity value is calculated as the arithmetic mean of segment points Essentially, the operation is repeating the basic SLIC but working at each layer with super-pixels instead of pixels and maintaining non-uniform seeding positions to better describe the properties of the embedded super-pixels. It is important to mention that the resulted clustered segments C ρ k combine pixels from sub-segments C ρ (c.f. Figure 3 (d)), thus ensuring border congruency. The iterations end upon reaching a desired number of super-pixel segments K ρ .
The proposed modification of the SP clustering brings a few benefits. First, it leads to a considerably better modeling of texture transitions (c.f. Figure 5). Second, using the mass center locations for seeding points, prevents the occurrence of a misaligned clustering done on finer mosaic scales for the consecutive iterations. The congruency of SP boundaries is of vital importance for simplifying the encoding approach and improving the speed and quality performance of the originally proposed compression methods [44], [45].

C. DEPTH RECONSTRUCTION
The operator B (7) is expected to exploit the relation between color and depth through a cross-modality guided reconstruction filter [33], [32]. In practice, we adopt the crossbilateral filter as modified in [37]. Two weight laterals are applied per pixel V j in pixel neighborhood (e.g. square block) ψ j : where {λ s , λ c } are parametrized Gaussian smoothing kernels [32] for the spatial proximity and intensity similarity correspondingly. Then a bilateral weighted average is applied to each depth pixelẐ  to form the reconstructed mapẐ (7). The neighborhood ψ j (c.f. Figure 6 (a)) is selected to be proportional to the segment size of the current refinement level ρ The spatial proximity kernel λ s must be related to the size of the neighborhood ψ j . An example of the filter performance is demonstrated by visual outputs given in Figure 6 (b-d).

IV. APPLICATION CASES A. DEPTH MAP ENCODING APPLICATION
First application is encoding of depth map in the V+Z representation, where the color and depth modalities are already aligned. With reference to Figure 4, this means that the input pixel and depth maps are with the same spatial resolution. Figure 7 illustrates the proposed technique. The decimated depth map Z being output of DRS undergoes arithmetic encoding exemplified by the operator B. It outputs an encoded binary sequence P. The reconstructed depth mapẐ is compared against the original one by means of Sum of Squared Errors (SSE) on super-pixel level, and regions of high reconstruction error are split into finer and embedded 97520 VOLUME 8, 2020  Figure 5 (b)).
super-pixels. Their centroids are returned to the DRS module which updates the outputs for next-iteration reconstructed depth mapẐ and its decimated version Z. The latter one along with the localization information for partitioned SP segments is stored in a predictive sequence unified with P. The refinement process is applied iteratively subject to an encoding bit-budget compared with the bit length T of the sequence P, i.e. T (P) ≤ .

1) DEPTH REFINEMENT BY SUPER-PIXEL PARTITIONING
The SSE is calculated for each super-pixel SPs with highest errors ε ρ k are marked for further refinement by going to the finer scale ρ = ρ − 1 being kept after the multi-layer clustering. The seeding points G ρ k and the associated C ρ k segments are fed back to DRS.

2) ENCODING SCHEMES
We encode three components: (A) The uniformly-decimated depth map Z produced at iteration ρ is encoded in predictive sequence P Z ; (B) the depth values corresponding to partitioned SPs are encoded in predictive sequence P pt ; and (C) the partitioned SP structure is encoded in the a binary sequence B.
(A) The decimated depth map Z ρ at stage ρ has an isotropic structure with dimensions s ρ {H,W} and values Z ρ k , corresponding to each segment C ρ k , as illustrated in Figure 8. The segmentation structure comes from the color modality and can be reproduced, thus it does not need to be encoded. The map itself is encoded in a predictive sequence P Z similarly to ''JPEG-LS'' standard [21] and described in detail in our previous work [45].
The entire sequence P pt is subsequently encoded by an adaptive multi-alphabet range coder [52], [45]. (C) The partitioning is encoded in a binary sequence B, formed by two sub-sequences {B im , B pt }, which encode the partitioning of the isotropic map Z ρ and the partitioning tree structure P pt respectively, as shown in Figure 9. Partitioned SPs are indexed by 1 (split) and non-refined SPs are indexed by 0 (no split). The map B im encodes the partitioned SPs for the initial stage ρ and the shape and indexing follow those of the isotropic map Z ρ . The binary map is scanned column-wise to initialize the first index tree level in B pt . Next level is for partitioned SPs belonging to consecutive refinement stages ρ = ρ − 1. Those are encoded in a concatenated sequence in B pt . Note that it does not need to store information about the number of children of refined SPs, as this is automatically found when the SP clustering for a refinement level is run. For the last refinement level ρ = 0, there is an exception: SPs which are marked for partitioning are not indexed further, since the segment is entirely encoded by from the original depth values. The sequence B im is encoded separately from the rest of the tree by a ''Context-Adaptive Binary Range Coder'' (CABRC) [53]. For the context modeling, it is assumed that ''split/no-split'' of current SP depends on ''split/no-split'' of its neighbors. Using this assumption, the value of a binary element B im k , is assigned to four possible binary sub-contexts indexed by the sum of neighboring pixels.

3) ENCODING EDGE OUTLIERS BY ''YIELD-FLOW'' PROTOCOL
The efficiency of the proposed depth encoding approach relies on the ideal consistency between color and depth modalities. However, in real case of V+Z capture, depth maps can come with various artifacts caused by stereo-correspondence errors, low-resolution nonconfocal depth sensors along with projection misalignment and resampling ambiguities introduced by measurement errors [43], [44]. Examples of regions with such artifacts are given for a frame of ''Ballet'' data set in Figure 10 (c, d). While artifacts of the above-mentioned types are affecting a relatively small number of pixels, the encoding residual error will be concentrated precisely around them (c.f. Figure 10 (e)). We denote such problematic areas as edge-consistency outliers (ECO). The SPs which contain ECO, will indicate high SSE values (14), then the refinement partitioning will concentrate on those SPs attempting better quality which might go until the last refinement stage is reached and pixels are encoded individually. Apparently, the refinement scheme applied in such manner will be inefficient and could fail producing an optimal encoding output for the given bit-budget . To tackle the problem, we propose an optional ECO binary encoding scheme called ''yield-flow'' protocol (YF). It indicates an encounter of possible ECO, if the partitioned SP children have at least two members with the same depth value as of the parent SP. In such case, the encoding system activates YF process that consists of sequence of {''YES'' -1, ''NO'' -0} flags (c.f. Figure 10 (a, b). The first bit of YF ''tells'' the encoder whether the SP is to be encoded for ECO. If YES, then the YF follows the internal pixel boundary of the SP counter clock-wisely (c.f. Figure 10 (a, b)) starting from lowest-left boundary node of neighboring SPs. The positive bit value indicates whether the boundary segment has to be processed. The positive bits in YF will indicate a ''yield'' procedure: The depth value of processed pixel -Z j is replaced with the value of neighboring pixels in the horizontal and vertical nearest direction that belongs to other SP clusters of the same refinement stage. In case of many choices, the decision is done for the neighboring pixel that forms the smallest angle ω between the neighborhood direction and the direction to SP centroid G ρ k . In our realization, the yield process meets the following requirements. First, the SP should belong to a refinement stage higher than a certain threshold ρ > t ρ , (e.g. t ρ = 5). Second, pixels considered for the yield process are those that have no error comparing to GT for the new assigned depth value. Since the yield-processed pixels belong to GT, then those are excluded from the SP clusters of all higher refinement stages and should be skipped also by the regularization filtering step. The performance of the proposed edge outlier encoding is exemplified in Figure 10 (f), where it is shown that most of ECO are suppressed for a significant quality metric gain (c.f. Figure 10 (c-f)).

B. FUSION FOR ASYMMETRIC V+Z CAMERA SETUP
The general DRS can be applied for 3D fusion of asymmetric V+Z data, provided some pre-processing is performed before feeding the DRS module as shown in Figure 11. In the abovementioned setting, the two modalities are not aligned as they come from two non-confocal sensors and the dedicated depth sensor is usually of lower resolution. The depth pixels Z k , k = (1, .., K ) , K < J should undergo a re-projection step to locate them onto the image grid of the color sensor where Z p k are re-projected samples and f is a set of camera parameters related to some multi-view geometry model [48], [54]. At the initial SP clustering stage, there are strictly K seeding points Q p k coinciding with the projected locations x p k of Z k . The same association is done for the output samples Z k . The projected locations x p k appear non-uniformly located with respect to the color map grip. Therefore, Q p k are found by a standard interpolation L (e.g. by bi-cubic splines [55]) The size of the seeding support region S j for the SP clustering operator (5) is fixed by the scale difference between the  dimensions of the two sensors: {W, H} V / {W, H} Z . Furthermore, a Richardson-Lucy iterative scheme [56] is applied iteratively where λ L is a regularization constant. For each iteration i, the error residual E i k is used as a feedback input to DRS, and further, the reconstructed result from B is accumulated for initial reconstructionẐ 0 . Usually, very few iterations i (e.g. i ∼ 3) are enough to converge to optimal output ofẐ .

V. EXPERIMENTAL RESULTS
We present experiments demonstrating the utilization of the proposed framework in two cases: depth encoding and fusion of asymmetric V+Z data. To quantify the performance, we use the standard mean absolute error (MAE), root mean squared error (RMSE) and the related Peak-Signal-to-Noise Ratio (PSNR) in [dB] between the processed and ground true depth maps. For datasets, where geometry is represented by disparity maps, we use also the percentage of bad pixels (BAD) which shows the percentage of disparities which differ from the ground true disparity map by more than one pixel [8]. The following datasets are used in the experiments: Microsoft's ''Breakdancer'' and ''Ballet'' [57]; Middlebury's ''Aloe'', ''Art'', ''Baby'', ''Dolls'', ''Teddy'', ''Cones'', and ''Bowling'' [8]; and ToF data. The latter contain scenes captured by asymmetric non-confocal V+Z stereo-camera setup, where the depth sensor is a noisy Timeof-Flight (ToF) camera with 120 × 160 pixels spatial resolution, while the color camera is of resolution 610 × 810 pixels.

A. DEPTH COMPRESSION FOR VIEW-PLUS-DEPTH DATA
The quality metrics are calculated versus the encoding (compression) rate in bits-per-pixel (bpp) measured on the encoded isotropic map P im . The first experiment characterizes the gain obtained by applying the reconstruction filter B (7). The results are shown in Figure 12 (a). The quality is varied by varying the SP segmentation point on the plot indicates the PSNR between eitherZ or reconstructed (regularized)Ẑ and the non-compressed depth. By increasing the number of SP elements K , one gets higher quality for the price of high bit rate. No further optimization is applied. Still, the proposed technique reaches PSNR of about 40 dB for ≤ 0.1 bpp with additional improvement of at least 2 dB when the reconstruction filter is applied.
against the works denoted as: ''Platelet'' [12], ''Milani'' [18], ''P80'' [15], ''GSOs+CCLV'', ''GSOm+CERV'' [16], ''H.264'' [23], ''JPEG2000'' [22]. Since ''GSOs+CCLV'' and ''GSOm +CERV'' perform optimally for different bitrate zones, for those, a single plot is given that holds the better metric value. The results are given in the plots of Figure 12 (b-f) for different test data. The proposed method is clearly highly competitive and performs best for very low bitrate regions (e.g. ≤ 0.05), where the quality of the decompressed output is above 45 dB. This is considered near-lossless for most of the rendering applications utilizing depth maps [59]. A depiction of decoded and reconstructed depth map for ''Art'' data set for bit budget ∼ = 0.008 bpp is shown in Figure 12 (h) and ''Breakdancer'' data set for bit-budget ∼ = 0.0014 bpp is shown in Figure 12 (i). When YF is applied for test sets with problematic zones (e.g. ''Ballet''), the results are highly competitive for the entire range. In another test, we calculate the BAD metric as plotted in Figure 12 (g). The curves show that for a wide range of tested stereo-matching datasets of Middlebury [8], the proposed technique robustly fades the BAD percentage to about 3−5% for bitrates below < 0.2 bpp [60]. Such performance is in par with the performance of the highest-ranked stereomatching estimation algorithms [8]. The performance of our method is slightly inferior for datasets of low-depth contrast and low-resolution (e.g. ''Bowling'' (c.f. Figure 12 (f)).

B. 3D RESAMPLING AND FUSION OF ASYMMETRIC VIEW-PLUS-DEPTH DATA
For this experiment we use the dataset from [42] which are commonly adopted benchmarking datasets. The datasets provide projected irregular data samples Z p k ready to be applied for 3D fusion and resampling. The GT depth maps have been captured by another high-end high-definition depth sensor. The scenes are referred to as ''Shark'', ''Books'' (c.f. Figures 13, 14), and ''Devil''.
been compared to the performance of à number of state-ofart 3D fusion methods ''BF'' [36], ''AD'' [28], ''GF'' [41], ''Hyp'' [37], ''Yang'' [38], ''JGF'' [29], ''IMLS'' [30], ''CLMF'' [31], ''TGV'' [42], and ''Yang'' [38]. The code scripts for all the referenced methods have been obtained online and run for the tuned or default settings, when the authors of the particular approach provide code scripts for the evaluation of same benchmarking test. The calculated MAE and RMSE are given in Table 1; visual outputs of some of the methods and scenes are given in Figure 13; along with depiction of the absolute difference maps (maps of residuals) with respect to GT data. The visual outputs for zoomed region VOLUME 8, 2020 (shown with black edge in Figures 13 (a) and 14 (a)) of a miniature elephant sculpture is provided in Figure 14). The proposed framework has been tested for three cases: ''Proposed (SP)'' with no iterative refinement applied, and when iterative refinement has been applied for i = 3 iterations (''Proposed 3 iter.''). The results can be analyzed as follows: the proposed framework in its basic form provides a balanced output in terms of error metrics, when compared to similarly performing methods e.g. ''Hyp'', ''TGV'' and ''GF'', where ''TGV'' has the most competitive results. However, ''TGV'' is slow and took about 10 minutes on our computing platform, while our proposed technique offers real-time performance. Basic interpolation methods involving no cross-modality filtering e.g. ''Voronoi (NN)'' and ''Bilinear'' perform surprisingly well in some cases (c.f. Table 1), which can be explained by the imperfectly aligned data modalities for GT data. Crossmodality filtering methods aim at finding edge congruency between V and Z modalities, and any initial misalignment leads to high error (c.f. Figure 13 (f-l)), which is not manifested in the direct resampling methods (c.f. Figure 13 (b, c)). However, visual appearance of the latter is not good in overall (c.f. Figure 14 (d, e)).

VI. CONCLUSIONS
The presented work improved and streamlined our previous depth compression method [45] to a more general aspect of treating View-plus-depth data. Specifically, we relate the depth representation with the underlined color modality in terms of super-pixels. To this end, we have proposed a novel hierarchical super-pixel segmentation which keeps the boundary congruency of successive layers. In this way, the segmentation structure is very suitable for depth modelling in terms of constant depth segments, and its subsequent down-sampling for effective encoding or for its color-adaptive reconstruction. More specifically, the SPs allow for embedding also the down-sampled depth isotropic maps and thus achieving better performance of the encoding scheme. The reconstruction filter, which leads to smoothed and well-aligned depth maps, has been made adaptive to the size of the refining SPs. We have added a boundary correction in terms of the proposed edge outlier encoding protocol. Apart from effectively avoiding code redundancies related to misaligned V+Z data, such boundary correction provides a suitable alignment of the two modalities, which is important for rendering virtual views.
The proposed encoding technique is highly competitive in the very low bit rate region. The general framework is also suitable for fusing non-confocal sensor data with asymmetric spatial resolution. It is easily tunable for other image processing tasks such as segmentation and multi-sensor data sparsification.