Millimeter Wave MIMO based Depth Maps for Wireless Virtual and Augmented Reality

Augmented and virtual reality systems (AR/VR) are rapidly becoming key components of the wireless landscape. For immersive AR/VR experience, these devices should be able to construct accurate depth perception of the surrounding environment. Current AR/VR devices rely heavily on using RGB-D depth cameras to achieve this goal. The performance of these depth cameras, however, has clear limitations in several scenarios, such as the cases with shiny objects, dark surfaces, and abrupt color transition among other limitations. In this paper, we propose a novel solution for AR/VR depth map construction using mmWave MIMO communication transceivers. This is motivated by the deployment of advanced mmWave communication systems in future AR/VR devices for meeting the high data rate demands and by the interesting propagation characteristics of mmWave signals. Accounting for the constraints on these systems, we develop a comprehensive framework for constructing accurate and high-resolution depth maps using mmWave systems. In this framework, we developed new sensing beamforming codebook approaches that are specific for the depth map construction objective. Using these codebooks, and leveraging tools from successive interference cancellation, we develop a joint beam processing approach that can construct high-resolution depth maps using practical mmWave antenna arrays. Extensive simulation results highlight the potential of the proposed solution in building accurate depth maps. Further, these simulations show the promising gains of mmWave based depth perception compared to RGB-based approaches in several important use cases.


I. INTRODUCTION
Wireless augmented and virtual reality (AR/VR) applications are recently attracting increasing interest. Realizing wireless AR/VR in practice can open the door for a wide range of interesting applications and use cases. Enabling Immersive AR/VR experience, however, requires high resolution and accurate depth perception. This can potentially allow the wireless AR/VR users to move freely within their indoor or outdoor environment. Current depth perception approaches for AR/VR systems rely mainly on RGB-D (depth) cameras for constructing the depth maps. While RGB-D based depth map construction approaches can generally provide good accuracy, they suffer from critical limitations in scenarios The associate editor coordinating the review of this manuscript and approving it for publication was Yunlong Cai .
with bright shiny or transparent surfaces, dark objects, and large rooms among others. These limitations stem from the fundamental properties of the way visible light propagate and interact with the different surfaces.
In order to overcome these limitations, we propose to leverage mmWave systems and signals for improving the depth map estimation accuracy. This is motivated by the interesting characteristics of mmWave signals and by the note that mmWave systems will be deployed in future AR/VR devices anyway for meeting the wireless communication requirements [1]. In terms of the mmWave signal characteristics, the propagation of these signals is not affected by the interference from the light sources which makes mmWave systems capable of detecting bright and dark objects. Further, the mmWave diffuse scattering and specular reflection properties could help in detecting transparent objects as well as VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ rough surfaces. These aspects among others motivate exploring the potential of leveraging mmWave transceivers for complementing the RGB-D depth-maps in AR/VR systems, which is the focus of this paper.
A. PRIOR WORK Previous depth map construction approaches focused on leveraging: (i) monocular images using RGB cameras [2], (ii) passive/active stereo images using either RGB-D depth cameras [3], [4] or infrared (IR) stereo cameras [5], [6], and (iii) gated images using active gated imaging cameras [7], [8]. In [2], a monocular depth estimation approach capable of capturing the object boundaries is proposed. In [3], RGB images along with sparse depth samples, acquired from depth cameras or computed via Simultaneous Localization and Mapping (SLAM) algorithms, are used jointly to reconstruct the depth maps. An alternative approach for depth estimation was proposed in [4], where a monocular structured-light camera -a calibrated stereo set-up with one camera and one laser projector-is leveraged for estimating the disparity. As for the active stereo systems, in [5], IR projected pattern from stereo IR cameras is adopted for depth estimation through active stereo matching. The IR images are acquired from the Intel Realsense camera [9]. Also, the IR pattern characteristics needed for active stereo matching are described in [6].
In addition, high-resolution depth images can be achieved for far objects using active gated imaging systems, as in [7], [8]. These depth map construction approaches [2]- [8], [10], however, have several important limitations complications as follows. (i) First, these depth map construction approaches normally fail to sense the depth for shiny, dark, transparent, and distant surfaces. While there are some attempts in solving these challenges using IR stereo cameras [5] or excessive processing of the RGB-D images [11], there is no complete and general solution yet to this problem. (ii) Further, these IR and RGB-D based depth map construction algorithms suffer from a critical limitation, which is the depth ambiguity for far objects/surfaces. The depths for distant surfaces can not be resolved by the algorithms in [5], [11]. (iii) Another key challenge is the additional bill of materials (BOM) cost incurred from integrating the IR stereo camera systems into the wireless AR/VR device architectures. On the contrary, the existing mmWave systems in the wireless AR/VR device architectures incurs no additional BOM cost when leveraged for depth map estimation purposes jointly with the primary purpose of wireless communications. (iv) The field of view coverage is also a main challenge. The depth map coverage is limited by the camera field of view. The camera field of view is constrained by the camera lens and by the light sensor. The field of view in mmWave MIMO systems, however, is constrained by the array radiation pattern, as will be explained in Section VI. By contrast, the typical field of view in mmWave MIMO systems can be larger than the typical camera field of view.
These challenges motivate the research for other technologies to complement the RGB-D cameras in accurately sensing the VR/AR environment. One promising technology for this goal is employing wireless millimeter wave (mmWave) systems. Since mmWave antenna arrays will be used to satisfy the communication high data rate demands of wireless VR/AR, it is interesting to investigate if they could also be useful for VR/AR-relevant sensing functions, such as depth estimation. Initial studies for using mmWave communication arrays for radar and sensing were presented in [12], [13]. These studies, however, focused only on the ranging problem (of one or multiple targets), not on the depth map construction problem. Other mmWave sensing and tracking work that was not restricted to communications hardware was presented in [14], [15]. The research in [14], [15], though, targeted tracking a single object in a small distance, and cannot be directly applied to depth estimation of surrounding surfaces in VR/AR. Further, the work in [12]- [15], did not study the trade-offs between estimation accuracy and different system parameters, such as number of antennas and adopted bandwidth, and did not compare between the system performance under transceiver architectures constraints, such as those imposed on the analog phased-array transceiver architectures. By contrast, interesting research challenges are accompanying the mmWave MIMO based scene depth map construction framework ranging from beam codebook design challenges to scene depth estimation challenges. These challenges will be addressed in this work and will be explained in detail in Section V.

B. CONTRIBUTION
In this paper, we consider the mmWave MIMO based depth map construction problem for AR/VR systems, adopting mmWave communication hardware and frame structure. The contributions of this paper can be summarized as follows.
• mmWave MIMO depth map construction framework: We formulate the mmWave MIMO depth map construction problem and propose a general framework for building depth maps under the constraints imposed by mmWave communication hardware and frame structure.
• A design for depth-map suitable sensing beamforming codebook: We define the characteristics of the desirable mmWave sensing beamforming codebook for efficient depth map construction and develop a codebook construction approach that meets these characteristics.
• High-resolution depth map construction approach: Given the designed beamforming codebook, we develop a novel signal processing approach for jointly processing the signals received by the sensing beams and building high-resolution depth maps.
The proposed solution is extensively evaluated using accurate ray-tracing channels generated from Wireless InSite [16], and ground truth depth images generated from Blender [17]. The simulation results show the promise of mmWave MIMO sensing in becoming a viable depth estimation solution for communication-constrained sensing systems, either as a standalone approach or as an integrated approach with RGB-D depth cameras. These simulation results can be of great usefulness for various applications; they can be generally applied to AR/VR devices, smart home devices, or auto drive devices. Notation: We use the following notation throughout this paper: A is a matrix, a is a vector, a is a scalar, A is a set of scalars, and A is a set of vectors/matrices. a p is the p-norm of a. |A| is the determinant of A, A F is its Frobenius norm, whereas A T , A H , A * , A −1 , A † are its transpose, Hermitian (conjugate transpose), conjugate, inverse, and pseudo-inverse respectively.
[A] r,: and [A] :,c are the r th row and c th column of the matrix A, respectively. diag(a) is a diagonal matrix with the entries of a on its diagonal. I is the identity matrix. 1 N and 0 N are the N -dimensional all-ones and all-zeros vector respectively. A ⊗ B is the Kronecker product of A and B, A•B is their Khatri-Rao product, and A B is their Hadamard product. N (m, R) is a complex Gaussian random vector with mean m and covariance R. E [·] is used to denote expectation. vec(A) is a vector whose elements are the stacked columns of matrix A.

II. SYSTEM AND CHANNEL MODELS
In this section, the system model for the adopted communication-constrained sensing framework is first formulated, followed by the characterization of the adopted channel model.

A. SYSTEM MODEL
In this paper, we propose to reuse the same AR/VR mmWave communication system/circuits to do the sensing and depth map construction, as shown in Fig. 1. Hence, we adopt a sensing model that accounts for the mmWave communication system/circuit constraints. This communication-constrained sensing model consists of a transmitter and a receiver; both are connected through a self-isolation circuitry to a shared N antenna array, as depicted in Fig. 2. This type of operation is commonly referred to as MIMO in-band full-duplex operation [22]. We assume that the transmitter and receiver chains are well-isolated by an isolation circuitry to avoid any self-interference. This assumption is reasonable with the recent developments of self-interference systems. One example of these systems is the magnetic-free non-reciprocal circulators (i) based on coupled-resonator loops [23] or (ii) based on CMOS circulators operating in the 28GHz mmWave band [24]. Another example is the receiver with integrated magnetic-free non-reciprocal circulator and baseband self-interference cancellation operating in the Sub-6 GHz band [25]. A third example is the magnetic-free SOI CMOS circulator operating in the 60GHz mmWave band [26]. Accounting for this self-interference, however, is an important direction for future extensions.
Further, and for the sake of having low-cost and power consumption mmWave transceivers, we adopt an analog-only architecture for the N -antenna array used for transmission and reception, [27], [28], where the beamforming/combining is done in the analog domain using a network of phase shifters. Next, we summarize the transmit and receive signal models.

1) TRANSMIT SIGNAL MODEL
We consider a wideband single-carrier waveform comprising multiple time frames. These frames are transmitted over an aggregated time interval of T seconds during which the environment is assumed to be relatively static. This time interval is commonly referred to as a coherent processing interval (CPI) [29]. Each frame consists of both data and preamble sequences designed for the wireless communication function. The co-existing sensing model also uses these preamble sequences to sense the environment and build the depth maps, as will be explained in detail in the following sections. This can be achieved by either splitting the frames between sensing and communication or by designing the sensing and communication beam training operations to share VOLUME 9, 2021 FIGURE 2. A block diagram of the communication-constrained sensing model is illustrated. The sensing framework, , consists of (a) the beam codebook design P and (b) the post-processing design g (., P), to estimate the scene depth map D. The upper path represents the transmitter path, while the lower path represents the receiver path. the same preamble sequences. Next, for ease of exposition, we assume that M frames/preamble sequences are dedicated for sensing. If s m [n] denotes the n th transmitted symbol at the m th frame, with E |s m [n]| 2 = 1, then the complex-baseband representation of the transmit waveform can be written as [30] where E s represents the average energy per symbol, T S is the symbol time, and T F is the frame duration. N m is the number of symbols in the m th frame, which is divided into a preamble sequence of length N p and a set of data symbols of length N d m . Further, we assume that the same preamble sequence s p [n], n ∈ {1, . . . , N p }, is transmitted in the first N p symbols of each frame. Note that for the sake of simplifying the transmit and receive signal representation, we incorporated the transmit pulse shaping and receive filtering functions into the channel model. Finally, if a beamforming vector f ∈ C N ×1 is used to transmit the signal at the AR/VR device, the complex-baseband representation of the transmitted signal can expressed as This transmitted signal will interact with the environment (through reflection, scattering, etc.) and will be received back by the AR/VR device. Next, we describe the receive signal model.

2) RECEIVE SIGNAL MODEL
Let G tar denote the number of targets/scatterers in the environment. Then, focusing on the preamble sequence transmission/reception (i.e., the first N p symbols of each frame), the receive sensing signal of the m th frame can be written as where w ∈ C N ×1 is the combining vector at the AR/VR, and v m [n] ∼ N C 0, σ 2 n I is the receive noise with variance σ 2 n .
, is the delay-d channel matrix between the transmission from and the reception by the AR/VR antenna array, which is described in the following subsection.

B. CHANNEL MODEL
Given that the depth sensing problem highly relies on the accurate modeling of the surrounding environment and its geometry, we adopt a geometric channel model in this work. More specifically, we consider the extended Saleh-Valenzuela wideband geometric channel model [31]- [34]. Based on that, the g th target contribution in the delay-d channel, H d,g , can be modeled as where L ray is the number of channel clusters; each cluster is contributing with one ray of complex channel coefficient α , time delay τ , and azimuth/elevation angles of departure and arrival, φ T ,g , θ T ,g and φ R ,g , θ R ,g , respectively. a T (·, ·) and a R (·, ·) represent the transmit and receive array response vectors associated with the angles of departure and arrival respectively. The transmit and receive pulse shaping signals are included within p(t) such that p(t) = p T (t) * p R (t). The path gain associated with the g th target is denoted by G g and can be expressed as where G T and G R are the transmitter and receiver gains, λ is the operating wavelength, PL is the path loss exponent. Finally, ρ g denotes the distance (range) between the AR/VR device and the g th target/scatterer and σ RCS g denotes the radar cross section of this target.

III. PROBLEM DEFINITION
Our objective in this paper is to efficiently estimate the depth/range map of the surrounding environment using the communication-constrained mmWave MIMO sensing model in Section II. Before delving into the formal problem definition, it is important to distinguish between the range and the depth of a certain target. As depicted in Fig. 3, the range of a target with respect to the AR/VR camera (which is aligned with the AR/VR antenna array) is the linear radial distance from the camera center (focal point) to the target. For the depth, it is measured by the y-coordinate of the camera center (focal point) with respect to the x-z plane of the target. Given that the range and depth can be calculated from one another, we focus our formulation on the depth estimation problem. Next, we define the depth map of the surrounding environment with respect to an AR/VR device. In this paper, we express this depth map as an M h × M w matrix D map ∈ R M h ×M w . Further, we use M res = M h M w to denote the total number of pixels in the depth map. The range map R map ∈ R M h ×M w is similarly defined. Now, given the system and channel models in Section II, the AR/VR device constructs the estimated mmWave-based depth map through two main steps: (i) sensing the environment using several beamforming and combining sensing vectors and (ii) post-processing the receive sensing signal to construct the estimated depth map. More formally, if a beamforming-combining pair (f m , w m ) is used to transmit and receive the N p symbols of the m th preamble sequence, then the receive sensing signal can be expressed as For ease of exposition, we define the sensing beamforming codebook P as the codebook that includes the M beamforming-combining pairs, i.e., P = {(f m , w m ) : m ∈ {1, . . . , M }}. Finally, given the receive sensing matrix Y, a post-processing is applied for estimating the depth map. If g(.) denotes the post-processing function, the estimated depth map D map ∈ R M h ×M w can be written as Our objective in this paper then is to design the sensing beamforming codebook P and the post-processing g(·) to efficiently estimate the depth map D map to be as close as possible to the actual depth map D map . To evaluate the performance of the proposed approaches, we will adopt the root-mean squared estimation error (RMSE) and the mean absolute error (MAE) between the depth maps, which are defined as In Section V, we will present the general framework of our proposed depth map estimation approach. This will be followed by a detailed description of the two main components in this framework, namely the beamforming codebook design P and the post-processing solution g(.), in Sections VI and VII.

IV. BACKGROUND
Before going into the proposed framework for estimating depth/range maps using mmWave MIMO, we provide a brief background on the basis of the single-target range estimation problem. For a preliminary model, consider one target in the free space with a Line-of-sight (LoS) path to the AR/VR device. Further, consider the case when one mmWave beam is perfectly steered towards that target, as depicted in Fig. 3. Adopting this preliminary model, the target range estimation accuracy bound will be first examined. Then, a description of the main algorithms used in the literature to approach this problem is provided.

A. TARGET RANGE ESTIMATION ACCURACY
Our main objective is to find the fundamental limit for mmWave MIMO based depth estimation, which can be considered as range estimation at every possible eyesight direction, i.e. at every azimuth angle φ ∈ [0, 2π[ and every elevation angle θ ∈ [0, π[. For the range estimation accuracy, one useful metric is the Cramer-Rao lower bound (CRLB) on the range estimation. For white Gaussian noise, the CRLB provides a lower bound on the mean-squared-error of any unbiased estimator, hence it is used as a benchmark for the performance analysis of parameter estimation [35]. Considering the case of range estimation for a single target, the CRLB of this single target range is formulated as [13], [29], [35] where ς is the speed of light, B is the transmission bandwidth, P int is the integration gain and is equal to the number of symbols used for preamble estimation, and η depends on the power spectral density shape of a(t) over the preamble duration. Under the assumption of a flat spectral density for a(t), η 2 = (2π) 2 /12. The radar signal-to-noise ratio for this target can then be expressed as SNR rad = E s G rad /σ 2 n , where G rad denotes the path gain associated with the target.

B. TARGET RANGE ESTIMATION ALGORITHMS
Estimating the round trip delayτ is equivalent to finding the range estimateρ, since they are directly related througĥ τ = 2ρ/ς. Given the extensive research on delay estimators in the literature [29], [36], we will restrict the scope of this paper on the magnitude based delay estimators in [37] for simplicity. In a general sense, given a known transmit preamble sequence x 0 [n] and the received baseband sequence z[n], the receiver can estimate the round-trip delay by maximizing an objective function, the cross-correlation function between the two time-sequences, over a range of possible delays. Based on this notion, two delay estimators are formulated as follows [37].

1) BASIC CORRELATOR
The basic correlator is a coarse delay estimator that performs the maximization at the same sampling frequency, f S , tuned by the AR/VR communication system. Assume that the length of the received baseband sequence, z[n], is L z samples, where the last N z samples are non-zeros. The range estimate can then be formulated aŝ where T S = 1 f S = 1 B denotes the sampling time, Q represents the set of possible discrete sample delays, and the optimal q solution is denoted by q BC . Unfortunately, the accuracy of this range estimate is limited by the sampling frequency f S . One attempt of improving the estimation accuracy is by performing the maximization at a higher sampling frequency. This attempt, however, increases the computational complexity dramatically, which motivates the role of the upcoming delay estimator, the massive correlator [37].

2) MASSIVE CORRELATOR
The primary function of the massive correlator is to perform the maximization of the objective function at a higher sampling frequency without the computational burden of computing the shift in real time. For this reason, [37], [38] introduced the solution of pre-designing a specific correlator bank that contains shifted versions of the reference sequence, x 0 [n]. The receiver will then multiply the received sequence by the correlator bank to compute the objective function.
We describe the steps of the massive correlator algorithm as follows [37], [38]. (a) Upsample x 0 [n] with a sampling frequency higher than f S , denoted as f est . (b) Define the correlator bank matrix, X 0 , where each row of this matrix is a shifted version of the upsampled x 0 [n]. Let the number of rows in X 0 be equal to (2δ + 1), where δ is the largest lag/advance discrete fractional delay in the receive sequence, such that δ = f est 2f S . (c) Downsample independently each row of the correlator matrix to the lower sampling frequency f S ; let the resulting matrix be named as B 0 . The reason for this step is to test delays at the higher sampling frequency, f est , but only apply multiplications at the lower sampling frequency, f S . (d) Shift back the receive sequence, z[n], with the coarse discrete delay estimate, q BC , such thatz[n] = z[n + q BC ], and then concatenate the sequence into one row vector,z. (e) Calculate the fractional range estimate,ρ , such that +ρ . With an end goal of constructing depth maps, these range estimation algorithms will then be leveraged by the mmWave MIMO based depth estimation as explained in the upcoming sections. In the next section, we formulate a general framework for scene depth estimation.

V. GENERAL FRAMEWORK FOR SCENE DEPTH ESTIMATION
In this section, we highlight the key elements of the proposed depth map estimation approach, namely the sensing beamforming codebook P and post-processing g(.), and discuss the challenges associated with designing these elements. As depicted in Fig. 4, we first design the sensing beamforming codebook P offline based on the desired AR/VR properties such as the field of view, the scene aspect ratio, and the number of horizontal and vertical beams covering the scene view. To build the depth map of a certain scene, the beam pairs of the designed codebook are used to sense the environment and acquire the receive sensing matrix Y in (6). This receive signals are then jointly processed using the post-proposed approach to build the depth map. In the remaining of this section, we explain the challenges associated with designing the codebook and the post-processing operations. Then, we will present how our proposed solutions overcome these challenges in Sections VI and VII.

A. CODEBOOK DESIGN CHALLENGES
To effectively sense the surrounding environment and build efficient depth maps, the beams of the sensing codebooks should be designed to scan the full scene. Since the mmWave MIMO based depth maps may potentially complement the RGB-D based maps, our objective is to build a beamforming codebook that scans the full rectangular grid of the typical depth sensors of the AR/VR cameras. However, the classical beam steering codebooks such as the DFT codebooks [39], that independently sample the azimuth and elevation directions, do not normally fit a rectangular grid. They instead form a parabolic grid, i.e., for a fixed elevation angle, the grid line of these codebook beams are parabolic curves as shown in Fig. 5(a). This mismatch between the mmWave MIMO-based and camera-based depth grids could lead to clear distortion in the joint mmWave/RGB-D depth map construction and make it hard to complement the RGB-D depth map using mmWave MIMO sensing.
One possible solution is to estimate the depths on the parabolic grid using the classical beamforming codebook and then interpolate/extrapolate to calculate the rectangular depth map. The main disadvantage of this solution, however, is that the interpolation can potentially lead to considerable loss in the depth map accuracy as the changes of the depth are not normally smooth in nature. Hence, in order to avoid the interpolation loss, the more persuasive solution is to develop a depth map compatible beamforming codebook that fits exactly the desirable rectangular sensor grid. With this motivation, we propose a beamforming/combining design approach in Section VI to overcome the codebook mismatch challenge.

B. SCENE DEPTH ESTIMATION CHALLENGES
The sensing beamforming codebook is used to sense the surrounding environment. Now, given the receive sensing matrix Y, the objective of the post-processing is to construct an accurate depth map of the facing scene. This process, however, has several challenges. In order to explain these VOLUME 9, 2021 challenges, let's first consider the case when the environment has only a single target. In this case, the sensing/scanning beam that is directed towards the region that includes this target will result in some backscattering signal. This signal can be used for calculating the round-trip time of flight and consequently the range of this target, leveraging the MIMO radar concepts [29], [40] and the algorithms detailed in Section IV-B. In terms of the range/depth map, the pixel that includes the region of this target will simple have the value of the estimated range/depth. In practice, however, the environment has several targets/surfaces and the mmWave arrays have strict constraints on their hardware: power budget, computational complexity, etc. These limitations lead to critical challenges for our objective of building accurate depth and range maps of the environment. More specifically, if we adopt the approach that scans the surrounding scene using a beamforming codebook and processed the receive sensing signal of each beam independently to estimate the depth of the region defined by this beam, then this approach will have the following key drawbacks.
• Low-resolution depth-maps: The low resolution drawback is mainly due to (i) the limitation on the number of AR/VR antennas, which is controlled by many factors in the AR/VR device such as the device dimensions, computational complexity, circuit routing, power consumption, etc., and (ii) the number of beams in the sensing codebook P, which is limited by the time allocated for the depth estimation process.
• Inter-target interference: The constraints on the number of antennas at the AR/VR device limit the system spatial resolution. This makes it hard to differentiate between the ranges/depths of the different targets/surfaces that are close to each other. In other words, when measuring the depth of the object in a particular region/ direction, multiple objects/surfaces may reflect the incident signal at the same time. The interference between these reflected/scattered signals may highly affect the accuracy of the range/ depth estimation. Hence, if a certain pixel has multiple objects/surfaces, it will be difficult to estimate the shortest depth of the objects in this pixel (to follow the depth map definition in Section III).
• Inter-path interference: When sensing the range/depth of a certain target, the optimal situation (in terms of depth estimation accuracy) is when the target backscatters a single ray to the receiving array. In practice, however, the signal incident on a certain target may experience more than one phenomenon, such as scattering, reflection, diffraction, etc., which results in multiple rays. More than one of these rays could traverse the environment in different ways/directions, especially in indoor environments, before reaching the receiver. This means that they may reach the receiving array from multiple angles and with different time of flights. This causes an inter-path interference which makes it hard to accurately estimate the range/depth of the target of interest. For example, if the receiver estimates the range/depth based on a wrong path, this may noticeably degrade the accuracy of the depth map estimation. This challenge is depicted in Fig. 6. As illustrated, the challenge is how to design the sensing framework (the codebook and post-processing) to detect the desired channel path (the path in blue) while filtering out all the undesired channel paths. Examples of undesired paths are the paths 2-4. Path 2 is transmitted and received within the main lobe. Path 3 is transmitted and received within the side lobe. Path 4 experiences multiple reflections instead of back-scattering, before reaching back the receiver. It has to be noted that the diffuse scattering and specular reflection properties of the mmWave signals are still crucial for constructing depth maps despite their contribution to the inter-path interference challenge. Without these properties, the sensing framework may not be able to construct a meaningful depth scene of the surrounding environment. In the next two sections, we efficiently design the two elements of our proposed depth map sensing framework, namely the sensing codebook and the post-processing, to address these challenges.

VI. DEPTH MAP BASED DESIGN FOR SENSING CODEBOOKS
As discussed in Section V-A, our objective is to design a sensing beamforming codebook that fits the rectangular grid of the depth camera. In this section, we first present our codebook design that achieves this objective. Then, we incorporate a new side-lobe reduction approach to ameliorate the inter-path interference problem.

A. PROPOSED CODEBOOK DESIGN
Since the objective from the beamforming-combining pair codebook design is for the codebook grid to match the desired rectangular grid of a range/depth scene, we start with the relevant camera geometry equations. The scene definition starts by defining the key quantities of the field of view, FoV, and the scene aspect ratio, A R . Let the field of view be centered around the boresight antenna array direction. It is worth noting that the separation distance of the camera plane away from the antenna array reference point, aka the focal length, is irrelevant in our codebook design. This is based on the notice that the beamforming/combining codebook design normally depends on angles rather than distances.
In a general sense, for any chosen value of focal length, the sensor grid points' coordinates are first calculated to determine the codebook angles accordingly. More specifically, assume that the focal length is set to a certain value, F L . The camera plane width, aka the sensor grid width in the horizontal dimension, S H , and camera plane height, aka the sensor grid height in the vertical dimension, S V , can be calculated as For designing a beamforming-combining pair codebook, let Let the x-and z-axes be aligned in the direction of the sensor grid width and height respectively, and let the y-axis be the direction of the depth. The (x, y, z) rectangular coordinates of the sensor grid points on the camera plane can then be defined as where we note that |C| = N V N H = M . After defining the (x, y, z) coordinates of every grid point on the camera plane, their M corresponding (θ z , θ x ) angles with respect to the z-and x-axes can now be calculated using the mapping from rectangular to spherical coordinates, such that , (x, y, z) ∈ C . (15) VOLUME 9, 2021 Finally, after calculating the (θ z , θ x ) angles for each and every grid point, the beamforming codebook, F , for an N H × N V transmit UPA, is then expressed as where κ = 2π λ is the wave number, λ is the operating wavelength, and d s is the antenna element spacing between adjacent UPA elements in meters. b H ∈ C N H ×1 and b V ∈ C N V ×1 are the horizontal and vertical basic vectors used for constructing the beamforming codebook. We will call these vectors, b H and b V , the constituent horizontal and vertical beamforming vectors, respectively. In our depth estimation problem, the receive combining codebook, W, can be similarly defined for the N H × N V receive UPA. For such case, the cardinalities of the sets are equal, |W| = |F | = |C| = |O| = M . Further, let F ∈ C N ×M and W ∈ C N ×M be the matrices that consist of the codebooks beams of F and W. Then, the proposed sensing beamforming-combining pair codebook P can be expressed as A comparison between the classical and the proposed beam codebook design is demonstrated in Fig. 7 for a scene of 100 • field of view and 16/9 aspect ratio, using 16 × 16 UPAs. The top figures are the 3D codebook radiation patterns, while the bottom figures are the 2D codebook grids at a plane within 13.32mm depth. As shown, the proposed beam codebook eliminates any grid mismatch distortion.

B. SIDELOBE REDUCTION APPROACH
As discussed in Section V-B, to rectify the inter-path interference problem, the sensing framework needs to filter out the undesired channel paths. As illustrated in Fig. 6, one type of undesired channel paths is the type of paths transmitted from/received by the sidelobes of a codebook beam. For this reason, we propose an efficient sidelobe reduction (SLR) approach. In [41], [42], an SLR approach was proposed for low sidelobe beamforming in uniform circular arrays. Inspired by their work, we propose a new efficient sidelobe reduction approach to uniform planar arrays (UPAs) to reduce beamforming/combining sidelobe levels.
The key idea of this approach is when applying different weights on the beamforming/combining vector elements, the sensing framework can control the beam radiation pattern in a way to increase the power difference between the mainlobe and the sidelobes. Specifically, let c H ∈ R N H ×1 and c V ∈ R N V ×1 represent the horizontal and vertical weight vectors for sidelobe reduction. Let b H and b V denote the horizontal and vertical constituent beamforming vectors after sidelobe reduction. The updated beamforming codebook, F , for an N H × N V transmit UPA, can then be rewritten as where δ H , δ V denote the sidelobe reduction control variables; the higher the values, the greater reduction in the sidelobe power levels compared to the mainlobe power level. The updated combining codebook W can be similarly defined. The beam codebook P follows the same definition in (17). Fig. 8 illustrates the radiation pattern in dB for one beamforming vector out of the updated beamforming codebook, F , for different values of the sidelobe reduction control variables, δ H , δ V . As depicted, increasing the values of the control variables increases the power gap between the mainlobe level and the sidelobes levels. To take into consideration the phase quantization of the RF phase shifters in the AR/VR transceiver architecture previously shown in Fig. 2, we examine the effect of 2-bit phase quantization on the power radiation pattern. The 2-bit discrete phase shift set is 0, π 2 , π, 3π 2 . Fig. 9 compares the normalized power radiation pattern between the case of continuous phase shifts and the case of 2-bit quantized phase shifts. As depicted, the phase quantization affects the beam pattern shape of the sidelobes.
One main advantage of this approach is its computational efficiency; as formulated, only two element-wise multiplication between the weight vectors and the constituent beamforming vectors, b H , b V , are needed to update the beam radiation pattern. This multiplication, however, requires an analog beamforming architecture with the capability of changing both the phase and magnitude. In the results section, we only used this SLR-based beam codebook in the simulations of Fig. 24. In the future work, it is interesting to explore phase-only approximations of this SLR-based beam codebook structure. By contrast, reducing the sidelobe levels dramatically increases the beamwidth of the mainlobe, as depicted in Fig. 8. The increased mainlobe beamwidth, however, can be mitigated by the other solutions proposed for rectifying the inter-path interference problem, e.g. the successive interference cancellation (SIC) algorithm and the joint processing (JP) solution, as will be described in the following section.

VII. PROPOSED SCENE RANGE/DEPTH ESTIMATION
In this section, given a pre-designed beamforming/combining codebook, P, we propose an efficient approach for the scene range/depth estimation in AR/VR devices. As depicted in Fig. 4, once the beamforming/combining codebook has been designed, the AR/VR transmits the sensing signal  while sweeping over all the beamforming-combining vector pairs. Specifically, for a beamforming-combining vector pair (f m , w m ), where m ∈ {1, . . . , M }, the receive sensing signal, y m ∈ C N p +L d , can be modeled as in (6) and (7). After reception, the acquired sensing signals are processed to estimate the range and depth maps, as will be thoroughly explained in this section. Our proposed post-processing solution has three main elements: (i) The use of oversampled/overlapped beams, (ii) the successive interference cancellation based management of inter-target and inter-path interference, and (iii) the joint processing of the signals received using the codebook beams to realize high-resolution VOLUME 9, 2021 . and accurate depth maps. Next, we explain these three elements in Sections VII-A-VII-C before presenting the scene range/depth map construction approach in Section VII-D.

A. OVERLAPPED BEAMS
With the objective of increasing the resolution of the mmWave MIMO based depth maps, we propose to adopt oversampled sensing codebooks to scan the surrounding environment. In particular, for the sensing codebook, we adopt the developed codebook in Section VI-A with oversampling factors of F OS H and F OS V in the azimuth and elevation directions. While the oversampled codebook has the potential of enhancing the depth map resolution, it is important to note that advanced post-processing (for the receive signals using these oversampled beams) needs to be incorporated to achieve this goal. The reason mainly goes back to the wide beamwidth (and low spatial resolution) of the codebook beams, which is fundamentally limited by the number of AR/VR antennas. This wide beamwidth leads to a number of challenges: (i) The spatial regions scanned by the oversampled beams have high overlap. This makes it hard to differentiate between the depths of the different objects in the depth map pixels, which challenges the objective of realizing high-resolution depth maps. (ii) Since the codebook beams still have wide beamwidth, the inter-target interference problem discussed in Section V-B still exists.
To address these challenges, we propose a novel post-processing approach based on successive interference cancellation and joint-beam processing. This approach in summarized in two main steps as follows. In the first step, a successive interference cancellation (SIC) based algorithm is used to detect the most dominant channel paths contributing to the range/depth estimation of the region covered by each codebook beam. These paths form a set of candidate ranges/depths for the scene range/depth estimation. In the second step, a developed joint-beam processing solution selects one range/depth out of the set of candidate ranges/depths formed by the SIC algorithm. These two sequential algorithms are discussed in detail in the following two subsections.

B. SUCCESSIVE INTERFERENCE CANCELLATION
The main goal of the successive interference cancellation (SIC) algorithm is to detect all the dominant paths that might contribute to the range estimation of the region of interest. This is motivated by its good performance in multi-target detections problems [43]. The SIC algorithm is applied in the discrete-time domain and is summarized in Algorithm 1. The algorithm is described as follows. Let the length of the receive sensing sequence y m [n] be L y = N p + L d symbols. First, as shown in Fig. 10, for every codebook beam, the delay position of the maximum cross-correlation magnitude value is detected. Q is the set of possible delays. Second, the SIC algorithm encodes the transmit preamble signal to be shifted to this delay position and subtracted it from the received signal. Afterwards, the algorithm repeats itself to detect the second local maximum above the threshold value. Finally, The SIC algorithm stops iterating when all the local maxima above the threshold value are detected. The output of this algorithm is a set of candidate delays for every codebook beam. These sets pass as input to the next algorithm, the joint processing solution, as will be explained in the next part. In Fig. 10, note that the cross-correlation magnitude plot appears to be drawn as a continuous plot, only for illustration purposes. The actual cross-correlation magnitude, however, is expressed in discrete time delays.

C. JOINT PROCESSING SOLUTION
The purpose of the joint processing (JP) solution between the overlapped beams is to estimate the transitions in depth/range maps more accurately. The proposed JP solution is summarized in Algorithm 2. The algorithm is described in detail as  follows. First, the JP solution works on the candidate delay sets, the output from the SIC algorithm, {T m } M m=1 , to choose one range estimate out of the candidate delay set. This processing, however, is employed relative to the 2D codebook grid, as illustrated in Fig. 11. Following this notion, the linear indices in T m is now converted into matrix subscripts T h,v through the transformation m = (v − 1)N H + h, such that T m = T h,v , where v is the elevation beam index (vertical grid index) and h is the azimuth beam index (horizontal grid index). The objective is to calculate the scene range estimates across all beam directions, ρ h,v SRE , ∀h, v.
As shown in Fig. 11, the JP solution sweeps from left to right, then from top to bottom. For each grid point, the JP solution uses (i) the set of the current grid point, named as the ''current set'', and (ii) the sets of the previous adjacent grid points to construct a ''common adjacent set''. This common adjacent set is the union of the sets of all previous adjacent grid points. Then, to investigate if a new object/surface transition appears, this current set is compared with the common adjacent set to detect if there is any set difference. This is based on the notion that the difference set can probably be the new edges that will appear in the range map while sweeping. If the set difference is not empty, then the solution chooses the path with the least time-of-flight from the set difference. Otherwise, if the set difference is empty, then the solution chooses the path with the least time-of-flight from the current set.  3: for h = 1 to N H do 4: Construct the common adjacent set Choose the least delay from the difference set Choose the least delay from the candidate set

D. RANGE/DEPTH MAP CONSTRUCTION
In this section, we formulate the depth map construction approach, the last step in Fig. 4. In summation of the broader view, the mmWave MIMO sensing based range/depth map estimation framework is outlined in Algorithm 3. The algorithm steps are summarized as follows.
Step 1 refers to the design of the beamforming-combining pair codebook P was covered in Section VI.
Step 4 refers to the successive interference cancellation described in Section VII-B.
Step 5 refers to the joint processing solution detailed in Section VII-C. After that, in Step 6, the fine range estimate can be calculated, such that ρ m MC = ρ m SRE + ρ m , where ρ m is computed from the algorithm described in Section IV-B2. Next, after calculating the range estimates, the upcoming steps (Steps 7,8) are focused on constructing the range and depth maps. Note that the range of an object is actually the radial distance in spherical coordinates. Fortunately, the (x, y, z) rectangular coordinates of the sensor grid points on the VOLUME 9, 2021 Algorithm 3 mmWave MIMO Sensing Based Range/Depth Estimation Framework Inputs: Field of view FoV, aspect ratio A R , number of horizontal and vertical beamsN H , N V . Outputs: Range map R map , depth map D map .
1: Design the beamforming-combining pair codebook, P, following Section VI. 2: for m = 1 to M do For each pair (f m , w m ).

3:
Acquire receive sensing signal, y m [n], as in (6)  camera plan were already calculated for the design of the beamforming-combining pair codebook using (14). These rectangular coordinates in (14) can then be converted to spherical coordinates, such that In order to construct the matrices for the range and depth maps, let , ∈ R N V ×N H be the matrices that represent the angles of the spherical coordinates (θ z , ) ∈ S, respectively. Following Step 6 in Algorithm 3, the range map estimate R map ∈ R N V ×N H can be expressed as where m ∈ {1, . . . , Given the angles in spherical coordinates and the range map estimate, the depth map estimate D map ∈ R N V ×N H can then be expressed as Finally, since the range and depth map resolutions are set to N H × N V , two-dimensional image interpolation can be employed to scale the maps to the desired resolutions, M h × M w . Examples of interpolation methods are the nearest neighbor interpolation and the bicubic interpolation.
Although the bicubic interpolation can probably be the interpolation method of choice for achieving more estimation accuracy, the nearest neighbor interpolation is more computationally efficient. In the simulation results of Section VIII, we evaluate the two interpolation approaches for our mmWave MIMO based depth map construction problem.

VIII. SIMULATION RESULTS
In this section, we evaluate the performance of the proposed mmWave based depth estimation approach. First, we describe the adopted simulation framework in Section VIII-A before extensively studying the estimation accuracy of the proposed approach under various scenarios and system parameters. The simulation results presented can be of great usefulness for various applications; they can be generally applied to AR/VR devices, smart home devices, or auto drive devices.

A. SIMULATION FRAMEWORK
Since the depth estimation heavily depends on the environment under test, it is crucial to evaluate the performance of the proposed solution based on realistic channels. This motivates using channels generated by accurate ray-tracing to capture the sensing dependence on the environment geometry, scatterers' materials, AR/VR position, etc. This is why we designed the simulations models using Remcom Wireless InSite [16], which is an accurate 3D ray-tracing simulator. Further, to efficiently incorporate diffuse scattering models, we need to have highly detailed floor plans with a sufficient number of faces. To achieve this objective, we resorted to the high-fidelity game engine, Blender [17], to build accurate floor plans. These plans/models are then exported to Wireless InSite to obtain the ray-tracing outputs, and finally to MATLAB to construct the channel models in (4) and implement the proposed depth estimation approach. The proposed evaluation framework is illustrated in Fig. 12. For benchmarking, we also use the Blender floor plans to obtain the ground truth depth maps, which are essential to evaluate the accuracy of our solutions. The ground truth maps are generated by placing a Blender camera at the same position of the UPA reference antenna element, and adjusting the Blender camera parameters to capture the same field of view.

1) SIGNAL MODEL
We adopt the signal model described in Section II with a focus on the sensing system performance. The AR/VR device is assumed to be fixed in position. Unless otherwise mentioned, the UPA size is 16 × 16 antennas (N H = N V = 16) at the mmWave 60GHz operating band with transmission bandwidth of 2GHz. The antenna elements have a gain of 0dBi with half-wavelength antenna spacing. The transmit power is set to 30dBm. The preamble sequence is the same as the one in the single carrier PHY packet preamble of the IEEE 802.11ad standard (3328 symbols). M preamble sequences are used to sense the environment via M beamforming-combining pairs. For the sake of calculating a rough estimate of the time allocated for environment sensing through transmission and reception, assume that all the M preamble sequences are transmitted sequentially with guard intervals in between. The highest M value reported in the upcoming simulation results is 4096 beams. Assuming a sampling rate of 2Gsps, the estimate of the longest sensing time is then ≈ 7ms.

2) CHANNEL GENERATION
The channel matrix, H d , is generated in two steps. The first step is generating the channel rays using the ray-tracing software, Wireless InSite. The Wireless InSite propagation model is set to 'X3D' with 0.1 • ray-spacing and enabled mode of diffuse scattering. Up to three reflections, one diffraction, and one transmission properties are allowed for each ray in the Wireless InSite simulation. The diffuse scattering model used is ''directive with backscatter''; this model is fixed across all materials in all the testing scenarios. The chosen diffuse scattering model creates two scattering lobes; a forward lobe of diffuse scattered power centered on the direction of specular reflection and a backward lobe centered on the opposite direction of incidence. The diffuse scattering parameters of the different materials are summarized in Table 1. The values reported in Table 1 follow the ITU default parameter values at 60GHz. The second step in the sensing channel generation is calculating the delay-d channel matrix out of the channel paths using the DeepMIMO dataset generation code [44]. Using these channels and following (4)-(3), the noisy receive sensing sequences are generated. The noise power is calculated based on a 2GHz bandwidth and a receiver noise figure of 7dB.

3) mmWave bASED DEPTH ESTIMATION PARAMETERS
The beamforming-combining pair codebook is designed based on a 100 • field of view centered on the antenna array boresight, a 16/9 scene aspect ratio, and horizontal and vertical oversampling factors of unity. The ground truth depth maps are generated from Blender using a Blender camera with a 100 • field of view, a focal length of 13.43mm corresponding to a sensor width of 32mm. The ground truth depth map image quality is set to 1080p resolution; i.e., 1920 × 1080 pixels. Concerning the massive correlator, f est is set to 100 multiple of the sampling frequency f S ; i.e., δ = f est 2f S = 50. Unless mentioned otherwise, the   massive correlator is adopted for range estimation. Throughout this paper, two performance metrics are used: (i) rootmean-square-error (RMSE) between the estimated map and the ground truth map to indicate the standard deviation of the estimation error, and (ii) mean-absolute-error (MAE) to denote the expected value of the estimation error. The two metrics are defined in (9). Next, we evaluate the performance of our proposed mmWave MIMO depth estimation approach in four main scenarios: (i) A one wall scenario in Section VIII-B, (ii) a two walls scenario in Section VIII-C; (iii)) a room with two pillars scenario in Section VIII-D, and (iv) a conference room scenario in Section VIII-E.

B. ONE WALL SCENARIO
The one wall scenario consists of an AR/VR transceiver facing a wall in free space propagation. Unless otherwise mentioned, the separation distance between the wall and the transceiver is 7 meters and the wall building material is concrete. In Fig. 13, we show the estimated range and depth maps for the one wall scenario compared to the ground truth maps. Fig. 13(a) and Fig. 13(b) show that the range map estimation error has an average MAE of 0.098m and RMSE of 0.127m. Further, the depth map estimation error Fig. 13(c) and Fig. 13(d) has an average MAE of 0.12m and RMSE of 0.153m. Overall, these figures show that the proposed approaches can accurately estimate the range/depth maps for a wall at 7m distance from the AR/VR device with around 10cm error, which highlights the effectiveness of this approach. Impact of the Important System Parameters: Next, we briefly evaluate the impact of the various system parameters on the performance of the proposed mmWave depth map estimation solution.
• Number of antennas and sensing codebook beams: In Fig. 14, we plot the estimated and ground-truth depth maps for a different number of antennas and codebook oversampling factors. As illustrated, the depth estimation accuracy can generally improve by increasing the number of antennas and/or the codebook oversampling factors. This comes with the cost of deploying more antennas at the AR/VR device or employing more beams, which translates to a longer sensing time. In Fig. 15, we plot the estimated and ground-truth depth maps for different antenna configurations using the same number of antenna elements. As depicted, the depth estimation accuracy depends on the UPA configuration, with the best configuration being the 6 × 4 UPA because of its closeness to the 1080p aspect ratio.
• RF phase shift quantization: As previously described in Section VI-B, the phase quantization of the RF phase shifters in the AR/VR transceiver architecture produces a noticeable change in the radiation pattern shape of the sidelobes. To examine the effect of this phase quantization on the estimated depth maps, Fig. 16 shows the comparison of the estimated depth maps for two cases of the RF phase shifters at the AR/VR device: (a) continuous phase shift and (b) 2-bit quantized phase shifts. As depicted, the phase quantization contributes with a small negative impact on the depth map estimation accuracy for the one wall scenario at a separation distance of 7 meters.
• Transmit sensing power: In 17a, we investigate the effect of changing the transmit power on the depth map estimation accuracy. The SNR value of 0dB corresponds to a transmit power of 15dBm. This figure shows that a transmit power of 5dBm (SNR of −10dB) could be sufficient to reach around 10cm error for the depth estimation accuracy.
• Preamble sequence length: The estimation error versus transmit power is depicted in 17b for different values of preamble sequence lengths, namely preambles with 50, VOLUME 9, 2021 100, 1000, and 3000 symbols. As shown in this figure, increasing the preamble sequence length improves the depth estimation accuracy at the expense of increased sensing time and post-processing complexity.
• Separation distance between the AR/VR device and the facing wall: Fig. 18 investigates the impact of increasing the depth value on the depth estimation accuracy. As shown in this figure, the larger the distance between the AR/VR device and the facing surface, the larger the error in the depth estimate, which is expected. This figure also highlights some advantage for the bicubic interpolation compared to the other interpolation methods.
• The surface material: Now, we evaluate the performance of the proposed approach for different surface materials. More specifically, we summarize in Table 2 the range map MAE for different candidates of the wall material. Overall, we can notice some correlation between the estimation accuracy and the scattered to incident power ratio property of the materials, which are summarized in Table 1.

C. TWO WALLS SCENARIO
The two walls scenario consists of one AR/VR device facing two walls in free space propagation as depicted in 19a. The separation distance between the front wall and the AR/VR device is 1m while the separation between the back wall and the AR/VR device is 2m. The walls' building material is concrete. Each wall consists of 2, 048 faceted faces, and each face contributing with at most one backscattered ray. The purpose behind studying this scenario is to test the alignment of the estimated map compared to the ground truth depth map. The results of this test are illustrated in 19b, where the estimated range and depth maps are compared to the ground truth maps. As shown in 19b, the two edges of the front wall in the estimated maps align reasonably well with the one displayed in the ground truth maps. This highlights the promising performance of proposed mmWave based depth estimation solution.

D. A ROOM WITH TWO PILLARS
In this scenario, we consider a 5m×5m room where one AR/VR device is centered at the front door of the room, as depicted in Fig. 20. The room consists of a concrete floor plan with two wood pillars in the middle of the room. The wood pillars are at 2 meters distance from the AR/VR transceiver. The floor plan consists of 15, 488 faceted faces whereas each of the wood pillars consists of 3, 072 faceted faces. Note that the ceiling of the floor plan is set to the invisible mode for visibility purposes only. For the estimation error assessment of the indoor space scenario, Fig. 21 shows the comparison between estimated and ground truth maps for 16 × 16 UPA antennas with a codebook oversampling factors of four in both azimuth and elevation dimensions. First, Fig. 21(a) with Fig. 21(b) show the estimate and ground truth range maps, which have a MAE of 0.139m and RMSE of 0.355m. For the depth maps, Fig. 21(c) with Fig. 21(d) represent 1080p maps with estimation error of (i) 0.126m for the MAE and 0.356m for the RMSE with nearest neighbor interpolation, and (ii) 0.123m for the MAE and 0.328m for the RMSE with bicubic interpolation. From  observing the difference in maps, the mmWave reasonably recover most of the depth information of the scene with low codebook resolution (16 × 16) compared to the ground truth 1080p resolution. With narrower transmit and receive beams, i.e. more antenna elements, the estimation accuracy is expected to further improve. The depth map estimation accuracy for this scenario is also evaluated at different SNRs in Fig. 22. In this figure, we adopt the model and system parameters used in Fig. 21 with 16×16 UPAs and oversampling factors of four. It is also worth mentioning that 0dB SNR corresponds to −20dBm transmit power in our setup. As shown in Fig. 22, the estimated depth maps have MAE of almost 10cm at 0dB, which highlights the promising performance of our proposed depth map estimation approach at relatively low SNRs and in an indoor room with several surfaces and different materials. This will be further emphasized in the following subsection.

E. CONFERENCE ROOM SCENARIO
In this scenario, we consider the conference room shown in Fig. 23. The ceiling of the indoor space is set to the invisible mode for visibility purpose only. The 10m×10m indoor space has a 6m×6m conference room with glass walls. The indoor space walls are made of layered drywall, the ceiling is made of ceiling board, and the floor is made of floorboard. The conference room chairs and tables are made of wood. The conference room door opening is 1m in width and 2.7m in height. The number of facets for each item in the indoor space is as follows: 2, 048 facets for the layered drywall, 2, 048 facets for the floorboard, and 2, 048 facets for the ceiling board. In addition, the number of facets for each item in the conference room is as follows: 1, 568 facets for the glass wall, 4, 446 facets for the table, 21, 192 facets for the office chairs. The conference room scenario consists of two AR/VR devices for two scenes under study -the first device is centered at the front door of the conference room while the second transceiver is placed outside of the conference room facing the other glass facet. The scenes captured by the AR/VR camera for the two cases are shown in Fig. 23(b) and Fig. 23(c).
One main motivation for leveraging mmWave MIMO to estimate the depth maps (compared to RGB based depth estimation approaches) is the expected higher efficiency in detecting transparent and dark objects. In Fig. 24, we compare our mmWave MIMO based depth estimation approach with the RGB based depth estimation approach, detailed in [2], for the two considered conference room scenarios. It's worth emphasizing here that the algorithms in [2] achieve considerably good depth accuracy when tested on the NYU depth V2 dataset [45]. As shown in Fig. 24, the mmWave MIMO based estimator outperforms the RGB based estimator in recognizing transparent and dark objects. For the first scene, the glass wall was not detected by the RGB estimator. Also, in the presence of a scene with low illumination, the mmWave MIMO based estimator performance shows robustness in the estimation accuracy compared to the RGB based estimator. Figure 1.c) and 2.c) were generated with the aid of the SLR approach in Section VI-B, with δ H = 2, δ V = 3. For this reason, the depth maps constructed by the mmWave MIMO system seem coarser than the one constructed by RGB cameras, which can be resolved using morphological image processing operations, e.g., the erosion operation. As for the second scene, the RGB based estimator is unable to detect the transparent glass compared to the mmWave MIMO based estimator. Interestingly, despite the fact that the glass scattering ratio is 0% based on Table 1, the conference room glass wall is partially recovered by the mmWave MIMO based estimator because of the boresight reflection path. This makes the wireless AR/VR experience safer by providing the ability to detect transparent surfaces. All these promising results highlight the potential of leveraging the proposed mmWave MIMO based depth map estimation approaches for immersive AR/VR experience.

IX. CONCLUSION
In this paper, we considered the problem of estimating accurate depth maps for AR/VR devices, which is an essential goal for immersive mixed-reality experience. For this problem, we proposed leveraging the mmWave communication systems that are deployed on the AR/VR devices to estimate and build high-resolution depth maps. We formulated the communication-constrained depth map sensing problem and proposed a comprehensive framework for realizing this objective. The proposed framework includes (i) the construction of depth map specific sensing codebooks using practical mmWave antenna arrays and (ii) the development of efficient post-processing solutions for jointly processing the receive signals from the multiple sensing beams and estimating high-resolution depth maps. Simulations using accurate 3D ray-tracing models confirmed the promising accuracy of our proposed mmWave based depth map estimation approach in various environment scenarios. In particular, the results show that the proposed approach can construct relatively high-resolution depth maps with less than 10cm error using practical mmWave systems. This highlights the potential of leveraging this solution to complement RGB-D based depth maps and realize immersive depth perception for wireless virtual/augmented reality systems.