HUMAN4D: A Human-Centric Multimodal Dataset for Motions and Immersive Media

We introduce HUMAN4D, a large and multimodal 4D dataset that contains a variety of human activities simultaneously captured by a professional marker-based MoCap, a volumetric capture and an audio recording system. By capturing 2 female and $2$ male professional actors performing various full-body movements and expressions, HUMAN4D provides a diverse set of motions and poses encountered as part of single- and multi-person daily, physical and social activities (jumping, dancing, etc.), along with multi-RGBD (mRGBD), volumetric and audio data. Despite the existence of multi-view color datasets captured with the use of hardware (HW) synchronization, to the best of our knowledge, HUMAN4D is the first and only public resource that provides volumetric depth maps with high synchronization precision due to the use of intra- and inter-sensor HW-SYNC. Moreover, a spatio-temporally aligned scanned and rigged 3D character complements HUMAN4D to enable joint research on time-varying and high-quality dynamic meshes. We provide evaluation baselines by benchmarking HUMAN4D with state-of-the-art human pose estimation and 3D compression methods. For the former, we apply 2D and 3D pose estimation algorithms both on single- and multi-view data cues. For the latter, we benchmark open-source 3D codecs on volumetric data respecting online volumetric video encoding and steady bit-rates. Furthermore, qualitative and quantitative visual comparison between mesh-based volumetric data reconstructed in different qualities showcases the available options with respect to 4D representations. HUMAN4D is introduced to the computer vision and graphics research communities to enable joint research on spatio-temporally aligned pose, volumetric, mRGBD and audio data cues. The dataset and its code are available https://tofis.github.io/myurls/human4d.

technologies, comprise the core elements for human-centric 4D media production, a domain essential in several technological and industrial sectors.
On the one hand, these technologies constitute key elements in immersive experiences that provide remote virtual presence and co-presence (e.g. XR conferencing [2], XR museums [3], etc.). The experiences are further enhanced by augmenting the virtual and immersive worlds with photorealistic representations that enable highly natural and realistic audiovisual communication between multiple users.
The advancement of shape and motion computer vision techniques, the development of immersive media technologies, as well as the interest of the industry in human-centric 4D media production, highly and rapidly increase the need for large, high-quality datasets that will act as cornerstones for their continuous development, also enabling their joint evolution. Nevertheless, at the moment, only few datasets are partially focused on some of the aspects of these challenging tasks.
On top of that, several computer vision methods approach 3D/4D research tasks from monocular or HW-SYNCed multi-view color (i.e. 2D) streams. However, by definition, 2D data cannot cope with the intricacies of 3D/4D shape or form, at least to the extent that the volumetric data can. That is probably due to the lack of HW-SYNCed depth/volumetric data from public resources. For instance, the lack of HW-SYNCed volumetric data along with ground-truth 3D poses for supervision eliminates the attempts for data-driven 3D pose estimation approaches from volumetric data.
To this end, we create HUMAN4D, a dataset that fills these gaps by providing professional motion capture along with volumetric data captured in 3D character and mesh-and point-based volumetric representations. In particular: • We introduce a publicly available 4D dataset containing a large corpus of annotated spatio-temporally aligned multi-view RGBD (mRGBD), volumetric and motion capture data, in order to enable extensive research on several computer vision and graphics topics.
• To the best of our knowledge, HUMAN4D is the first dataset that provides HW-SYNCed mRGBD frames along with marker-based motion capture and audio data cues, with the use of recent consumer-grade depth sensing devices, cutting-edge optical motion capture technologies and body-worn audio recording, respectively. • We provide pose estimation baselines by applying datadriven 2D and 3D pose estimation algorithms on singleand multi-view data sequences, along with insights with respect to the advantages of HUMAN4D for training such methods. • We perform and report a detailed study on volumetric data compression using 3D codecs, examining the rate distortion from several perspectives, while respecting online volumetric video encoding and steady bit-rates. • We conduct and report objective visual quality evaluation on various volumetric representations, i.e. meshbased volumetric data evaluation across various reconstruction qualities.
The remainder of this paper is organized as follows: Sec. II overviews related datasets including 4D data in a similar aspect; Sec. III describes in detail the HUMAN4D dataset, giving evidence with respect to its creation and statistics; Sec. IV benchmarks 2D and 3D pose estimation data-driven models on HUMAN4D; while Sec. V benchmarks 3D codecs and compares mesh-based 4D representations with respect to visual quality using well-known objective metrics; in Sec. VI, we discuss the impact of this dataset to the research community and beyond; finally, Sec. VII concludes the paper and discusses future work.

II. RELATED WORK
Over the past few decades, the computer vision research community has showed an increased interest for virtual human related technologies. A variety of traditional and learningbased computer vision methods are targeting open research problems using motion, volumetric, image and action-based data. In this section, we discuss relevant datasets [37]- [42], providing details and explaining the nature of the data they offer to the research community. A brief overview of these datasets follows, while Table 1 summarizes their features and modalities. MHAD [37]: One of the first publicly available datasets offering MoCap and RGBD data is (Berkeley) MHAD. The MHAD dataset contains spatio-temporally aligned data cues captured with a professional MoCap system with active markers [43] along with 12 RGB and 2 MS Kinect v2 (RGBD) cameras, 6 wearable inertial sensors (accelerometers only) and 4 microphones, recording the audio signals during the performance of the actions. The dataset consists of 659 data sequences from 11 human actions performed by 12 subjects. Although MHAD enables research on multi-view pose estimation and beyond, the MS Kinect v2 devices are only 2 and not HW-SYNCed, resulting in the existence of spatio-temporal offsets between the deprojected depth maps  [6], HUMBI provides mesh-based 3D geometry of the subjects along with their respective texture atlases. For HUMBI, the use of depth sensors was out of scope, thus multi-view depth sensing was not considered. HUMAN4D aims to tackle lacking areas of existing, publicly available 4D datasets. HUMAN4D consists of a large corpus of spatio-temporally aligned mRGBD, volumetric and motion capture data cues, providing high synchronization precision between the multiple RGBD streams exploiting the HW-SYNC capabilities of the sensors. On top of that, HU-MAN4D contains (social) activities between multiple subjects (2), enabling research on challenging computer vision tasks under the multi-person aspect (e.g. occlusions, multiple person instances in the field of view, larger volumetric areas, etc.). HUMAN4D is meant to provide the computer vision research community with data that will enable the research and development of novel approaches on intensively active human-centric research domains. It is worth noting that the consumer-grade depth sensing devices used for the RGBD data capturing are commercially available in the market, allowing the experimentation and development of computer vision algorithms applicable even for production purposes.

A. 4D CAPTURING SETTING
The capturing of the dataset took place in a professional motion capture studio (Artanim Foundation 1 ) where, beyond the motion capture system, special portable equipment for FIGURE 1: Pictures taken during the preparation and capturing of the HUMAN4D dataset (in Artanim's facilities). The room is equipped with 24 Vicon MXT40S cameras rigidly placed on the walls, a portable volumetric capturing system with 4 Intel RealSense D415 depth sensors temporarily set up to capture the RGBD data cues and wearable microphones for the actors.
volumetric capturing was set up, as depicted in Fig. 1. In particular, 24 motion capture (MoCap) cameras along with 4 stereo-based depth sensors and microphones using HW and software (SW) synchronization (see Sec. III-C1 for details) were used, to capture the whole dataset. All 24 motion capture cameras were rigged on the walls, to maximize the effective experimentation volume. The high number of motion cameras (24) increases the accuracy of the motion capture due to the elimination of occlusions, providing that way high precision ground-truth poses for the dataset. The actual capturing space was set in an area of approximately 4m×4m so that the bodies of the actors were at least partially in the field-of-view of the RGBD cameras during the performances. These cameras were placed at the 4 corners of the stage in a cross schema. The floor-plan of the whole capturing setup is illustrated in Fig. 2. Finally, a 3D body scanner was used to obtain an accurate 3D mesh-based volumetric model of one of the actors.

B. DATASET CREATION
For the creation of the dataset, 4 professional actors, 2 female and 2 male were recruited, in order to pursue the highest possible quality of the captured actions, with respect to the authenticity of the performances. Within HUMAN4D, without the post-processing products (i.e. volumetric data), we captured and introduce the following: • Multimodal data of 14 single-person and 5 two-person actions (19 in total), including physical exercises, daily and social activities, totalling 56 single-person and 10 two-person sequences, respectively. In Table 2, details with respect to HUMAN4D activities are figured. • Projection matrices and external calibration camera parameters retrieved using an anchor-based calibration method to reduce pairwise accumulating errors, en- abling 2D projection of 4D data to the various camera views and vice versa. • 30 audio cues for some of the activities where the actors had to talk and act based on specific scripts and scenarios (see Table 2). • Synchronization between the modalities by providing timestamped data. • 1 scanned and rigged 3D model of one of the professional actors. • A set of benchmarks to facilitate comprehensive evaluation of 2D and 3D pose estimation methods, along with evaluation of volumetric video production and compression quality. Following, we describe in detail the modalities we used and the techniques we applied to capture and create the dataset.

1) SPATIO-TEMPORALLY ALIGNED mRGBD CAPTURE
To the best of our knowledge, HUMAN4D is the first publicly available dataset that offers HW synchronized multi-view RGBD data captured in a real-time manner. Most of the existing datasets use synchronized RGB cameras [38] or previous versions of Microsoft Kinect for RGBD capturing [39], which do not support HW triggering, requiring SWbased soft synchronization solutions. In HUMAN4D, we instead use the Intel RealSense D415 sensor which offers this functionality [45]. D415 sensors can be configured in either master or slave synchronization mode, eliminating the need for external HW triggering when connected in a device cluster. One device can be set as "master", providing the synchronization signal, and the rest as "slaves" that receive it and cohere. The impact of HW-SYNCed mRGBD capture for volumetric-and pose-related tasks is depicted in Fig. 6, where point-clouds extracted by deprojecting mRGBD frames from HUMAN4D and CMU [39] are compared, showcasing the improved temporal alignment of the HW-SYNCed HUMAN4D against CMU data. It is worth noting that CMU constitutes currently the only existing dataset that provides synchronized depth maps by applying a HW modification on the Kinect v2 devices. Regarding depth capturing, the sensors were used in "high accuracy" mode, offering only the high confidence depth estimates, therefore producing accurate but sparse depth data. It is worth noting that we configured the sensors exploiting their spatial filtering and exposure adjustment capabilities to capture the best possible depth quality. We captured the mRGBD data using the capturing system 2 proposed by Sterzentsenko et al. [46], while spatial alignment between the sensors was achieved using the multi-sensor calibration schema proposed by Papachristou et al. [47]. HW-SYNCed mRGBD samples are depicted in Fig. 3.

2) 3D SCANNED AND RIGGED CHARACTER
To obtain an animatable mesh, one of the actors was scanned using a custom photogrammetry-based body scanning rig (Fig. 4). The rig consisted of 96 Canon Powershot A1400 cameras controlled using SW-based on the Canon Hack FIGURE 4: Using a custom photogrammetry rig with 96 cameras, photos were taken of the actor (left) and reconstructed into a 3D textured mesh using Agisoft Metashape [48] (right).
Development Kit (CHDK) [49]. Lighting was provided by LED strips mounted on the rig. All cameras were triggered in a synchronized manner. To aid the photogrammetric reconstruction of the bodyscan, the dark MoCap suit worn by the actor was temporarily augmented with colored paper markers, which were removed before the MoCap process.
Using a commercial photogrammetry SW tool, Agisoft Metashape [48], the individual photos were aligned to reconstruct a textured 3D mesh. After the cleanup of mesh artifacts from the reconstruction process, the mesh was rigged and skinned for animation, using a standard full-body humanoid skeleton created by a professional 3D animator.

3) OPTICAL MARKER-BASED MOTION CAPTURE
To obtain reference animation of the 4 actors performing the various activities, a professional motion capture setup was used. The setup consisted of 24 Vicon MXT40S cameras (Vicon, Oxford Metrics, UK) sampling at 120Hz. Each actor wore a dedicated motion capture suit with 53 attached retroreflective markers. This dense marker set along with the high number of motion cameras (24) allowed us to capture highly accurate and precise MoCap data to serve as ground-truth for training, supervising and evaluating data-driven approaches and beyond.
For the purpose of subject calibration, each actor was asked to perform a full range of motion of all joints. The procedure ensured that the joint locations were correctly mapped to the set of the tracked markers. Before each activity, the actors were asked to start in a T-pose and then proceed to their assigned activity.
The captured animations of the actor whose body was subsequently scanned, underwent a retargeting process by a professional 3D animator. The goal of this process was to

4) AUDIO RECORDING
The use of audio and its fusion with visual data have shown significant results in various research tasks such as human emotion recognition [50], scene analysis [51], human activity recognition [52] and more. To this end, also targeting the capture of social activities, we recorded audio during the performance of some of the actions. In particular, 30 of the activities (see Table 2) include audio either as a monologue (single-person) or conversation between two subjects, based on the related scripts and scenarios. For this purpose, wireless body-worn microphones were used to record the audio cues. The audio recording was performed at the frequency of 48 kHz.

C. DATASET PROCESSING AND ANNOTATIONS 1) SYNCHRONIZATION AND CALIBRATION
Inter-and intra-modality synchronization is a prerequisite for such datasets. The motion capture cameras operate in inter-camera synchronization by default. With respect to the mRGBD capturing setting, as we already mentioned, Intel RealSense D415 sensors offer intra-and inter-sensor HW synchronization as well. With respect to the inter-modality synchronization, considering the motion capture clock as reference for the full system, along with the mRGBD and audio data timestamping, a SW-based synchronization technique was applied to temporally align the data. In particular, given the motion capture frequency equal to 120 Hz, the temporally closest MoCap sample to every mRGBD frame timestamp FIGURE 6: Colored point-clouds from CMU [39] (Left) and HUMAN4D (Right) datasets showcase the benefits of HW-SYNC. In CMU, where the Kinect devices are modified for synchronization purposes, the leg of the subject is corrupted in a slow movement (i.e. slow leg lifting) due to the existence of temporal offsets between the devices. In HUMAN4D, the leg is appropriately captured in a fast movement (i.e. punching and kicking). was considered the matching pose, giving a low temporal difference t d , where t d ≤ 1 120 /2 ms =⇒ t d ≤ 4.16 ms. The initial temporal offset between the modalities was detected with the use of a marker-equipped (2 markers) clapperboard at the beginning of each sequence, enabling all the modalities to capture the time instance of the clapping event. In detail, for the motion capture data sequences, the 3D position signals of the clapperboard markers were analyzed to detect the clap event by identifying the time instance when the euclidean distance between the markers is the minimum; for the audio signals, the clap event caused an easily detectable peak on the amplitude of the audio signals, while for the RGBD data, the event was manually detected.
For the spatial alignment of the modalities, the MoCap system was calibrated once before the captures, while the mRGBD system was calibrated per subject (every subject performed all the actions at once). The spatial alignment between MoCap and mRGBD was achieved by applying a semi-automatic technique, capturing short sequences of moving retro-reflective markers using both modalities before the capturing of each subject. For these sequences, the infrared (IR) stream of the sensors was enabled instead of the color. The details of the inter-modality spatial calibration go beyond the scope of this paper.

2) 2D AND 3D POSE FROM MOTION CAPTURE
The spatio-temporal alignment between the modalities and the highly frequent and precise 3D motion capture enable the extraction of 3D poses accurately mapped on the RGBD data cues. With a set of J = 33 j-joints, as depicted in Fig. 5, a 3D pose per frame f and skeleton s is mapped to every single mRGBD frame. Then, by applying inverse transformation per camera pose and projecting the 3D positions of the joints on the RGBD views, the 2D keypoints K are calculated by:  where x f,s,j ∈ R 3 is the 3D position of joint j, T g→l is the transformation from the global (g) coordinate system to the local (l) one of sensor s with the arrow showing the direction of the transformation. π denotes the projection function that transforms the 3D coordinates to pixels, using sensor's intrinsic parameters matrix K s . The 2D outcomes of this processing are depicted in Fig. 7 and 8 . Furthermore, considering the MoCap marker 3D positions and their corresponding 2D projections on the sensor views (using the projection of Eq. (1)), we extract the 3D and 2D bounding boxes containing each subject per frame, by fitting a rectangular slightly padded (2% of the dimension size per side) prism and box around the 3D positions and 2D projections, respectively.

3) VOLUMETRIC DATA FROM MULTI-VIEW RGBD
Real-time 4D reconstruction evolves as a cutting-edge component in XR applications and beyond, especially focused on challenging dynamic data such as rigid and non-rigid human motions. Key concept of this dataset is the exploitation of the mRGBD cues of human activities to produce and dispose volumetric data captured in a real-time manner, in the form of colored point-cloud and colored/textured 3D mesh instances for every single mRGBD frame. Point-cloud: An RGBD image is composed of a color image I and a depth image D, which, after the application of a local transformation between them, are registered to the same coordinate frame. Then, given the depth sensors poses (T v := Rv tv 0 1 ) known in a common coordinate  system, where R s and t s denote rotation and translation, respectively, we transform every depth pixel p, p ∈ D s , from the depth image domain coordinates of each view to a global coordinate system by: where T l→g is the relative pose from the local (l) coordinate system of sensor s to the global (g) one with the arrow showing the direction of the transformation. π −1 denotes the deprojection function that transforms the pixel to 3D coordinates, using sensor's intrinsic parameters matrix K s . Merging the transformed partial point clouds from each view to the global space, results in the colored point cloud data. The outcome of this process is illustrated in Fig. 9.
3D Mesh: Beyond point-based volumetric data, watertight colored and textured 3D mesh instances are reconstructed in a real-time manner (up to the frequency of the sensor acquisition, i.e. 30 fps) applying the GPU-based implementation proposed by Alexiadis et al. [8], based on the fast Fourier Transform (FFT) -based approach proposed by Kazh-dan [53]. The 3D geometry reconstruction relies on a scalar volume function V (q) containing the splatted 3D surface information, as given by the point cloud calculated using the depth maps, defined over a 3D grid .., N Z }, inside the foreground object's bounding box. This 3D grid of V (q) is considered the volume resolution of the 3D reconstruction, used with power of 2 components for FFT, i.e. 2 r × 2 r+1 × 2 r , r ∈ N. Applying then the marching cubes algorithm [54], the 3D surface is extracted in the form of triangular meshes (vertex positions, normal vectors and connectivity). The coloring and texturing of each triangle of the surface is based on a weighted average between the cameras for which the specific part is not occluded. The weights estimation depends on the visibility angle between the camera and the respective area.
Applying [8] in voxel grid resolutions with r = 5, r = 6, r = 7, we extract textured and colored triangular 3D mesh instances for all the mRGBD frames of the dataset in three (3) different resolutions. Color-per-vertex and textured 3D mesh instances are depicted in Fig. 10.

D. HUMAN4D BENCHMARKING SUBSETS
For benchmarking on HUMAN4D, we divide the dataset into two subsets, a single-(H4D1) and a two-person one (H4D2), in order to reduce the amount of data processing, as well as to evaluate samples that represent varying human poses. At the beginning of each sequence, the subjects were standing in T-Pose for calibration purposes. To that end, we decided to remove the first 100 frames of each sequence to avoid the collection of many similar poses (T-Pose) and to randomly sample 100 frames from the remaining part of each sequence, totaling 5600 and 1000 single-person and multi-person frames, respectively. Given that we benchmark HUMAN4D with pre-trained models or non data-driven encoders, both subsets, H4D1 and H4D2, are used as testing sets. The rest of the data can be considered as training and validation sets to allow the experimentation and development of new data-driven approaches on HUMAN4D. We benchmark HUMAN4D with respect to pose estimation and volumetric video compression by applying state-of-the-art approaches of the respective fields. In the following sections (Sec. IV and V), we evaluate pre-trained models as well as 3D codecs for pose estimation and 3D compression respectively, on the benchmarking subsets of the dataset. An overview of the benchmarking flow and methodology we follow and present in the following sections is depicted in Fig. 11.

IV. POSE ESTIMATION
HUMAN4D enables research to human pose-related computer vision tasks by providing spatio-temporally aligned RGBD data from multiple views under a HW-SYNC setting, along with accurate 3D and 2D poses. Recent research efforts are devoted on various single-and multi-person pose estimation approaches, from single RGB in the wild [18], [57]- [59], depth [60], [61], multi-view RGB [23], [62] and multiview RGBD [22], [63], among others. However, the selection criteria of the methods we benchmark are to be open-source and applicable to HUMAN4D, producing baseline results for our dataset. Finally, it is worth noting that the mRGBD frames of the evaluation set that go beyond the capabilities of the pre-trained models (for instance, several body parts out of at least one of the views) are excluded, preventing wrong and unfair evaluation with respect to the effectiveness of the methods.

A. SINGLE-VIEW 2D POSE ESTIMATION
Considering the 2D poses per view, we assess state-of-the-art methods for 2D pose estimation from color images. We apply the methods on the color views of all (4) RGBD cameras, extracting the overall error metrics per mRGB frame by averaging the errors per view. Methods. We select 2 widely known 2D pose estimation methods, a bottom-up and a top-down one, to assess their effectiveness on HUMAN4D color images. Firstly, we select OpenPose by Cao et al. [21], a deep bottom-up pose estimation method that combines confidence maps with part affinity fields to predict multi-person 2D poses in real-time. For the evaluation of HUMAN4D, we used the latest version of the method as found to the official code repository 3 . Secondly, we evaluate AlphaPose, another data-driven approach proposed by Fang et al. [55]. AlphaPose constitutes a top-down, real-time 2D pose estimation method, that is continuously supported and updated over the last years. For the present experiments, we used the latest version of the method as found on the official repository of the authors 4 . Finally, we also experimented with the official code of VNect 5 , by Mehta et al. [20], one of the first data-driven methods that approached 3D pose estimation from single RGB images, and A2j 6 , by Xiong et al. [60], for 3D pose estimation from single depth maps. However, the methods were not favorably applicable to our dataset, probably due to the differences between the characteristics of the training sets used to train the models and HUMAN4D. For A2j for instance, the depth data used to train the body pose estimation model have been captured with Asus Xtion PRO, a structured-light depth sensor that provides depth maps of different resolution and depth noise in comparison with the stereo-based depth sensor from Intel, Intel RealSense D415. To this end, the results are not presentable, however the related tools for experimentation are available in the code repository of our dataset 7 . Metrics. To measure the body joints localization accuracy, we measure mean Average Precision (mAP) for the common joints between the 2 methods and the ground truth annotations considering the Percentage of Correct Keypoints-head (PCKh) metric, as defined in [64]. PCKh constitutes a slight modification of Percentage of Correct Keypoints (PCK) [65], defining a matching threshold α as the percentage of the head segment length (from neck to head top), instead of the long edge of the bounding box that contains the subject, aiming to make the metric independent from specific body posture and articulation. To this end, a prediction for a frame f and a skeleton s is considered correct if its euclidean 2D distance error f,s falls within a pixel circular region around the ground-truth keypoint with radius r = αL head , i.e.: where L head is the length of the head segment and α is the scalar that controls the relative threshold for correctness consideration. Results. We separately present the results of the methods on H4D1 and H4D2 to better distinguish their effectiveness on single-and multi-person color data. At first, similarly to the outcomes on other public datasets, AlphaPose outperforms OpenPose showing higher accuracy both in single-and multiperson benchmarking sets of HUMAND. Nevertheless, even though both methods showcase lower accuracy on the multiperson data of H4D2, which is much more challenging due to the occlusions between the subjects, it is worth noting that the difference between the single-and multi-person results of OpenPose is low (∼ 1.5%), while AlphaPose presents a higher drop of approximately 9%. Taking into account that the distance between the subjects and the sensors is short, from 1 to 2 meters, and in most of the two-person samples, there are severe occlusions for some of the sensors, we can probably assume that OpenPose, as a bottom-up approach behaves more robustly on occlusions, however AlphaPose, as a top-down approach, is more accurate but is strongly affected by occlusions. In order to provide extra information VOLUME 4, 2016  to the reader, along with the results on HUMAN4D, we also indicate the related outcomes of the methods to other datasets, i.e. MPII [42] and COCO [56] using PCKh with α = 0.5, as presented in Table 3. Finally, a plot depicting the correlation between PCKh mAP against α threshold for both methods on both subsets, is illustrated in Fig. 12.

B. MULTI-VIEW 3D POSE ESTIMATION
Subsequently, we evaluate multi-view 3D pose estimation on HUMAN4D, exploiting the multi-view color images along with the respective intrinsic and extrinsic camera parameters and using HUMAN4D 3D poses as ground truth.
Methods. We choose a recent state-of-the-art method proposed by Iskakov et al. [23], which constitutes a novel solution for multi-view single-person 3D human pose estimation based on a learnable triangulation (LT) technique, combining 3D information from multiple spatio-temporally aligned 2D color views. In particular, LT (alg.) [23] is a top-down 3D pose estimation method based on end-to-end differentiable alge-braic triangulation with an addition of confidence weights estimated from the input images. We ran the experiments only on the HD41 benchmarking subset of the dataset since the method estimates single-person 3D poses, using the latest version of the code published by the authors 8 .
where J s is the total number of joints of skeleton s. Finally, we also use mean AP with 3D PCK metric [66] per joint, where an estimate is considered correct when the 3D euclidean distance error, i.e. f,s (j), is less than a distance threshold α 3D , as: for a frame f and skeleton s, correspondingly. Results. Classic triangulation algorithms assume that the 2D point coordinates from each view equally contribute to the triangulation 3D point coordinates estimation. The major advantage of the LT approach is that the contribution of the 2D joint positions that cannot be estimated reliably (e.g. due to joint occlusions) to the final triangulation outcome, is controlled by a neural network. In particular, learnable weights have been added to the coefficients of the matrix corresponding to different views. A limitation of the LT approach is that it fails when some of the body parts are out of the field of view of the cameras, leading to erroneous estimates. Another limitation is that LT approach supports only single-person 3D pose estimation and for that reason it was applied only on H4D1. Quantitative results of the method on HUMAN4D, complemented with results on CMU [39] dataset, are reported in Table 4. Fig. 14 illustrates the correlation between the mAP against α 3D threshold on HUMAN4D.
Qualitative results regarding the predicted 3D poses against ground-truth on HUMAN4D are illustrated in Fig. 13, where LT (alg.) seems accurate in "clean" poses where self-occlusions are limited (success cases on top rows), while the accuracy is limited in the presence of self-occlusions (failure cases on bottom rows).

V. VOLUMETRIC VIDEO
Beyond pose estimation, we benchmark a set of state-ofthe-art static 3D codecs, in the context of a live streaming scenario. Moreover, we assess the visual quality of textured 3D mesh instances to demonstrate the positive correlation between the objective visual quality and the FFT voxel-grid resolution.

A. VOLUMETRIC VIDEO COMPRESSION
Compression of volumetric data produced in a real-time manner is thought to be a key enabler of a wide variety of applications, such as XR teleconference, real-time dense surface mapping in AR devices and free-viewpoint videos.
A key contribution of HUMAN4D is that it enables future benchmarking in static and temporal volumetric video compression, by offering a large dataset of samples and sequences of point-and mesh-based volumetric data. In contrast with motion pictures where solutions are mature and proven, realtime varying geometry coding is still an open challenge frequently cured utilizing only intra-frame coding, ignoring temporal relations between volumes of consecutive frames. Such an endeavour is presented in [67] by Doumanoglou et al. In a similar manner, for the purpose of this work, the codecs are tested in various profiles, aiming at specific bitrates, using appropriate metrics on HUMAN4D point-and mesh-based volumetric data cues. To be coherent, we define common codec profiles both for H4D1 and H4D2 dataset subsets. A matching procedure between different codecs for the same target bit-rate was adopted, defining the acceptable deviation margin between target and achieved bit-rate to be ±10%.

1) MESH-BASED VOLUMETRIC VIDEO COMPRESSION
Initially, we benchmark 3D codecs on mesh-based volumetric data using the benchmarking subsets of meshes reconstructed in three different voxel-grid resolutions (i.e. r = {5, 6, 7}) applying the real-time 3D reconstruction method by Alexiadis et al., as reported in Section III-C3. Codecs. We employ Corto [68] and Draco [69], two 3D codecs particularly chosen due to their high quality realtime performance. Targeting specific bit-rates for real-time mesh-based volumetric video transmission, we constructed a series of compression profiles with varying compression level, quantization parameter per attribute and different compression methods for specific attributes. HUMAN4D meshbased compression benchmarking focuses on three different per-vertex attributes: geometry and normals represented in floating points and color in unsigned integers. Corto codec [68] configuration consists of four parameters. One quantization value for each of the mesh attributes, i.e. Geometry (GQ), Normal (NQ) and Color (CQ) Quantization bits, and one switch to denote the normal prediction method. We select between two different normal prediction methods, VOLUME 4, 2016 the Normals Quantized Coding (NQC) and the Normals Delta Coding (NDC). In the former, we store the differences between the normals estimated from the quantized geometry and the quantized actual normals, using an octahedron projection representation [70]. In the latter, the quantized normals in the octahedron projection representation are solely delta coded, with respect to a neighboring quantized normal belonging to a quad incident to the normal's vertex.
Regarding the Draco codec [69], the configurable parameters are the compression level (CL) which adjusts the compression speed versus the size mixture, the geometry quantization bits (GQ), the normals quantization (NQ) and the color quantization bits (CQ). Contrary to Corto, Draco does not expose any normal manipulation option to adjust.
Beyond these conventional open-source codecs, novel 3D and 4D data compression approaches have appeared, such as the one proposed by Tang et al. [36]. This method constitutes a novel block-based 3D compression model, being the first deep 3D compression method that can train end-to-end with entropy coding, lossless compression of the surface topology, exhibiting a novel block-based texture parametrization that inherently promotes temporal consistency without tracking and the necessity of the UV coordinates compression. This codec achieves superior results in comparison to conventional 3D codecs, such as Draco and Corto, in regards with the rate-distortion (RD) balance. Specifically, it is deemed to achieve on average 66% lower bit-rate for the same level of distortion in 4D data. For the purpose of this work, we did not benchmark this particular codec since it is not currently open-source. Metrics. With respect to the metrics, we use RMS, Hausdorf-fAbs and HausdorffRel metrics to compare the compressed and raw mesh-based representations. For the extraction of RMS and Hausdorff distance, we exploit a tool implemented based on [71]. This tool provides numerical metrics for the similarity of source and target triangle or quadrilateral meshes. It is worth mentioning that, for the same pair, swapping between the source and target meshes can lead to different numerical values, thus as usual for these metrics in the literature, we define the correct value to be the maximum of these two, for all metrics.
Hausdorff distance metric is used in two variations. Haus-dorffAbs metric is defined as the maximum value of all the uniformly minimum sampled distances across all points of the source surface to the target surface. HausdorffRel metric is a variation of HausdorffAbs metric which tackles the comparison of surfaces with different scales. For the RMS calculation, we need to have a set of minimum distances between two surfaces, the mean distance E m can be calculated by: where |S| denotes the area of S. Using the mean distance formula, the root mean square error is defined by: Results. For a fair comparison between the codecs, we choose to employ a testing scheme based on rate-distortion terms. In that direction, we keep the bit-rates steady for the pairs and evaluate the corresponding distortion introduced by each codec. As it can be seen in Fig. 15, Draco consistently outperforms Corto, in terms of distortion induced for any tested bit-rate. The profiles used for the benchmarking are depicted in Table 5.
Having tested the same codec profiles both for single and multi-person subsets of the HUMAN4D dataset, we noticed that the bit-rates achieved by both codecs on the multi-person subset are slightly greater than those on the single-person one. That is probably due to the fact that the additional information induced in the form of the second subject, leads to larger surfaces that, despite using the same voxel-grid areas and resolutions, results in more challenging 3D surfaces to compress, in regards with elements count and connectivity information.

2) POINT-BASED VOLUMETRIC VIDEO COMPRESSION
To benchmark point cloud compression, beyond the reconstruction of the raw point-cloud instances from the mRGBD samples described in Section III-C3, we also use another point-cloud reconstruction approach. The raw point-cloud instances typically contain ∼ 25, 000 points per frame for the single-subject sequences and ∼ 40, 000 points for the twosubject ones. This alternative reconstruction approach allows us to create denser point clouds by sampling points from the surface of the high resolution meshes (i.e. using voxelgrid resolution with r = 7). Points are sampled from the mesh surface with a probability proportional to the area of the underlying mesh faces using Point Cloud Library (PCL) [72]. We set the algorithm to generate point cloud instances containing 300, 000 points per frame. Codecs. To benchmark the performance of point cloud compression, we perform a rate-distortion analysis for the codecs Draco, Corto and CWIPC, the MPEG anchor codec proposed in [33] and evaluated in [73]. CWIPC is parameterizable with respect to the Octree Depth (OD) and JPEG Quantization Parameter (JPEGQP). We select to perform the analysis on 4 target bit-rates. Note that, for all codecs we first identified the compression parameters that achieve the target bit-rates within a 10% tolerance. Details on these profiles are listed in Table 6. Metrics. To measure the distortions introduced by compression to the point-cloud samples, we used standard, well established, full reference metrics, as released by the standards body MPEG [74], [75]. More specifically, we measure Peak Signal-to-Noise Ratio (PSNR) using the maximum of the    Codec configurations used to achieve the targeted bit-rates for the voxel-grid resolutions of the reconstruced 3D mesh instances, i.e. for r = 5, r = 6 and r = 7.   nearest neighbor euclidean distances amongst all points in the reference point cloud as the peak value v p by: The same process is then applied to the point cloud colors at each of the corresponding points between the decoded and the groundtruth point clouds. Metrics are collected utilizing the MPEG PCC-DMETRIC tool [76] 9 to calculate these distortions for each frame in the dataset. Results. Analyzing the experimental results, CWIPC codec achieves lower geometry distortions for the same bit-rate in comparison with Draco and Corto, while in higher bit-rates, all the benchmarked codecs showcase similar efficiency. CWIPC exploits octree occupancy to encode geometry positions, thus is able to retain more points from the original point cloud. Details with respect to point-cloud compression benchmarking are illustrated in Fig. 16, while the codec profiles used for the experiments are listed in Table 6. For the sake of clarity, we summarize the abbreviations of codec configuration parameters in Table 7.

Codecs
Parameter Abbreviation

B. MESH-BASED VOLUMETRIC VIDEO VISUAL QUALITY
In this section, we assess the visual quality of HUMAN4D textured 3D mesh instances between the three different resolutions of the underlying voxel-grid. The aim is to demonstrate the positive correlation between the objective visual quality and the utilized voxel-grid resolution used to reconstruct the mesh-based volumetric data. As mentioned in Section III-C3, the reconstruction of the mesh-based volumetric data is achieved by applying the realtime method proposed by Alexiadis et al. [8], parameterized 9 http://mpegx.int-evry.fr/software/MPEG/PCC/mpeg-pcc-dmetric in three different voxel-grid resolutions to produce watertight textured 3D mesh instances of varying vertex and face counts. Higher resolution grids lead to meshes of higher element count that are, per se, expected to capture more photorealistically and precisely the observed subjects.
Apart from the self-evident impact of higher resolution sampling on the reconstructed hull's spatial fidelity, additional benefits may arise with regard to the accurate colorization of its surface. To showcase and quantify this effect, we firstly project the examined mesh on its respective RGB images and sample the color of its fragments based on a weighted contribution of the corresponding pixels. Then, we render the mesh from the exact same viewpoints that the aforementioned images were captured and compare the synthesized images to their respective silhouette-cropped textures, using conventional image quality metrics.
We conduct the assessment separately to H4D1 and H4D2 benchmarking subsets. The former, consisting of 4 subjects with 14 sequences each, and each of these sequences with 100 sampled mRGBD frames, reconstructed in 3 voxel-grid resolutions (i.e. r = {5, 6, 7}) and rendered from 4 viewpoints, results in a total of 67, 200 rendered views of 16, 800 mesh instances. Similarly, the latter includes 2 couples, with 5 sequences of 100 frames each, reconstructed in the same 3 voxel-grid resolutions and rendered from corresponding viewpoints, giving a total of 12, 000 views of 3, 000 3D meshes. Metrics. For the visual quality assessment, we opted to use Peak Signal-to-Noise Ratio (PSNR) (Eq. 12) and Structural Similarity Index (SSIM) as metrics to objectively quantify the photometric and photorealistic consistency between the captured, raw color (RGB) views and the mesh-based 4D representations in the various voxel-grid resolutions on the rendered views' quality.
SSIM is a full-reference metric conceived as an improvement over the traditional PSNR and MSE-family metrics and is widely referenced in the video and photography industry as it is believed to capture better the human perception of visual quality. Instead of decomposing the input signals and then estimating absolute errors, as in the case of MSElike metrics, SSIM incorporates into its calculations the fact that images are inherently highly structured and thus their topology and the relations that arise between their elements, VOLUME 4, 2016  due to that fact, should not be ignored. Luminance Masking and Contrast Masking are two well-known visual perception phenomena that are taken into account during the process of obtaining SSIM measurements. The former is about the low visibility of distortions in bright regions, while the latter is about the masking of distortions in highly textured, nonsmooth, areas of an image. The SSIM formula is composed of three individual measurements of "structural similarity", luminance l, contrast c and structure s between two windows x and y of similar size. The individual comparison formulas are: c(x, y) = 2σ x σ y + c 2 σ 2 x + σ 2 y + c 2 (14) s(x, y) = σ xy + c 3 σ x σ y + c 3 (15) with µ x the average of x, µ y the average of y, σ 2 x the variance of x, σ 2 y the variance of y, σ xy the covariance of x and y, c 1 = (k 1 L) 2 , c 2 = (k 2 L) 2 , c 3 = (c 2 /2) are three variables to stabilize the division with weak denominator, L the dynamic range of the pixel values and k 1 = 0.01, k 2 = 0.03 by default. SSIM is then a weighted combination of these comparative measures: where α, β, γ > 0 are parameters used to adjust the relative importance of the three components. More on the SSIM and its development can be found in [77].
Results. As can be seen in Tables 8 and 9, the experiments conducted, validate the claim that increments of a textured mesh voxel-grid resolution lead to increases in its objective visual quality. Both for single-and multi-person evaluation sets, PSNR increases in par with mesh resolution. From r = 5 to r = 6 the increase is more pronounced, while from r = 6 to r = 7, it seems to diminish, indicating that a further increase in 3D mesh voxel-grid resolution may be futile, at least as regards the texture fidelity in terms of PSNR.
The SSIM case generally follows the same trend, with the exception of the S3 and S4 subjects from the single-person subset, where post r = 6 increase in resolution does not seem to further improve the SSIM of the textures. In these cases, the r = 6 and r = 7 SSIM values are approximately equal, exhibiting a difference of less than 10 −4 . In Fig. 17, volumetric samples from the single-and multiperson subsets are illustrated, rendered in the 3 different voxel-grid resolutions along with the corresponding RGB images from the same viewpoint. The increase of texture quality we want to highlight in these views is most apparent in the eyes area of the multi-person renderings. As can be seen, for r = 5 the right eye of the male subject is blurry and barely visible. As the voxel-grid resolution increases, the eye gets crisper and better defined. Such behaviour can be noticed in other areas of the volumetric data as well.
In a nutshell, experimental results indicate that the increase of 3D mesh voxel-grid resolution indeed leads to objective quality increase, though with diminishing returns. This latter observation, together with the near real-time capabilities of the mesh-based volumetric reconstruction pipeline for r = 6 and the decreased bandwidth needs it requires when compared with the r = 7 case, makes r = 6 voxel-grid resolution the most sensible choice for a volumetric livestreaming setup.

VI. DISCUSSION
We created HUMAN4D to provide the research community with a public resource that fills identified gaps in publicly available human-centric 4D datasets, consisting of motion capture and HW-SYNCed volumetric data. In the flood of recent literature, a plethora of algorithms and deep models focus on 3D pose estimation, however, only a few methods approach the task with the use of multi-view depth and volumetric data. That is probably due to the complexity and timeconsuming setup of multi-view capturing settings as well as the lack of spatio-temporally aligned multi-view depth maps with ground-truth data. To this end, we aim to enable research on that direction encouraging the computer vision community to develop and experiment with new 3D pose estimation approaches on HUMAN4D by providing HW-SYNCed depth and volumetric data along with ultra-accurate ground-truth 3D poses for supervision and evaluation. With regards to volumetric data, volumetric video is an emerging immersive medium, being unique due to its fully threedimensional nature and its capability to enable six degrees of freedom (6DoF) spectating when used in 4D environments. HUMAN4D has been created on the principle to provide spatio-temporally aligned mRGBD data captured to produce point-and mesh-based volumetric videos, reconstructed and compressed respecting online encoding and steady bit-rates. On top of that, in most public datasets, the temporal misalignment between the multiple color and depth streams adds extra noise to the already noisy depth and color data, reducing the quality of the volumetric video. In HUMAN4D, this noise is absent due to the high synchronization precision (HW-SYNC).

VII. CONCLUSION
In this paper we introduced HUMAN4D, a new multimodal human-centric 4D dataset containing a large corpus with more than 50K samples from daily, physical and social activities of annotated spatio-temporally aligned multi-view RGBD, volumetric and motion capture data along with audio recordings. To the best of our knowledge, HUMAN4D is the first dataset that provides HW-SYNCed mRGBD frames with the use of recent consumer-grade depth sensing devices. We also provide evaluation benchmarks based on discriminative pose estimation and volumetric data compression methods. We make all the data 10 and code 11 available online, including the respective synchronization, calibration and camera parameters, along with data loaders and other processing, vi-sualization and evaluation tools, for academic use and further research. In that scope, the authors commit to continuously maintain the dataset for the community by adding new tools, baselines and captures. Despite the continuous maintenance of the dataset, benchmarking subsets will remain constant to allow the assessment and comparison between new stateof-the-art methods on the same datasets. We believe that HUMAN4D and its associated tools will stimulate further research in computer vision and data driven approaches, enabling research on human pose estimation, real-time volumetric video reconstruction and compression, with the use of consumer-grade RGBD cameras sensors.

VIII. ACKNOWLEDGEMENTS
We gratefully appreciate the work conducted by the team of the Artanim Foundation Motion Capture Studio, providing high quality motion capture and 3D scanning services. We also want to give special thanks to Sylvain Chagué and Valérie Juillard, members of Artanim team, for scanning, post-processing and rigging of the 3D character and for postprocessing and retargeting of the animations, respectively.
Finally, we also acknowledge financial support by the H2020 EC project VRTogether under contract 762111. His involvement with those research areas has led to the coauthoring of more than 300 articles in refereed journals & international conferences. VOLUME 4, 2016