3DHR-Co: A Collaborative Test-Time Refinement Framework for In-the-Wild 3D Human-Body Reconstruction Task

The task of 3D human-body reconstruction (3DHR), which mostly utilizes parametric pose and shape representations, has witnessed significant advances in recent years. However, the application of 3DHR techniques in handling real-world in-the-wild data, still faces limitations. Training the 3DHR model in such scenario with 3D human pose’s ground truth (GT) is non-trivial. Curating the accurate 3D human pose GT for in-the-wild scenes remains difficult due to various factors. Recent test-time refinement approaches on 3DHR task leverage 2D off-the-shelf human keypoints information to support the lack of 3D supervision on in-the-wild data. However, we observed that additional 2D supervision alone could cause overfitting issue on common 3DHR backbones, making the 3DHR test-time refinement task seem intractable. We answer this challenge by proposing a strategy that complements 3DHR test-time refinement work under a collaborative approach. Specifically, we initially apply a pre-adaptation approach that works by collaborating various 3DHR models in a single framework to directly improve their initial outputs. This approach is then further combined with the test-time adaptation work under specific settings that minimize the overfitting issue to further boost the 3DHR performance. The whole framework is termed as 3DHR-Co, and on the experiment side, the proposed work can significantly enhance the scores of common classic 3DHR backbones up to -34 mm pose error suppression, putting them among the top list on the in-the-wild benchmark data. Such achievement shows that our approach helps unveil the true potential of the common classic 3DHR backbones. Based on these findings, we further investigate various settings on the proposed framework to better elaborate the capability of our collaborative strategy in the 3DHR task.

Figure 1: We exhibit the strategy of our approach (lower branch) compared to the recent work (upper branch) in the test-time refinement strategy of 3D 3DHR on the left column.Our approach acts as a refinement tool from given pre-trained 3DHR backbones that previous works [10,9] was not intended to.On the right column, with given arbitrary input data, our result provided better output as it solves pose ambiguity in the final 3D 3DHR result.The recent work of DynaBOA [9] utilized the ResNet-50 [11] backbone commonly used for 3DHR task.TTR stands for test-time refinement.

Abstract
The field of 3D human-body reconstruction (abbreviated as 3DHR) that utilizes parametric pose and shape representations has witnessed significant advancements in recent years.However, the application of 3DHR techniques to handle real-world, diverse scenes, known as in-the-wild data, still faces limitations.The primary challenge arises as curating accurate 3D human pose ground truth (GT) for in-the-wild scenes is still difficult to obtain due to various factors.Recent test-time refinement approaches on 3DHR leverage initial 2D off-the-shelf human keypoints information to support the lack of 3D supervision on in-the-wild data.However, we observed that additional 2D supervision alone could cause the overfitting issue on common 3DHR backbones, making the 3DHR test-time refinement task seem intractable.We answer this challenge by proposing a strategy that complements 3DHR test-time refinement work under a collaborative approach.Specifically, we ini-tially apply a pre-adaptation approach that works by collaborating various 3DHR models in a single framework to directly improve their initial outputs.This approach is then further combined with the test-time adaptation work under specific settings that minimize the overfitting issue to further boost the 3DHR performance.The whole framework is termed as 3DHR-Co, and on the experiment sides, we showed that the proposed work can significantly enhance the scores of common classic 3DHR backbones up to -34 mm pose error suppression, putting them among the top list on the in-the-wild benchmark data.Such achievement shows that our approach helps unveil the true potential of the common classic 3DHR backbones.Based on these findings, we further investigate various settings on the proposed framework to better elaborate the capability of our collaborative approach in the 3DHR task.

Introduction
3DHR, which has the task of estimating 3D human pose and shape information, has gained significant attention in various applications, including AR/VR gaming, healthyfitness tracking, and virtual cloth try-ons.Despite notable advancements in 3DHR works [16,20,28,8,3], a remaining challenge in 3DHR work remains: the availability of 3D data labels for learning is still limited.This issue arises as acquiring the 3D labels from real-world environments a.k.a.in-the-wild scenario necessitates specialized equipment such as body-mounted sensors [39] or multiview cameras [27], which incur substantial expenses associated with data collection.Consequently, this scarcity of data poses a significant challenge in achieving robust 3DHR under in-the-wild scenarios.
To tackle the issue above, 3DHR works embraced the strategy of non-fully-supervised learning [16,17,20].Prior to this trend, however, early works were mostly focused on the human pose estimation tasks, which the following works [31,32,45] handled the few 3D label limitation by feeding the multi-view images input information.In the upcoming years, these works by [40,14,41] leveraged the weaklysupervised learning strategy that basically utilizes the nonrelated 2D and 3D label-based data together for training their human pose estimation network.
As the above's human pose estimation works showed considerable performances, pioneer works [16,17,20] are proposed to directly infer the human pose and shape outputs from the image or video input via parametric function (e.g.SMPL [25], MANO [33]).Those parametric-based works are also known for their straightforward benefit during learning as they can utilize the estimated 3D human pose parameters that are projected to 2D representations, to be supervised by the richly-available 2D ground truth dataset (such as the dataset of MSCOCO [24]).These works were then evolved into the temporal strategy (e.g.: [18,26,3,42]) that basically utilized features of neighboring frames of video input to subdue the lack of feature information in particular frame or sequences due to unseen body parts [19].
As those recent 3DHR works [16,17,20] showed considerable ability in predicting human pose and shape outputs by including 2D data supervision, recent approaches, such as BOA [10] and DynaBOA [9] utilized the 2D detected pose information provided from the off-the-shelf detector to firmly guide the 3DHR model during inferencetime learning or known as test-time adaptation.Such strategies pave the novel way to achieve top performance in the in-the-wild benchmark data (such as 3DPW [39]) without much modification on the network architecture level.Their works, however, are only focused on the task of domain adaptation.As shown in the left column's upper branch of Figure 1, they (BOA or DynaBOA) require an initial training step with specific dataset (e.g.: Human3.6M[13]) that is limited in terms of variety but provides 3D ground truth, to be later adapted on test time with both Human3.6M[13] (as Source domain) and in-the-wild data (e.g.3DPW [39] as Target domain).
Based on the matter above, our observation in Figure 2 showed that the SPIN's [20] backbone's performance (middle bars of Figure 2) that are directly plugged onto BOA's framework (BOA-plugged), was only showing on-par results with the original BOA's work that has its own modified backbone version (left-bars).This is detrimental as BOAplugged, pre-trained with various datasets, should have shown better performance than the original BOA, which was solely trained with Human3.6M[13].This issue is caused by the overfitting problem on BOA-plugged, as it directly adapts the whole video data in the target domain, meaning that any video frame sequences unrelated to the initially fetched frames are insufficient for adaptation.
We tackled the issue above with a straightforward objective: proposing a refinement framework that can plug the original pre-trained 3DHR model f 0 and directly show better refined 3D body results in test time than the initial version via its adapted version f a .The general idea of this setup is displayed in the left column's lower branch part of Figure 1.The knowledge of each backbone is represented with different colors, and our test-time refinement strategy is meant to increase each's knowledge to predict better 3DHR outputs.Following the idea of BOA (or DynaBOA) that utilizes off-the-shelf 2D pose information during adaptation, our work was able to show better visual results (right column of Figure 1) than the in-the-wild 3DPW [39]'s recent adaptive-based method (DynaBOA [9]).
Our strategy, termed as 3DHR-Co, is a test-time refinement approach that utilizes currently available 3DHR methods to work collaboratively in distilling the information from one model (as a teacher) unto another (as a learner).This is made possible by treating the learner side of the 3DHR model with the perturbed version while the teacher side with the non-perturbed version of input data during test time.The 3DHR outputs difference between perturbed and non-perturbed versions provide implicit cues for learning during inference time, making it similar to the strategy self-supervised zero-shot learning [36].In the following sections of this manuscript, we provide further discussion related to our works and the detailed strategies for better understanding.To summarize, our contributions above are written as follows: • The proposal of a collaborative-based test-time refinement strategy for 3DHR task (3DHR-Co) that can refine 3DHR models on the go.
• Top-level performance achievement on the in-the-wild benchmark data, obtained by only using the classic common 3DHR backbones run via our collaborative-Figure 2: The error scores of BOA [10] with original SPIN [20] backbone pre-trained with various datasets (middle) and with modified SPIN backbone pre-trained with single dataset (left) that is on-par.We tackled this issue by performing a refinement strategy that pushed the error score lower (right).
based test-time refinement framework.
• A study that provides discussions and recommendations regarding the optimal solution for running 3DHR test-time refinement under collaborative strategy.

Related Works
We describe several prior works that are relevant to our main studies.They are classified and described in the following subsections:

3D Human-body Reconstruction
Recently, 3D human-body reconstruction has gained popularity in the computer vision research community.Early works focused on estimating the pose of humans in a stick-man representation, mostly known as the task of human pose estimation [31,32,45,40,14,41].In the past few years, parametric-based approaches have been proposed to simplify the prediction task of 3DHR by estimating the human pose and shape outputs directly.Among these parametric-based approaches, one particular work (namely skinned multi-person linear model or SMPL [25]) was proposed to represent the non-clothed human mesh form with those parameters.Early work termed as Human Mesh Recovery (HMR) [16] utilized deep learning architecture [11] to provide features for regressing the SMPL parameters and camera outputs to render the final human mesh output.The following works then utilized a similar strategy but with several modifications on the backbone levels to support better 3DHR performance from single input image [44,4].Realizing that image-based works can be extended to the temporal domain (video-based), various works are also proposed to tackle the limitation of single image input-based 3DHR works [17,26,3,42].These approaches are naturally better as temporal strategy provides features from neighboring frames that help recognize certain human poses that might be undetected in a particular frame.Just a while ago, recent works [15,30,43] involved particular expressive parameters such as hand and face details.Moreover, the current developments [34,35,12, 1] also included clothing details information which is pleasing to look on.

Test-time Learning
Test-time learning, also known as test-time training, is an approach that performs learning on the deep network model during inference time that aims to increase the knowledge of the model itself to be more robust in solving certain task [38].Other non-3DHR works [36,5,21] have experimented with test-time learning to boost their performances.One interesting work, zero-shot image restoration [36], showed that test-time learning is possible from an untrained model.This strategy is further improved by these recent works that mainly utilized the meta-learning approach to firstly train the untrained model and then do the test-time learning procedure for fast adaptation purpose [29,37,22].In recent times, however, the paradigm of test-time learning has intersected with the idea of domain adaptation, which aims to adapt a model learned from a specific domain to be robust on certain target domains.This idea later evolved into domain generalization, where the model is adapted from certain to generalize well to outof-distribution various target data.In relation to the 3DHR works recent studies of BOA [10] and DynaBOA [9] implemented the idea of test-time training for domain adaptation.DynaBOA [9] utilized additional information retrieval and dynamic adaptation to surpass its predecessor (BOA [10]).These prior works differ from us in terms of motivation, as they focus on the task of solving domain shift issues via test-time adaptation.

Preliminary
In this passage, we describe the general idea of our approach.In contrast to prior works [10,9], ours is focused on the task of refinement strategy of the pre-trained-available 3DHR models.The objective above led us to the strategy of collaborative learning, where two different 3DHR models are placed directly into our refinement framework.As shown in Figure 3, our framework provided 2 specific branches (white and blue regions) with one model (f 0 placed on the white-colored upper-branch region) acting as the teacher and the other one (f s placed on the blue-colored lower-branch region) acts as learner.The knowledge of the teacher model with no perturbance case is transferred to the learner model that is perturbed.The perturbed version is Figure 3: Our straightforward yet intuitive test-time refinement framework.The framework above denotes our pre-adaptations strategy (Line 6-13 of Algorithm 1), which aims to provide pre-adapted weight (f s ).The form of f s is already within our test-time refinement framework; however, it is best to furtherly refine it using bilevel adaptation [10,9].provided with noise addition on the image level that acts as a partial occlusion to the human body.By teaching the learner model with the knowledge of the non-perturbed version, it is natural that the learner model gains more improvement during refinement (visualized by the increased green-colored content of f s in Figure 3).Based on the idea above, we provide a detailed elaboration of our test-time refinement mechanism for the 3DHR task in the following.
For simplicity purposes, we make use of the common 3DHR methods that utilize the SMPL parametric model [25].The SMPL model is constructed by body pose parameter θ and the shape parameter β.In addition, besides predicting the SMPL parameters, recent common 3DHR methods are also tasked to estimate the camera parameters, c.These SMPL parameters (θ, β) can be used to generate corresponding human mesh representation, M, along with the 3D keypoints representation J via the mesh-to-3Dskeleton mappings provided in SMPL function.The camera parameter c can be used to perform a weak-perspective projection of J from 3D to 2D space.The above nomenclatures are used in the following elaboration.

3DHR-Co test-time refinement algorithm 3.2.1 Motivation
Our test-time refinement framework is explicitly defined in an algorithm form, and its representation is outlined in the Algorithm 1.As illustrated in Figure 3 earlier, our method is designed to directly refine the pre-trained 3DHR model.To achieve this, we employ two refinement strategies: (i) the pre-adaptation stage (Line 6-13 in Algorithm 1), and (ii) the full scope of our 3DHR-Co test-time refinement, which includes further refinement through our regeneration-based bilevel adaptation (Line 15-18) in Algorithm 1).The purpose of our pre-adaptation strategy (depicted in Figure 3) is to provide a ready-to-adapt weight that avoids overfitting issues when the bilevel adaptation algorithm is plugged directly with the 3DHR backbone.As bilevel strategy works in a sequential manner, the backbone has the tendency to preserve knowledge on certain sequential frames, leaving the remaining incoming streams to be sub-optimal during adaptation.Our strategy tackled this issue by introducing the sampled target data during the pre-adaptation stage.The effect is crucial, and as demonstrated in Figure 2, our method (right-bars) with BOA function and the pre-adapted SPIN weight can achieve better results than the BOA function with the pre-trained SPIN weight (middlebars).With this motivation, the whole 3DHR-Co test-time refinement algorithm is constructed to suit the plugged pretrained model.

Pre-adaptation Stage
In this passage, we provide the technical elaboration on running the pre-adaptation strategy.The pre-adaptation strategy is visually depicted in Figure 3 and pseudocode-wise, it is written in the Line 6-13 in Algorithm 1.To perform pre-adaptation, our work is supported with 2 intermediate output data, namely: test-time intermediate label and intermediate perturbed data.The intermediate label data is generated by the non-perturbed version of test input data, while the perturbed version acts as the intermediate perturbed data for the learner module.This approach came with 2 benefits: (a) arbitrary in-the-wild input data can be used directly to extract the perturbed data for refinement purposes, and (b) various models without modification can be learned on the go during test-time.The remaining task is then focused on the strategy of creating a reliable test-time refinement framework.
At the algorithm level, our pre-adaptation approach first fetches the current batch data B i and its neighboring frames (B i−j , ..., B i+j ) for extracting the temporal features V l .This procedure is done to extract the intermediate label data (θ l , β l , C l in Line 10) extractor, and in this work, the common temporal [3,42] 3DHR works are directly utilized to fulfill such task.Meanwhile, the intermediate perturbed data (θ r , β r , C r in Line 11) version is obtained from the predicted body and camera parameters from test input data B r corrupted via Gaussian noise (Line 8).
This strategy is further learned via MSE loss functionality (Line 12): and its visual representation is denoted by the red arrow of Figure 3 where each of the output parameters is used for su- L = loss(θ l , β l , C l , θ r , β r , C r , G) by using Eq. ( 1) 13: Update learner: f s ← ADAM(f s , ∇ f L ) P = B ▷ Update buffer P with the latest iteration data pervision.J r is the 3D keypoint information extracted from the SMPL output parameters of the perturbed version that are projected into 2D via C r .The approach above simulates continual knowledge improvement (step-by-step wise) as depicted by the increased green-colored content in the lower-branch scope of the student network (f s ) in Figure 3.
Note that the batch of data in each step is sampled randomly.Thus on the step of k + 1, data can be unrelated to data in the previous step k as shown in Figure 3. Once the preadaptation stage is finished, f a is transferred to the next scope (Line 15-18) to perform bilevel test-time refinement.

Regeneration-based bilevel test-time refinement
The full-scope of the 3DHR-Co refinement framework includes the initial pre-adaptation strategy above and the following regeneration-based bilevel refinement approach (Line 15-18).The objective of the latter's strategy is to refine the pre-adapted 3DHR backbone's weights while also avoiding the overfitting issue.To run the regeneration strategy, we employ a straightforward weight refreshment mechanism (Line 16) that runs under frame sequence manner.The data buffer P is utilized to store the latest adapted weight that learns from previous stream sequence data (Line 18).For the adaptation, the bilevel function (Line 17) is applied, and in this step, f a is always updated along the batch input.The bilevel function acts as a refinement tool that can adapt the pre-adapted f a 3DHR model in test-time, and we refer to the original implementation of the respective authors [10,9].The regeneration strategy is proposed to avoid the overfitting issue mentioned earlier.The procedure is straightfor-ward (Line 16) as it re-initiates the model with the weight obtained from the pre-adaptation stage (Line 6-13) when a new sequence is fetched.Doing so helps the model to directly re-learn from the pre-adapted version f s , which already has the knowledge of tested data, rather than the pretrained version f 0 .This step is important as the f s version already secured initial improvement capability without being overfitted to a particular frame or sequence on test data.These capabilities are further measured in the following Experiment section.

Experimental Setup
In this discussion, we provide a comprehensive overview of our implementation method.The pseudocode of our algorithm is implemented using the Pytorch framework.To ensure consistency, various recent and common backbone models are utilized for refinement: SPIN [20] and RSN [44], as they provide the same SMPL parameters and similar Pytorch scripts.The backbones of these works are directly plugged into our algorithm.The intermediate label data extractor is performed via temporal T works of MPN [42] and TCMR [3].The Eq. 1 is determined with constant coefficients (λ 1 =10, λ 2 =0.1, λ 3 =1, λ 4 =1) while learned via ADAM optimizer with a learning rate of 0.00001 during the pre-adaptation stage (from f 0 to f s ).Our regenerated-based bilevel adaptation function (from f s to f a ) follows the recent setting of bilevel function of Dyn-aBOA [9].Our refinement strategy is run on the NVIDIA RTX 3090 GPU, and all experiments are focused on the Table 1: Quantitative scores by performing ablation study on the number of sampled data (percentage) and noise information in our pre-adaptation strategy.In the following, SPIN [20] scores improvement are shown.MPN [42] and TCMR [3]   By accumulating the additional functions proposed in this work, classic SPIN backbone can achieve state-of-the-art result in in-the-wild benchmark data (3DPW [39]).Legends: Intrm.=intermediate,pert.=perturbed,pread.=pre-adaptation,reg.=regeneration, dy.=dynamic, and blv.=bilevel.benchmark of in-the-wild scenario (via 3DPW [39]).

Finding the Way for Improvement
The main challenge of our 3DHR-Co framework is to find reliable data used during test-time learning.As described in the Method Section above, we opt to gener-  version is then used in our full-scope strategy along with the bilevel (Reg.Blv.) [10] and dynamic bilevel (Reg.Dy. Blv.) [9] functions.Their scores are shown in Figure 4 accordingly.The recent setting of [9] is utilized in our bilevel function as it shows best score via full-scope 3DHR-Co framework (refer to right-most result with the score of 63.72 mm in Figure 4).Based on this experiment, we further explore the benefit of both pre-adaptation and the full-scope of 3DHR-Co in the following discussions.

Pre-adaptation Test-time Refinement Experiment
The effect of our pre-adaptation strategy is showcased in the following experiment.This task is performed by working with the intermediate-label and -perturbed data settings.The intermediate-perturbed data are represented with Gaussian noises varied among σ 1 =35, σ 2 =50, σ 3 =65.The intermediate label version is represented by the number of sampled data (percentage) taken from the tested in-the-wild Figure 6: Qualitative results of RSN method that are adapted using our pre-adaptation strategy.Significant pose changes are highlighted with red arrow marks.MPN and TCMR act as the intermediate label extractor during pre-adaptation.dataset.In this work, each of the epoch sampled 3 video sequence of the 3DPW benchmark set, where each video contains randomized 8 batch of frames, thus, around 24 frames are utilized from the total 35K frames (0.07%) in 3DPW [39] test set.We empirically determined the maximum number of 600 epochs for further pre-adaptation experiments (42%).

Working with intermediate-label and intermediate-perturbed data
We show the quantitative pre-adaptation scores for both SPIN [20] and RSN [44] models in Table 1 and Table 2 respectively.The scores are shown using MPJPE metric, varied along the number of perturbations in the horizontal direction and the intermediate label extractor (MPN [42] or TCMR [3]) on vertical manner.As shown in the tables above, the more data to be sampled, the larger improvement is achieved as errors are suppressed in a large gap.In the case of SPIN model in Table 1, our pre-adaptation strategy can suppress the error score up to -22.49 mm (orange mark at ep-601), while RSN can be reduced with -20.11 mm error gaps (orange mark in Table 2).This indicates that both of these works have hidden potential without modifications at the architecture level.The RSN model case, although more focused on the lower-resolution scale 3DHR task, is still capable of doing test-time refinement, as shown in Table 2. SPIN's qualitative improvement results from its initial version are demonstrated in Figure 5, while RSN's results are shown in Figure 6.The red arrow marks highlighted the significant pose difference between the initial outputs of the pre-trained model and the refined versions via our strategy.MPJPE and PA-MPJPE scores are shown directly below the respective image.
Figure 7: Internet results using our approach that are compared directly recent top performer of 3DHR adaptation method (DynaBOA).Occluded body parts are plausibly estimated using our approach.
We notice that the appreciable improvements are mainly shown in the body parts that are partially occluded.This phenomenon indicates that our test-time refinement approach in the pre-adaptation stage is already capable of solving the occlusion issue.In a few cases, however, we observed that larger noise (σ 3 ) might lead to sub-optimal results (shown in the fourth-row image of Figure 5 and Figure 6).Thus, it is recommended to use as low as possible noise value for the 3DPW [39] case.This is in line with our finding as quantitatively, both SPIN and RSN demonstrated the best performances via σ 1 = 35, as shown in both Table 1 and Table 2, respectively.

Full Scope 3DHR-Co Test-time Refinement Experiment
The next important step in our experiment is to explore the full-scope version of our 3DHR-Co strategy.Using the pre-adapted weight, our algorithm runs the bilevel function along with the regeneration strategy (Line 15-18.This approach is proven to give top performance in the current 3DPW [39] benchmark using only classic SPIN backbone (ResNet-50 [11]).The quantitative scores are shown in Table 3 and Table 4, where MPN and TCMR act as the intermediate label extractor, respectively.Similar to the discussion above, a large error suppression gap (-34.53 mm) is obtained by 3DHR-Co(SPIN) with the smallest noise con-figuration (with σ 1 shown in Table 3).This leads the classic SPIN model to achieve superior performance in the current 3DPW [39] benchmark (MPJPE = 63.72 mm as highlighted in Table 3) with the MPN as an intermediate label data extractor.Re-plug it with TCMR as an intermediate label data extractor, and the SPIN model is capable of achieving a similar result in the 3DPW benchmark (MPJPE = 63.92 mm as highlighted in Table 4).The works above conclude that the strategy of using 3DHR intermediate data extractors and the pre-adapted classic backbone together is capable of achieving reliable test-time refinement tasks collaboratively in 3DHR works.

In-the-wild Real World Experiment
To further show the validity of our work, 3DHR-Co is demonstrated with real-world internet data cases where no ground truth is available.We follow the DynaBOA setup that first generates the 2D pose via an off-the-shelf 2D human pose estimator (AlphaPose) [6,23,7].Our method is performed via pre-trained SPIN plugged directly into the 3DHR-Co framework with MPN as the intermediate-label extractor.As shown in Figure 7, the stream inputs that have severe occlusions in the human body are recovered better with our approach compared to DynaBOA [9].The visible errors of [9] (red-arrows in Figure 7) are the issue of human pose ambiguity problem where some occluded paired-body parts are often switched.Our 3DHR-Co approach solved this issue while showing considerable temporal results from the input stream (refer to supplementary files).

Execution Time
Our approach came with the extra advantage as it runs with small computational time.Our pre-adaptation approach took approximately 0.15 seconds for a batch of data (each batch contains 8 frames with the size of 3 × 224 × 224 dimensions).For an epoch that sampled 3 × 8 frames takes around 3 seconds for each pre-adaptation step.As shown in Table 1, running approximately 400 epochs in the preadaptation step takes only less than or around half-hour to boost the performance of pre-trained SPIN [20] up to -21 mm error suppression on the whole 3DPW [39] test set.Such an approach is favorable compared to re-training a new 3DHR model that takes hours or days to learn.The regeneration-based bilevel steps share similar timing with the original source [9] that take approximately 1 second for adapting a batch of data (1 batch runs 1 frame).

Limitations
While the experiments above show remarkable performances in boosting the 3DHR models, our current development is still limited in terms of boosting test data with the non in-the-wild characteristics.The performance of the full-scope 3DHR-Co framework depends on the intermediate data extraction strategy applied during the preadaptation stage.In the pre-adaptation stage, when MPI [2] and Human3.6M[13] are tested directly, our refinement performances are sub-optimal as it only improved with only around -2 mm in both cases.We presume this phenomenon happened due to their dataset characteristic that mostly contains homogeneous scenes from in-studio environments.Specifically, their scenes are equipped with static backgrounds and similar color information shared among videos and frames.This condition is well fitted on both networks during the training stage (via [13] training set), and thus, adding noises to such scenes gives minimum effect during test-time refinement.Future studies should consider the constraint above to optimally perform 3DHR refinement in various dataset domains.

Conclusion
In this work, we propose a collaborative-based test-time refinement framework (termed 3DHR-Co) specialized for boosting the 3DHR task.Running 3DHR in an in-the-wild scenario poses an important challenge, as it is hard to obtain 3D human pose labels for training the 3DHR model.Such a case subsequently pushes the recent 3DHR works and their refinement approaches to learning with limited supervision.The 3DHR-Co is proposed to answer this challenge by allowing various common 3DHR models to be boosted directly in test time.This is achieved by employing 2 technical strategies: (i) finding the way of extracting reliable intermediate data during test-time to be used during test-time learning and (ii) the refinement framework itself that leverages the collaborative strategy.In specific, the collaborative approach involves the process of transferring the knowledge of one 3DHR model to another.Our experimental results showed that even with the common classic 3DHR models, our framework obtained a remarkable performance boost, achieving top performance results and thereby revealing their true potential in solving the in-thewild scenario.The explorations above are also provided with thorough discussions to help find the best test-time refined 3DHR outcomes using our method.We also describe our work limitations for future improvement of 3DHR refinement studies.

Table 3 :Figure 5 :
Figure 5: Qualitative results of SPIN method that are adapted using our pre-adaptation strategy.Significant pose changes are highlighted with red arrow marks.MPN and TCMR act as the intermediate label extractor during pre-adaptation.

Table 2 :
[3]]selected as the intermediate label data generator.Our algorithm allows all methods to work collaboratively in the test-time refinement mechanism.Orange mark defines error gap < -20 mm.Quantitative scores by performing ablation study on the number of sampled data (percentage) and noise information in our pre-adaptation strategy.In the following, RSN[44]scores improvement are shown.MPN[42]and TCMR[3]are selected as the intermediate label data meshes generator.Our algorithm allows all methods to work collaboratively in the test-time refinement mechanism.Orange mark defines error gap < -20 mm.

Table 4 :
Quantitative scores of our full scope 3DHR-Co test-time refinement strategy.We use DynaBOA to perform the bilevel function (Line 17) in our algorithm.TCMR is selected as the intermediate label data extractor.Red mark below defines error gap < -30 mm.