POP: A Generic Framework for Real-Time Pose Estimation of Planar Objects

Accurate pose estimation of planar objects is a key computation in visual localization tasks, with recent studies showing remarkable progress on a handful of baseline datasets. Nonetheless, achieving similar performance on sequences in unconstrained environments is still an ongoing quest to be accomplished, largely due to the existence of several sources of errors, which are correlated but often only partly tackled in the literature. In this article, we propose POP, a generic real-time planar-object pose-estimation framework which is designed to handle the aforementioned types of errors while not losing generality to a specific choice of keypoint detection or tracking algorithm. The essence of POP lies in activating keypoint detection module in the background as well as adding several refinement steps in order to reduce correlated sources of errors within the pipeline. We provide extensive experimental evaluations against state-of-the-art planar object tracking algorithms on baseline and more challenging datasets, empirically demonstrating the effectiveness of the POP framework for scenes with large environmental variations.


I. INTRODUCTION
Object pose estimation is central to many applications in computer vision and robotics, namely surveillance, robot manipulation and augmented reality (AR) [11], [44]. In particular, the problem of tracking pose variation of a planar object is regaining attention in AR for user localization as planar homographies provide strong point-to-point geometric constraints for relative pose estimation [16]. This is a challenging problem since users can move fast and rotate sporadically in an unpredictable way while enjoying their AR contents in the wild, inducing significant motion blurs, viewpoint shifts and illumination changes.
Estimating pose of a planar object has traditionally been carried out using a handcrafted modular pipeline, whereby features are initially extracted and matched, and the matched features are tracked using an optical flow algorithm (such as the Kanade-Lucas-Tomasi (KLT) tracker [32]). The tracked features are then used to estimate the underlying homography The associate editor coordinating the review of this manuscript and approving it for publication was Shiping Wen . in each frame. Major baseline and state-of-the-art methods adopt this framework albeit accompanied by various modifications, with recent work [25], [44] achieving near 80-99% average tracking accuracies on several public datasets. Unfortunately, such positive results do not always replicate to unconstrained scenes encompassing large environmental variations [44], leaving it as one of the remaining challenges for this type of method. This is the main motivation of our work.
In recent years, several work have devised end-to-end frameworks for estimating pose of a planar object from raw images, largely motivated by the success of deep learning-based methods in estimating poses of non-trivial rigid objects [19]. Nevertheless, state-of-the-art results for planar objects are still achieved by more traditional modular pipelines [44] primarily due to lack of training data available for end-to-end platforms, which require large variations in backgrounds and viewpoints to learn the correct representation for planar objects [38]. This is the major bottleneck behind the current deep learning-based methods, especially in AR where scenes can exhibit large environmental changes. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Illustration of potential error sources in estimating pose of a planar object through a traditional modularized pipeline. In (a), extracted keypoints are concentrated in a specific region or have low feature scores; (b) incorrect keypoint correspondences are formed between the initial image and the current image; (c) keypoint tracking fails (e.g. due to fast motion); (d) Homography is incorrectly estimated (e.g. due to outliers). We address the problem of reducing these error sources altogether.
While it is hoped that more training data would eventually yield end-to-end methods with desirable accuracy, we believe enhancing traditional modular pipeline to work better in unconstrained environments (frequently encountered in AR) would yield its own valuable contribution during this period of potential transition. Modular pose estimation pipelines usually suffer from errors arising from multiple sources, namely low-quality or biased feature extraction, feature mismatches, unreliable keypoint tracking and incorrect homography estimation (see Fig. 1) for an illustration). Regrettably, these errors are not just additive but correlated, i.e. features concentrated in a specific region can trigger biased and potentially incorrect feature matches, increasing the possibility of incorrect homography estimation. While methods in the literature typically boils down to minimizing just one source of errors, the highly correlated nature of these errors inspires us to develop a framework jointly considering all the error sources. This leads to the following contributions: + We activate a keypoint detection module in the background to prepare for potential keypoint tracking failures, we manipulate [15]'s SSVM approach and implement PROSAC to utilize weights learned from SSVM in effectively ordering samples of correspondences in estimating homography. + In Sec. III-A, we propose a grid-level dynamic thresholding technique for retrieving useful keypoints relatively evenly across the query image in each frame. + Using the above as well as other refinement and learning modules, we propose a real-time planar object pose-estimation framework (POP, see Sec. III), which is capable of reducing bias in keypoint extraction, decreasing errors arising from feature mismatches and removing outliers for more stable homograph estimation (see Fig. 2). + We provide a thorough experimental evaluation of our POP framework against baseline and state-of-the-art planar object tracking methods across multiple challenging benchmark datasets (namely UCSB [13], Lintrack [50], EOS [15], MTSC [49], TMT [41] and POT210 [30]), empirically demonstrating the effectiveness of our scheme especially for unconstrained scenes. + We present our own unconstrained dataset (MSL), which has emphasis on large variations in viewpoints and zoom scales. Our POP scheme is again compared against other planar object pose estimation algorithms on this dataset. Conversely, this work is limited to using off-the-shelf tracking modules, and consequently errors generated within the tracking process are not refined. Nevertheless, we believe this is a pioneering work for considering multiple sources of errors in a single pipeline, and it is hoped that this will provide a useful direction for future research in reducing correlated errors for planar object tracking.

II. RELATED WORK
Early learning techniques were performed using a predefined classifier offline. These methods are inevitably classifier because it does not adaptively respond to real-time environmental changes [44]. In recent years, an online learning methods have been actively studied in real-time input images [36]. In this section, we then summarize the main technique of the recent methods which are based on online learning for object tracking and pose estimation. In this section, we review relevant literature in pose estimation of planar objects.

A. PLANAR OBJECT POSE TRACKING
In the early days, a planar object's pose was estimated by attaching some markers around the object [23]. Marker-based trackers typically involve a template-based method for recognition and tracking [4]. The representative keypoint detector for marker detection is accelerated segment test (FAST) [40] and generic accelerated segment test (AGAST). These detectors are used to extract the corner points of each marker with which the tracker estimates the respective plane geometry. This type of trackers are computationally cheap but there is a big disadvantage that they can only be used in constrained environments where markers are present.
To address the above problem, studies began to focus on markerless tracking approaches (also known as natural feature tracking (NFT)), which can recognize objects by extracting features from images [37]. For finding feature correspondences, feature detectors such as FAST and AGAST are combined with a feature descriptor which characterizes local information around each keypoint. Such descriptors include binary robust independent elementary features (BRIEF) [8], binary robust invariant scalable keypoints (BRISK) [29], speeded-up robust features (SURF) [3]. Nowadays, most planar object tracking pipelines [25], [44] utilize NFT and optical flow to match or track features from which pose is estimated.
Recent work focused mostly on improving robustness to feature mismatches. [15] advanced the Struck [14] method (mentioned below) to specialize for planar objects using online structured support vector machine (SSVM), which updates pose of the target object in each frame. [49] considered the temporal spatial consistency of extracted keypoints FIGURE 2. Main contributions of our POP framework. In (a), our keypoint refinement module tries to retrieve keypoints evenly across the image by dividing image into grids and dynamically changing the detection threshold in each grid. In (b), we apply online structured support vector machine to weigh correspondences (between the query and initial database images) through which homography is estimated. In (c), outlier correspondences are removed through inverse homography computation after which the homography is re-estimated. by adopting a multi-task structured keypoint-model learning over several adjacent frames. Recently, [44] proposed Gracker, which is a graph-based tracker that can explore the object's structural information, improving state-of-the-art performance on many available datasets except in unconstrained environments. To this date, only a few studies have investigated on improving keypoint extraction and final pose refinement but even these focus on minimizing one particular type of errors.
Several work have adopted deep learning to estimating poses of planar objects [10], [46]. Most of these studies are performed by detecting object in each frame, requiring a very large number of images for training. For training, [35] used 600 sheets per object, and [46] used 1,000 sheets. [47] proposed a method of recognizing pose by recognizing a planar object from another viewpoint in real-time using a plurality of cameras. Nevertheless, besides the large amount of training time and data required, these trackers have only been tested on a small number of non-public custom-generated datasets, making it difficult to compare against other pipelines.

B. REMARKS ON OTHER GENERAL OBJECT TRACKERS
We briefly review some well-known tracking methods for general objects as some of them are used for comparisons in Sec. IV. [20]- [22] proposed the Tracking-Learning-Detection (TLD) structure, which builds the foundation of modern trackers for general objects. In this framework, the object model is continuously updated through growing events and is refined through pruning events which generate positive samples and negative samples respectively to update the model. Derived from this, [48] attempted to track objects by constructing multi-scale image feature space. [6], [7] proposed the kernelized correlation filter (KCF), which applies superpixels to efficiently track using color information. Henriques et al. proposed the dynamic graph tracker (DGT), a fast multi-channel extension of linear correlation FIGURE 3. A schematic diagram of our planar object pose-estimation (POP) framework. The basis structure is similar to that of a traditional modular pipeline for planar object tracking. As mentioned in Fig. 2, we improve the keypoint extraction module by adopting grid-level dynamic thresholding (see Sec. III-A). We activate a background feature detection module for minimizing the impact of accuracy when keypoint tracking (using optical flow) fails. An SSVM-based online learning module is implemented to improve robustness against feature mismatches by learning weights for correspondences. Outliers are handled and discarded after which homography is refined. filters using a linear kernel. [14] proposed Struck, which learns and updates the classifier using a structured support vector machine (SSVM) and estimates the object pose respectively.

III. KEY COMPONENTS IN THE POP FRAMEWORK
In this section, we illustrate our planar object pose-estimation (POP) framework in detail (see Fig. 3 for an overview). The foundation structure of POP builds upon the modular pipeline described in Sec. II. Most importantly, it comprises several additional modules and steps to take into account of 3 sources of errors described in Fig. 1.

A. KEYPOINT REFINEMENT VIA DYNAMIC THRESHOLDING
A majority of previous work [14], [15], [20], [26], [28], [49] weigh all detected keypoints equally during feature matching and homography computation. Consequently, this can result in extracting adjacent keypoints containing similar information, potentially increasing bias to an image region with high concentration of keypoints as well as raising computational burden. To resolve this, we consider the spatial distribution and quality of each keypoint, dividing image into grids and finding as-evenly-distributed-as-possible keypoints, whose qualities are higher than those of the suppressed keypoints in their respective local grids.
In each grid, we set the maximum number of keypoints (M ) to a constant. If it is found that the number of keypoints (N i ) in a grid i exceeds this hard threshold, we filter and extract only the maximum number of keypoints in the grid based on the intensity corner value. If N i ≥ M , we proceed to the descriptor-generation stage for that grid. If N i < M , then we enter the re-extraction mode for additional keypoint detection.
We compute M as follows: first, we set a hard threshold on the total number of keypoints allowed in an image, defining it as Q for convenience here. Then, we divide Q by the total number of grids. In terms of equations, where N i is the number of keypoints in grid i, W is the number of grids along the horizontal axis and H is the number of grids along the vertical axis (we assume rectangular grids).
As we used 4 × 4 = 16 grids, we set the re-extraction threshold (M ) as 700/16 = 43.75. i.e. if a grid contains more the 43.75 keypoints, then we assume the grid has enough keypoints. (In such case, we store up to 1000/16 = 62.5 keypoints.) (see Fig. 5a)  During the re-extraction stage, we set the attribute of each grid by considering the standard deviation of the constituent image pixel values. If the standard deviation of pixels (σ i ) is greater than a constant threshold α, we determine the grid contains useful information and decide to extract more keypoints by lowering the utilized detector's threshold value (β). In this article, we call this process dynamic thresholding.
On the other hand, if σ i < α, we assume the grid contains little information and quit the re-extraction process. Note that each grid is regarded as an independent attribute, meaning this stage can be processed in parallel and select keypoints relatively evenly across the image. we set α to 50 after empirically observing its slightly superior tracking accuracy over other tested pixel standard deviation (α) values (see Fig. 5b). Also, We initially set of the AGAST detector to 40. This value is lowered to 10 during every re-extraction stage. Fig. 4 shows the advantage of our refinement module. First, Fig. 4a shows keypoints locally concentrated in specific regions of the image, increasing the bias of extracted features. Second, Fig. 4b shows that performing grid-level keypoint extraction adjusts this bias to a certain extent by compensating a large number of keypoints. Last, Fig. 4c demonstrates that re-extracting keypoints via dynamic thresholding helps to spatially even out keypoints. The addition of keypoint refinement does not necessarily lead to much increased runtime. This is partially because the module controls the number of keypoints to some limit, requiring less time during feature matching. The vanilla tracking framework (Fig. 4a) requires 126.09 ms per frame, whereas considering only spatial distribution (Fig. 4b) requires 73.84 ms and utilizing our keypoint refinement (Fig. 4c) requires 83.56 ms. Hence, the refinement module can be implemented without compensating and actually improving the real-time performance.
After going through the above-mentioned re-extraction stage, the set of extracted keypoints (K ) are used for matching and pose estimation.

B. IMPROVING ROBUSTNESS TO FEATURE MISMATCHES
Matched keypoints typically contain a non-negligible portion of outlier correspondences, which need to be discarded for reliable tracking and pose estimation. We employ two modules to enhance robustness to these mismatches.
First, we activate a keypoint detection module in the background (i.e. simultaneous detection and tracking) to consistently detect for image features irrespective of optical flow-based tracking outcomes. This allows the tracker to utilize keypoints extracted from the background module when tracking fails without having to move onto the next frame.
Second, we apply a combination of the learning-based correspondence weighting along with RANSAC to improve outlier detection. Previous work have either implemented a RANSAC-variant algorithm (e.g. simple tracker) or correspondence weight learning (EOS) for outlier detection. In this work, we apply a combination of both to yield an optimal result. More specifically, for each frame, keypoint correspondences between the DB image and the query image are weighted using a structured support vector machine (SSVM), after which are used for robust homography estimation through a 4-point PROSAC [9] algorithm. The estimated candidate homographies are then fed in as training data for learning correspondence weights online. By using the updated model, we reiterate over the keypoint correspondences to recompute homography based on new weights.

1) ROBUST HOMOGRAPHY ESTIMATION
Homographies are calculated using the modified PROSAC algorithm and the matched keypoints between the query (input) image and the database (DB) image. For elaboration, let {K d : k 0 d , k 1 d , . . . , k m d } be a set of keypoints from the DB image and {K q : k 0 q , k 1 q , . . . , k n q } define a set of keypoints from the query image. We additionally define as a set of matched keypoints such that (i, j) ∈ means keypoint i from the db image match with keypoint j from the query image. Keypoint correspondences are randomly sampled multiple times from , with 4 matches at a time to yield a set of candidate homographies between the DB image and the query image (we set the number of iteration 1000). The solution maximizing the number of inlier correspondences each weighted by its matching score s ij is chosen as the homography matrix H, i.e. where is a standard RANSAC robust kernel. π( x, y, z ) := x z , y z is a 3D to 2D projection function. We have designed the matching score s ij to incorporate a combination of the scores outputted by the detector (e.g. FREAK) and the SSVM by defining it as where c ij is the match score from the detector, d j is a normalized descriptor vector of keypoint i in the input image VOLUME 8, 2020 and w ij is the predicted weight vector from the SSVM given a correspondence (i, j). For the first frame, w ij is set to all-1 vector across all correspondences.

2) ONLINE LEARNING OF A STRUCTURED SUPPORT VECTOR MACHINE (SSVM) FOR LEARNING DESCRIPTOR WEIGHTS
We train a SSVM that learns to weigh keypoint descriptors given a pair of keypoint correspondences as input. For this purpose, we utilize the candidate homography matrices estimated from Sec. III-B.1. Learning is carried out using on the structured output maximum margin framework [42]. Given a pair of keypoint sets {K d , K q } and a set of candidate homographies {H}, the weights {w ij } from Eq.(4) can be learned via a large-margin framework by solving where w is a stacked vector of all weights w ij . δH denotes the task-dependent structured error of predicted output H instead of the observed output H g . The slack variable ξ measures the surrogate loss for the keypoints and c is the regularization parameter. The SSVM constraints force δH at H = H g to be always greater than δH g through the slack variables {ξ g }.
In other words, δH can be interpreted as a guarantee of improved performance. As shown in Eq.(5), γ ij encourages higher weights for inlier keypoint correspondences and lower weights for outliers. The loss function expresses a finer distinction between H g and H, which plays an important role in the SSVM.
Eq.(5) can be solved offline with batch problem, but for real-time applications, w ij should be updated online to adapt the model to a given environment. As a result, we redefine Eq.(7) to ensure real-time operation, and the learning-based optimization can be formulated as min w, where τ is a leveraging parameter. For speed improvement, we have employed a binary descriptor for d j . Since SVM requires a real vector as input, we divide the descriptor into 8-bit binary numbers and convert each to a real number.

C. OUTLIER REMOVAL
Once the homography is estimated, outlier keypoints are discarded and no longer tracked. To achieve this, we check three criteria for each keypoint match, (a) keypoint error distance, (b) stability of the inlier/outlier mask and (c) stability of the computed homography. For the first criterion, we employ an inverse-homography (H −1 ) computation and use each matched keypoint (i) from the query (input) image to estimate the location of the corresponding keypoint (j) in the DB image (see Eq. (8), (9)). For a pair of matches (i, j), the estimated keypoint j in the DB image (k j d ) from a keypoint i in the query image (k i q ) is obtained through the equation k If the distance between the actual keypoint i and the estimated keypoint i ( k j d − k j d 2 ) exceeds the threshold value of 5 pixels, the corresponding pair of matches is discarded (see Fig. 7a).
For the second criterion, we inspect the stability of the inlier/outlier mask of matched keypoints over a window of 20 frames. For each keypoint track, we record the number of times the keypoint changes from being an inlier to outlier TABLE 2. Ablation study results of the POP framework. The top row shows the baseline tracker's mean accuracy (i.e. without any of the 3 components in Section III). Remaining rows have been generated by enabling different combinations of the modules proposed in Section III), namely keypoint refinement (KR), learning for improving robustness to feature mismatches (LIR) and outlier handling for homography estimation (OH ).

TABLE 3.
Average accuracies (%) and per-frame runtimes (s) of planar object trackers on various benchmarks. Our AGAST-FREAK-implemented POP algorithm shows the highest accuracy with smallest standard deviation across all tested datasets. or vice versa. Any keypoint track switching more than 5 times is regarded unstable and removed. If this results in fewer than 4 inlier point tracks (i.e. minimum number of points required to compute a homography), we utilize the background detection module's matched keypoints to recompute homography and discard outlier keypoints based on the first criterion only. This has an effect of filling up a new pool of keypoint tracks.
For the last criterion, we check the smoothness of the tracked object's pose by observing the magnitude of drift of the object vertices. The 4 vertices of the planar object  are retrieved from the DB image and then projected to the query image. Assuming similar projections of the vertices have been made for the previous frame q − 1, we compute the sum of vertex movements as follows: where j denotes a vertex keypoint, and H q and H q−1 are the homography matrices for the current query image and the previous query image respectively. If the square root of Eq.(10) is above 30 pixels, we determine the homography is unstable. Additionally, if the projected vertices (in the query image) are not in the correct order (due to a twisted or concave motion) or concentrated in a very small region, the corresponding homography is also discarded. In these unstable-determined cases, we fall back to the homography estimated from the previous frame.

IV. EXPERIMENTAL RESULTS
We conducted 4 experiments to analyze the performance of our POP framework in detail. First, we conducted an ablation study to empirically show the performance gain brought by each stage of the framework and indicate the need for all the stages introduced to achieve top performance. Second, we compared the tracking accuracies achieved by different combinations of feature detectors and descriptors when using POP to demonstrate the generic nature of the framework. Third, we compared our winning POP combination from the second experiment against other baseline and state-of-the-art planar tracking algorithms on various datasets, empirically demonstrating the scheme's robustness to unconstrained tracking environments. Last, we selected the two best performing algorithms from the 3rd experiment and compared their accuracies on more challenging datasets (TMT and POT210) comprising various unconstrained tracking environments. All the experiments were conducted on a PC with an Intel Core i7-4790 (3.6 GHz) CPU and 16 GB RAM.

A. ABLATION STUDY
To empirically observe how each component in Section III affects the tracking performance, we tested all 2 3 = 8 combinations, which were created by disabling different sets of the 3 proposed modules. Corresponding results are illustrated in Table 2. From the Table 2, we verify adding each component to the baseline planar object tracker improves mean tracking accuracy, justifying the need for each stage in POP.
For measuring tracking success rate, we reported the proportion of frames satisfying two criteria. First, the fraction of matched keypoints over total number of keypoints must exceed η, which we set to 20%. Second, the value of the scoring function S(Ĥ, H) has to be smaller than the threshold value T d set to 10. We used [15]'s scoring function defined as whereĤ ∈ R 3×3 is the ground truth homograph matrix, H ∈ R 3×3 is the predicted homography matrix [15] and {c k } ⊂ R 3 are the homogeneous vertices of a square of length 2 centered at the origin (for the purpose of normalization). We reported the number of frames with S(Ĥ, H) < T d . Table 4 summarizes the tracking accuracies achieved by each combination. AGAST-FREAK achieved the best tracking success rate by reaching 99.92% while AGAST-SURF pair showed the worst result of 62.03%. Looking at the average accuracy of each detector and descriptor, AGAST performs the best across all detectors and BRIEF achieves top performance amongst tested descriptors. (see Table. 5). The average tracking accuracy achieved across all combinations is 91.93%, indicating our POP framework is generic and not necessarily dependent on a particular selection of detector and descriptor.
For measuring the tracking accuracy, we used a region-ofinterest (ROI) overlap ratio [12], which computes the proportion of area correctly detected by tracker with respect to ground truth. This is a widely used metric for comparing the pose estimation accuracy [44]. For UCSB datasets, we used the area within the black-and-white frame as ground truth. For others, we used ground truth regions provided by the dataset authors. Table 3, Table 8 (detailed pose estimation results), Fig. 8 and Fig. 11 show that our POP tracker overall outperforms other trackers by achieving best tracking accuracy on most benchmarks when compared to other algorithms. GPF, EOS, and DGT confirmed failed more frequently during fast motion or rapid scale changes. In particular, GPF and DGT showed weaknesses for fast movement and rotation.
For most benchmarks, we found that our POP tracker and Gracker showed comparatively high accuracies. As shown in Fig. 11, most of the baseline algorithms were vulnerable to scenes with strong perceptiveness. Gracker especially showed worsened performance as the viewpoint angle increased. DGT, and GPF showed significantly lower accuracies in unconstrained versatile scene conditions. KCF showed weaknesses for datasets with occlusions. This is most likely to have been caused by the algorithm's confusion due to similar texture around the neighborhood.
Our real-time performance comparisons can be found in Table 3 and Table 8. The fastest runtime is achieved by KCF, although its accuracy shows large variations. On the other hand, the slowest object tracker was DGT which also showed poor accuracies. Gracker showed similar but slightly slower runtime than ours. POP took the third longest time out of the tested algorithms, but its accuracy is overall consistently high and still can be run in real-time across all datasets (see Table 3 and Table 8).

D. TRACKING ACCURACIES IN TMT AND POT210
In the last experiment, we selected the two best performing algorithms from the 3rd experiment (see Section IV-C), namely Gracker and POP, and compared their performances on more challenging datasets, namely TMT and POT210. The TMT dataset consists of the sequences named tilt, zoom, occlusion, rotation, translation and unconstrained, and the POT210 dataset comprises scale change, rotation, perspective distortion, motion blur, occlusion, out-of-view, and unconstrained. Table 6, Table 7, Fig. 9 and Fig. 10 show that the POP-based tracker overall outperforms Gracker. POP achieved higher accuracies in 7 out of 9 sequences in TMT. POP's mean accuracy was 4.1% higher than that of Gracker (see Table 6).
On the POT210 dataset, POP showed better accuracies in 4 out of 7 sequences, with mean accuracy 5.8 % above that of Gracker. The reason for such large difference in the mean accuracy is that POP showed relatively consistent results across the sequences while Gracker showed extremely poor performance on the motion blur and unconstrained sequences. We believe the POP's ability to recover from tracking failures through parallel detection and tracking mechanism as well as dynamic thresholding of grid-level keypoint detection helps to produce consistent results on challenging scenes (see Table 7 for details).

V. CONCLUSION
In this article, we have proposed POP, a generic planar object pose-estimation framework for mitigating several correlated sources of errors arising in unconstrained environments. More specifically, we introduced a keypoint refinement step to improve quality of keypoints, added a detection module working in background as well as an online learning framework for more accurate homography estimation to reduce feature matching errors and attached an outlier removal step to minimize errors from homography estimation. Through extensive experimental comparisons, we empirically demonstrated the effectiveness of POP in various situations including unconstrained environments. We believe this suggests a need to consider all sources of errors simultaneously to achieve stable performance in such dynamic scene. Future work will focus on improving each added module in POP, for instance introducing a learning technique for pose estimation or transferring advantages of other tracking algorithms. We will also investigate on reducing errors arising from unreliable keypoint tracking, aiming towards reaching the full potential of our POP in versatile AR environments.