Registration of Consecutive Frames From Wireless Capsule Endoscopy for 3D Motion Estimation

Wireless Capsule Endoscopy (WCE) is a non-invasive medical procedure devised for painless in vivo inspection of the gastrointestinal (GI) tract. It is especially valuable for the examination of the small intestine since it is difficult to reach by traditional endoscopic procedures. The setup includes a camera with an embedded light source and a circuit capable of acquiring and transmitting the video. The main challenge of this technology is the identification of the position and trajectory of the capsule as it travels through the GI tract, which is particularly relevant during the detection of anomalies in the tissue. Given only the information provided by the recorded images, it is possible to estimate the 3D motion of the camera capsule and provide a full trajectory reconstruction. A critical yet difficult step in this process is the image registration between sequential frames. Therefore, being able to determine accurate correspondences between points, regions or features in two consecutive frames is crucial for the computation of the relative rotation and translation of the capsule. This paper comprises a comparative assessment of methodologies to address this problem with a porcine colon dataset obtained with our experimental setup.


I. INTRODUCTION
Endoscopic capsules are currently used for a variety of medical exams for the inspection of the full length of the GI tract and constitute a non-invasive approach without the risks involved with the sedation process and the risk of perforation from standard endoscopes. This is a very attractive gastroenterology alternative exam that is especially relevant for patients who require repeated inspection at regular intervals [1] and particularly for the examination of the small intestine which is not easily reached with other conventional endoscopic procedures [2]. During a standard WCE exam, the patient ingests a capsule that travels along the The associate editor coordinating the review of this manuscript and approving it for publication was Larbi Boubchir . GI tract moved by peristalsis. Each capsule is equipped with light-emitting diodes and one or more cameras that acquire a sequential set of frames that is then transmitted to a recorder. The frames are low-resolution images affected by significant geometric and radiometric distortion due to the small-sized lenses and poor lighting conditions. These images are inspected by a clinician to identify and locate possible lesions such as polyps and ulcers [2].
One of the difficulties of the WCE procedure is the length of reading and video reporting time. The average video report time is 30 to 60 minutes depending on the trajectory of the capsule along the GI tract and pathology in question. An extra challenge is faced with the risk of missing particular pathologies including indiscreet mucosal bulges given the complex nature of lesions and the fatigability of the human VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ eye. Another complication regarding WCE is the possibility of retention which requires endoscopic or surgical intervention to retrieve the capsule. The retention rate is 1.2-2.1% in patients with suspected small bowel bleeding; 2.35% for suspected small bowel Crohn's disease; 4.63% for established small bowel Crohn's disease; 2.2% for patients with abdominal pain and diarrhoea; and 2.1% in patients with neoplastic lesions [1]. Lastly, one of the main drawbacks of the WCE technique is the lack of information regarding both the position and orientation of the endoscopic capsule as it moves throughout the GI tract [3]. Information regarding the localization and motion of the capsule is particularly valuable when an abnormality is detected in the tissue. The development of automatic methods to overcome this limitation is therefore essential. These methods can be based on a variety of principles and techniques. Several computer vision techniques are based on image analysis alone, using the images acquired by the capsules to estimate motion and displacement throughout time. Recently, artificial intelligence has produced notable progress in this field [1].

A. PAPER ORGANIZATION
In this paper, after discussing the related work, a novel experimental setup is presented in the data acquisition section. The approach chosen for the identification and localization of the capsule that is presented in the methodology section is based on image analysis alone. The image registration step of the localization process is addressed within the framework of WCE video frames by exploring different image registration approaches. The main goal is to estimate robust correspondences between overlapping regions of closely-spaced frames from the acquired data. Lastly, for a quantitative evaluation of the results, the computation of a residual error using the fundamental matrix and the computation of the corresponding rotation and translation errors using the essential matrix is presented. The results obtained are then presented and discussed.

II. RELATED WORK
WCE localization systems are broadly classified into three types according to the sensing method: magnetic-fieldstrength methods, electromagnetic wave and field-based methods, and image-based methods [4], [5], [8], [9]. Magnetic localization techniques can be implemented using an internal permanent magnet in the capsule and a sensing module outside the capsule. An alternative to this approach is to use a magnet outside and a sensing module inside the capsule [15]. Other methods combine magnetic localization with magnetic actuation. The PillCam capsule from Medtronic, for example, can perform localization using a set of 8 receivers located on the patient's abdomen. The intensity of the Radio Frequency (RF) signals is used for the estimation of the location of the capsule, an approach that does not require any additional equipment [16]. RF-based approaches can use various principles for capsule localization: radio frequency identification (RFID); time of arrival (TOA); direction of arrival (DOA); time difference of arrival (TDOA); angle of arrival (AOA); and received signal strength indicator (RSSI) [4], [6], [7], [17]. There are also approaches combining RF localization and computer vision to determine 3D motion [18] and capsule orientation [19] that are complementary to magnetic techniques.

A. COMPUTER VISION METHODOLOGIES
Since endoscopy capsules are equipped with cameras and light-emitting diodes, other approaches are based on image analysis and computer vision [20]. Computer vision methodologies measure the displacement of the capsule inside the GI tract as the rigid motion of the capsule [21] by retrieving, for example, visual features or image intensities changes between video frames [22]. In the first part of this process, after the extraction of points of interest, features or visual cues, the image registration between video frames is performed. Next, the 3D rigid motion of the camera capsule between frames is estimated to allow for odometry estimation and capsule localization estimation. This computation is obtained relative to the capsule itself given internal landmarks and taking into account the luminal geometry [21], [23]. The estimation of the 3D motion of the capsule relies on the computation of the Essential matrix [24]. Given two images A and B acquired by a calibrated camera, with I A and I B representing the homogeneous coordinates of the pixels of the images, the following relationship applies: where E, the Essential matrix, is a 3 * 3 matrix of rank 2.
Matrix E can be expressed as a function of the product of a 3D rotation matrix R and of a skew-symmetric matrix T made up with the elements of the translation vector t = (t x , t y .t z ) where R and t describe the rotation and translation between the two camera positions of A and B. Given the matrix E, the 3D rotation R and translation t can be computed up to a scale factor. Most of the localization methods mentioned require an external module to the capsule, which complicates the process. For this reason, computer vision approaches that only require the information provided by the recorded frames are quite promising. Still, the main difficulty imposed by these techniques relies on the search for sufficient and robust corresponding points, regions or features between frames in order to accurately compute the essential matrix.

B. IMAGE REGISTRATION
The estimation of matrix E requires that corresponding geometrical entities such as points, lines or regions are determined. Therefore the estimation of the 3D capsule displacement requires image registration. The registration process is an alignment problem, and it can be viewed as a spatial transformation of matching points between two sets of data [25]. The registration process involves recovering the spatial transformation T that maps I B to I A : As a result of Equation (1), the mapping depends on the depth of the 3D points. For capsule endoscopy, image registration depends both on the image changes due to capsule motion and to intestine motion. The intrinsic camera parameters and the distortion coefficients associated with the capsule's camera are extracted prior to the registration step by appropriate calibration in order to remove distortion in the endoscopy frames [25].
Image Registration methods can be grouped into direct (or pixel-based) strategies or feature-based strategies [11]- [13].

1) DIRECT (PIXEL-BASED) VS FEATURE-BASED METHODS
Strategies that determine a proper motion model to define the alignment between a pair of images, compute its parameters and shift or warp the images relative to each other and explore how much the pixels agree are called direct or pixel-based methods [13]. An error metric is chosen for the comparison and a search technique is also devised. The easiest technique is to do a full search and try all possible alignments, which can be computationally exhaustive. Alternatively, approaches that resort to Fourier transforms and hierarchical coarse-to-fine approaches based on image pyramids can be used to speed up the computation [14]. Some other approaches are based on the Taylor series expansion of the image function to get sub-pixel precision in the alignment [13].
The other main registration strategy opposed to the direct method is the feature-extraction method. In this technique, the algorithms first extract distinguishing features from both images, match the individual features and then determine a global correspondence in order to compute a robust geometric transformation between them [12].
Initially, in older feature-based methods, when the images were poorly textured the features ended up being unevenly distributed and the algorithms were not able to provide accurate matches for pairs that should have been aligned [12]. Additionally, in some of these feature-based methods, the matching relied solely on the cross-correlation between regions comprising the features which failed to produce a good alignment when the images were rotated. Contrarily, direct methods use all available information because of the contribution of every pixel. These methods also have a limited range of convergence. To overcome this challenge, coarseto-fine techniques are generally used but the addition of more levels into the pyramid often ends up blurring important image details. Recent feature-based methods operate in scale-space and use orientation invariant descriptors to match images that differ in scale and orientation. These descriptors are designed for repeatability and the extracted features end up being well distributed which produces enough correspondences [12].

2) ARTIFICIAL INTELLIGENCE STRATEGIES
Artificial Intelligence (AI) methodologies devised for image registration are considered feature-based methods given the search technique for the correspondences. In machine learning strategies, image features are first extracted by the user and then an artificial neural network system is used in order to predict and/or classify the new data [1]. This learning process can be performed in a supervised or unsupervised manner, depending on whether or not ground truth information is available. Deep learning refers to a class of artificial neural network systems with several layers that have the advantage of automatically extracting features. In the medical image analysis field, the most used deep neural network structure is the convolutional neural network (CNN) [1].

III. DATA ACQUISITION
The data used for this assessment is the Mirocam dataset obtained with our novel experimental setup [10]. Since the colon is not in-vivo and the capsule cannot be moved through peristaltic movements, it was crucial to develop a different approach for the movement of the capsule along the ex-vivo porcine colon. The colon was cut longitudinally and fixed into a foam with a previously excavated path. This novel experimental setup allows for the recording of video frames along the entire length of a fixed ex-vivo porcine colon by a camera capsule that is moved by a robotic manipulator while it stores 3D motion information at each instant [10]. This setup provides valuable ground truth information regarding the sequence of camera poses at any given instant and consequently the camera's trajectory.
Hence, as presented in Figure 1, the experimental setup includes an ex-vivo porcine colon attached to a scaffold, a camera capsule, a capsule holder, a robotic arm, a data belt, a receiver and a receiver cradle. The capsule used was the MC1000 Mirocam Capsule from IntroMedic, which has a static frame rate of 3 FPS. The camera is attached to the gripper of a robotic arm with a two-piece capsule holder and moved through a preprogrammed path along a harvested ex-vivo porcine colon previously sutured into an excavated foam scaffold. In a normal exam, the patient wears the belt around the waist and the signal is transmitted from the capsule to the belt through the skin. In this case, since there is no patient, the signal cannot be transmitted through the skin so it reaches the belt with double ended alligator clamps.
Throughout the experiment, the robotic arm recorded the orientation and position of the gripper that holds the camera, along the predefined trajectory at regular time intervals, an information that is inaccessible in WCE exams. Handeye calibration was previously performed to estimate the rigid transformation between the gripper and the capsule camera.
A pinhole camera model with radial distortion was considered for the calibration of the capsule camera. The calibration parameters are presented in Table 1. Given that the frame rate of the capsule is fixed, it is possible to compare the orientation and location registered by the robot at each time VOLUME 9, 2021  interval with the rotation and translation computed from the data extracted from the image sequence with each registration method, to identify the most suitable methodology for trajectory reconstruction of WCE frames.

IV. METHODOLOGY
An overview of the methodology used for this assessment is presented in Figure 4. Both direct (or pixel-based) and feature-based methods were explored. Some of these registration approaches are agnostic since they do not use a parametric model while others do. For the Direct (or pixelbased) method, a hybrid multi-scale elastic model with an affine pre-registration (MEIR/MPIR) developed especially for WCE video frames [32] was explored. The feature-based methods explored were chosen according to the results of the comparative assessment in [26] performed especially with images from wireless capsule endoscopy. These methods are Scale-Invariant Feature Transform (SIFT), Speeded Up Robust Features (SURF), Maximally Stable Extremal Regions (MSER) and Local Intensity Order Pattern (LIOP). Lastly, a commonly used deep-learning method for optical flow computation (PWC-Net) was also explored as a feature-based method for image registration.
In order to evaluate and compare the explored image registration methods, two approaches were used. One is based on the computation of the Fundamental matrix and the estimation of the distances between the corresponding features and the epipolar lines. The other approach is based on the estimation of the 3D motion that the endoscopic capsule undergoes. For that purpose the Essential matrix is computed and the trajectory data acquired by the robot manipulator responsible for the movement of capsule is used as ground-truth.

A. HYBRID MULTI-SCALE ELASTIC MODEL WITH AN AFFINE PRE-REGISTRATION (MEIR/MPIR)
This registration procedure is formulated as a minimization problem requiring a multiple scale description of the input frames that aim to reduce or eliminate possible local minima and to expedite the convergence of the method [32]. Since in a normal WCE exam, the capsule is driven by peristalsis, the model assumes that the overall movement is a combination of the rigid movement of the capsule itself and the non-rigid deformation of the small intestine, which is an elastic and deformable organ.
This method relies on the grey-scale version of the WCE video frames and defines the relationship between a pair of images (I R , I T ), where I R is the reference kept unchanged, I T is the template and x = (x 1 , x 2 ) is an arbitrary pixel in the domain . The aim is to find the geometric transformation φ, that minimizes the distance D = D(I R , I T (φ)), defined in (4), involving the space of square integrable function L 2 ( ), between the transformed template image I T (φ) and the reference image I R [32].
The normalized dissimilarity measure (NDM) between images R and T (φ) is defined in Equation (5).
119536 VOLUME 9, 2021 The multi-scale approach refers to a multi-scale representation of the data, reference R and template T images. R θ i and T θ i represent the interpolated reference and template images respectively, obtained with spline interpolation, for a pre-defined increasing sequence of scales, denoted by θ i , with i = 0, 1, . . . , n. At a coarse scale, only the most noticeable features in both images are preserved, while small details become more visible at finer scales.
Then, the multi-scale image registration (MEIR) defined in [32] consists of an affine pre-registration at the initial and coarse scale θ 0 , defined by followed by a sequence of elastic image registration steps, at subsequent and increasingly finer scales θ i , for i = 1, . . . , n. To speed up the total optimization process and prevent possible local minima, the solution at scale θ i−1 is used as the starting point for the elastic registration at the finer scale θ i and the unknown transformation φ is split into an identity part I 3×3 and a deformation part u as shown in (7).
S(u) is formulated as a function of the Lamé constants, λ and µ that characterize the elastic properties of the tissue. The regularization parameter α balances the impact of the similarity in the final cost function and the elastic regularization term S(u), defined in (8), enables the optimization problem to be well-posed and restricts the solution to a linear elastic transformation.
A multi-scale affine image registration approach (MPIR) is also defined and can be thought of as a particular case of the MEIR approach with α = 0 [32].

B. SCALE-INVARIANT FEATURE TRANSFORM (SIFT)
SIFT algorithm, described in Figure 3.1), implements a cascade filtering procedure for the identification of stable points in the scale space [27]. Each keypoint descriptor is extracted from a set of reference frames, stored in a database, compared with a new input frame and the points that minimize the euclidean distance between features vectors are then selected. The final subset of correspondences is assigned based on position, scale, and orientation [33]. Keypoint descriptors are created from local geometric deformations represented by blurred difference of Gaussians (DoG) image gradients in various orientation planes at multiple scales by determining both the magnitude and the orientation of the gradient around each position. Although SIFT is quite slow and it is not as effective for low powered devices [36], its features are partially invariant to illumination and distortion, are resistant to image noise and remain invariant to scaling, rotation and translation. [27], [28].

C. SPEEDED UP ROBUST FEATURES (SURF)
SURF algorithm, described in Figure 3.2), presents a fast point-extraction and description scheme that is proven to produce high robustness with changeable lighting conditions [29]. Although SURF is not very stable to rotation [36] and provides fewer key-points than SIFT, it is faster and more robust against different image transformations [29], [30].

D. MAXIMALLY STABLE EXTREMAL REGIONS (MSER)
Image sections that remain nearly unchanged along as extensive range of thresholds are designated Maximally Stable Extremal Regions (MSER)s. The MSER algorithm, described in Figure 3.3), achieves correspondences between frames from different viewpoints based on the extremal regions achieved with a local binarization technique by using predefined threshold values. These features are popular for fast blob detection and its description is rotation-invariant given that the information exploited is local [34]. MSER also has a limited performance on blurred and/or textured images, since blur can distort the shapes of the extracted MSERs [35].

E. LOCAL INTENSITY ORDER PATTERN (LIOP)
The Local Intensity Order Pattern (LIOP) algorithm, described in Figure 3.4), uses intensity order instead of raw intensities and exploits the fact that the relative order of pixel intensity is unchanged with monotonic variations. The feature descriptors described so far are sufficiently robust to multiple lighting and distortion variations but fail to produce the best results in a few particular cases with more complex lighting changes such as specular reflections and exposure time variations, which are very common in endoscopic datasets [31].

F. DEEP-LEARNING FOR OPTICAL FLOW USING PYRAMID, WARPING, AND COST VOLUME (PWC-NET)
PWC-Net is a compact CNN model for optical flow estimation designed according to pyramidal processing, warping and cost volume. The combination of deep learning and domain knowledge reduces model size and improves performance [37].
Firstly, since raw images are prone to variations in lighting conditions, this method uses learnable feature pyramids. In this architecture, a pyramid of feature representations is constructed given two images I 1 and I 2 with an L number of layers. The bottom level corresponds to the features of the input images and the upper lth level to the lth downsampling representation of the features. Secondly, the warping operation from traditional approaches is incorporated as a layer in the network to estimate large motion. Then, there is another layer to construct the cost volume, which is a more appropriate representation for optical flow estimation. These two layers for warping and cost volume have no learnable parameters in order to reduce model size. The optical flow estimator is a multi-layer CNN given the cost volume, the features from the first image and the upsampled optical flow as an input as shown in Figure 2. The output is the optical flow estimation in the lth level [37]. Finally, according to the post processing, the PWC-Net resorts to a context network to exploit the contextual information for optical flow refinement.
Other network architectures have also been designed using principles both from the stereo and optical flow information. These use image pyramids or three-level feature pyramids, while PWC-Net learns deeper feature pyramids to achieve better performance. Other architectures also warp the input images instead of the features, which hinders the information propagation. Thus, the PWC-Net is able to construct a multiresolution cost volume and uses a low search range to reduce the computation [38].
Using the TensorFlow-based implementation tutorial of PWC-Net [37], available at [51] and selecting a pre-trained model (pwcnet-lg-6-2-multisteps-chairsthingsmix), an optical flow estimation was conducted for all pairs of sequential frames. The optical flow angle and magnitude values obtained with PWC-Net are stored in the RGB images presented in Figure 5.

G. ESTIMATION OF POINT MATCHES
For the MEIR and MPIR methods, with each pair of frames k and k + 1, given a subset of P k points and the parameters computed (scale (s), rotation angle (θ), translation components (t x and t x )), were used to determine the rotation matrix R(θ), the translation vector T (t x , t y ) and the respective P k+1 position of the transformed points P k in frame k + 1.
Regarding the PWC-Net, for each pair of consecutive frames, with the subset of points P k from the initial frame, given the each angle and magnitude obtained for each pixel from the optical flow estimation, the corresponding P k+1 points coordinates in the consecutive frame were computed in order to be used as matching points.
For the feature-based registration methods described above (SIFT, SURF, MSER and LIOP algorithms), the set of extracted features from each pair of sequential frames k and k + 1 were used to compute the P k and P k+1 point matches.
In addition to the correspondences determined with the registration methods, a preliminary manual annotation was also used in a subset of video frames to provide an additional reference benchmark to be used in the computation of the distances to the epipolar lines. This manual annotation is also useful to visually compare the quality of the matching points obtained from each approach. The 15 pairs of consecutive WCE images with the highest number of matches with non-zero displacement were chosen and manually annotated. All sets of point matches P k and P k+1 from each registration procedure were corrected for lens distortion with the calibrated camera parameters.
All registration methods explored for the search of robust correspondences were implemented in Matlab 2019a, except for the pre-trained PWC-Net that was explored with Python 3.6, with a TensorFlow implementation. All registration results obtained were then compared with the computation of the fundamental and essential matrices also in Matlab 2019a. All tests were performed in a computer with a 3.4 GHz Intel Core i7 processor and 16 GB of RAM.

H. FUNDAMENTAL MATRIX ESTIMATION
The set of initial points P k from frame k and points P k+1 from frame k + 1, obtained with each image registration method, are used to estimate the fundamental matrix F. For that purpose the normalized eight-point algorithm [24] was used. Given the epipolar lines in both frames, computed using Equation (10) and its dual (for the backward correspondence), the distances between the matched points and corresponding epipolar lines can be computed using Equations 11 and 12. l k+1 = FP k (10) With l k+1 = [a k+1 b k+1 c k+1 ] defining an epipolar line. If P k+1 = [x k+1 y k+1 1], then: a k+1 x k+1 + b k+1 y k+1 + c k+1 = 0 (11) which means that each point should belong to its corresponding epipolar line. Ideally, for all i = 1, . . . ,n absolute epipolar distances d F i between each point P k+1 and each epipolar line l k+1 obtained with F, given by equation 12, should be equal to zero.
The distances d F T i in the opposite direction, from frame k + 1 to frame k, can be obtained with the same procedure but using the transpose of the fundamental matrix F T instead of F and the corresponding epipolar lines l k = [a k b k c k ]. Consequently, the root mean squared distances associated with each pair of frames k and k + 1, would also be equal to zero. In order to determine the registration error between frame k and k + 1 in both directions, the residual error is computed as suggested in [24] and presented in Equation (13).

I. ESSENTIAL MATRIX ESTIMATION
The estimation of the essential matrices E allows for the recovery of the 3D rotation matrix and the translation vector T (up to a scale factor), throughout the trajectory [24]. The matrices were estimated using the M-estimator sample consensus (MSAC) algorithm [50] with bundle adjustment, for the set of point matches P k and P k+1 , from all 15 pairs of frames, obtained with the image registration methods described (and the calibrated camera parameters from Table 1). The orientation and location of the calibrated camera relative to its previous pose were also obtained. For each pair of registered frames k and k + 1, the relative rotation matrices R k,k+1 and the relative translation vectors T k,k+1 were obtained. These matrices were compared against the relative rotation matrices R robot k,k+1 and the relative translation vectors T robot k,k+1 obtained from the robot data (ground-truth).

1) ROTATION MATRIX ERROR
Firstly, for the evaluation of the estimated rotation matrices, a rotation error matrix R err k,k+1 for each pair of k, k + 1 frames was computed as shown in Equation (14).
This matrix is still a rotation matrix and can be represented using the axis-angle representation which parameterizes the rotation in a 3D Euclidean space by a vector corresponding to the axis of rotation and an angle of rotation [24]. For all pairs of consecutive frames, the rotation error matrices R err obtained for each registration method (MEIR, MPIR, SIFT, SURF, MSER, LIOP and PWC-Net) were converted into the vector of the axis of rotation v R k,k+1 and the angle of VOLUME 9, 2021 rotation θ R k,k+1 . The weighted average of each v R k,k+1 given all pairs of consecutive frames was estimated for each registration method and expressed in polar coordinates as described in [40] and represented in Equation (15), where x 0 is a unit vector and R ≥ 0, so that R = ||x|| and x 0 = ||x|| −1 x.
The vector x 0 is called the Mean Direction Axis (MDA) and R is called the Mean Resultant Length (MRL) [40]. The MDA (e x , e y , e z ) for each registration method was estimated along with the standard mean deviation (σ e x , σ e y , σ e z ) for all its components. The mean error angle θ R was also obtained by computing the average of each θ R k,k+1 obtained with each pair of consecutive frames along with the corresponding standard mean deviation σ θ R .

2) TRANSLATION VECTOR ERROR
For the comparison of the translation vectors, the cosine of the angle between the estimated translation vector T k,k+1 and the translation vector registered by the robotic arm T robot k,k+1 was estimated for each pair of k, k + 1 frames as shown in Equation (16).
Ideally, the values for the cosine should be equal to 1 and the corresponding angle should be equal to zero. The translation angles θ T k,k+1 between the two translation vectors were extracted from the cosine values. Finally, the weighted average of each θ T k,k+1 along with the corresponding standard mean deviation σ θ R were computed.

V. RESULTS
In Figure 5, a sample of the images from the experimental dataset is shown regarding the machine-learning PWC-Net strategy for the registration step. The input is made up of a subset of pairs of consecutive frames and the output are the components of the optical flow vectors, stored in the RGB channels. A few selected point matches obtained with each registration method (MEIR, MPIR, SIFT, SURF, MSER, and LIOP) in the first pair of consecutive frames from the experimental dataset, is compiled in Figure 6 along with  the manually annotated matches for visual comparison. The residual errors obtained with the use of the fundamental matrix for each registration method for all 15 pairs of consecutive frames are shown in Figure 7. The mean axis of rotation, the mean resultant length and the mean angle of the rotation error matrix are presented in Table 2 along with the mean angle between translation vectors and the corresponding standard deviations for each registration procedure.

VI. DISCUSSION
The main contributions of this paper are the development of the experimental setup (that was assembled) and the approach for the estimation of the capsule's relative motion given only the common information provided by sequential images. This experimental setup is different because in an environment where the camera moves with the peristaltic movements of the GI tract it is not possible to obtain a ground truth regarding the position of the camera along the path. Since our goal is to determine a methodology that efficiently reconstructs the trajectory of the capsule at each instant of the route, it was necessary to develop a setup where this ground truth was accessible. In this case, the camera is moved by a robotic arm that can register the position of the camera at all times. The objective is to extract the rotation and translation of the camera between each pair of frames using only the information from the images. The matches obtained with LIOP, shown in Figure 6.7) are affected by significant errors. The residual errors obtained are high and inconsistent and the number of matches between frames is quite low, which is why in some image pairs it was not even possible to compute the essential matrix for the estimation of the rotation error and translation error. The results obtained with MEIR and MPIR are visually more plausible than those obtained with SIFT, SURF, and MSER. In the sample example shown in Figure 6.4);5);6) it can be seen that these methods do not produce suitable matches. Additionally, SURF does not produce enough matches to compute the fundamental matrix and consequently the residual error, in all pairs of frames, is significant. The MDA, MRL and θ R values for SURF, along with LIOP, are also high. Most likely the results obtained with these methods correspond to non-moving features. This can be concluded by comparing the point correspondences obtained with these methods and the manually annotated points in Figure 6. On the other hand, the fundamental matrices estimated using MEIR and MPIR were calculated with a small sample of matches, unlike SIFT, SURF and MSER which yielded random and more extensive point matches. For the purpose of image comparison, only the points with manually annotated matches were displayed in Figure 6. It is possible to assume that the residual errors, MDA, MRL and error angles would decrease if the estimates of the fundamental and essential matrices had been obtained with a larger set of matches. It could be expected that the method MEIR, that models deformation/elasticity, would yield better results with the porcine colon images, but that is not the case. MEIR and MPIR perform similarly with this dataset. Our experimental setup is different from real case WCE videos since, in our case, the colon is fixed and the capsule does not move as a result of the peristalsis of the small intestine. The capsule's movement is guided  by a robotic arm, so it is plausible that the method that accounts for elasticity does not yield better results than the one that disregards the existence of elasticity. MEIR and MPIR methods perform relatively well, considering the visual comparison of the matches in Figure 6.2);3). The pairs are closer to the annotated ones and the values for its residual errors are similar. The residual errors obtained with both MEIR and MPIR are lower and more consistent throughout the successive frames than the ones obtained with the featurebased methods. Additionally, the MDA, MRL and error angle results for both methods are also similar. The results obtained with the PWC-Net are the best in both the visual analysis regarding the accuracy of the matches and the quantitative analysis regarding the values for the residual errors, MDA, MRL, θ R and θ T .

VII. CONCLUSION
One of the main challenges of WCE technology is the identification of the location and the trajectory of the capsule as it moves through the GI tract, which is especially relevant for the detection of anomalies. This paper explores a few image registration approaches between sequential frames as the first step into the process of overcoming this difficulty.
In this article, through comparison with ground truth information provided by the robot, it is proven that it is possible to determine the relative movement of the capsule between frames, with the computation of the essential matrix, up to a scale factor, when the image registration method produces enough good matches between sequential frames given this difficult colon dataset. The handling of this dataset was a challenge due to the intrinsic difficulty of successfully achieving suitable point matches with any registration method. Using SIFT, SURF, MSER, and LIOP, in some cases, it was very difficult to extract enough features and matching points for the estimation of the fundamental and essential matrices. The results obtained with this evaluation allow the following conclusions: (1) the most common feature matching approaches used in computer vision are not adequate for these datasets; (2) MPIR and MEIR both perform similarly given that in this experimental procedure the capsule is moved by the robotic arm so there is no need to account for elasticity; (3) The best estimates of the capsule trajectory were obtained using the PWC-Net for the image registration of consecutive frames, which yielded the smallest residual errors, MDA, MRL and error angles.
In the future, we can move to conventional datasets, where the camera is moved by peristalsis, without the need for ground truth information, and adopt this procedure to reconstruct the full trajectory of the capsule. Additionally, further experiments will also be performed with datasets obtained with a variety of endoscopic capsules from different manufacturers.
MARINA OLIVEIRA received the M.Sc. degree in biomedical engineering from the Physics Department, in 2018. She is currently pursuing the Ph.D. degree with the Department of Electrical and Computer Engineering, University of Coimbra. Her research interests include 3-D computer vision, visual odometry, localization and mapping techniques, 3-D reconstruction, capsule endoscopy, and medical image analysis.
HELDER ARAUJO (Member, IEEE) is currently a Professor with the Department of Electrical and Computer Engineering, University of Coimbra. In the last few years, he has been working on non-central camera models, including aspects related to pose estimation and their applications. He has also developed work in active vision and on control of active vision systems. Recently, he has started work on the development of vision systems applied to medical endoscopy. His research interests include computer vision applied to robotics, robot navigation, and visual servoing. LUÍS PINTO received the B.S. and M.S. degrees in applied mathematics and the Ph.D. degree in mathematics from the University of Coimbra, Coimbra, Portugal. He studied numerical methods for solving integro-partial differential equations at the University of Coimbra, where he has been with the Center for Mathematics (CMUC), since 2014. He has been involved in several research projects in both image processing and numerical methods for partial differential equations.
EVA CURTO received the M.Sc. degree from the University of Coimbra, in 2018, where she is currently pursuing the Ph.D. degree with the Department of Electrical and Computer Engineering. Her research interests include 3-D computer vision, including medical endoscopy, pose estimation, and 3-D reconstruction of non-rigidly deforming objects.
LUÍS PERDIGOTO received the Ph.D. degree in electrical and computer engineering from the University of Coimbra, Portugal, in 2015. He is currently an Assistant Professor with the Department of Electrical Engineering, Polytechnic Institute of Leiria, Portugal. He is also a Researcher with the Institute for Systems and Robotics-Coimbra. His research interests include computer vision, robotics, and automation. VOLUME 9, 2021