Learning-Based Visual-Strain Fusion for Eye-in-Hand Continuum Robot Pose Estimation and Control

Image processing has significantly extended the practical value of the eye-in-hand camera, enabling and promoting its applications for quantitative measurement. However, fully vision-based pose estimation methods sometimes encounter difficulties in handling cases with deficient features. In this article, we fuse visual information with the sparse strain data collected from a single-core fiber inscribed with fiber Bragg gratings (FBGs) to facilitate continuum robot pose estimation. An improved extreme learning machine algorithm with selective training data updates is implemented to establish and refine the FBG-empowered (F-emp) pose estimator online. The integration of F-emp pose estimation can improve sensing robustness by reducing the number of times that visual tracking is lost given moving visual obstacles and varying lighting. In particular, this integration solves pose estimation failures under full occlusion of the tracked features or complete darkness. Utilizing the fused pose feedback, a hybrid controller incorporating kinematics and data-driven algorithms is proposed to accomplish fast convergence with high accuracy. The online-learning error compensator can improve the target tracking performance with a 52.3%–90.1% error reduction compared with constant-curvature model-based control, without requiring fine model-parameter tuning and prior data acquisition.


I. INTRODUCTION
R ECENT advances in computer vision enable the detection of the robot configuration in unstructured environments [1], [2], similar to human visual perception that allows us to interpret body movement relative to our surroundings. In computer vision, camera pose estimation is a fundamental problem that has been widely studied in the areas of structure from motion (SfM), visual odometry [3], simultaneous localization and mapping (SLAM), and even commonly applied in augmented reality [4] as well as autonomous navigation [5]. Pose estimation by means of temporally coherent features in a sequence of twodimensional/three-dimensional (2-D/3-D) images [6] can avoid the complicated integration of additional positional sensors.
However, feature-based estimation using cameras is inherently subject to the image quality, which is inevitably affected by unstable light exposure, vision occlusion, and rapid viewpoint changes [3]. This weakness is made more apparent with cameras used in the eye-in-hand configuration, where the camera (i.e., the eye) is fixed on the robot end-effector (the hand) to see its surroundings. Although the eye-in-hand approach is intuitive and provides active visual perception, it requires effective end-effector movement for pose detection [7], [8] and greatly demands for consistent robot motion patterns. In addition, as the camera usually points closer toward the objects of interest [9], the effect of local lighting variations and specular reflection will be dominant in the camera view. To compensate for the pose error induced by the lack of high-quality image features, fusing computer vision data with other sensing feedback has become a promising option.
The most prevalent type of fusion approach is to integrate cameras with inertial measurement units (IMUs) [4], [10], that is, the visual-inertial system (VINS), which has been generally developed for rigid-link robots. Detected acceleration and angular velocity could be utilized by employing statistical filtering techniques, such as extended Kalman filters [4] or learningbased fusion methods, e.g., long short-term memory [11] and convolutional neural networks [12]. Although the extrinsic calibration and accumulated drift in VINS were widely discussed [13], [14], [15], residual and nonquasi-static vibrations in soft robots would induce increased or accumulated positional errors much more than their rigid counterparts. Note that the IMUs accuracy and reliability are also bounded by its limited acceleration/velocity sensing range. Mechanical integration of IMUs would also require tailor-made or compact packing with the camera at the soft robot's tip, whereas rigid robots have the freedom to fix the IMUs anywhere along their links. To this end, there remains a demand for alternative sensors that can directly measure the pose of soft robots.
Soft robots usually involve relatively large deformation of which the strain changes on its body surface would give strong cues to estimate its configuration. Real-time strain sensing achieved with fiber Bragg grating (FBG) optical fiber is a potential candidate that can utilize these strain changes for feedback [16], [17], [18]. FBG sensors provide several advantages over electronic strain sensors, including the capability for dense strain measurements with a single connection and insusceptibility to water submersion and electromagnetic (EM) fields. As a result, FBGs have been investigated in thin surgical tools, such as biopsy needles [19] or even in magnetic resonance imaging environments [20], [21]. Continuous-grating multicore fiber with optical frequency-domain reflectometry (OFDR) interrogation is one form of FBG sensing that is capable of stand-alone 3-D curvature sensing with a single fiber and is typically integrated in manipulators or instruments with strict diameter requirements [22], [23]. For pose estimation of fluid-driven soft robots, single-core optic FBGs using the common wavelength division multiplexing method would be more appropriate considering its advantages of higher sensing sampling rate (100-3000 Hz) and significantly lower cost. When helically wound onto the robot surface [24], [25], the fiber can sensitively detect small deformations at high frequencies enabling reliable closed-loop robot control. Task space control of the soft robot using absolute FBG-detected strain would be more reliable than using IMU feedback, which needs to calculate the integral of relative acceleration/velocity.
The mapping from FBGs measurement to continuum robot configuration can be established using either analytical modeling [21] or machine learning [26] approaches. Sefati et al. [25] had compared their tip positional sensing accuracy of a planar bending continuum manipulator equipped with three parallel FBG fibers. The results demonstrated improved sensing performance using the data-driven method without prior information of the FBG allocation. In learning-based methods, positional markers need to be employed as the ground truth to complete the mapping. In our previous work [27], we also proposed a flexible surface sensing system in which only one single-core fiber inscribed with FBGs was embedded in a soft substrate. Offline learning was needed to "train" the mapping between FBG strains and the surface morphology detected by motion capture cameras.
Considering the small form factor of FBG fibers and their ease of integration with devices/instruments, researchers have also aimed to leverage them with various camera configurations.
In other previous work, we employed a single-core FBG fiber on a continuum robot to enhance the 2-D motion estimation and path tracking in the endoscopic camera view [28]. However, these types of 3-D shape and 2-D motion estimators need to be trained by additional positional sensors in advance, heavily relying on prior data exploration and accurate ground truth data. Alambeigi et al. [29] also proposed a sensor fusion technique to address the shape/position estimation of continuum robots. As an illustration, the intermittent external information provided by an eye-to-hand camera calibrated the continuous imperfect FBG feedback to achieve the accurate 2-D positional sensing in obstructed environments. This work is one example of the few visual-strain fusion combinations for positional sensing, with even fewer examples using cameras integrated into the robot tip for eye-in-hand feedback.
Therefore, our concern in this article is to utilize a selfcontained camera to serve as the pose ground truth in ordinary cases, while the online initialized and updated FBG sensor can be fused to settle estimation error caused by poor-quality images. No external sensors would be applied to the algorithm since we would like to simplify the employed devices. A widely adopted sensor may be used in the test but just to prove the accuracy of camera-based pose estimation as the training ground truth. The sensing dimension is also extended from 3-D position/shape to 6-D pose, offering more flexibility in robot applications, such as spatial image stitching.
With real-time learning-based sensing feedback available, the design of the continuum robot controller can also leverage the advantages of data-driven refinement [30], [31], [32]. Our previous work had utilized online-learning locally weighted projection regression and Gaussian process regression (GPR) for orientation control [33] and visual servoing control [34], respectively. Such pure data-driven control has encouraging potential in soft robots but is subject to time-consuming data-exploration procedures. Although analytical kinematics modeling encounters challenges in parameter characterization due to nonlinear fluid and elastomer's dynamics, the convergence of analytical solutions can usually be guaranteed as it is calculated from the inverse kinematics mapping. The combination of kinematics model-based and learning-based approaches could leverage both of their respective advantages. The constant-curvature (CC) assumption can be used to establish a rough kinematics model, saving the time for data collection. Once the robot starts manipulation, the online data exploration could be activated, with which an error compensator is learned/updated to reduce the positional error induced by the model-based control.
In this article, we aim at accurate eye-in-hand pose estimation of soft robots, which is realized by sensing fusion of an integrated single-core FBG fiber and a monocular camera. The fusion result can be further used as feedback for position control. The proposed framework is depicted in Fig. 1. The major contributions of this work are summarized as follows: 1) online-learning-based pose estimation using sparse strain measurement of single-core FBG fiber and sensing fusion with monocamera SLAM; 2) hybrid control combining model-based and data-driven methods for accurate position tracking using soft robots, Fig. 1. Robot control architecture. Hybrid controller combining kinematics model and data-driven-trained compensator is implemented, with the pose feedback obtained by sensing fusion of FBG strain measurement and visual feedback. Monocamera ORB-SLAM2 is used, which serves as the ground truth to initialize and update the ELM-based model using FBG strain data. The FBG-estimated portion in the sensing fusion will be more heavily weighted in visual feature-deficient scenarios.
without the need for precise parameter tuning and prior data collection; 3) Experimental validation of the proposed sensing fusion modality under poor visual conditions and validation of the robust hybrid controller via target tracking tasks. The rest of this article is organized as follows. Section II focuses on the pose estimation aspect, which acts as the sensing feedback in the robotic system. It briefly introduces the state-of-the-art camera-based pose estimation method Oriented FAST and Rotated BRIEF (ORB) SLAM2, after which our proposed online-training FBG-empowered (F-emp) pose estimator by extreme learning machine (ELM) and the visual-strain fusion scheme are explained in detail. Section III presents the hybrid kinematics controller comprising the CC-based model and GPR-based error compensator. Experimental validation for both sensing and control is summarized in Section IV, including comparisons between fusion-based and camera-based pose estimations in handling visual obstacles, as well as comparisons between controllers using only a CC model versus combining the model with an error compensator. In addition, the overall performance of path following by integrating our proposed visual-strain fusion sensing and hybrid controller is demonstrated. The effect of physical contacts on our system is also investigated. Finally, Section V concludes this article.

II. POSE ESTIMATION OF SOFT MANIPULATOR
To improve the eye-in-hand pose estimation stability by integrating a single-core FBG fiber, the FBGs can be evenly distributed on the robot body. In our case, the fiber is helically wrapped on the cylindrical surface, thus reflecting the robot's overall deformation [see Fig. 2(a)] via wavelength shifts [see Fig. 2 We hypothesize that the camera-based estimation in featureabundant scenarios is the primary choice of sensing information but may suffer from inadvertent poor image quality. For such circumstances, the F-emp pose estimation model, trained by accurate camera-based estimations, could act as a stable backup and guarantee the operation of the entire framework.

A. Task Space Definition
The eye-in-hand camera is fixed on the robot end-effector, therefore sharing the same pose with the robot tip. There have been various approaches utilizing visual feedback to analyze the camera pose. We take the initial position of the end-effector without robot actuation as the origin of a global (measuring) coordinate system (green coordinate frame in Fig. 2), and the robot central axis as the z-axis with downward as the positive direction. In the SLAM algorithm, the initial pose is taken by default as the origin of its measuring frame. The task space related to the end-effector position is defined in the 3-D global frame. The camera pose estimated by SLAM at time step k is defined as z c = p c (k) q c (k) , including the position p c (k) ∈ R 3 and quaternion-represented orientation q c (k) ∈ R 4 . The actual pose becomes z = p(k) q(k) . Under stable and smooth movement, the SLAM estimation z c in feature-abundant camera views can be considered as the approximation of robot endeffector pose, i.e., z c ≈ z. It is worth noting that SLAM would not be the only option to provide z c ; the feedback from other pose measuring approaches will also be valid to train the FBG estimation model, such as EM tracking. Here, the reason to adopt the SLAM-based pose is to utilize the integrated camera without the need for any external sensing devices. The actuator input is represented as u(k) ∈ R m (at equilibrium state), where m denotes the dimension of actuation space. The control objective is to generate an actuation command Δu(k), achieving the desired movement Δp * (k) or Δq * (k). The single-core FBG fiber is helically wrapped along with the continuum robot. The multiplexing l units of FBGs inscribed are independent of each other, providing the corresponding l wavelength/strain measurement points. Wavelength shift vector λ(k) ∈ R l depicts the difference between wavelength vector at time step k and the original wavelength vector λ 0 corresponding to the initial robot configuration [see Fig. 1

B. Camera Pose Estimation Via ORB-SLAM2
ORB-SLAM2 has three common modules: tracking, local mapping, and loop closing [35]. The camera pose can be obtained at each input image frame by building a perspective-npoint model through the tracking thread. After ORB features in the image are extracted, pose estimation can be conducted by matching features in two consecutive frames and refined by minimizing the reprojection error with motion-only bundle adjustment optimization. This reprojection error, represented as e S here, is defined as the Euclidean distance between the image projection of 3-D map points and the corresponding observed feature in the image plane. That is to say, the precision of e S represents the matching accuracy of feature correspondences and then the quality of the pose estimation. The monocular camera was calibrated with OpenCV via robot operating system. It should be noticed that the pose estimation result using SLAM has a different measurement scale than that of common positional sensors. An affine transformation on the raw measurement would be needed to proceed with its usage in robot control. We utilized the EM tracker to calibrate the scaling parameter in this transformation in advance such that the online training of F-emp model and sensor fusion is both in the metric scale.

C. Learning-Based Pose Estimation Using FBGs
The consideration of optic fiber integration is that the wavelength shift/strain sequence of all FBGs should be mapped uniquely to reflect the end-effector pose but not altering the original soft robot mechanical properties (see Fig. 3). Details about the fiber placement can be found in our previous work [28]. Simply, the fiber was wound helically on the robot body (see Fig. 2). No rigorous requirements on the wrapping structure were set as long as the FBGs can be dispersedly distributed. Distances between adjacent turns could also vary without strict consistency. The details of the fiber used in our experiments can be found in Section IV-A. With an appropriate sampling rate of pose estimation provided by the SLAM algorithm (i.e., 20-50 Hz, related to the camera and computer performance), both the wavelength shift λ(k) and the pose estimation z c (k) (see Section II-B) at time step k could be obtained correspondingly. These sensing feedback pairs enable the establishment of a mapping relationship. A pose estimation model using FBG feedback can, thus, be trained.
ELM is an online-updated algorithm employing singlehidden-layer feedforward networks (SLFNs) with randomly assigned input biases and weights. It can facilitate rapid initialization and updates of a trained model, outperforming tuning-based training methods (e.g., fully connected network, FCN for short) in the rapid weight initialization, extremely fast learning speed (thousands of times faster than FCN [36], [37]), and strong generalization performance. Here, we improved an existing adaptive online sequential ELM (FOS-ELM) [38], making it capable of dynamic adaptation as well as outlier exclusion. Different from the standard ELM, our method enables online parameter updates with the adaptive forgetting scheme inherited from FOS-ELM [38] and selectively adopts the newly obtained samples based on the SLAM reprojection error. The main unknown parameter to be tuned during the training procedure is the output weights, which can be automatically calculated by a mathematical transformation. This part is for determining the mapping between wavelength shifts λ(k) and end-effector poses z c (k) of the soft robot.
Training: A few sample pairs are collected for the pose estimation model's initialization, where the robot is actuated by a predefined sequence U = u(1) u(2) · · · u(N 0 ) with N 0 steps of exploration. The corresponding sequences of wavelength shift and camera-based pose estimation are as follows: is to be learned. Consider the training set with input Λ and output Z c , with N 0 distinct training samples. The output of an SLFN with N hidden nodes can be represented by [39], [40] where for the ith hidden node, a i = a i1 a i2 · · · a il T and T are the weighting vectors linking to the input nodes and output nodes, respectively. The activation It is set as radial basis functions here, i.e., Represent the inner product of vectors a i and λ(j) as a i λ(j). Equation (2) can be written compactly as follows: Here, Φ is called the output matrix of the hidden layer. Obviously, proper values of parameters in Φ will result in The goal during training is to minimize the network cost function O − Z c , i.e., to find a solution vector that includes the to-be-tuned a T i , β T i , and b T i , i = 1, 2, …, N. Several algorithms can be utilized to search this solution, e.g., gradientbased iteration and least-square solution. Here, we use the latter method. An advantage of the ELM algorithm is that the values of input weights a i and hidden-layer threshold b i could be assigned randomly without having to consider the input data; thus, the output matrix Φ could then be obtained. Given an input set Λ, the least-square solution β of linear system (5) could then be determined by Finally, the output weights β will be analytically determined as follows: where Φ † means the Moore-Penrose (MP) generalized inverse of Φ. After these steps, the global nonlinear mapping model (1) generated by ELM is ready for prediction. The initialization step hereto is the whole procedure of standard ELM, which is an offline training method. Its robustness is determined by the MP inverse, possibly resulting in low overall estimation accuracy. However, the MP inverse is only employed in the initialization phase for network weights calculation, and the subsequent online update can weaken the adverse effects of MP inverse. Prediction: Provided with the wavelength shift λ(k) at the kth time step obtained, the corresponding pose of robot end-effector can be calculated by Since during the prediction procedure, the model is independent of camera-based pose estimation z c (k) (see Section II-B), it could be regarded as another pose estimator that could be further fused with z c (k).
Updating: Suppose the existing prediction vector β (0) is obtained by initial training dataset D (0) composed of Λ and Z c with N 0 distinct sample pairs. The expression of β (0) based on (7) could be rewritten as (9) When a new set of training data D (1) with N 1 distinct sample pairs is available for ELM, the weighting vector β (1) corresponding to both D (0) and D (1) can be calculated as follows: where Following this iteration, the ELM model would be updated after the kth training dataset D (k) as follows [36]: In consideration of the possible deteriorated camera-based estimations due to the poor image quality, it is necessary to set an activation threshold of the model updating mechanism. The reprojection error e S mentioned in Section II-B can be utilized as such an indication to determine whether the newly obtained sample is incorporated for online learning. When e S is larger than the threshold (> 1.4), the matrix β (k) will keep the value as in the last iteration step.
The reduction of effects from old data in the update procedure of ELM model can be achieved by introducing and adjusting several weight parameters for the old measurements. Equation (11) can be expressed in the form of where Weighting w is added to the variables related to old training samples; thus, the two factors (13) and (14) will be The recursive expression of (15) can be obtained by Sherman-Morrison formula as follows [38]: where .

D. Camera-FBG Sensing Fusion
In the ORB-SLAM2 algorithm, the reprojection error e S (introduced in Section II-B) can reflect the pose estimation accuracy. The fusion result can be regarded as a combination of SLAM and F-emp portions, with an adjustable weighting that characterizes the visual sensing accuracy. Based on this error e S , the weighting of SLAM portion in sensing fusion can be determined. The final pose estimation can, thus, be obtained by the following criteria (18) shown at the bottom of this page, where E L and E U are the error bounds distinguishing whether or not to entirely trust or discard the SLAM estimation, respectively; K S is an adjusting factor.

A. Kinematics Initialization by CC Model
In consideration of the limited dimensions of the actuation space compared with the task space, only the position p or orientation q can be controlled. In this section, the control objective is illustrated by the end-effector position p. As in this article, we utilize such a single-segment soft manipulator to demonstrate the idea of visual-strain fusion and hybrid control, and the controllable number of DoFs would be limited to 2. This results from a curved surface workspace with negligible thickness. Here, positional DoFs x and y are controlled as an example. The robot kinematics can be initialized based on the CC assumption, which is widely applied to continuum robots. The CC model is constructed based on the assumptions of zero torsion and uncoupling among actuation chambers. That is, the bending robot is assumed as no torsion involved, and the length variation of inflated chambers will not affect the other chambers. Two main parts need to be considered during the modeling, namely mappings from configuration space to task space and from joint space to configuration space. The former mapping is robot independent, while the latter mapping is robot specific as it is related to the robot actuation mechanism. In the CC-based model, the three parameters in the configuration space can be interpreted as r, θ, and φ, representing the radius, the central angle, and the bending direction (rotation angle) of the arc, respectively [see Fig. 4(a)].
The lengths of the three fluidic chambers [see Fig. 4(b)] are denoted as u = l 1 l 2 l 3 T . The angle φ representing the Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
bending direction could be obtained as follows: as well as the backbone arc curvature κ as The central angle of the backbone arc can be obtained by θ = κl, while the axial length l of robot yields l = (l 1 + l 2 + l 3 )/3. To summarize, the expression of three parameters r, θ, and φ in the configuration space can be represented as follows: where δ = (l 2 The position of end-effector p in the coordinate, as shown in Fig. 4(a), can be found as follows: The corresponding differential format can be expressed as follows:ṗ = J˙ u (23) where the Jacobian matrix J can be calculated by differentiating the position p with respect to the input u. Proved with the matrix J, we could establish its inverse function as follows: where J † is the generalized inverse of J. A singularity point exists on the initial status, that is, when the robot body is straight and aligns with the z-axis as in Fig. 4(a). This circumstance could be handled by adding tiny-value (e.g., 10 −5 ) variations on the chamber lengths during the calculation of J inverse. To involve constraints during generating motion commands for solving redundancy, e.g., to maintain the closest status to a preferred configuration, this scheme could be extended by an additional factor as [41] follows: ) is used to find a redundant solution approaching the preferred robot configuration, which could be set as the initial configuration u 0 without actuation air pressure or only with prepressure. The original chamber length is denoted by u 0 and the length in current time step k is u (k) .
Therefore, during the runtime, once a command of the desired displacement Δp * is given, the corresponding change of three chambers Δ u could be obtained to calculate the actuation command of stepper motors controlling the chamber air pressure. Thus, for the kth time step, the new chamber lengths can be formed as follows: where K p is a proportional gain to adjust the change of chamber length. To simplify the modeling for fluid dynamics, we assume that the extension of chamber lengths has a linear positive correlation relation with the stepper motors' output u, i.e., where α is a diagonal matrix, including the three multiples for three chambers. However, in actuality, this linearization could not describe the transformation well. Nonlinear elongation of elastic chambers and transmission of fluids, as well as other modeling uncertainties, would induce errors in the robot control. Thus, a learning-based component to compensate tracking deviations is investigated in the Section III-B.

B. Online Data-Driven Error Compensator
The online update of an additional error compensator enables the controller to compensate steady errors and even adapt to mechanical property changes, e.g., material fatigue. Once the actuation change Δu(k) is executed in a new step, the corresponding actual motion vector Δp(k) could be estimated as in Section II. Thereby, a set of new sample pairs, including input x = [u(k − 1) T , Δz(k) T ] T and output y = Δp e (k) would be produced. Here, the variable Δp e (k) represents the difference between desired motion and actual motion, i.e., Δp e = Δp * − Δp r (28) where Δp r is the actual motion in task space corresponding to the desired Δp * . The purpose of our proposed error compensator is to predict this error in advance and consider this potential deviation together with the desired robot movement. Thus, Δp * would be improved after compensation as follows: This newly collected sample could reflect the latest robot mechanical status and is added into the model training dataset, i.e., input matrix X and output Y. GPR is utilized here for the model training. The working principle of GPR has been introduced in our previous work [34]. For each step of motion, the model would be retrained for updating. A size limitation N max r for this dataset is predefined to keep the prediction fast and effective. If the current size N (k) r > N max r , the oldest sample [x 1 , y 1 ] will be discarded.
No prior data exploration is needed in robot manipulation. The robot can be actuated using the CC model-based controller first, while feedback for compensator initialization is being collected. After several motion steps N c (≤ N r ), the GPR-based compensator will be first trained and updated in the following steps. For time step k > N c , an additional compensated component is added as in (29).  [42]. The chamber inflation was regulated by an actuation unit comprising of three pairs of stepper motors and cylinders. Precise angular position control could be implemented on the motors, thus adjusting the volumes of sealed cylinders connected to chambers. An endoscopic camera (depth of field: 8 to 150 mm) and a LED module were fixed on the tip cap [see Fig. 4(c)]. A single-core optical FBG fiber with 17 FBGs (6-mm long gratings, 20-mm spacing) was helically wrapped and adhered on the silicone continuum body. For the convenience of fabrication, the distances between adjacent turns of fiber were set at approximately 16.5 mm. The robot outer diameter was 20 mm, and the bendable part was 90 mm in length. As the robot base was fixed in the experiment and the twisting is negligible, the roll orientation would not be controllable.

B. Pose Estimation by ORB-SLAM2
Pose estimation accuracy using ORB-SLAM2 was validated in a LEGO-constructed scenario. The robot was actuated according to a predefined sequence U for three stepper motor sets. This actuation sequence was expected to steer the robot to follow a trajectory that spreads out from the initial position.
A pair of EM trackers were attached on the robot tip to record the actual (ground truth) pose z. Intrinsic calibration of the monocamera was performed first. Extrinsic calibration of monocular metric scale was also required in the initialization procedure. Before each time of manipulation, the robot would move slowly along one or two direction(s), until the initialization for visual features was ready. With the robot actuated by sequence U with N steps, the set of SLAM estimation Z c = z c (1) z c (2) · · · z c (N ) and EM tracker measurement Z = z(1) z(2) · · · z(N ) can be obtained. The affine transformation from P to P c could be calculated; thus, the SLAM position measurement is calibrated as follows: where R, k, and p are the rotation matrix, scale factor along all dimensions, and translation vector, respectively. Measurement errors of P c comparing with P were calculated. The mean absolute errors of ORB-SLAM2 along x, y, and z axes were, respectively, 0.508 mm, 0.596 mm, and 0.385 mm, while the root-mean-square error (RMSE) was 0.998 mm [see Fig. 5(a)]. The trajectories constructed by P c and P could be found in Fig. 5(b). As can be seen in conditions of abundant visual Different from SfM or other feature-based methods, this reconstruction is simplified by positioning the images according to the estimated end-effector/camera pose. This will effectively increase the function of sparse SLAM algorithm as well as the mosaicking efficiency.

C. Sensor Fusion Pose Estimation
The ELM was chosen to train and update the F-emp pose estimation model in real time. With the same actuation sequence as in Section IV-B, wavelength shifts were collected and trained by ELM to estimate end-effector pose at the same time. We set the numbers of hidden nodes and initialization samples as N = 200 and N 0 = 450, respectively. These parameters were roughly tuned referring to the ELM estimation accuracy. The interval between adjacent steps was set as 0.05 s. It could be found in our experiments that the prediction result of ELM would be improved with the increment of sample number N 0 . Once a necessary number to guarantee the convergence of prediction was satisfied, the accuracy would not significantly increase. The ELM model was updated every newly obtained sample. The prediction results were compared with the measurement using ORB-SLAM2, which was the benchmark in training. The trajectory reconstructed by ELM estimation approached that of the SLAM measurement closely. The mean estimation errors along x, y, and z axes of ELM were 1.82×10 -4 mm, 3.95×10 -4 mm, and 4.39×10 -4 mm, respectively, while the mean spatial error was 8.28×10 -4 mm. This result demonstrates that ELM is capable of learning the pose information utilizing wavelength feedback. To test the pose estimation effectiveness when SLAM is unable to achieve consistent and stable estimation, we validated the sensing fusion methodology in the following two conditions, which are under moving visual obstacles and under the varying lighting condition, respectively.

1) Under Moving Obstacles:
The robot was actuated with a similar predefined spiral sequence, as shown in Section IV-B. Moving or stable obstacles would disturb camera view, with examples of the camera view and features, as illustrated in Fig. 6(a). In the first 100 s [i.e., first 2000 time steps, marked with 1 in Fig. 6(a)], no disturbances in the camera view were applied. For the testing scenario, this period would guarantee around 600 features in each frame, and the mean error of SLAM and fusion was 0.840 mm and 0.768 mm, respectively. After that, a hand was positioned statically in front of the camera or moving quickly to partly shield the field-of-view (marked with 2 and 3 ). Neither case would consistently cover all features in the camera view. During these periods, the number of features was less than or even reduced to 0 in rare frames, the mean SLAM error was 1.694 mm (4.2% to the largest distance to starting point, 40.23 mm) and the maximum error was 26.272 mm (65.3% to the largest distance), while the fusion errors had a mean of 1.132 mm (2.8% to the largest distance) and max. of 2.573 mm (6.4% to the largest distance). For the last case, the hand would statically cover the whole field-of-view (marked with 4 ) for ∼2 s each time. During this period, the visual features would be consistently lost; thus, the SLAM procedure would be stagnated (max. error 38.483 mm, 95.7%), but the fusion results could be maintained (max. error 4.747 mm, 11.8%). In the moving-obstacle period, there would also be moments where all the features in the camera view were blurred 3 , resulting in the SLAM estimation pausing. However, in our sensing modality, the F-emp estimator would compensate for the lack of visual sensing and guarantee the acquisition of sensing feedback. In the control and plot, if no SLAM estimation was provided, we would set the SLAM pose as the latest valid value to avoid the lack of feedback. As shown in Fig. 6(b) and (c), including the plots of errors, the trajectory of fused pose would not be affected by the moving obstacles, while pure SLAM-based pose would be deteriorated as expected. However, there is still a defect of the SLAM estimation that could not be obviously resolved by the fusion method. As can be seen in the lower subfigure of Fig. 6(b) demonstrating the error percentage to the corresponding motion step, the smaller step lengths (e.g., during first 50 s, mean step size 0.12 mm) will result in poorer estimation (mean errors 0.82 mm, 1074.4% and 0.75 mm, 966.8% for SLAM and fusion results, respectively). The noise of SLAM dominated the estimation error in such tiny steps, and this is also a reason for the high percentage throughout the whole procedure. Even in the outer trajectory (e.g., during the last 175-200 s), the step size was only 0.29 mm. Therefore, we may conclude that the advantage of SLAM is the overall localization rather than incremental estimation, and this behavior would also be maintained in the fusion method based on SLAM.
2) Under Varying Lighting Conditions: Varying lighting conditions were also tested in the pose estimation experiment (see Fig. 7). Under the lighting provided by incandescent lamp [marked with 1 in Fig. 7(a)], the robot started the same series of movement as in Section IV-C1.
During manipulation of the robot, the lighting condition was changed gradually or abruptly to weak lighting (marked with 2 ) or complete darkness during 75-150 s (i.e., time step 1500-3000), and then returning to the initial lighting. An additional moving LED source was tested from 150 to 200 s (step 3000-4000), which was oriented directly toward the endoscopic camera ( 3 ). It could be seen that low-light level would reduce the number of visual features in the camera view and induce minor noise in the SLAM-based pose estimation [see Fig. 7(b)]. However, once the lighting was fully removed in the camera view (which resulted in entire black image feedback), the SLAM procedure would be interrupted due to the absence of features. The moving lighting source would also bring consistent noise to SLAM estimation. When the LED was directly facing the camera, most of the features (especially on the lighting spot area) would be lost and SLAM estimation error increased. The fusion-based estimation could keep a stable estimation level (RMSE: 1.324 mm, 3.4% to the largest distance to starting point, 40.23 mm), largely improving the estimation accuracy compared with SLAM (RMSE: 3.116 mm, 3.4% to the largest distance). A similar limitation on the tiny incremental motion estimation, as discussed at the end of Section IV-C1, also appeared in this varying lighting test.
Scene reconstructions under conditions in Section IV-C1 and C2 were also conducted (see Fig. 8). Poses for image stitching were provided by the sensing fusion result. The blurs caused by the moving hand and varying lighting were reflected in the reconstruction, several of which are marked with a dotted outline. Although the images were blurred, the whole mosaicking could still successfully stitch together, accredited to the stable and consistent feedback of sensing fusion.

D. Tracking: Hybrid Control Versus Model-Based Control
Control performance comparisons were tested by target tracking along two kinds of paths: a pentagram trajectory comprising of straight lines and sharp angles (see Fig. 9); and a circle trajectory (see Fig. 10). It should be noticed that the control task aims at the positions on the x and y axes. Paths shown in the following figures are projections on the x-y plane, while the actual end-effector trajectory is distributed on the spatial curve surface. EM trackers were used to provide the sensing feedback in this section. During this tracking task, the target was moving at a consistent speed. Each iteration of control loop was set as 0.05 s. As we know, model-based control could guarantee stable performance but needs parameter tuning to achieve higher accuracy. Besides those related to robot structures, others, such as the proportional-integral-derivative (PID) factors, would also affect the convergence performance in tracking tasks. The following two experiments are to validate the effects of online-learningbased portion in the hybrid control.
Proportional gain K p in (26) is to adjust the calculated chamber length change, which plays an important role in the CC model-based control. The smaller its value is, the smaller the actuation change will be. We roughly set several different values of K p and tested if the learning-based portion could compensate

1) Path With Sharp Angles:
The performance of tracking the moving target along the pentagram path is shown in Fig. 9. Gain K p was defined as 0.1, 0.08, 0.06, and 0.04, respectively. As the target would switch on every time step, the curve named with "Error" represents the distance between the current target and actual end-effector position. "Deviation" is calculated as the distance from the actual end-effector position to the closest point on the desired path. When K p was tuned as a proper value [see Fig. 9(a)], the pure model-based controller's tracking performance was roughly acceptable. However, for the cases that K p could not be tuned well [see Fig. 9(b)], the model-based method was not capable of adapting to the speed of moving target, i.e., the "Error" obviously increased. Mean tracking errors (i.e., step errors) and path deviations (e.g., the closest distances to the desired path) with different values of K p are listed in Table I, accompanied with the standard deviation (STD). Both the value and percentage of the errors are provided. The percentage for tracking error was obtained by averaging all tracking (error/step size), where the step size is 3 mm that for the path deviation was the average value of all (deviation/corresponding distance to the starting point). The column "Improv." indicates the percentage that hybrid controller outperforms model-based controller in the where e m and e h are the tracking errors (or path deviation) using model-based controller and hybrid controller, respectively. For these four cases, the hybrid controller could improve the performance (68.9%-86.4% in tracking error, 65.8%-80.9% in path deviation) compared with the model-based one, even under the precondition that the tracking performance of mode was far from an acceptable standard.
2) Smooth Path: Similar to the pentagram path tracking, the performance when following a circle (see Fig. 10) is shown according to the same arrangement of Fig. 9. The difference with Section IV-D1 is that this path was constructed with smooth curves. When K p was not fine-tuned, the compensated part could steer the end-effector to approach the desired path quickly [see Fig. 10(b)]. Both the tracking error and the path deviation in tracing results can be largely reduced (52.3%-90.1% in the tracking error, 78.7%-94.1% in path deviation, Table II) under four values of K p .
Both experiments in Section IV-D demonstrated that the hybrid control scheme enables the tracking convergence and gradually increased tracking accuracy without fine modeling tuning and data exploration. Taking the path deviation in pentagram tracking (see Table I) as an example, 0.1 was the optimal value of K p among the four values; however, the distinction of deviation was effectively reduced after using the hybrid controller (e.g., path deviation: 0.361-0.696 mm), greatly outperforming the model-based controller (1.057-3.646 mm). It could be noticed that there were deviations at the lower middle of the pentagram path (see Fig. 9(a) and (b), coordinate around [0, −10]) as well as on three areas with 120°interval in the circle path [see Fig. 10(a) and (b)], when K p was set as both 0.1 and 0.06. This kind of error results from the highly nonlinear mapping from stepper motor positions (i.e., air cylinder volumes) to elastic chamber elongations. The rough linearization (27) to correlate the motor command and chamber elongation could not meet the nonconstant change of factor α. However, even with such an insufficiently tuned kinematics for control, the onlinelearning-based error compensator still enables enhancement of the tracking performance [see Fig. 10(c)].

E. Hybrid Control With Sensing Fusion
Experiments integrating the visual-strain fusion-based pose estimation and hybrid controller were also conducted. The robot end-effector was instructed to track a complicated closed path along an elephant-shaped path [see Fig. 11(a)]. The desired path involved 1500 target points, with approximately equal distances. For the pose estimation part, the ELM model was initialized after the first N 0 = 450 steps before which the SLAM estimation was acting as the sensing feedback for robot control. After initialization, the pose information would be provided by the fusion result of SLAM and FBG measurement. The ELM model was also updated online. For the closed-loop target tracking, model-based control was used for the first cycle (0-85 s, step 1-1700). Considering the initial end-effector position [0, 0] was not on the path, 200 more time steps after the 1500th step were included in the validation of model-based control to compensate for the initial approaching procedure. During this period, data Fig. 12. Tracking performance with the same subfigure meanings as in Fig. 11. The nonlinear relationship between chamber elongation and motor actuation for each chamber was fitted using real-robot data. The third-cycle performance was added, where a moving obstacle created visual disturbances in the camera view. collection and initialization of the GPR-based error compensator were also finished. Thus, in the second cycle (85-160 s, step 1701-3200), the robot was actuated utilizing hybrid control, which includes model-based control and the online-updated error compensator. The improvement using hybrid controller compared with the model-based controller can then be reflected.

1) Using Linear Actuation-Elongation Mapping:
We tested the performance using the same hybrid controller as in Section IV-D (K p = 0.1), the result of which is demonstrated in Fig. 11. The path of hybrid controller is marked by dense small filled circles, while that of the model-based method is marked by unfilled triangles. It could be seen that similar to the previous tests, the learning-based part in the hybrid controller can effectively compensate for most of the deviations of modeling uncertainties. However, when switching the inflated chambers [marked red in Fig. 11(a)], the hybrid controller still met difficulties in totally correcting the tracking trajectory to the desired path. Although, if we further fine-tune parameters in the model-based controller, it is feasible to bring a satisfactory tracking result with less path deviation. However, such performance is still valuable to be discussed, which shows that switching chambers is one of the main cases of concern for pneumatic-driven soft actuators, especially in the continuous-path-following tasks. In Fig. 11(b) and (c), the height difference of two controllers' error peaks [e.g., 13.322 mm and 7.944 mm, respectively, for model-based and hybrid controllers in Fig. 11(b)] as well as the mean values [2.625 ± 2.501 mm and 1.604 ± 1.487 mm in Fig. 11(b)] demonstrate that the error compensator can decrease the tracking error under various regions.
2) Using Nonlinear-Fitted Actuation-Elongation Mapping: In this experiment, the robot-specific mapping (27) was modified. Instead of a linear magnification, a nonlinear relationship between the chamber length and actuation motor was fitted using cubic spline data interpolation. With this specific-mapping correction, the control performance of model-based method could also be obviously increased with a fine-tuned K p . The three deviations [see Fig. 12(a)] when switching inflated chambers could be eliminated, as no obvious error peaks are found [see Fig. 12(b) and (c)]. This time, we supplemented one more cycle (160-235 s, step 3201-4700), where the hybrid control continued being used, but moving obstacles were applied as visual disturbances in the camera view, similarly in Section IV-C1. In the last cycle with visual disturbances, the tracking error (1.751 ± 0.473 mm) and path deviation (0.336 ± 0.221 mm) [see Fig. 12(b) and (c)] could also maintain in a low level. These two experiments demonstrate that the fusion-based pose estimation could provide valid feedback for the hybrid controller or other controllers.

F. Effects of Physical Collisions
Considering that the general purposes of soft robot use involved physical interaction with surrounding objects, we intend to investigate how the proposed fused sensing scheme, even under the interaction disturbance, can encounter deteriorated feedback either from visual or strain sensors. As a result, the synergetic use of both sensing approaches would give rise to the overall control performance. We intentionally made the robot even more susceptible to the contact interaction such that the FBGs wrapped on the robot cylindrical surface (without outer layer protection) would reflect the disturbance to the measured robot configuration as to which the proper visual sensing is expected to compensate this FBG sensing disturbance in order to maintain the control performance. Besides the LEGOconstructed setting, an abdominal simulator was built using swine viscera, acting as the surrounding for the camera in these experiments.
1) Slow Push on the Robot: Statistical results (in the two setups) are summarized in Table III. When the force was applied on the rigid tip [see Fig. 13(a)], its effect on the pose estimation can roughly be compensated by the fusion method ( Fig. 13(a) 1 , mean error: 1.216 mm). The appearance and removal of  During the smooth and continuous pushing period, the difference between SLAM-, FBG-, and fusion-based estimations is not obvious (see Table III). Although sometimes the mean errors using the above three methods are slightly larger ( Fig. 13(a) 2 , ∼2.5 mm), the fusion result can also reflect the pattern of movement.
Contact tests were also conducted on the soft body [see Fig. 13(b)]. As the local contacts usually resulted in only a small portion of sparse FBGs being affected, the F-emp estimation accuracy did not deviate too much (e.g., Table III  The fusion sensing performance can be comparable to the baseline that is the EM tracking directly on the robot end-effector (video 3:17-3:30). When the force was intentionally applied to the location wrapped with fiber, noise was observed from the F-emp pose estimations. This correlates with our hypothesis, i.e., if direct contacts exist between the fiber and surroundings, the measured FBG wavelength will involve external-force-caused strain variation and induce estimation error. One method to avoid this is to add a protective sheath/bellow externally in order to isolate or weaken the intensity of forces on fiber gratings. Such protection can be considered when applying the proposed sensor fusion method in specific applications.
2) Fast Flick on the Robot: To address more complicated circumstances involving both FBG and visual sensing disturbances, we conducted a set of finger flick tests in which we can observe how the F-emp estimation behaves when the visionbased estimation deteriorated by motion blur (see Tables IV  and V). Different from the case with continuous pushing, the SLAM algorithm was unable to stably measure the motion since visual features were lost (video 3:30-3:39). Although the accuracy of the fusion-based estimation was slightly reduced relative to finger push, the fusion-based estimation was still stable (Table IV, error: 1.221 ± 0.860 to 1.769 ± 1.434 mm). Although, due to the robot hyperelasticity, a large deviation (e.g., 10.911 mm) would appear at the moment when an abrupt force was applied, our fusion-based estimation could gradually adapt to the vibration. Table V took two extreme examples to discuss the limitation of this robot structure. They were chosen when the robot was executing in the abdominal simulator. It can be seen that the error is larger than in Table IV, particularly the error maximum, due to the increased force and corresponding bigger vibration amplitude. Although the fusion result occasionally fell from the optimal estimation, its mean error and STD were still close to the best one (e.g., fusion error 2.654 mm, min. error 2.275 mm). An interesting observation is that, here, the mean error of pushing on the rigid cap was even larger than on the body. One possible reason for this issue is the online-trained ELM model that did not involve enough related samples to accommodate such large range of motions. Aiming at fast model establishment, the   ELM models in our experiments were initialized online with a limited number of samples, which were in large probability of dense local distribution. Our model focused more on incremental local motions instead of cross-workspace movements. Such a mode was set for fast readiness of the F-emp prediction. The tradeoff between global accuracy and training samples should be adjusted if similar cases (large vibration) take a great proportion in the specific applications. In the Appendix, we discussed the robustness of the proposed sensing framework trained by various numbers of training samples to fast flick, where we can see that more and denser initialization samples will benefit to higher accuracy under physical collisions. The effect of protection on the FBG fiber was also simply tested.

3) Fixed Obstacles in the Workspace:
Besides the validation on sensing, we also tested the controller's performance in the abdominal simulator. The same controller, as shown in Section IV-E2, was used. In Fig. 14(a), a zoomed-in area (yellow outline) demonstrated the improvement of our proposed controller again compared with the model-based controller. Noted that to extend the types of physical collision, an aluminum rod was fixed in the robot workspace, limiting the robot bending on one side [see Fig. 14(b)]. The paths marked by an ellipse indicate the area that the robot body was blocked by such a constraint. It can be observed that the actual paths of robot deviated from the desired path, even using the hybrid controller. This is due to the friction between the hyperelastic robot body and rod. As the target on the desired path switched for each control step, the correction effect of closed-loop controller could not totally compensate for the friction. However, the actual trajectory could keep convergent to the desired. Once the robot left the blocked area, the hybrid controller enables to bring the path back to the desired path in high accuracy again [mean error: 0.399 ± 0.401 mm, Fig. 14(c)], while when using pure model-based control, the robot motion could not return smooth. The image stitching of the abdominal-simulator scene [see Fig. 14(d)] clearly demonstrates the details of the scene.

V. CONCLUSION
In this article, we proposed an integrated soft robot control system, integrating visual-strain fusion-based pose sensing and online-updated hybrid control. All the data-driven models used in the system could be conducted online, without prior data collection. Sparse strain measurements along a single-core FBG fiber wrapped on the robot were trained online as a pose sensor. SLAM estimation using the monocular camera on the robot end-effector was used for the FBG sensor training. The fusion result of SLAM and FBG was able to provide robust feedback of the end-effector pose and accomplish 6-D image stitching. Sensing accuracy and continuity under extreme visual conditions, such as moving obstacles and varying lighting conditions, were resolved, even when encountering full shielding or absolute darkness. The sensing fusion proved immune to failures in SLAM caused by poor feature quality in images. The mean estimation error could be increased and stabilized from RMSE 3.116 to 1.324 mm. For the control scheme, the hybrid controller combining model-based kinematics and learning-based error compensator enabled steady control in target tracking tasks. The learning-based compensator in the hybrid controller reduced the tracking error by >80%. This controller can relax the requirement on modeling accuracy and effectively accommodate unmodeled nonlinearity.
The proposed framework integrated visual-strain fusion sensing modality and hybrid controller could be extended to other robot designs, including multisegment prototypes, although, in this article, we only validate it on the single-segment continuum robot. The application of single-core FBG fiber was not limited by the number of segments, as long as the adjacent segments were connected by a continuous joint that was smooth for wrapping the fiber. However, for the hybrid controller, the kinematics model should be changed or modified according to the specific manipulator used. The learning-based error compensator had the potential to be implemented by the same means as in Section III-B and enhanced the model-based control performance, if the feedback variable and actuation command could be collected and trained.
It is worth noting that our learning-based FBG model incorporates sparse FBGs to predict the robot pose based on its configuration, which has been proved capable of adapting to common local contacts. The further advanced multicore fiber using OFDR technique can even eliminate the local/global interaction effect on a similar pose/configuration estimation, such as impulsive or continuous interaction-induced deformation. Examples can be found in bronchoscopy (e.g., ion endoluminal system, Intuitive Surgical, Inc.) and catheterization platforms [22], [23], [43]. The FBG sensing could still be robust against pulsatile liquid flow or sudden contact with the surrounding. In light of the increasing use of FBGs in soft robotics, we can foresee the syngenetic and practical value of using both camera image and FBG strain data as the closed-loop control feedback.
In the aspect of the proposed algorithm, the combination of vision and FBG strain sensing can be further explored. The FBG fiber can be calibrated offline as a position or orientation sensing device and integrated with a monocular camera to compose a visual-FBG soft-robot SLAM framework, similar to the VINS SLAM [15]. Well-calibrated FBGs can take on the role of IMUs in a new enhanced visual SLAM system, therefore recovering the metric scale to enlarge their usage in soft robotic applications. FBGs could resolve challenges in processes, such as estimator initialization, extrinsic calibration, online loop detection, and tightly coupled relocalization, thus generating a new SLAM architecture for continuum robots. This visual-FBG SLAM system would have a great potential to be used in endoscopic robot localization, navigation, and control.

APPENDIX
As a summarized analysis of Section IV-F1 and F2, the possible factors that affect the sensing accuracy under collisions could be categorized into two types, namely, algorithmic and physical factors. The hypothesis is that more samples for F-emp model training and external protection for FBG fiber will increase the robustness to external contacts. We conducted two simple tests as straightforward validations. First, experiment to test the effect of training samples' number and density under fast flick on the robot was conducted. As mentioned in Section IV-C, the number of samples for ELM model initialization was set as N 0 = 450 throughout this article. It is hypothesized that when the number of samples increases with an incremental density, the ELM model can be trained to deal with abrupt and irregular collisions more effectively. Therefore, N 0 = 450, 900, and 1800 were used separately under the same scene condition (see Table VI). Their corresponding distribution densities were also constant to, double, and quadruple the original density, respectively. After initialization, the samples for model's incremental training also maintained such densities. It can be found that the model initialized by 1800 samples and updated accordingly has the minimum estimation error (0.712 ± 0.173-0.808 ± 0.316 mm) among the three settings. The maximum error also has a decreasing trend when the density of samples increases. This is consistent with our hypothesis. Note that the time for model initialization and updating in each step is not obviously affected, showing that our proposed model is able to improve the robustness to collisions by increasing the number of samples.
To test FBG fiber's susceptibility to contact, we compared the wavelength shift with and without a silicone sheet cover (∼2-mm thickness) for protection (see Table VII). The unit of wavelength shift is nanometer (nm). The average range of absolute wavelength shift during the spiral sensing test was 0-0.158 nm (mean 0.024 nm). The noise in wavelength shift under static status (without external force applied) was measured as max. 0.004 nm (2.53% of the max. wavelength change 0.158 nm; the following brackets indicate the same meaning) and mean 9.07×10 −4 nm (0.57%). The fiber was placed in straight, C-shaped, and S-shaped grooves separately, where the maximum wavelength shift under finger push (∼2 N) was recorded. In Table VII, the "%" represents the proportion of such maximum shift in the sensing range 0.158 nm. It can be seen that the silicone protection can effectively isolate external forces on the fiber, with the sensing noise on wavelength shift reduced from 46.14%-58.35% to 9.18%-24.49%. Other fiber integration methods (e.g., placing a multicore FBG fiber in the inner channel of a continuum robot) will also facilitate the improvement of sensing accuracy under collisions.