DeepNav: Joint View Learning for Direct Optimal Path Perception in Cochlear Surgical Platform Navigation

Although much research has been conducted in the field of automated cochlear implant navigation, the problem remains challenging. Deep learning techniques have recently achieved impressive results in a variety of computer vision problems, raising expectations that they might be applied in other domains, such as identifying the optimal navigation zone (OPZ) in the cochlear. In this paper, a 2.5D joint-view convolutional neural network (2.5D CNN) is proposed and evaluated for the identification of the OPZ in the cochlear segments. The proposed network consists of two complementary sagittal and bird-view (or top view) networks for the 3D OPZ recognition, each utilizing a ResNet-8 architecture consisting of five convolutional layers with rectified nonlinearity unit (ReLU) activations, followed by average pooling with a size equal to the size of the final feature maps. The last fully connected layer of each network has four indicators, equivalent to the classes considered: the distance to the adjacent left and right walls, collision probability and heading angle. To demonstrate this, the 2.5D CNN was trained using a parametric data generation model, and then evaluated using anatomically constructed cochlea models from micro-CT images of different cases. Prediction of the indicators demonstrates the effectiveness of the 2.5D CNN, for example, the heading angle has less than 1° error with computation delays of less that <1 milliseconds.


I. INTRODUCTION
The cochlear implant [1] is one of the most successful implantable devices in clinical practice.It helps to restore lost hearing by delivering electrical impulses to the auditory nerves via an electrode array inserted into the cochlea in the inner ear [2].Cochlear implant navigation involves inserting a wire containing an array of stimulating electrodes into the delicate spiral (or snail) shaped tube that varies in diameter and height along the Z plane and imposing geometrical limitations to the cochlear implant surgery as shown in Fig. 1.The quality of restored hearing sensation is strongly related to the efficacy of cochlear implant surgery, in particular the The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan .optimum positioning and the insertion depth of the electrode array inside the cochlea without further damaging the remaining hearing [3].The standard technique relies on the surgeon's manual skills when pushing the electrode array down the spiral-shaped cochlea and requires the surgeon to identify the optimal insertion path largely by feel.Although the tip of the electrode array is not sharp enough to pierce through the bony wall of the cochlea, extreme pressure may increase the risk of the electrode tip crossing the auditory nerve or the modiolus.Medical-imaging techniques such as MRI, computerized tomography (CT) [4] and X-rays [5]) are not practical options for guidance in implantation surgery as they cannot provide real-time imaging and they are impractical due to the very small volume of the cochlea.The systems in [6], [7], [8], [9], [10], [11], and [12] for cochlear implant FIGURE 1. Guidance of cochlear implant electrode array.The mean heights at the basal, middle and apical turns are 2.3 mm, 1.2 mm and 0.8 mm respectively [13].The quality of restored hearing sensation is strongly related the optimum positioning and the insertion depth of the electrode array inside the cochlea without further damage to the remaining hearing.
navigation derive information from impedance measurements on the electrodes at the end of the electrode array.While useful for identifying the position of the electrode tip, performance is compromised by the limited accuracy of the measured impedance values.Integration of a robotic arm [14] does not lead to better navigation performance as it similarly receives the guidance parameters from imprecise impedance calculations.The present limited accuracy of identification of the position of the tip would be improved by embedding intelligence, which would require accurate navigation.
To avoid adverse consequences such as crossing the anatomical wall as a result of the extreme geometrical limitations, computer-assisted surgery has been used [15].These methods identify the extremely precise centerline trajectory required inside a three-dimensional (3D) reconstructed cochlea as priori knowledge (or post-processing) for cochlear implant electrode array insertion by automatic means.Electrode insertion algorithms have been designed based on the type of electrode: 1) lateral wall electrode [16] that slides along the spiral ligament; and 2) modular-hugging electrode [17] which tends to go closer to the inner wall (the modiolus).
This paper proposes a method to significantly enhance cochlear implant navigation by rapidly identifying an interactive safe insertion zone in real-time using a novel 2.5D convolutional neural network (CNN), providing very high insertion resolution accuracy.The proposed 2.5D algorithm navigates the tip of the electrode safely along the centerline coordinates to ensure minimal insertion risk while the rest of the electrodes slide along the cochlear wall.The electrode array model used is based on a commercially available array with 16 platinum electrodes [Advanced Bionics HiFocusTM SlimJ electrode (Hannover, Germany)].
The rest of the paper is organized as follows.Section II presents the prior art and the cochlear implant navigation algorithm proposed in this work.Section III describes the methods used for data generation and a framework to derive the navigation indicators.It also discusses the design of the 2.5D CNN and the joint 3D operator.Section IV examines the efficacy of the 2.5D CNN in different scenarios and visualizes the navigation steps for an anatomical cochlea model.Concluding remarks are drawn in Section V.

II. RELATED WORK: CENTERLINE TRACING ALGORITHMS
There are a variety of approaches that can be utilized to identify the centerline of tubular structures.One category consists of skeletonization approaches [18] and those using multiscale enhancement, morphological reconstruction and segmentation methods [19], [20], [21], [22].They require the processing of full 3D volume and every image pixel with numerous operations per pixel.
A second category tracks the centerline based on a filter or an assumed model.Commonly used filters are based on eigen-structure of local Hessian [23], idealized tubular models of vessels [24] and Hough transforms [25] to locate vessel direction and its cross vectors at a reference frame.For example, Hessian of the image is interpreted as second order partial derivatives of 3D sub-images at reference nodes, which requires extensive computation time.Cylindroidal superellipsoids [26] is an advanced model of probing for 3D tubular shapes using recursive fitting methods.Although the fitting-based approaches perform well across morphological complexities, they derive model parameters using maximum likelihood which is an extremely complex and lengthy process.
A third category utilizes vectorization algorithms [27], [28], [29] for tubular structure boundary analysis and centerline tracing where only pixels close to the border are processed.They are well-suited to real-time and robust tracing in large image sets.The sparse exploration of the boundaries yields low computational overhead, but it introduces higher sensitivity to the discontinuities and geometrical complexities.An algorithm utilizing vectorization approach to handle 3D (volumetric) data is described in [30].It is a fully automatic tracing algorithm emulating a 3D cylinder model and recursively explores the boundary of tubular structures.The simulations using the 3D cylinder algorithm on constructed cochlea models illustrate that the centerline tracing does not perform reliably when it is faced with high-order tubular changes.
Machine learning offers an alternative approach to identify and trace the central coordinates [31], [32], [33].Steerable features and randomized decision trees are used in [31] to perform centerline extraction by learning the structural patterns of a tubular-like object.The approach in [33] uses orientation flow field and classifier to extract blood vessel centerlines.The average computation for tracing all coronaries takes about 1 minute on an Intel Core i7 2.8 GHz processor with 32 GB RAM as reported in [33].
CNNs are a class of deep learning algorithms that have recently been utilized in 3D tubular structure tracing [34], [35], [36].In [34], a 3D dilated CNN [37] was trained to predict the most likely direction and radius of an artery at any 120594 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The sagittal and bird-view views are generated by projecting the 3D points onto two orthogonal planes (i.e.X-Y and X-Z planes).Two CNNs are trained in parallel to map each view's projected image to its corresponding navigation indicators, which are then fused together to estimate 3D joint operator.
given seed point.The tracing scheme in [34] was developed based on determining a posterior probability distribution over a discrete set of possible directions as well as an estimate of the radius.The drawback with this design is that the optimal direction determination is posed as a classification problem.The possible directions are distributed on a sphere where each point corresponds to a class.The best classification performance was obtained for the directions {500, 1000 or 2000}.The design in [34] demands excessive computational cost in classifying directions and is not suitable for real-time applications; it requires 20 seconds for fully automatic coronary tree extraction using the Nvidia Titan Xp GPU.In [35] and [36], 3D CNNs were proposed to trace the cardiovascular tree structure.They require 58 and 25 seconds using 12 GB GPU and Tesla P40 GPU, respectively.This paper proposes a novel low computation deep navigation method using 2.5D multi-view CNNs that can better transform the input image to a small number of key perception indicators to recover 3D tracing information on the tubular structures, as shown in Fig. 2. The 3D cochlea segment is pre-processed and projected onto sagittal and bird-view planes and then applied to separate CNNs for a mapping process.Each view decodes the relevant navigation (or tracing) information and fuses them; it contains the location distribution of the joint-view 3D tracing operator.
The proposed tracing method has the following contributions: 1) A 2.5D tracing algorithm which shows significant trade-off between the performance and processing time by removing a dimension of an image.The algorithm is a good fit for tracing-related tasks in real-time processing images.
2) A compact residual convolutional architecture is used for each projected 2D image.It predicts the steering angle and the indicators including the collision probability and the distance to the left-right walls in real-time.3) A direct perception approach maps an input image to a small set of indicators that are used to identify the optimal tracing path or insertion zone for navigation, for example, of the electrodes inside a cochlea.The mapping framework performs abstraction of the images by keeping only a set of compact and yet complete descriptors resulting in real-time optimal path identification.4) A comprehensive physiological-inspired tubular dataset which provides a very diverse set of virtual environments for training the 2.5D tracing algorithm.Through extensive evaluation, it is shown that the trained model is efficient and can be applied to real cochlea models.The training set-up can be completely generalized for unseen scenarios.5) A joint 3D operator for navigation in 3D set-ups.

III. METHODS
In this section, the datasets used in this study are described followed by the deep mapping framework for extraction of the navigation indicators and the CNN architecture.The definition of the input data and desired outputs provide a better understanding of the methods.Finally, the joint 3D navigation operator is discussed.

A. DATA
To learn the navigation indicators (or parameters) in cochlea tracing, two types of dataset are utilized.The first dataset is composed of synthetic MATLAB-generated images for training purposes.The second dataset contains anatomical cochlea models.Both are used to quantitatively analyze the navigation performance of the proposed 2.5D CNNs.

1) SYNTHETIC IMAGES
For the sagittal and bird views of the cochlea structure, a parametric segment model of the cochlea is proposed to accommodate all the navigation features for training the 2.5D CNNs.The model shown in Fig. 3(a) has deformation capability to emulate all the variations along the cochlea such as bend, rotation, and length-width variation.For example, in the bird-view mode (i.e., looking at the cochlea from the top), the bend intensity changes constantly along the cochlea.The bend in each cochlea segment (either bird-view or sagittal) is composed of two crucial parameters; the arc intensity and the turning effect which are evident when there are sharp turns.Both effects are shown in the Fig. 3(a)-(b).A closer look at Fig. 3(b) shows that the inner arc between A and B nodes is smaller compared with the outer arc between nodes C and D, which defines the turning points along a cochlea.The length and the width vary radically along the cochlea path (e.g., the mean width at the basal, middle and apical turns are 2.3 mm, 1.2 mm and 0.8 mm, respectively [13]).The length and the width are, therefore, generated for various sizes to cover all the variations along the cochlea.In Fig. 3(d) the width of the cochlea segment is tuned by stacking the number of length-adjusted arcs.Orientation information is important for cochlear tracing.In the proposed parametric model, it is required to obtain a rotational invariant representation for cochlea segments.In order to make the model more robust to orientation variations, the generated images are also rotated along the z-axis by [0:360 • ] to emulate the bird-view of the cochlea and along the y-axis by [0:90 • ] to generate the sagittal tracing segments.The rotation step size is 5 • .Overall, the most practical point in data generation is to design the edges having high correlation with the cochlea projection into two sagittal and bird-view planes.Generating the right edges greatly helps to identify the navigation inferences, through the generalization capability and the noise-artefact robustness of the 2.5D CNN.

2) COCHLEA MODELS
The 2.5D CNN and tracing algorithms were examined with a set of three synthetic cochlea models (Synth model1 . . .Synth model3 ).The purpose of utilizing synthetic data is to provide an analysis of the algorithms under controlled conditions that mimic the cochlea structure.The averaged model used for the synthetic cochlea models was generated in MATLAB 2022.b using: where s ranges from 6.5 to 21.25 to resemble the anatomical human cochlea with a mean length of 41.5 mm and diameter of 2 mm for parametric sweeping purposes [38].The synthetic 3D cochlea models were constructed within a 10 mm × 10 mm × 10 mm volume comprising the cochlea model and the pad arrays to obtain consistent (x, y, z) dimensions for evaluation of tracing performance.In a similar manner three anatomical cochlea models were constructed from micro-CT images (Anatom model1 . . .Anatom model3 ).These evaluate the centerline tracing algorithm against a ''golden standard,'' a hand-traced cen- terline by surgeons in realistic reconstructed cochlea models.The realistic cochlea models were derived from micro-CT images of 512 × 512 pixels per slice.A manually defined ground-truth was used to quantify traversal performance.The micro-CT data was imported to Simpleware ScanIP v2016.09(Synopsys, Mountain View, USA) for image processing and data segmentation by defining regions in the image data that belong to the same anatomical layers [6].Smoothing filters utilizing recursive Gaussian, median, and mean filters were used to adjust the grayscale range.Manual segmentation was used by editing the morphology or filling cavities (i.e., dilate, erode, open and closed functions) were used in ScanIP software.To obtain appropriate boundaries and remove any overlapping sections between the tissue layers, Boolean operations were applied [6].The volume conductor of the cochlea and the layers in its vicinity were generated based on a high-resolution (2.24 µm × 2.24 µm × 5 µm) voxel size micro-CT image stack of a human cochlea.Due to limited computation memory, the effective operative field of the scans was rescaled to include only the cochlea and its immediate surroundings and was subsequently down sampled to an isotropic resolution of 9.6 µm with a spatial resolution of 930 × 930 × 1014 voxels.
The constructed synthetic and anatomical models represent height (h) and width (w) variations (h < w) in human cochlea anatomy.For example, the h w ratio of the Anatom model1 , Anatom model2 and Anatom model3 are 45 62 , 35 55 and 50 67 .It should be noted that the reported ratios are the initial height over width as shown in Fig. 4 and decrease along the cochlea.

B. DEEP MAPPING FROM AN IMAGE TO INDICATORS
A framework is laid out to map the generated images to a set of typical navigation indicators shown in Fig. 5. Three types of indicator to represent an optimal path navigation are proposed: the distance to the adjacent walls, the distance to the frontal wall (i.e.collision probability) and heading angle.The electrode array insertion is concerned with the two adjacent anatomical walls for following the centreline when the tip of the array is pushed inside the tubular structure.This is shown in Fig. 5(a Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.(Dist − RW ) respectively.Collision probability (Collision) is the next indicator that shows the maximum allowable navigation jump to avoid crossing the anatomical walls along insertion iterations.This is a crucial indicator as it accurately shows the stopping points specifically in the tubular turns before mapping the next image frame shown in Fig. 5(b).The navigation angles (θ, ϕ) are the next indicators that direct the optimal rotation of the electrodes along the sagittal (θ) and bird-view (ϕ) projection planes.In total, four affordance indicators to interpret the navigation scene are extracted from each image frame using the 2.5D CNN for each view.With the 2.5D view processing, a 3D safe insertion zone can be defined using all generated height and width variations of the cochlea in sagittal and bird-view projections around the predicted centerline coordinates as a hypothetical insertion cylinder [e.g., 50% of Dist − RW as shown in Fig. 5(d)].

C. ARCHITECTURE OF THE 2.5D CNN
The 3D points are projected on two views (i.e., 2.5D view).For each view, a convolutional network having the same network architecture and architectural parameters and the outputs are constructed.Based on multi-task learning [39], a ResNet [40] architecture followed by separate outputs shown in Fig. 6 is proposed.Since residual architectures are known to help generalization on both shallow and deep networks, it is adapted to increase model performance.The architecture of the 2.5D CNN is highly compact, where the input layer has a size of 64×64 to accept the sagittal or top views.The output of each 2D convolutional layer is activated by a rectified nonlinearity unit (ReLU) with its parameter equal to 0.1, which allows for a small, non-zero gradient when the unit is saturated and inactive.
Since most parameters in the proposed network lie in the first fully connected layer, a convolutional layer and a max-pooling layer are added to improve the degree of discrimination of the learned feature and reduction of the number of parameters.Dotted lines represent skip connections defined as 1×1 convolutional shortcuts to allow the input and output of the residual blocks to be added.After the last ReLU layer, the architecture splits into two different fully connected layers.The main branch consists of a fully connected layer and a softmax output layer to classify the collision probability (Collision), distance of the electrode to the left (Dist − LW ) and right (Dist − RW ) wall (see Section III-B).For the auxiliary branch, neurons are split to form a regression network for estimation of the tracing angles along sagittal or bird-view planes (θ, ϕ).Mean-squared error (MSE) and cross-entropy (CEN) losses are utilized to classify the tracing angles and the affordance indicators, respectively: L Tot , L MSE and L CEN represent the total loss of the model, the loss of tracing angle prediction and the loss of other indicators, and α, β show the loss weights.The network was designed with a compact architecture, but the joint optimization might pose a convergence problem.Specifically, imposing no weighting between the two losses during training results in convergence to a very poor solution.This is because the MSE gradient' norms are proportional to the absolute tracing angle and initially has much higher value.Therefore, α is set to 0.1 and more weight is assigned to L MSE in later stages of training (i.e., 0.2-0.3).Adjusting the loss weight between the two losses would likely result in optimal performance or require much longer optimization times.The Adam optimizer [41] is used with a starting learning rate of 0.001 and an exponential per-step decay equal to 10 − 5.

D. JOINT 3D NAVIGATION OPERATOR
A joint 3D tracing operator is proposed to flexibly position the electrode array through the complex 3D tubular structure.As illustrated in Fig. 7, the 3D navigation operator is composed of three elements: 1) the bird-view axis (Y ) to monitor the width variations in a tube, 2) the sagittal axis (Z ) to identify the height of a tube, and 3) a navigation vector − → nav k .Bird view (Y ) and sagittal (Z ) axes are jointly connected at node O shown in Fig. 7 and form a unified structure that is rotated based on the assigned angles to the unity vector − → nav k .3D space directions are indicated by considering two angles; θ and ϕ around a unity vector − → nav k = [θ, ϕ] in Fig. 7, where θ describes the bird-view rotations around the Z axis and ϕ describes the sagittal rotations around the Y axis after being rotated by θ • around the Y axis.The length of the navigation vector − → nav k also defines the maximum allowable length that the electrode array that can be pushed inside the tubular structure (the cochlea in this case) in each iteration and shown in Fig. 7.The navigation vector − → nav k can be shifted along the identfied distances [(Dist − LW ) and (Dist − RW )] to the cochlea walls from the origin (O) in both sagittal and bird-view projections.All the defined parameters in the joint 3D tracing operator introduce super-flexibility in different scenarios with highly precise tuning of the navigation of the electrodes.

IV. EXPERIMENTAL SETUP AND RESULTS
This section presents and discusses the results, using the metric-based experimental setup, cochlear implant electrode insertion in noisy scenarios and the navigation indicators prediction in a real cochlea model.

A. REGRESSION AND CLASSIFICATION RESULTS
In this section, the quantitative and qualitative results of the 2.5D CNN are discussed.The 2.5D CNN addresses the regression network for estimation of the tracing angles along sagittal or bird-view planes (θ, ϕ).To quantify the regression performance two metrics are used: root-mean-squared error (RMSE) and explained variance ratio (EVA).RMSE measures the average magnitude of the prediction error, indicating how close the observed values α are to those estimated by the network α: The EVA measures the proportion of variation in the predicted values with respect to those of the observed values.Such variations are given by the variance of the residuals Var = ( α − α) and the variance of the observed values Var = (α).
If predicted values approximate the observed values well, the residual variance will be less than the total variance, resulting in EVA ⪅ 1.Otherwise, the residual variance will be equal or greater than the total variance, producing EVA = 0 or EVA < 0, respectively.To assess the performance on collision prediction (Collision), the distance of the electrode to the left (Dist − LW ) and right (Dist − RW ) wall, average classification accuracy and F-1 score are used.It should be noted that training of the 2.5D CNN used the combination of the synthetic data generated by the parametric model explained in Section III-A1 and the projection of the synthetic cochlea models (Synth model1 . . .Synth model3 ) in Section III-A2.Using the parametric synthetic data generation and synthetic cochlea models (Synth model1 . . .Synth model3 ), the sagittal and bird view networks were trained by over 1 million 2D cochlear segments with different width, length, inner and outer arcs and rotation directions.The trained networks have high generalization capability to data variation and can perform electrode navigation for unseen cochlea cases from different patients.The generated data were divided into a training set containing 70% percent of the data to optimize the parameters and the hyperparameters, and a testing set consisting of the remaining 30% to evaluate the 2.5D CNN performance on the unseen data.The whole network was then examined on the anatomical models (Anatom model1 . . .Anatom model3 ) with manually defined ground-truth to quantify traversal performance.The tracing process begins by defining a sampling cube around the seed point in the scala tympani.Having sampled a segment of 3D cochlear, it is projected onto the bird and sagittal views and sent to the 2.5D CNN.The sampling cube is rotated and adjusted based on the latest tracing information [θ, ϕ] for sampling the next cochlea segment.This process continues to the last segment and sampling iterations along the cochlea; it is user controlled.
Table 1 compares the average performance of the 2.5D CNN using cochlea models (Anatom model1 . . .Anatom model3 ) with other architectures from the literature [40], [42], [43].From these results, it is observed that the 2.5D CNN, even though 70 times smaller than the best architecture (ResNet-50), maintains considerable prediction performance while achieving real-time operation.Furthermore, the comparison against the VGG-16 architecture indicates the advantages in terms of generalization due to the residual learning scheme and parametric data generation model, as discussed in Section III-C and Section III-A.1, respectively.The design succeeds at finding a good trade-off between tracing performance and the number of parameters detailed in the CNN architecture as shown in Table 1.In order to enable the placement of an electrode array to promptly react to situational changes, it is necessary to reduce the network's latency as much as possible.

B. DEEP MAPPING OF NAVIGATION INDICATORS IN NOISY SCENARIOS
Cochlea navigation is a difficult task, primarily because of the noise and variability associated with the real-world scenes.Computer vision has displayed a promising performance and flexibility when dealing with high degrees of noise and variability.This is because unlike most of the iterative methods where the search of true direction is determined based on a local estimate of the orientation and history information, the proposed and other CNN methods consider the whole feature map and the outline of the images (i.e. the borders).Typically, the added noise corrupts the process of mapping cochlea images to the navigation indicators including the distance to the adjacent left and right walls, collision probability and heading angle, and results in either minor or major deviations from the ground truth.The results in Fig. 8 [42], ResNet-50 [41] and VGG-16 [43].
information of the projected cochlea segments.This can be seen as a stream of images with localized amplitude (or intensity) variations which makes the border recognition extremely difficult.For σ N > 0.4, the RMSE of all algorithms increase at a higher rate.Fig. 8(a) also shows that the ResNet-50 always has higher noise robustness for 0.05 < σ N < 0.55.
Contiguous tracing which is the ratio of successful trials in tracing centerlines in all trials is calculated and shown in Fig. 8(b) for 0.05 < σ N < 0.55.The contiguous ratio analysis considers the randomness of the 2D noise distribution.The graphs are computed from a total number of 30 trials for the cochlea models (Anatom model1 . . .Anatom model3 ).For the 2.5D CNN, the tested cochlea models are traversed contiguously because the designed ResNet architecture helps with generalization of the border recognition in the image segments.In Fig. 8(b), the ratio of successfully traced centreline coordinates by the ResNet-50 algorithm are higher compared to 2.5D CNN but has about 3X longer execution time.
C. JOINT-VIEW PROJECTION AND NAVIGATION: STEP-BY-STEP STUDY Fig. 9 shows the qualitative results of progressive projection and tracing in Anatom model3 , its corresponding 3D operators and the identified indicators.An oriented sampling cube (OSC), which is a tight fit around 3D point in local space, is generated at four different locations of the Anatom model3 to show the performance of the 2.5D CNN.These locations capture almost all the geometrical difficulties along the navigation path (i.e.width and height variations, rotations along Z axis etc.).Fig. 9(a) is the start of the navigation and location of the OSC around the scala tympani seed point, so seed point x-y-z coordinates are set to the center of the OSC.3D sampled points obtained from the input depth image are projected onto x-y and y-z planes of the coordinate system, respectively.Notice that the projections on the three orthogonal planes may be coarse because of the resolution of the depth map [44] Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
which shows the length of − → nav k is also set to 39, the minimum of collision in both views θand ϕ are also set to −5 • and 115 • .This process is then repeated for four different locations by moving the OSC along cochlea as shown in Fig. 9(a)-(d).3D depth sampling is obtained by rotating the OSC to the identified θ and ϕ of the previous step (i.e., the θ and ϕ history).The height, width and depth of OSC are also defined according to the derived information in the previous step [e.g., (Dist − LW ), (Dist − RW ) and the minimum of collision probability in both views θand ϕ].This is an automated and reliable depth sampling that converts the whole cochlea to the smaller segments.The size-adjusted OSC rotates along the cochlea; the sagittal and bird-views also rotate accordingly to capture the projections.The identified coordinates and the 3D operator present the optimal navigation tool for surgical purposes.

V. CONCLUSION
In this paper, 2.5D CNN is proposed to map the projected 2D cochlea images into accurate navigation indicators, including the distance to the adjacent left and right walls, collision probability and heading angle.A novel network architecture was designed (i.e.converting a 3D to two complementary networks) to trade off performance for processing time to enable online operation, Each network consists of 5 dense convolutional layers with {(12×12) . . .(96×96} kernels and ReLU activations, followed by just one average pooling, with size equal to the size of final feature maps and three dense layers.The training was performed by minimizing the categorical cross entropy with the Adam optimizer.Tracing of the cochlea is a laborious and dangerous task as error margins are extremely small.The proposed method learns to promptly react to the radical directional changes, geometrical variations and overall rotations along the cochlea.It was shown through extensive evaluations on processing time, navigation accuracy and noise robustness analysis that the proposed approach performs well with both synthetic MATLAB-generated images and anatomical cochlea models constructed from micro-CT images.The results confirm reliable navigation with an average of >98% mapping accuracy.The processing time of the navigation platform which consists of 3D segment sampling, 2.5D projections, navigation indicators extraction and eventually the remapping to 3D navigators is around a millisecond per insertion step.Where there are local noise and artefacts, the feature map activations clearly recognize the edges of the of the generated images by the parametric model.Future work will focus on integrating the proposed navigation method into a robotic arm with a real-time imaging module to implement a precise computer-aided system for virtual cochlear surgery.

FIGURE 2 .
FIGURE 2. Overview of the proposed joint-view navigation framework.The sagittal and bird-view views are generated by projecting the 3D points onto two orthogonal planes (i.e.X-Y and X-Z planes).Two CNNs are trained in parallel to map each view's projected image to its corresponding navigation indicators, which are then fused together to estimate 3D joint operator.

FIGURE 3 .
FIGURE 3. Synthetic data generation.(a) Illustration of the parametric cochlea segment model.There are two arcs defined between the A and B nodes, and between the C and D nodes.Their width and the length are tunable in this proposed model.(b) shows the arc intensity change.In (c), the length of the arcs is tuned to the smaller values.(d) shows when the width is tuned based on adjusting the arc length between C and D. Cochlea segment rotation is an important factor in implant navigation and this capability is shown in (e).(f) Combining (b), (d) and (e).

FIGURE 4 .
FIGURE 4. The initial height h and width (w ) (h < w ) of the anatomical models (Anatom model 1 . . .Anatom model 3 ) around the scala tympani seed point.The models are designed to have h w model 1 < h w model 2 < h w model 3 .The models consider geometrical variations along the navigation paths.
) by identifying the distance of the electrode to the left and right walls indicated by (Dist − LW ) and 120596VOLUME 11, 2023

FIGURE 5 .
FIGURE 5. Illustration of navigation indicators.(a) electrode distance to the left and right walls, Dist − LW and Dist − RW .(b) Collision probability Collision which shows the distance to the front wall.(c) The navigation angle and (d) safe insertion zone for optimal navigation to the left and right walls, Opt − LW and Opt − RW .

FIGURE 6 .
FIGURE 6.(a) The 2.5D CNN is a joint deep mapping network, from a single 64 × 64 frame including Dist − LW , Dist − RW , collision probability Collision and the tracing angles along sagittal and bird-views (θ ,ϕ).Main architecture of the CNN consists of a ResNet with 4 residual blocks; (b) followed by dropout and ReLU non-linearity.Afterwards, the network branches into 4 separated fully connected and regression layers.The design notation including the convolution kernel's size, the number of filters and the residual connections are shown in the figure.

FIGURE 7 .
FIGURE 7.Illustrating the joint 3D navigation operator.The − − → nav k = [θ, ϕ] is formed by identifying the navigation indicators from the sagittal and the bird-view projections.In this example, the navigation operator is shifted by θ • to the left and ϕ • upward.The length of the navigator is defined by the minimum of collision probability of sagittal and the bird-view projections.The distance to the walls in both projections also give margins for shifting the − − → nav k = [θ, ϕ] to left-right and up-down considering the green dotted arrows according to the optimal safe zone.

TABLE 1 .
Average quantitative results on cochlea models (Anatom model 1 . . .Anatom model 3 ): EVA and RMSE are computed on the Dist − LW , Dist − RW and the tracing angles along sagittal or top views (θ, ϕ), while Avg.accuracy and F-1 score are evaluated on the collision prediction task.Despite being relatively lightweight in terms of number of parameters, 2.5D CNN maintains a very good performance on both tasks.
(a) show RMSE<0.1 for noise standard deviation (σ N ) of 0 < σ N < 0.15.2D gaussian noise was embedded in the generated images and used for deep extraction of the navigation indicators in noisy situations.Fig. 8(a) shows that for σ N < 0.22, the average RMSE of (Dist − LW ) (or (Dist − RW )) in cochlea models (Anatom model1 . . .Anatom model3 ) is below 0.18.Increased noise causes more variations on the border

FIGURE 8 .
FIGURE 8. (a) The calculated RMSE in mapping of Dist − LW to the ground truth as a function of noise compiled for the anatomical models (Anatom model 1 . . .Anatom model 3 ).(b) Ratio of successful iterations completed by 2.5D CNN as a function of noise compared with AlexNet[42], ResNet-50[41] and VGG-16[43].

FIGURE 9 .
FIGURE 9. Automated tracing along a 3D cochlea using the Anatom model 3 .(a) , (b), (c) and (d) represent the shifted OSC shown by cyan color along the Anatom model 3 .The superimposed OSC along cochlear samples different geometrical complexities at different turns.In (a), the OSC is placed around the scala tympani seed point, the sampled 3D cochlea segment is projected into the orthogonal sagittal and bird-view planes.The navigation indicators for both views are derived in two different columns below the projections.For example, sagittal view of 3D tracing algorithm starts from the seed point with θ = −5 • , Collision= 45, Dist − LW = 30 and Dist − RW = 25.Rotated joint 3D operators are also superimposed in each OSC for different scenarios.The derived navigation indicators Top, Bottom, Left and Right are shown in (a), (b), (c) and (d).
, which can be improved by performing median filter and opening operation on the projected images.The designed CNNs for each view then process and map the input projections into the navigatio's indicators.For the identified indicators including (Dist − LW ), (Dist − RW ) and (Collision) in each view, the distances from the left, right and the frontal walls are normalized between 0 and 64 (6 neurons to quantize 64 steps, with nearest points set to 0 and farthest points set to 64).The navigation angles (θ and ϕ) are also indicated by two numbers.By fusing the computed navigation indicators from both sagittal and bird-view projections, a 3D joint operator is finally formed as shown in Fig. 9(a)-(d).The superimposed 3D navigators in each figure consists of blue and black arrows to quantify the height and width of the sampled cochlear respectively.The red arrow also shows the optimal navigation path.For example, the navigation parameter for the OSC samples around the scala tympani seed point, the (Dist − LW ), (Dist − RW )of both views are (Dist − LW /RW ) Sagittal = 30/25) and (Dist − LW /RW ) Bird−view = 29/24).(Collision) 120600 VOLUME 11, 2023