Neural Network Kalman filtering for 3D object tracking from linear array ultrasound data

Many interventional surgical procedures rely on medical imaging to visualise and track instruments. Such imaging methods not only need to be real-time capable, but also provide accurate and robust positional information. In ultrasound applications, typically only two-dimensional data from a linear array are available, and as such obtaining accurate positional estimation in three dimensions is non-trivial. In this work, we first train a neural network, using realistic synthetic training data, to estimate the out-of-plane offset of an object with the associated axial aberration in the reconstructed ultrasound image. The obtained estimate is then combined with a Kalman filtering approach that utilises positioning estimates obtained in previous time-frames to improve localisation robustness and reduce the impact of measurement noise. The accuracy of the proposed method is evaluated using simulations, and its practical applicability is demonstrated on experimental data obtained using a novel optical ultrasound imaging setup. Accurate and robust positional information is provided in real-time. Axial and lateral coordinates for out-of-plane objects are estimated with a mean error of 0.1mm for simulated data and a mean error of 0.2mm for experimental data. Three-dimensional localisation is most accurate for elevational distances larger than 1mm, with a maximum distance of 6mm considered for a 25mm aperture.


I. INTRODUCTION
Tracking and localisation of point-like objects is crucial for a large variety of medical applications in ultrasound (US) imaging, such as tracking of microbubbles for superresolution US imaging [1], [2] or US-guided placement of fiducial markers for radiotherapy [3]. Additionally, tracking of surgical tools (such as needles and catheters) is essential during minimally invasive procedures [4]- [6], as when placed inaccurately, these devices may cause trauma by damaging tissue or deliver ineffective treatment to the wrong location [6], [7]. As such, US is frequently used for guidance through imaging but accurate localisation in a three-dimensional target domain remains challenging. This is primarily caused by the nature of data acquisition using linear arrays, which assumes that all signals originate from within the image plane and thus only a two-dimensional B-mode image of the image plane is formed. We refer to this obtained 2D image as the US image and assume it consists of the reconstructed point, or point-like, source corresponding to the object we aim to track accurately. But, if this point source is located out-ofplane it will primarily show as aberration in the reconstructed image domain; additionally one may misinterpret features, such as a needle shaft as the tip [8]. Thus, the problem to provide an accurate positional estimate in 3D from only 2D US images is consequently a notoriously difficult task without any auxiliary information [9] and is a field of active research [10], [11]. Early approaches used speckle information to estimate out-of-plane displacements [12], [13]. Another possibility for instrumented US tracking of needles was proposed by Xia et al. [14], [15], who designed a custom-made imaging probe consisting of a central array for conventional imaging and two side arrays for 3D tracking [15]. Alternatively, one may approach the needle tracking problem in full 3D to obtain accurate positional estimates [16].
In this work, we propose an alternative, real-time capable, method of performing 3D tracking without the need for custom-made probes and using a single set of measurements per time-step from a linear array. In the following we assume a point source model for the tracked object. For the estimation of lateral and axial positions, we examine high intensity pixels in the reconstructed 2D US image, similar to [16], where tracking was performed in full 3D. For the estimation of the elevational direction, or out-of-plane distance of the point source to the image plane, we use a machine learning approach. In particular, significant markers are extracted from the measured time series and a neural network is trained using synthetic data modelled for a prototype optical ultrasound (OpUS) imaging setup [17] to predict out-of-plane distance and associated aberration in axial position in the reconstructed 2D US image. The markers used for the neural network are summary statistics extracted from the measurement data and correlated with the offset to establish a nonlinear mapping Fig. 1: Illustration of the measurement geometry. If the point source object is located out-of-plane (right side), then the object appears distorted in the reconstructed US image, i.e. too low by δ in the axial direction. This is why the axial position needs to be corrected. between the two.
Additionally, we assume regularity in the temporal evolution of the object location to improve robustness and reduce uncertainty in the estimation compared to independently analysing subsequent images. The regularity assumption mimics conditions encountered in clinical practice, where the objects, such as needle tips and microbubbles, are expected to follow a smooth trajectory without rapid jumps or jitter during insertions into soft tissue. This can be incorporated, while retaining computational efficiency, by a Kalman filter [18], [19] which is a flexible method for estimating the state (position, velocity, etc.) of a dynamic system. It has been classically utilised in engineering applications such as target tracking and navigation, but has been also extensively used in inverse problems and medical imaging [15], [19]- [23]. The underlying idea of Kalman filtering is to update the estimate of the state at time step k+1 each time new data become available as opposed to smoothing, where the whole trajectory from time step 1 to k are updated as well. It has the appealing property, as opposed to estimating the full posterior of all states, that the problem does not become intractable as the number of data points increases.
We note, that Kalman filtering has earlier been utilised for a needle tracking problem in 3D [16] as well as for microbubble tracking in 2D [24]. In [25], the authors track a wire tip using Kalman filter and perform an elevational position estimation from geometric markers. In this work we approach the problem in 3D with only a single set of measurements (per time point) from a linear array and combine it with a neural network in order to obtain reliable estimates on the elevation to correct the axial position in the 2D US imagex. We evaluate the proposed method by tracking a point-source for simulated OpUS data and object trajectories with changing elevation. Robustness is evaluated with respect to increased noise in the measurement data and accuracy compared to positional estimation using only the pixel with maximum intensity in the OpUS image. Finally, we evaluate the method on experimental OpUS measurements.

A. Image formation and object tracking
In the following we specifically consider a custom setup and simulation framework for a freehand optical ultrasound imaging system. That is, the US signal generation is modelled as pulse-echo imaging, where each source along the linear aperture emits a pressure wave, which reflects off the point scatterer and is detected by a single fibre-optic detector placed right next to the imaging aperture [17]. We note that the tracking framework here can be generalised to any linear US array, where every source element also acts as a detector. Moreover, we consider in the following the tracking problem in a 3D space, that is we aim to determine the 3D coordinate x of the point-source. Given the recorded radiofrequency (RF) time series p, reconstruction of the 2D in-plane US imagex is performed using a basic delay-and-sum algorithm (equivalent to dynamic focussing) [26].
When the location of the point-source is in-plane, then the reconstructed imagex can be directly used to estimate the location x reliably. On the other hand, if x is out-ofplane then the 2D reconstruction will lead to a distorted image, in the sense that the reconstructed axial position is located deeper in the target than the correct position [27]. The aberration occurs because the time of flight is larger from objects that are positioned out-of-plane due to the imaging geometry (see Fig. 1). Thus, this axial aberration needs to be detected and processed to simultaneously provide an estimate of the elevational distance between the point target and the image plane, and the corresponding coordinates projected onto the image plane.

B. Object tracking for in-plane objects
Most tracking applications primarily assume that the object of interest is in-plane and features extracted from the reconstructed imagex are good indicators of the actual position x. Thus, the majority of tracking algorithms are based on intensity values in the B-mode images for point marker tracking [28], [29] in combination with various image registration approaches [30]- [32]. More recent developments make use of deep learning techniques to estimate the coordinates of objects directly from the measured time series p [33]- [35]. Nevertheless, there is no clear gold-standard to perform object tracking, as the particular approach depends heavily on the application and practical need [9].
In this study we are concentrating on single object tracking and use the maximum intensity (MI) estimate for comparison and reference, since it provides highly efficient and accurate estimates under ideal assumptions, i.e., high signal-to-noise ratio without any elevation. Consequently, we will also design our tracking model in the following by using the pixels in the US image with highest intensity for the estimation of axial and lateral positions in the filtering process.

C. Kalman filtering
Kalman filtering, a class of Bayesian filtering, is especially effective in situations where the data stream is over time and one must update the state given the new data and the history of the system; as such it is ideally suited to robustly perform the object tracking considered in this study. Specifically, Kalman filtering [18] consists of closed-form update formulas for a linear Gaussian filtering problem, which will be discussed next. The estimation of axial and lateral coordinates is similar to the approaches suggested by [16], [24], we will then continue to extend our model to incorporate elevation and a correction of estimated axial coordinates.
1) Lateral and axial coordinates: Estimation of lateral and axial coordinates and corresponding velocities x k = (x lk x ak v lk v ak ) T at time step k is based on the locations of highest absolute intensity pixels in the image. We assume that these locations are spread around the location of the object. While the velocity of the object is not the main interest, it is introduced as an auxiliary variable to help in predicting the motion, as will be described below. We denote by y lk ∈ R n the n lateral and by y ak ∈ R n the n axial highest intensity locations and let y k = (y T lk y T ak ) T . We then build a model where the matrix associates given high intensity locations in y k with the actual coordinates of the object in x k . Furthermore, we assume that the noise in the observed locations is normally distributed with s 2 lk being the sample variance of y lk and s 2 ak the sample variance of y ak . This way uncertainty is naturally incorporated into the model as how spread out the high intensity locations are.
We model the motion of the object with a constant velocity model [36] x where c k ∼ N (0, diag(σ 2 l , σ 2 a )) is assumed to be a random acceleration component and σ 2 l and σ 2 a are lateral and axial process noise variances, respectively. The matrices A and G are defined as [19] where ∆t is the time between subsequent observations. The velocity and acceleration of the object at time step k−1 is used to give an accurate prediction of the position at time step k. Writing Eq. (4) explicitly for the positional variables only, we obtain This means that we assume the position at time step k to be close to the position at time step k − 1 plus the displacement given by the time between subsequent observations, velocity, and acceleration.
2) Out-of-plane offset and axial aberration: In case there is an offset between the imaging plane and the object, we observe aberration in the reconstructed axial coordinate due to the geometry of the imaging problem (see Fig. 1). In this case, location estimation based on only the high intensity pixels would result in a biased estimate of the axial coordinate. Instead, we use information in the measurement data p to estimate the offset and axial aberration and use the knowledge to correct the estimate of the axial coordinate. To do this, we train a neural network with training data obtained from an OpUS simulator. Details on the neural network will be provided in following Section II-D.1 and on the simulator in III-B. We note that instead of a neural network, other sufficiently expressive nonlinear prediction models could be used.
The filtering model is then extended to include the unfiltered out-of-plane offset and axial aberration, denoted as y ek and y δk , that are received as output from the neural network. We can then define y * k = (y T k y ek y δk ) T . Filtered out-ofplane offset and axial aberration and their velocities, denoted as x ek , δ k , v ek and v δk are included in the state vector We extract mean value µ from the highest amplitude entries in the measured time series p and use highest intensity pixels in the 2D US image. A previously trained (orange arrow) neural network estimator then uses the last state estimates of lateral x lk and axial x ak coordinates together with µ to estimate offset and axial aberration. The next state is then updated via Kalman filtering to provide robust positional estimation. where and σ 2 e and σ 2 δ are the process noise variances of the outof-plane offset and axial aberration components, respectively. Finally, we let 3) State estimation: As stated earlier, a major benefit arising from linearity and Gaussianity of the filtering models are the closed-form update formulas for mean m k and covariance P k of the state. At k = 0 we assume x * 0 ∼ N (m 0 , P 0 ), where m 0 = 0 and P 0 = 15I to serve as an uninformative prior. Note, that the mean m k corresponds to the estimated coordinates for x * k and P k is the corresponding covariance matrix. At every round, the prior predictions for mean and covariance are updated recursively as where Q * = G * diag(σ 2 l , σ 2 a , σ 2 e , σ 2 δ )G * T . We then evaluate the neural network as described in the following section to provide the estimates of offset and axial aberration in y * k , then the Kalman update can be performed by Estimates of all coordinates are then given by m k . Their variances can be found in the diagonal of P k and could be used for uncertainty quantification of the Kalman updates. The estimated axial aberration is then subtracted from the estimated axial coordinate to yield an estimate for the actual axial coordinate as

D. Application to object tracking
We apply the Kalman filtering method to the task of object tracking modelled from OpUS image reconstructions and measurement data. After generating suitable training data showing the decay in amplitude as the distance from imaging plane increases. The decay is also shown on the right for different axial depths. In general, the rate of decay decreases with increased depth. The lateral coordinate x l was set to 0 in these tests. and training the neural network, tracking can be performed as outlined in the following.
1) Neural network: training data and architecture: We use an OpUS simulator [37], [38] to generate training data for the neural network. First, we define a uniform 20 × 20 × 20 grid of coordinates. The grid has bounds ±12 mm in lateral, 0.5 -14.5 mm in axial and ±10 mm in elevational direction. Two additional grid points were placed in-plane (zero in elevational direction). In each grid point we simulate measurement data with coordinate values equal to the grid point.
To obtain a marker for the offset estimation, we note that ultrasound source elements typically emit near-omnidirectional pressure fields within the image plane, but are usually designed to emit highly directional fields in the out-of-plane direction. This is achieved through a combination of eccentric element geometries and acoustic lenses [27]. As a result, the amplitude of pulse-echo signals from point objects depend strongly on the elevational (out-of-plane) position, and generally reduces with increasing elevational offset. The shape of the decay also depends on the position of the object. In short, the out-ofplane amplitude decay decreases as the axial depth increases, as illustrated in Fig. 3. This is why both the lateral and axial coordinates are used as inputs for the neural network. Thus, the pulse-echo signal strength across the aperture can be used as an effective marker of the elevational position. To exploit this, we use the RF time series p to compute the mean absolute value µ of those time series samples belonging to either highest or lowest 1% of a Gaussian defined by the mean and variance of p. This way most of the purely noisy part of the data is ignored. We also compute the distance between the mean of apparent (reconstructed) axial coordinates of n = 15 highest intensity pixels and the real axial coordinate used to simulate the data. This distance reflects the axial aberration that needs to be corrected for.
A neural network Λ θ with parameters θ is then trained to map lateral and axial coordinates, and mean absolute value of high amplitude entries in measured time series, denoted as u = (x a x l µ) T , to a prediction of unfiltered out-of-plane offset and axial aberration w = (y e y δ ) T . Since the simulator output is almost symmetric with positive and negative offsets, we train the network with absolute offset values ≥ 0. This means that we can only estimate magnitude of the offset, not the direction. The network chosen is a standard multilayer perceptron [39] with two hidden layers and 20 nodes in each layer. Each hidden layer has a sigmoid activation function whereas for the output layer the activation function is linear. The network is trained by finding a set of parameters θ * such that mean squared error between neural network output and the ground truth is minimised, i.e., where M is the size and i is an index over the training data. We used the Levenberg-Marquardt algorithm [40], [41] to train the neural network. The dampening parameter was set to the default value of 10 −3 . The optimisation stopped when the validation performance did not improve in six epochs in a row or the relative norm of the gradient of the minimised function was smaller than 10 −7 .
2) Tracking: We track the point-source from a sequence of optical ultrasound image reconstructions. At k = 1, we set m 1 = 0 and P 1 = 15I. We find the coordinates of n = 15 highest intensity pixels and use the neural network to estimate the unfiltered out-of-plane offset and axial aberration. Since the neural network input contains the lateral and axial position of the point-source, we use the estimate from the previous time step. If the motion of the object is somewhat regular, this does not have a big impact on the estimation accuracy. The coordinate estimates are then updated with Kalman filter update formulas Eq. (16). An estimate of the true axial coordinate of the object is then obtained by subtracting the axial aberration estimate from the apparent axial coordinate obtained directly from the 2D US image. An illustration of the full tracking workflow is shown in Figure 2 and summarised as pseudo code in Algorithm 1. while new data acquired do 5: Update mean m pr k and covariance P pr k by Eq. (15) 6: Compute marker µ k and high intensity pixel locations y lk and y ak 7: Perform Kalman update with (16) 10: Perform axial aberration correction with (17) 11: Display image overlaid with coordinate estimates 12: k ← k + 1 13: end while 14: end function

A. Experimental setup
Experimental validation of the method was performed using a custom OpUS imaging system comprising a handheld imaging probe. We have chosen the OpUS imaging system for this study due to three main advantages: it offers direct access to the RF data, it was previously accurately characterised inhouse, and the system can be accurately and highly efficiently modelled numerically -thus making it an ideal fit for this study. This system, which was described in full in [17], uses scanning optics to couple excitation light sequentially into the proximal ends of 64 optical fibres arranged in a linear array. This light is delivered to an optically absorbing coating deposited at the distal ends, where it is converted into divergent ultrasound waves via the photoacoustic effect [42]. Thus, an OpUS source aperture is rapidly scanned to enable videorate and real-time imaging in a 2D imaging plane. Backscattered ultrasound waves are detected using a single fibreoptic ultrasound detector comprising an optically resonant plano-concave Fabry-Pérot cavity [43], with lateral extent of 25mm.

B. Simulated data
A highly efficient and accurate simulator of the OpUS imaging setup, as previously described in [37], [38] and based on the FOCUS ultrasound simulator [44], [45], was used to evaluate the performance of our method with synthetic data examples produced with the OpUS simulator. In total, four synthetic data sets were generated to test different properties of the tracking method. Noise amplitude was computed such that SNR for in-plane locations was 6.5 dB and decreasing SNR with elevational distance, due to decreasing signal strength.
The first data set (Exp.1) comprises 101 time points and a smooth, curved object trajectory with linear motion at constant velocity in the elevational direction to test the overall performance, see Figure 4. We remind that our proposed method extends the MI estimation with Kalman filtering and incorporation of elevational offset estimation and axial aberration correction. Thus, Exp. 1 shows the importance of the aberration correction. We then examine other factors, such as noise in the second data set (Exp. 2), which is the same as the first one, but with tenfold noise in every tenth measured time series to investigate the robustness of the method. The third data set (Exp. 3) follows also the same axial-lateral trajectory as Exp. 1, but the object is positioned in-plane for all frames. The fourth data set (Exp. 4) has stationary lateral and axial coordinates with a constant change in elevation, and is meant to test the accuracy of offset estimation.
1) Reference methods for comparison: In addition to the proposed combination of neural network tracking with Kalman filtering (NNK), we test two other reduced models: plain Gaussian random walk (NNK-RW) and independent subsequent states (NNK-I). Mathematically they differ with respect to the dynamic model: NNK-RW assumes that x k = x k−1 + c k and NNK-I that x k = c k (compare to Eq. (4)). We compared our method to MI tracking which estimates the object location as the pixel with highest intensity and thus only outputs a 2D location. To evaluate the performance of all considered methods, we computed the mean 2D Euclidean distance from the estimated axial and lateral coordinates to the ground truth using synthetic data. We additionally evaluate accuracy of the three-dimensional positional estimate with Exp. 4. Finally, we examined the localisation accuracy of NNK as a function of depth (axial coordinate) and out-of-plane offset with an axial line trajectory simulated with different values for out-of-plane offsets. This evaluation was done using 3D Euclidean distance.

C. Experimental data
To test the out-of-plane tracking abilities of the method we performed one physical experiment closely matching simulated Exp. 4 and one to test accuracy when moving further out-of-plane. In the first experiment, the tip of a metal pushpin (tip diameter: 50 µm) was used to emulate a point object and was submerged in water as a homogeneous background medium. This pin was placed centrally within the imaging aperture at an axial distance of 7.5 mm, and was attached to a manual translation stage (PT1/M, Thorlabs, Germany) to allow for controlled motion orthogonal to the image plane (i.e., "out-of-plane") and provide ground-truth positions for quantitative evaluation. The tip of this pin was placed at outof-plane positions ranging between -3 mm to +5 mm at a regular step size of 100 µm, and at each position a 2D OpUS image was acquired. For the second physical experiment, the out-of-plane and lateral positions were varied simultaneously to mimic a non-orthogonal drift of the object. The object was initially located centrally in the image at an axial depth of 7.5 mm, and moved in increments of 100 µm (lateral) and 200 µm (out-of-plane) to a total out-of-plane position of 10 mm. SNR for the experimental data is estimated to be around 4 dB.

D. Tuning parameter selection
The tracking algorithm requires the selection of four process noise variance parameters that can be used to fine-tune the process. Values that are too low (< (10 −4 mm) 2 ) may cause the estimated trajectory to be too restricted in case of rapid changes in the position or velocity of the object. With ideal data (high SNR), values that are too high have little effect, but with noisier data robustness suffers. This transition starts to take place at around the value of (0.2 mm) 2 . Thus, the parameters were chosen empirically as (0.005 mm) 2 to allow enough flexibility to recover from sudden changes in the position and velocity but at the same time provide robustness against noise.  Table I shows the errors for the four tracking experiments and different varying methods. The proposed NNK methods perform clearly better than MI with data where the tracked object is out-of-plane, due to the correction of the axial aberration caused by out-of-plane offset: the localisation error for all NNK methods is 0.13 mm, while for MI it is fivefold. For occasionally noisier data filtering-based NNK and NNK-RW retain their performance and clearly outperform NNK-I and MI that do not assume dependence between subsequent positions, this indicates that a filtering approach is necessary to provide robustness. This benefit of filtering and axial aberration correction is clearly visible in Fig. 4, where MI overestimates the axial coordinate and for some noisy images the estimate jumps off the trajectory (green spikes). Nevertheless, if we consider no out-of-plane offset without additional noise, all methods perform comparably well with localisation error around 0.11 mm. In terms of worst-case performance, NNK performs the best with maximum error less than three times the mean error in every experiment. In experiment 2 with increased noise, this ratio increases to almost five with NNK-RW and to over 10 with NNK-I and MI. Videos of tracking results with NNK for experiments 1 and 4 are presented in the supplementary material ( Supplementary Videos 1 and 2). Fig. 5 shows how the 3D error depends on depth and out-of-plane offset for lateral position at 0 mm. Interestingly the error is largest (∼1.6 mm) when offset is small and depth is large. For bigger offsets the error gets smaller. This indicates that there are no strong enough markers in the data to reliably estimate all three coordinates when the out-ofplane offset is small. However, the reliability increases with higher out-of-plane offsets. The relatively poor performance at shallow depths and large elevational offset (bottom right of Fig. 5) is caused by the directivity of the ultrasound sources and a large propagation distance, which result in poor SNR and hence poor amplitude estimates. The same pattern was obtained for different lateral positions (data not shown), with minor fluctuations in the stable region and degradation for the extreme points close to the imaging boundaries. We also  proposed method (NNK), the two reduced models using Gaussian random walk (NNK-RW) and independent subsequent states (NNK-I), and the reference method maximum intensity (MI). note, that even though the out-of-plane offset estimation is not correct for all instances, the estimated axial aberration and filtering approach still provides accurate results, as can be seen in Fig. 6: lateral/axial trajectory is very close to the ground truth.

B. Results on experimental data
Before we could apply NNK for tracking, the experimental data required normalisation to match the amplitude (in arbitrary units) of synthetic data. While the experimental data is clearly noisier than synthetic data, the tracking method performs reasonably well. The axial aberration correction works and out-of-plane offset largely follows the expected trajectory (see Fig. 7 and Fig. 8). However, when going farther than 6 mm away from the imaging plane in the second experiment, the algorithm breaks down and the localisation error increases rapidly. The axial/lateral localisation error is mostly below 0.3 mm and with mean around 0.2 mm for the first experimental dataset. For the second dataset the same holds for frames 1 to 35, after which the error starts to increase. In the first experimental dataset most of the 3D error originates from the elevational component. This effect is not as pronounced in the second dataset. We note that in the first dataset the estimation accuracy is worse for the first part with negative elevational distance. This indicates that the imaging probe suffers from a source of asymmetry (e.g., acoustic shadowing by an edge, or unexpected source or receiver directivity) that has not been accurately accounted for in the numerical model. This effect can also be observed in the video of this tracking experiment in the supplementary data (Supplementary Video 3).

C. Computation times
Performing one iteration of tracking took on average 298 ms. The time was split as follows: Reconstructing the image (269 ms), finding high intensity pixels (21 ms), neural network prediction (6.6 ms), Kalman filtering (0.43 ms), and displaying the image (83 ms). Hence reconstructing the image is clearly the most time consuming task and the NNK framework only adds a small computational overhead.
Preliminary tasks include generating training data and training the neural network. Training data with 8800 rows was generated in about an hour. Training the neural network took on average only 20 seconds with a median of 17 seconds (over 10 training attempts). Generating the training data and training the neural network has to be done only once which means that tracking is essentially performed in real-time. Computations were performed on a workstation with AMD Ryzen Threadripper 2950X processor and 32 GB RAM. The codes for NNK are written in MATLAB while the OpUS simulator uses routines compiled from C++ for CPU.

A. Discussion of markers
Our experiments show that magnitude of the simulated measurement data coupled with axial and lateral position is correlated with the out-of-plane offset and an axial positional aberration, as illustrated in Fig. 3. This correlation can be exploited with machine learning to find a nonlinear relationship between these quantities. We also found that this correlation holds with experimental data after data normalisation. Nevertheless, the tracking with experimental data is less stable and shows reduced accuracy. This can be partly attributed to reduced SNR in the measurement data as well as deviations from the ideal assumptions in the simulation, but the Kalman filtering offers a framework to partly mitigate these negative effects and is still able to provide a stable estimation of the axial/lateral coordinate.
In this work we have used an OpUS simulator and computed all markers, offset and axial aberration, from the simulated data. We note that under the assumption of a homogenous medium, we can alternatively calculate the axial aberration analytically using the point-spread function of the imaging system. Nevertheless, we have observed in conducted experiments that an analytic calculation can help for small distances in the simulated data, but will lead to decreased accuracy for the experimental datasets. Thus, computing the axial aberration from the reconstructed US images for training seems to provide more generalisable markers for the estimation process. Furthermore, this fully simulated framework can be extended to heterogeneous media.

B. Offset accuracy and range
The quantitative analysis shows that tracking accuracy is worse for small offsets and larger depths. Most of this error seems to be caused by the out-of-plane offset estimation, while axial and lateral components are tracked well. This indicates that even though the offset might be incorrectly estimated, the proposed axial aberration correction still works. We attribute the difficulty of estimating small elevations to the shallow slope of the out-of-plane amplitude decay for larger axial depths, as shown in Figure 3. This decrease in accuracy is also seen in the error matrix in Figure 5, and worsens with increasing axial distance. Thus, it is important to provide both offset and axial aberration, to provide accurate tracking results within the Kalman filtering. For the simulated data this effect shows symmetrically at roughly out-of-plane distance under 1 mm. For the experimental data the threshold distance is similar, but an asymmetric behaviour can be observed, where positive elevational distance is underestimated and negative overestimated, as seen in Fig. 7. This indicates that the purely simulated framework can in principle be transferred to the experimental case, but small asymmetries in the imaging probe would need to be investigated and accounted for to further improve the results.
The imaging system in this work uses unfocussed, weakly directional circular optical ultrasound sources that insonify a wide elevational range. Consequently, out-of-plane tracking can be performed over a large elevational range limited by the SNR of the B-scan; in this work up to 6 mm for positive elevational distance. For larger out-of-plane offsets, the RF data SNR is insufficient to reliably detect the pulse-echo signal. However, for imaging probes comprising directional sources, or in the presence of an acoustic lens, this range could be different.

C. Limitations and clinical applicability
In this study we show that one can successfully use the correlation between out-of-plane amplitude decay and axial/lateral positions to estimate 3D locations from linear array data. Nevertheless, this correlation was observed in a simplified simulated and experimental setting assuming homogeneous media, i.e., a water bath in the experimental setup. In order to move towards clinically realistic scenarios we need to consider various deviations from the ideal case. In the following we discuss limitations and extensions needed for clinical applicability.
1) Speckle: The tracking presented is based on maximum intensity pixels, as such speckle of low to moderate intensity (compared to the intensity of the image of the object; for instance in the case of a highly echogenic needle tip) is not expected to interfere with the estimation procedure. However, strong speckle could result in tracking errors if only the amplitude is used as marker µ. In this case, the NN would likely need to be adapted to not only extract amplitude information, but also its variation across the imaging aperture -as this spatial variation for the actual object would differ from that of speckle signal.
2) Inhomogeneous media: In this work, we have presented results for homogeneous media. For an application to inhomogeneous media with spatially varying speed of sound the NN needs to be trained differently. Here approximate synthetic training data could, for instance, be generated using ensemble-mean speed of sound maps observed over a group of patients, and simulated with advanced methods such as the k-Wave toolbox [46]. In addition, acoustical attenuation would affect the extracted parameter µ and hence complicate accurate out-of-plane tracking. For applications to actual tissue, this attenuation should be included in the model used to train the NN.
3) Object geometry: The results presented in this work were obtained for point-like objects, such as clinically encountered in the form of microbubbles, fiducial markers, brachytherapy seeds and radio-opaque markers on surgical instruments. In order to extend the method to finite-sized objects, their ultrasound response needs to be accurately modelled. Conceptually, the method applies to finite-sized spherically symmetric objects, such as large needle tips or spherical implants. However, more complicated object geometries, such as long needles or asymmetric beads, are complicated due to nonlinearities arising from high echogenicity, and ambiguities in differentiating between needle tips and shafts or different object orientations. Such objects would require further refinement in the NN markers and the underlying acoustical model to make accurate predictions. 4) Tracking range and accuracy: In the experimental results presented here, an out-of-plane tracking range of up to ±6 mm was demonstrated, and was limited by SNR. This range could be further extended, provided the imaging probe emits a sufficiently diverging field in the elevational direction and SNR is improved, for instance using coded excitation schemes. The optical ultrasound imaging system considered here does not apply acoustic focussing in the elevational direction, and hence is ideally suited to tracking across a wide out-of-plane range. The achieved range of ±6 mm is clinically highly relevant, as correcting object placement over larger distances is typically not possible without removal and re-entry of a surgical tool. For clinical imaging systems, which typically apply elevational focussing, geometrical distortion and signal amplitude decay resulting from out-of-plane offsets will still occur, and the proposed method can still be applied, provided it is retrained. However, the out-of-plane tracking range and accuracy will depend on the tightness of the elevational focusing, and hence will vary with both the F-number of the elevational focusing lens and the axial position of the object relative to the focal distance. 5) Experimental setup: Here, we used a prototype optical ultrasound imaging setup to perform experimental validation measurements. While these were reasonably successful (cf. Figs. 7, 8) due to the availability of a highly accurate and efficient numerical model, the developmental nature of this system limited its practicality. Slow fluctuations were observed in the efficiency of the optical ultrasound sources and the sensitivity of the detector, which resulted in unforeseen variations in the ultrasound amplitudes. As the NN estimation requires the amplitude to be accurately known, these fluctuations limited the range of object trajectories to those that could be traversed quickly. This also resulted in slight differences between simulated and experimental data. Nevertheless, the estimation network generalised well to the experimental data, and the filtering approach further stabilised the estimation process. However, in principle any ultrasound imaging system that grants access to RF data could be used, even those generating focused transmissions -although the NN would need to be retrained for each considered setup and tracking accuracy and range will vary.

D. Extensions
The presented framework can be extended to tracking multiple point sources. In that case, a data association task would have to be solved [19], [47]. This means determining which pixels belong to which target. Furthermore, this would also require a more complicated setup for training data generation and extraction of markers from the measured time series.
Another interesting avenue to pursue would be the extension to needle tracking (as opposed to point object tracking) where the shaft can be mistaken for the tip, which would require a shape detection instead of a simple tracking. Alternatively, one can overcome the shaft problem by adjusting our framework to data obtained with an active listening needle [8] that relies on the reception of ultrasound pulses by a fibre-optic hydrophone (FOH) integrated into the needle.
Finally, due to the limitations mentioned in Section V-C it might be promising to consider the full RF time series as input to a convolutional neural network that is also capable of extracting geometric markers from the data. This information can be still paired with manually extracted markers, such as amplitude, to improve the tracking accuracy and robustness for future applications.

VI. CONCLUSION
This work proposes a neural network and Kalman filtering approach to perform accurate and robust object tracking in 3D from linear array data. The essential step is that a neural network estimates the third dimension and its impact on the 2D US image, in form of aberration in the axial coordinate. Then Kalman filtering is performed for all coordinates to provide a robust estimate with respect to noise. We have shown that the framework can provide high accuracy in estimating axial and lateral coordinates for objects that are not in-plane as well as the corresponding elevational distance. If the pointsource is too close to the imaging plane, it remains difficult to provide an accurate estimate on the elevational distance, but the proposed NNK framework is still capable to provide a robust and accurate estimate on the lateral/axial coordinate.