A Human-Centered Machine-Learning Approach for Muscle-Tendon Junction Tracking in Ultrasound Images

Biomechanical and clinical gait research observes muscles and tendons in limbs to study their functions and behaviour. Therefore, movements of distinct anatomical landmarks, such as muscle-tendon junctions, are frequently measured. We propose a reliable and time efficient machine-learning approach to track these junctions in ultrasound videos and support clinical biomechanists in gait analysis. In order to facilitate this process, a method based on deep-learning was introduced. We gathered an extensive dataset, covering 3 functional movements, 2 muscles, collected on 123 healthy and 38 impaired subjects with 3 different ultrasound systems, and providing a total of 66864 annotated ultrasound images in our network training. Furthermore, we used data collected across independent laboratories and curated by researchers with varying levels of experience. For the evaluation of our method a diverse test-set was selected that is independently verified by four specialists. We show that our model achieves similar performance scores to the four human specialists in identifying the muscle-tendon junction position. Our method provides time-efficient tracking of muscle-tendon junctions, with prediction times of up to 0.078 seconds per frame (approx. 100 times faster than manual labeling). All our codes, trained models and test-set were made publicly available and our model is provided as a free-to-use online service on https://deepmtj.org/.


I. INTRODUCTION
Three examples of the MTJ in the medial gastrocnemius (MG) muscle-tendon unit, recorded with three different instruments. The MTJ is indicated by a red cross. We specify contrast-to-noise ratios (CNR) for all considered instruments. The video frame in Figure a was collected with an Aixplorer V6 US system (Aixplorer). This figure also shows the embedding (yellow color) of the MTJ in the triceps surae muscle-tendon unit (MG, lateral gastrocnemius (LG, not shown), Soleus and Achilles Tendon (AT)). The white arrow in Figure b indicates the direction of principal movement of the MTJ from distal to proximal in the x,y-coordinate system. This video frame was collected with an Esaote MyLab 60 US system (Esaote). Figure c shows an image of the MTJ collected with a Telemed ArtUs US system (Telemed). D URING human locomotion, muscle-tendon complexes of lower limbs are under cyclic concentric and eccentric stress [7]. Within these units, muscles and tendons have different properties [8], contribute differently to external loading [9] arXiv:2202.05199v1 [cs.CV] 10 Feb 2022 and adapt differently to stimuli [10]. For instance, imbalances in muscle strength or tendon stiffness may impede efficient interplay during locomotion [11] or lead to injuries [12]. In clinical populations, knowledge of alterations in muscles and tendons due to short or long-term treatments (e.g., physical therapy or surgeries) is crucial for developing efficient therapeutic strategies [13].
To investigate tissue behaviour in lower limbs (e.g., the triceps surae muscle tendon unit) and to distinguish between individual contributions of muscles and tendons, their junctions are usually visualized using ultrasound (US) imaging ( Fig. 1) while their displacements are tracked with various methods [14]- [18]. The triceps surae ( Fig. 1. a) is a major contributor to human locomotion. It consists of three heads: the medial (MG) and lateral (LG) gastrocnemius as well as the soleus (SO) muscle. Each individual head is connected via a muscle-tendon junction (MTJ) to the Achilles tendon (AT). Thus, the MTJ provides a form and force-locked interconnection between contracting muscles and passively acting tendons [19]. In US images, the MTJ is clearly visible due to the change of acoustic impedance in muscles and tendons. Moreover, due to its definable maximum displacement and primary longitudinal (distal to proximal) travel direction ( Fig. 1. b), the area of this anatomical feature is covered by standard-sized linear US arrays [20]. Therefore, this method is widely used and it improves general understanding of muscle-tendon properties and their behaviour in healthy [21] and impaired subjects [16], [22].
However, musculoskeletal US imaging depends on operators [23]. In particular, image interpretation requires trained specialists. Moreover, investigating displacements of the MTJ in US images typically needs handcrafted labeling. For this reason, several semi-automatic and automatic methods to track MTJs have been proposed [14]- [16], [24]. Image analysis in biomechanical and clinical US studies relies largely on computer vision algorithms [25]. Applied on noisy, real-world US motion data, these optical-flow or matching based methods are prone to errors [26]. This is often due to low frame-rate recordings or poor image qualities of standard medical US systems [20]. One common parameter to quantify US image quality is the contrast-to-noise-ratio (CNR) [27].
Recently, machine learning solutions for automatic detection and tracking of musculoskeletal features in biomechanical applications have been developed [28]. These methods improve performance because they can learn to extract salient features, such as anatomical landmarks, directly from annotated input images. Therefore, a neural network is trained to find a mapping between input images and manually set labels [29], [30]. With a sufficiently large dataset, neural networks can successfully map novel data (generalization). During training with real world images, the network learns to neglect noise or instrumental errors, yielding robust and accurate results compared with classical computer vision applications [31]. Leitner et al. [17] for example, used data from an Esaote US system (Esaote SpA, Genoa, Italy) and a ResNet model architecture [32] with an attention mechanism [33] to investigate MTJ predictions on 107 subjects using 7200 manually annotated labels. They found that an inclusion of healthy and impaired patients into the training dataset improved overall performance of their model. Krupenevich et al. [18] focused their work on the trackability of MTJs across several isometric movements and complex functional tasks such as walking. They trained a MobileNetV2 [34] architecture on 1200 manually annotated ground truth labels, collected from 15 subjects that were walking, with a Telemed US system (Telemed UAB, Vilnius, Lithuania).
These newly emerging machine-learning applications for MTJ tracking show that deep neural networks provide strong performance in identifying the exact MTJ positions in US images, even for small training datasets, and independent of subjects and movements. However, there is still lack of evidence on how these algorithms perform on noisy interlaboratory, inter-observer data and may be generalized to diverse settings. Furthermore, previous MTJ tracking neural network models were evaluated on inaccessible test-sets and labels of test-set and training dataset were identified by the same person. This neglects unavoidable positional variations of different observers and introduces potential bias. In particular, machine learning benchmarks need to include more than one clinical specialist to generate reliable reference test-set labels [35]- [37] with low noise [38]. Moreover, predictions across multiple-domains (e.g. data collected from different instruments) are key in generalizing machine learning algorithms [39]. For example, deep-learning has shown excellent performance if training and test-set data are drawn from the same underlying distribution. However, large domain shifts in data (e.g. using data from machines of different vendors) often cause significant performance impairments. In case of the proposed MTJ tracker by Leitner et al. [17] and Krupenevich et al. [18], evaluation and training dataset come from the same US instrument. Therefore, these networks might fail to provide similar performance on datasets obtained from other US machine types.
In this work, we present a novel deep-learning approach for the detection of MTJs in ultrasound images. We curate a large and diverse training dataset, in order to provide a universal MTJ detection method, independent of the used US instrument, movement, muscle region or noise coming from experimental setups (Sect. II-A). We use a deep neural network with U-Net architecture and attention mechanism to predict the position of the MTJ as a probability density function (Sect. II-B). An objective test-set was created and curated by four independent specialist to evaluate the average deviation of our model from specialist labels (Sect. III). In addition, we estimate the generalization to novel datasets and discuss the capabilities of our method (Sect. III).

A. Dataset and Labeling
We used five different datasets in this study (Table I). Data were collected at the University of Graz, the Graz University of Technology and the University of Queensland between 2014 and 2020 on 123 healthy and 38 impaired individuals. Our research was approved by the responsible ethics committees, and their approval numbers are given in Table I. With 1590 recordings, the isometric maximum voluntary contractions (MVC) and passive torque movements (PT) on the medial gastrocnemius (MG) had the largest share in the dataset. A smaller amount of data was collected on the MG during running (48 recordings). The measurements on the lateral gastrocnemius (LG) consist of 109 recordings. The complete and fully anonymous dataset holds 1747 video recordings with a mean length of 19.84 seconds per video. Sequences were captured at frame-rates of 30 frames per second (fps) for studies with an Aixplorer V6 (SuperSonic Imagine, Aix-en-Provence, France) US system (Aixplorer, Fig. 1. a), 25 fps for studies with the Esaote MyLab60 system (Esaote, Fig. 1. b), and 30-80 fps for the Telemed ArtUs US (Telemed, Fig. 1. c), respectively.

Training Dataset Labels
The position of the MTJ was identified as the most distal insertion of the muscle into the free tendon (Fig. 1). Datasets 1, 2, 4 and 5 (Table I) were annotated by senior investigators (>3 years of research experience, conducted >2 subject trials investigating the MTJ in the past 2 years). Three cycles of reviews on the labels were conducted by the same annotator, with a minimum interval of 2 days between reviews. Dataset 3 (Table I) was annotated by a junior investigator who was carefully trained to annotate US data of the MTJ. Labels were reviewed by a senior investigator (>10 years of research experience, conducted >4 subject trials investigating the MTJ in the past 2 years). Table II shows a detailed distribution of all labels over included instruments, muscles and movements. A labeling frequency of 10 frames per video was chosen for dataset 1. Every 5 th frame was annotated in dataset 2-5. These sampling frequencies prevent strong temporal correlations. In total, our training dataset holds 66864 ground truth labels, covering 3 different movements, collected on two muscles from 123 healthy and 38 impaired subjects, and were recorded on three US systems from different vendors.

Test-set Labels
For model testing, we first randomly selected 12 participants from our dataset and then completely separated all related 107 recordings into an independent test data pool. From there we randomly excluded 13 MVCs and 14 PT movements and cut each video file to a length of 10 seconds covering the movement. Consequently, we excluded every 5 th frame of each individual time-series. This resulted in a number of 1360 frames in the test-set. We refrained from including the running movement in the test-set because diversity (collected on 3 subjects) and the overall share (7.3%) was low. In total, our test-set holds 1360 ground truth labels, covering 2 different movements, collected on 2 muscles from 12 individual subjects, and were recorded on 3 US systems from different vendors.
The test-set was then annotated by N = 4 specialists S = {1, 2, 3, 4}. Specialists had 2-10 years of experience in biomechanical and clinical research investigating muscles and tendons in 2-9 US studies in the past 2 years. The data was labeled using a web-based annotation tool (Labelbox Inc., San Francisco, CA, USA). The manual annotation of the test-set took on average 2.76 ± 1.11 hours.  Fig. 2. This figure demonstrates the processing workflow of our machine-learning approach. In a we show pre-processing steps applied to data such as cropping, resizing and rescaling. Furthermore, we have included image augmentation to account for noise occurring in the experimental data collection. b shows a U-Net architecture with 4-layers and attention mechanism. Our model processes probability maps with soft labels. Therefore, small positional uncertainties about the MTJ did not cause large losses. Figure c demonstrates that we applied a 2D Gauss fit on predicted maps and evaluated filtered noise coming from our algorithm or large inter-specialist variability. For our data analyses depicted in Figure d, we compared soft labels set by our model (red overlay) with individual specialist labels (blue dots with white edges). The 95% confidence interval for our model label pixel position is indicated with the white dotted ellipse. We defined the specialist mean position (blue cross) as the target reference label for our performance evaluation. Figure e shows an example of a complete time-series reconstruction of one PT movement included in the test-set. We have indicated specialists (gray color traces), reference labels (blue trace), and the prediction trace of our model (red color).
To evaluate specialist performances, we used leave-oneout cross validation. We computed Euclidean distances d k between label positions P k x,y of each individual specialist k ∈ S, and mean label positionsP j x,y of the remaining three specialists j ∈ S. Mean deviations of specialistsd S and their standard deviations σ S were computed over all k-folds, and used for model evaluations (Sect. II-D): Absolute standard deviation of specialistsσ S is used as a supporting distance metric for our data analysis (Sect. II-D) and filtering (Sect. II-C). From the four specialist labels P k x,y , with k ∈ S, we computed average label positionsP S x,y , in order to obtain the most likely positions of the MTJ. These average specialist positions were used as a reference for model evaluations (Sect. II-D) and were referred to as reference labels.
To account for label noise among specialists, the interclass correlation coefficient (ICC) for inter-rater analysis was used. Our calculated ICC is based on a mean-rating (k * = 3) and absolute-agreement in a 2-way mixed-effects model. Furthermore, we calculated standard error of mean (SEM) and rootmean-square errors (RMSE) over all k-folds.

B. Model
Data Pre-processing Fig. 2a (left) shows image sizes for each instrument output (Aixplorer, Esaote, Telemed). All images were cropped by an aspect ratio of 2 : 1. Crop rectangles are indicated by dashed lines and crop rectangle sizes (x-coordinate, y-coordinate, width, height) by pixels for each instrument. Origins were set at upper left corners. All frames were resized to 256 by 128 pixels by a third order transformation (Fig. 2a center).
For our neural network training, we applied additional image augmentation, to counterbalance differences and errors in the experimental setups. During experimental trials using bulky standard medical US probes, or in special environments (e.g. data collection on impaired subjects) transducers are often displaced [20]. This causes image distortions and can also lead to out-of-plane movements of transducers. Furthermore, images of the same tissue can be mirrored in consecutive trials of the same participant due to changed probe orientations after probe reattachment. For our training, we artificially created similar image distortions by random rotations in ranges of ±20 degrees, zooming images by scale factors of 0.7 to 1.3, with shearing to maximum angles of 0.2, randomly shifting in x-and y-directions by up to 10% of image sizes and randomly flipping images vertically and horizontally (Appx. I, Fig.8). All images were normalized for zero mean and unit standard deviation before used as input for the neural network (contrast normalization as described in Goodfellow et al. [29]). From this pre-processing, we assume that our model is more robust to dissimilarities because of experimental setups and movement artifacts (see Sect. III).
Manually set labels in the training dataset denote exact pixel positions of estimated MTJ positions in the image. For our training, we used soft labeling, where we assigned probability values to each image pixel (cf., Mathis et al. [40]). We used probability maps at the same resolution as original images, where we modeled positions of the MTJ by a 2D normal distribution with a covariance of 100 pixels at positions of specialist labels. Therefore, model training is more tolerant in terms of small positional uncertainties, in agreement with uncertainties of manual labels. This also accounts for US images where the MTJ is not visible due to out-of-plane movements of transducers, by assigning no MTJ position in the image. For each probability map, we applied the same random image augmentation and resizing to 256x128 pixels as for input images.

Model Architecture
A convolutional neural network (CNN) was used to predict a probability map of the MTJ position for a given input image. We built upon a U-Net type architecture [41] that encodes the input image into a feature representation using consecutive convolutional and max-pooling layers. These layers reduced spatial dimensions while increasing network depth. This allowed the network to extract salient features while taking global contexts of images into account (e.g., the characteristic Y shape of the muscle-tendon structure). The feature representation was decoded into the probability map by the same number of consecutive upsampling and convolutional layers as for decoding. Between encoding and decoding of the features, we applied skip connections to correlate spatial information with the extracted image features [41]. We assume that only a fraction of extracted features in the encoder is relevant to determine exact MTJ positions. Consequently, we followed the approach by Oktay et al. [42] and added an attention mechanism to the skip connections of our model. Here the encoded features were used to create attention maps that enabled the network to focus on more important regions (see Fig. 2b). We used a network depth of 4, with a starting filter number of 64 that we increased by factor 2 after each max pooling layer and decreased by factor 2 after each upsampling layer. The final probability map was obtained from the final convolutional layer with 1 channel and a sigmoid activation function, such that each pixel in the input image was labeled with values within range [0, 1], where 1 denotes high probability for the position of the MTJ at the corresponding pixel position.

Training
For our model training, we used the full training dataset of 66864 manually annotated images. We used binary crossentropy as a loss function, where we weighted the 0 class with 0.1, in order to counteract class imbalance. The Adam optimizer was used for parameter optimization with a learning rate of 0.0001, and decay rates β 1 = 0.9 and β 2 = 0.999.
To account for domain generalization on different instrumental data, a sequential learning strategy was used. Within three independent training steps, we sequentially added data from the individual instruments. Here we followed the sequence: 1. {Telemed}, 2. {Telemed, Esaote} and 3. {Telemed, Esaote, Aixplorer}, in correspondence with the size of the individual datasets (largest to smallest). We trained our model for 100 epochs per step and kept the pre-trained model weights from previous steps. After each step, model weights were saved, in order to assess the influence of additional data diversity on the model performance (Sect. III). For all other evaluations, we used the final model after the third training step.

Post-processing
From the obtained probability maps, we estimated MTJ positions by fitting a 2D normal distribution to the output of the neural network. For optimization, 2D Gaussian kernels were fitted for each image by employing a trust region reflective algorithm [43]. To set initial conditions for optimization, we first identified points with the highest probability in US imagesP M x,y = arg max(p M ) and then set starting conditions for the Gaussian fit as follows: A = max(p M ) − 2 * SD (amplitude), (x 0 , y 0 ) =P M x,y (origin), σ x = σ y = 5 * σ (standard deviation in x and y direction), θ = d = 0 (angle and offset). Pixel positions with highest kernel density were identified as MTJ positions P M x,y of our network (Fig. 2d). We found that although the training was performed with a pixelwise loss, the neural network provided similar probability maps to the ground-truth labels. In most cases, this allows for an unambiguous identification of the MTJ (error-cases are discusses in Sect. II-C).

C. Label Noise Filter
To account for label noise introduced by our algorithm, as well as by inter-observer variability, we identified three errorcase scenarios. Images which are classified in either one of these categories were excluded from the test-set. The first two error-cases are related to predictions of our network. We excluded predictions where labels were either located at image borders which is the case if or where the labels are within a 10 pixel padding around borders, and probability scores are low (max p M < 25%). Such cases are given by: For the first two cases, we excluded 11 images in total (Appx. I, Fig. 9). For these cases, our neural network was not able to identify the position of the MTJ position within the image. This erroneous behaviour occurred in 0.8 % of cases.
The third error-case relates to inconsistent labels among specialists. 5 images of the test-set (Appx. I, Fig.10) were excluded because MTJ labels of at least three specialist showed distances of 20 ·σ S from reference labelsP S x,y . This reduced the test-set by a total of 16 images (1.17 %). Thus, 1344 video frames were used for further performance analyses.

D. Data Analyses
We calculated Euclidean distances d M between the reference label positionsP S x,y obtained from specialists (Sect. II-A) and label positions from our model P M x,y , and compared them to mean deviations of specialistsd S . To evaluate labeling performances between our network and specialists RMSEs, SEMs, as well as mean absolute errors were computed. We assessed the agreement between our model and the reference labels in x-and y-coordinates ( Fig. 1 b) using Bland-Altman plots. These comparisons are illustrated by differences between pairs of measurements (d M x , d M y ) as a function of the normalized mean measures (normalization to instrument image sizes). Furthermore, we introduced a tolerance distance given as a multiple n * of mean standard deviationsσ S from the reference labelsP S x,y . We estimate the number of correctly identified samples by individual specialists and our model with increasing distance form the reference labels.

Comparison with other MTJ trackers
We compared our model to recent MTJ tracking algorithms by computing RMSE for each method based on our test-set. We considered the semi-supervised computer vision algorithm from Cenni et al. [16] and deep-learning approaches from Leitner et al. [17] and Krupenevich et al. [18]. For the algorithm of Cenni et al. MTJ positions in the first frame of each video are needed to start predictions. Therefore, we used the reference labelsP S x1,i,y1,i of the first video frame, from the i th video file in the test-set, to mark starting positions for optical flow calculations. MTJ trackers from Leitner et al. [17] and Krupenevich et al. [18] are based on CNN architectures. We used their provided trained models for predictions on our test-set. For these tracking solutions, evaluation and training datasets come from the same US instruments. To evaluate differences in image qualities of US machine types, and to  1) were calculated with the method described in [44].

E. Open Source Cloud Deployment
We implemented a cloud deployment (Google LLC, Mountain View, CA, USA) of our model and provide public access to our software-as-a-service via a web application. To comply with strict data security standards, each platform user receives a private and personal 128-bit Universally Unique Identifier (UUID) to access and interact with the uploaded data. Video files are temporarily stored on cloud infrastructure for prediction. User data is not stored on servers beyond calculations and deleted after job completion or in case of lost connections. Our web application is entirely open-source and publicly available at https://deepmtj.org.

III. RESULTS AND DISCUSSION
In this section, we first investigate inter-rate reliability's and than compare our model predictions to the reference labels. Moreover, we assess performance of other recent, and opensource, available MTJ trackers on our test-set and compare them with our model results. In Table III an overview of model results is presented. Results are obtained from the full test-set, which includes data from all considered instruments (Aixplorer, Esaote and Telemed), two muscles (MG and LG) and two movements (MVC and PT). Among the considered automated methods, our approach shows the best performance both in terms of RMSE (RMSE M = 4.89 mm) and SEM (SEM M = 0.10 mm).

Evaluation of specialists
Violin plots in Fig. 3 show labeling results for each specialist. We plot Euclidean distances d k between the Specialist label positions P k x,y and the mean label positions of the remaining three specialists P j x,y . Median and SEM values are comparable between the specialists. Few larger errors of up to 35.43 mm were found for annotations of specialists 1-3. The inter-rater assessment using ICC confirms good inter-rater reliability for specialists (ICC = 0.88, 95% confidence interval = [0.87, 0.89]). Table III shows that our trained network (RMSE M = 4.89 mm) is within the range of the performance level of specialists (RMSE S = 5.05±0.62 mm) on the test-set. The SEM suggests that our model produces consistent results, comparable to the individual specialists. Furthermore, it takes 107.4 s in total (tested on a NVIDIA GeForce RTX2070 graphical processing unit (GPU)) to predict all MTJ positions of the test-set using our algorithm whereas individual specialists need on average 2.76 ± 1.11 hours to label the same dataset. This shows that our method can accelerate data analyses times by a factor of 100 on a low cost GPU.

Comparison to specialists
Bland-Altman plots in Fig. 4 show that our model has no substantial bias in x-(2.1 mm) and y-direction (0.28 mm) of the detected MTJ position. Larger deviations in the xcoordinate (Fig. 4 a) result from the leading travel direction (Fig. 1 b) of the MTJ. We assume that image quality (i.e., CNR) influences the estimation of MTJ positions for Esaote and Aixplorer, causing larger scattering in x direction. None of the measured biases were found to be proportional to averaged values, indicating that manual and automatic analyses agree equally through the range of measurements.
In Fig. 5, we plot the percentage of correct MTJ detections and correct specialist labels as function of tolerance distance. Tolerance distance is depicted as a multiple n * of absolute standard deviationσ S from reference label positionsP S x,y . Results show that the number of correct samples is similar among our model and specialists. A slightly higher performance is achieved by specialists for small n * . The neural network appears to be more robust in general, as can be seen from the close to 100% correct detections with a tolerance  Line plots (our model in red color and specialists in blue color) show valid numbers of frames in percent (y-axis) with increasing tolerance distance (x-axis). Tolerance distance is depicted as a multiple n * of absolute standard deviationσ S from reference label positions. P S x,y . distance > 2 ·σ S . In other words, the network shows no random false detections, that can occur during manual labeling (e.g., due to attention loss in monotonous tasks [45]).

Performance on different muscles and movements
We decompose our test-set with respect to included muscles and movements, and then compare results of our algorithm d M to mean deviations of specialistsd S for each category.
In Figure 6a, we show performances on images of LG and MG muscle-tendon units. In Figure 6b, we highlight all video frames for our considered movements (MVC and PT). RMSEs shown in Table IV and kernel densities in Fig. 6a and 6b are in good agreement with larger deviations for predictions on the LG. In case of the LG, the share of labeled ground truth data in comparison to the total data volume was small (9.4%). In addition, LG data of the test-set was recorded by the US system (Aixplorer) with the least amount of training data compared to other included devices. Therefore, we assume that performance shortcomings stem from the low training and test dataset volume. Furthermore, varying morphologies of gastrocnemii heads around the muscle tendon junction could also influence prediction qualities of our model. In terms of movements, the two dataset sizes were balanced. The results show that our model has no preference for a specific movement. Comparisons with manual tracking suggests that MTJs, during controlled passive and active contractions, can be reliably and accurately tracked using our model.

Comparison with other MTJ trackers
We evaluated the performance of recently proposed MTJ tracking approaches on our test-set (Table III) and found that neither could reach the level of specialists. In particular, the machine learning approaches were trained on machine specific data collected with US systems of one vendor. The approach by Leitner et al. [17] was trained on Esaote data only, while the method by Krupenevich et al. [18] is based entirely on Telemed data. When evaluating these algorithms on our test-set data from the respective machine type, the methods by Leitner et al. [17] and Krupenevich et al. [18] achieve RMSEs of 5.57 mm and 12.06 mm on Esaote and Telemed data, respectively. Therefore, we conclude that these two machine-learning approaches are domain specific and provide poor performance on our diverse test-set. We associate the lack in performance with the lack of data diversity during training. Hence, it seems that the diversity of instruments, inter-laboratory data and inter-observer settings, as represented in our test-set impacts prediction qualities of previously proposed algorithms.

Domain Generalization
We observe that image qualities differ across US machines of different vendors (Fig. 1). This can originate in the US image formation (beamforming), filtering and other signal processing techniques (e.g. time-gain compensation) implemented on US systems themselves or preset by operators. Image appearances may also vary due to different applications of US systems in experimental setups. Leitner et al. [20] have shown that, for example, the Telemed system is widely used to collect data during movement with increased frame-rates at the cost of decreasing image qualities. Other devices, such as the Esaote system, offer transducers with wide probe heads to collect data on movements where larger scale length changes occur, decreasing frame-rates. To counteract sensitivity for noise coming from experimental setups and applications we have added data augmentation methods (as depicted in Sect. II-B). Furthermore, the evaluation of methods by Leitner et al. [17]  and Krupenevich et al. [18] on our multi-instrumental test-set has demonstrated that the performance of a neural network strongly depends on instrument types used for training and testing (Table III). Therefore, our model was trained with US images from three of the four most common US instruments in biomechanical research [20], [25]. In order to analyse the generalization of our network to novel instrumental data, we sequentially introduced datasets from the individual instruments during our model training (Sect. II-B). Fig. 7 shows the mean absolute deviations of the two intermediate models (trained with images from {Telemed} and {Telemed, Esaote}) and the final model (trained with images from all instruments {Telemed, Esaote, Aixplorer}). For the model that was trained with Telemed data only, we can observe excellent performance for the Telemed images, but the model fails for the other two machine types, similar to the studies by Leitner et al. [17] and Krupenevich et al. [18].
By adding the Esaote data to the model training, we obtain a significant performance increase for Esaote frames as well as for Aixplorer samples (see Table V). The model trained with solely Telemed and Esaote images, achieves a high performance for Aixplorer data (RMSE of 5.11), although it was not included in the training set. This suggests a successful generalization to MTJ tracking in US imagery. Thus, we expect that the final training stage of our model {Telemed, Esaote, Aixplorer} provides a similar performance on novel US instrumental data. Moreover, results in Table  V show that including images from multiple instruments in the model training does not decrease performance for the individual instruments. Hence, models trained with images from {Telemed}, {Telemed, Esaote} or {Telemed, Esaote, Aixplorer} achieve similar performances on the Telemed testset.
From our multi-instrument training approach and the applied data augmentation, we expect that our model can be applied to a larger variety of different experimental setups (e.g., instrument types, transducer rotations and shifts, image distortions due to movement,...) with less susceptibility to errors.

IV. CONCLUSION
In this work, we presented a data-driven deep-learning method for the estimation of MTJ positions in US images. Our approach is based on a CNN trained to infer MTJ positions across a variety of US systems from different vendors, collected in independent laboratories from diverse observers, on distinct muscles and movements. We used data augmentation methods and soft labeling to counteract label noise introduced by the experimental setups and inter-observer variability, respectively. A diverse test-set was created and labeled by four independent specialists, in order to provide more objective estimates of MTJ positions. Our method was able to identifiy MTJ positions in 99.2% of cases, with a root-mean-square deviation of 4.89 mm. This error is in the same range as the variation of human specialists (RMSE S = 5.05 ± 0.62 mm). Therefore, our method provides human-like performance on a divers test-set and requires only a fraction of manual labeling times (approx. 100 times faster). We demonstrated that our approach is applicable to generalize to data collected on different US machine types. We made all our codes, trained models and test-set (1344 labeled ultrasound images) publicly available and provide our model as a free-to-use online service under https://deepmtj.org/.
ACKNOWLEDGMENT C. Leitner would like to acknowledge Martin Sust (University of Graz) for the support of his research.

APPENDIX I
In this section we provide additional figures supporting our work. Fig. 8 showcases three augmented frames of the same image (MTJ collected on an Esaote system) using our settings for data augmentation. In Fig. 9 and Fig. 10, we show identified and excluded images from error-cases 1 and 2 as well as error-case 3, respectively.  Figure shows three augmented frames of the same image. We used the ImageDataGenerator implemented in TensorFlow [46] to additionally generate random data with the following properties: rotation-range=20, horizontal-flip=True, vertical-flip=True, zoom-range=0.3, width-shift-range=0.1, height-shift-range=0.1, shear-range=0.2, fill-mode=reflect.