Unsupervised Bayesian Inference to Fuse Biosignal Sensory Estimates for Personalizing Care

The role of sensing technologies, such as wearables, in delivering precision care is becoming widely acceptable. Given the very large quantities of sensor data that rapidly accumulate, there is a need to employ automated algorithms to label biosignal sensor data. In many real-life clinical applications, no such expert labels are available, and algorithms for processing sensor data must be relied upon, without access to the “ground truth.” It is therefore extremely difficult to choose which algorithms to trust or discard at any point in time, where different algorithms may be optimal for different patients, or even for different points in time for the same patient. We propose two fully Bayesian approaches for fusing labels from independent and potentially correlated annotators (i.e., algorithms or, where available, experts). These are generative models to aggregate labels (i.e., the outputs of the algorithms, such as identified ECG morphology) in an unsupervised manner, to estimate jointly the assumed bias and precision of each algorithm without access to the ground truth. The latter fused estimate may then be used to infer the underlying ground truth. For the first time in the biomedical context, we show that modeling correlations between annotators, and fusing information concerning task difficulty (such as the estimated quality of the sensor data), improve these estimates with respect to commonly employed strategies in the literature. Also, we adopt a strongly Bayesian approach to inference using Gibbs sampling to improve estimates over the existing state of the art. We present results from applying the proposed pair of models to simulated and two publicly available biomedical datasets, to demonstrate proof-of-principle. We show that our proposed models outperform all existing approaches recreated from the literature. We also show that the proposed methods are robust when dealing with missing values (as often occurs in real-life biomedical applications), and that they are suitably efficient for use in real-time applications, thereby providing the basis for the reliable use of sensors for personalizing the care of the individual.


I. INTRODUCTION
W ITH THE rapid increase in volume of wearable devices being used in clinical practice, there exists the possibility of personalising the care of individuals, such that care is more closely based on their physiology.This "precision" approach to healthcare is predicated on the fact that biosignal data arising from patient-worn sensors can be used for diagnosis and prognosis in a manner that is sufficiently robust and interpretable for use in a clinical scenario.Replacement of "one-size-fits-all" treatments for personalised equivalents would be greatly supported by the use of wearable healthcare sensors.It is recognised, however, that existing systems suffer from a lack of robustness [1], typically because reliable analysis of the very large resulting datasets of sensor data is often impossible using existing methods.Given the very large quantities of sensor data that rapidly accumulate, there is a need to employ automated algorithms to label biosignal sensor data (e.g., abnormal morphology of the ECG).However, automated algorithms are typically less reliable than gold-standard expert labels; the latter are typically sparse and expensive, and it is usually not feasible to obtain expert labels for real-time data arising from actual patients.
Expert labelling of these datasets that contain sensor data is the gold standard for diagnosing many diseases and providing subsequent care.From a clinical perspective, expert labelling is defined as being the result of experienced physicians annotating an area of interest in some data when the ground truth (e.g., the true time-of-occurrence or duration of an event; a numerical estimate of a physiological measurement, such as a vital sign; or the diameter of an object in a medical image) is not readily available.Expert labels are generally used for training automated algorithms when taking a supervised approach to learning.However, experts are relatively scarce, their time is expensive, and the task of labelling is time-consuming.It is thus typically impossible to obtain expert labelling in realtime for most practical healthcare-sensing applications, where time-series rapidly becomes too large in quantity for human interpretation.In such cases, automated algorithms must be relied upon to label real-time datasets.Expert labelling can be further complicated by the ambiguous definition of what it means to be an "expert" for a given medical application.There is no standardisation for testing measures of clinical competency above a proficient level, both in terms of accuracy and consistency, and no empirical method for measuring levels of clinical expertise, even though the quality of annotation relies heavily on the expert's experience.Therefore, large inter-and intra-annotator variabilities occur when physicians label data, depending on their respective experience and level of training [2], [3].
The "experts" in terms of the work described by this paper could be either humans (such as trained clinicians), or automated algorithms, or some combination of both.Empirical studies have shown that both humans and automated algorithms are likely to have their own bias values regardless of their level of "expertise" [3]- [5].The bias of each labeller (also termed annotator) is defined as being the average difference between the expert's estimate and the (unseen) true value.In the context of labelling medical data, being reliant on results from automated algorithms to produce a diagnosis (or to make a patient-specific decision for a certain treatment) is typically not sufficient due to large inter-and intra-variabilities between recommendations from such algorithms.Furthermore, when the ground truth is unknown, it is difficult to choose which algorithms to trust or discard, or even how to merge their recommendations to form a consensus output.
In this paper, we propose two generative models for aggregating individual continuous labels in a principled manner and inferring a more reliable label than each individual labeller considered independently.We demonstrate our models using two datasets from exemplar biomedical applications where subjective continuous labels of some presumed underlying ground truth are provided by "experts".These "experts" can be independent or potentially-correlated.For the purpose of this paper, we will accept the broad definition of "expert" to include automated algorithms for producing continuous-valued estimates given exemplar biomedical data.In keeping with the literature, we will refer to "annotators" and "experts" interchangeably.Similarly, we will refer to "annotations" and "labels" as referring to the outputs of the experts.

II. RELATED WORK
There exist some key contributions in the literature for modelling continuous-valued labels, and the biases and expertise of each annotator: Raykar et al. [6] used an expectation maximisation (EM) method to fuse continuous-valued labels for measuring the diameter of a suspicious lesion from a medical image.Welinder and Perona [4] devised a Bayesian EM framework for fusing binary, multi-valued, and continuous-valued labels, which explicitly modeled the precision of each annotator to account for their varying skill levels, without modelling their bias values.We have extended this Bayesian framework to jointly model the annotators biases and precisions using a Bayesian treatment [7].In medical imaging, Warfield et al. [3] proposed a method for validating segmentation by estimating the bias and variance of each annotator.A similar model was described by Ouyang et al. [8] that estimated the quantitative ground truth (such as count and percentage estimation) in a "crowd sensing" setting.Xing et al. recently proposed using a Gaussian prior on the bias parameter for the identification of cardiac landmarks in two-dimensional images [9].However, their model does not cater for missing annotations or the incorporation of physiological features into the model to further improve the estimation of ground truth [6].Furthermore, existing approaches have no principled means of accounting for uncertainty due to missingness and/or quality of the data.The aforementioned studies [3], [4], [6], [9], [10] serve as the basis of comparison for the proposed models.A comprehensive comparison of such models is presented in the Supplementary Materials Table III.Note that these models are unsupervised methodologies.Other related work concerns the use of supervised approaches.Ensemble methods, such as Bagging, Boosting, AdaBoost, and more recently Rotation Forests, combine the predictions from multiple base learners to form ensembles, which typically achieve more accurate aggregate predictions than the individual base learners.Ensemble learning, in general, is a performance-driven approach which involves a measure of quality of the base learners (e.g., accuracy), either for training or for estimating the weights of each base learner, for which labelled training data (i.e., training data with a ground-truth) are required.Also, their interpretability is critically limited [11].Our work diverts from this body of literature in that the proposed learning method does not require a ground truth, as commonly found in medical and biomedical datasets.

III. NOVELTY
In this paper, we built on our previous model [7], and propose a fully-Bayesian approach via Gibbs sampling for fusing continuous-valued labels from independent and/or potentially-correlated annotators to form a consensus in an unsupervised manner.The labels can be either QT intervals of an ECG measured, timestamps of abnormal ECG peaks from a sensor-based signal, or the estimated respiratory rate values from a wearable device.This allows, for the first time, a full distribution over the posterior estimates, due to the use of a strongly-Bayesian approach, and an explicit quantification of our uncertainty in the estimates using a signal quality extension.Additionally, the proposed framework overturns the strong assumptions of existing approaches such that we take into account the precision and bias of the individual annotators, in addition to estimating the correlation that exists between annotators.All existing algorithms from the literature assume that the experts are independent, including our previously proposed model [7].In this study, we aim to demonstrate that its performance can be improved using the newly proposed models.Allowing such correlations is an important means of representing, for example, the distinction between "experts" and "novices", or between human experts and automated algorithms.To the best of the authors' knowledge, this is the first attempt to perform such task, and we will demonstrate that explicit modelling of these correlations improves model performance with respect to the start-of-the-art.We aim thereby to make a case that the methods proposed in this paper increase robustness in our example sensor-based healthcare applications such that the results can support the use of patient-specific personalised treatments.

IV. PROPOSED GENERATIVE MODELS
The generative models considered by this paper are stochastic processes that are assumed to have generated the available observations, given some model parameter values.These will be fully-probabilistic, in which we model the joint probability distribution over all parameters.

A. Modelling the Latent Ground Truth
Suppose that there are N records of N labels (annotations).The underlying ground truth for the ith record, z i (such as a physiological timestamp of an event, a predicted continuousvalued of an vital sign, and a QT interval value in the ECG), can be assumed to be drawn from a Gaussian distribution with mean a i and variance 1/b.The probability density function (pdf) of z i is defined as follows: where a i can be expressed as a linear regression function f (w, x i ) with an intercept w 0 : w are the coefficients of the regression (which includes w 0 1 ).x i is a column feature vector for the ith record containing d features (i.e., we have an (N × d)dimensional design matrix, X = [x 1 ; ...; x N ]).To cater for the modelling of w 0 , a scalar value of one was added in the feature matrix (i.e., x i := [1, x i ])).Furthermore, the precision of the ground truth defined as the inverse-variance, b, is assumed to be modeled from a gamma distribution 2 as follows: where k b is the shape parameter and ϑ b is the scale parameter.
If one further assumes that the ground truth can be drawn independently from the N records, the conditional pdf of z as a vector of annotations is given by: z i may represent a QT interval value of an individual that is drawn from a Gaussian distribution cetered around its mean (i.e., x i w) (which is determined by the heart rate and age features x i .) with variance 1/b i .For a total of N subjects or N recordings, there are N independent Gaussian.
1 w 0 models the overall offset predicted in the regression, which is different from the annotator specific bias φ in the proposed models that will be described in later sections. 2A gamma distribution can be defined as Gamma where k is the shape of the distribution and ϑ is the scale of the distribution, Γ(•) is the gamma function.The gamma distribution is commonly used to model positive continuous values and it is therefore assumed that precision values are drawn from a gamma distribution.

B. The Independent Annotator Model
Assuming the presence of N recordings, we have a dataset, , where y j i corresponds to the annotation provided by the jth annotator for the ith record, and there are a total of R annotators.In this model, it is assumed that y j i is a noisy version of z i , with a Gaussian distribution N (y j i | z i , (σ j ) 2 ).The motivation for the latter comes from the central limit theorem: given the assumption that the annotations for a given annotator are independent and identically distributed, their residuals (i.e., errors in annotations) derived from the latent ground truth will converge to a Gaussian distribution.In the absence of prior knowledge, this assumption provides a robust and generalisable model for the given data, as will be demonstrated.Here, σ j is the standard deviation associated with the jth annotator and represents his variance in annotation around ground truth z i .Furthermore, the bias of each annotator, where it measures the average difference between the estimation and the ground truth, can be modeled as an additional term, denoted as φ j .The pdf of estimating y j i can then be written as: where (σ j ) 2 is replaced with 1/λ j .Here, λ j is the precision of the jth annotator, defined as the estimated inverse-variance for annotator j.It is assumed that y 1 i , • • • , y R i are conditionally independent given the ground truth z i ; with the assumption that records are independent, the conditional pdf of y can be modeled as: The assumption of conditional independence may not be necessarily true in cases where the annotations are generated by algorithms, some of which may be variations in implementation of the same general approach.Nevertheless, this assumption was made to simplify the model and subsequent derivation of the likelihood (see relaxation of independence assumption in Section IV-D).The pdf of the bias for annotator j, φ j , is assumed to be drawn from a Gaussian distribution with mean μ φ and variance 1/α φ [9]: Although the biases of the annotators might be assumed to have other distributions, such choices are likely to be datasetdependent.In the absence of any knowledge of the underlying distribution of biases, we adopt the strategy of assuming them to be drawn from a Gaussian distribution.As described earlier, it is assumed that precision values, such as λ j and α φ , were drawn from a gamma distribution, with parameters k λ , ϑ λ , and k α , ϑ α , respectively: In the case when predicting an QT interval for the ith subject: the QT interval, y j i , estimated from algorithm j is assumed to : y j i corresponds to the annotation provided by the jth annotator for the ith record, and is modeled by the z i (the unknown underlying ground truth), the φ j (bias), and the λ j (precision).Furthermore, z i is drawn from a Gaussian distribution with parameters mean a and variance 1/b, where a for the ith record is a function of feature vector x i as a linear regression function f (w, x), and w being the coefficients of the regression.φ j is modeled from a Gaussian distribution with mean μ φ and variance 1/α φ .The b, λ j , and α φ are drawn from a gamma distribution with parameters k b , ϑ b , k λ , ϑ λ , and k α , ϑ α , respectively.be within ± 1/λ j away from the truth QT value for subject i (i.e., z i ), with an offset, φ j , that is specific to algorithm j.

C. BCLA -A Joint Model of Ground Truth and Independent Annotators
Bayesian continuous-valued label aggregator (BCLA) [7] is a straightforward means of combining the ground truth and annotator models (see Section IV-A and Section IV-B, respectively).It comprises two key contributions: (i) BCLA provides an unsupervised estimation of the continuous-valued annotations that are valuable for time-series related data, as well as the duration of events for physiological data; (ii) it introduces a unified framework for combining continuous-valued annotations to infer the underlying ground truth, while jointly modelling annotators' bias and precision values.The graphical form of BCLA is presented in Fig. 1.Under the assumption that records are independent, the likelihood of the parameters θ θ θ = {w, λ λ λ, φ φ φ, α φ , b, z i } for a given dataset D can be formulated as: For the first time, we here propose a fully-Bayesian approach to BCLA modelling using Gibbs sampling, denoted as BCLA-Gibbs.We note that previous methods in the literature are not fully-Bayesian (either using maximum likelihood or maximuma-posteriori (MAP) approach for estimating model parameters).Our previous work [7], for example, uses a MAP approach (denoted as BCLA-MAP).The posterior probability of the parameters θ θ θ for a given dataset D can be written using Bayes' theorem as: where Each parameter in the BCLA-Gibbs likelihood is assumed to be independent, and can therefore be updated in a fully-Bayesian manner by sampling from its conditional posterior distribution with its hyperparameters (denoted by *).The derivations and convergence criterion of BCLA-Gibbs, as well as its implementation are explained in the Supplementary Materials.

Learning From Incomplete Data Using BCLA-Gibbs
For the N annotations from the R annotators, we should consider the case in which there are missing annotations from different annotators (i.e., not all annotators have labelled all recordings).In such a case, the hyperparameters of the posterior distribution of BCLA-Gibbs can be re-written as follows: . where Note that U j is the set of records with annotations provided by the jth annotator, V i is the set of annotators that provided annotations for the ith record, and N j is the number of records annotated by the jth annotator.w can be learnt by finding the zero gradient of the expectation of the complete data loglikelihood as w = ( N i=1 x i x i ) −1 N i=1 x i z i .The above allows us to cope robustly with the commonly-encountered difficulties arising from incomplete (or even sparse) labelling, in a principled and probabilistic manner.

D. BCLAc -A Combined Ground Truth and Correlated Annotator Model
The BCLA model assumed that annotators are conditionally independent given the features which defined the ground truth.It does not factor in the possible dependency (or correlation) between individual annotators, which might occur between sets of automated algorithms.For example, as outlined previously, a set of algorithms based on different implementations of the same analytical method might be expected to yield correlated errors in their respective labels.Incorporating a correlation measure into the annotator model could therefore allow for a better aggregation of the inferred ground truth.Annotators who are considered to be anomalous (i.e., those that are highly correlated to other annotators but which have relatively large variances and biases) should be penalised with lower weighting for their labels; conversely, expert annotators (i.e., those that are highly correlated to other annotators but which have relatively small variances and biases) should have their labels weighted more heavily in the model.A generative framework for modelling BCLA with correlated annotators (denoted BCLAc) is now described.
A multivariate normal distribution 3 can be applied to the annotator model, using the covariance matrix (denoted Σ Σ Σ) to describe the correlation among annotators, as well as providing a constraint on the biases φ φ φ.The Inverse-Wishart (IW) distribution 4 is used as a prior for the covariance matrix Σ Σ Σ, and the bias values φ φ φ for all annotators are modeled using a multivariate normal distribution with mean μ μ μ φΣ and covariance Σ Σ Σ/k 0 .The graphical representation of BCLA with correlated annotators (denoted BCLAc) is shown in Fig. 2.
As illustrated in Fig. 2, only the annotator model has been modified in BCLAc when compared to the BCLA model, by introducing the covariance measure among annotators.Assuming that each record is independent, the conditional pdf of the modified annotator model with covariance becomes the following: where Σ Σ Σ is the covariance matrix of the R annotators and where there are N recordings.Matrix Σ Σ Σ can be further decomposed into a correlation matrix and the precision values of the annotators.Using the separation strategy proposed by Barnard et al. [12], Σ Σ Σ is formulated as: where Q is an R-by-R diagonal matrix with entries being Here, λ j is the precision value for the jth 3 The probability density function of the d-dimensional multivariate normal distribution can be defined as where μ μ μ is 1-by-d and Σ Σ Σ is a d-by-d symmetric positive-definite matrix. 4The probability density function of a Inverse-Wishart distribution for dby-d symmetric positive-definite matrices X and T , and where v as a scalar greater than or equal to d, is , where ) is a multivariate gamma function.annotator, and ρ ρ ρ is the correlation matrix of the annotation errors among R annotators.The ρ ρ ρ matrix measures the Pearson product-moment correlation coefficients [13], where each element ρ ij ∈ [−1, 1].The correlation coefficient ρ pq of two sets of N annotations (i.e., y p and y q ), each provided by annotator p and q, can be written as: This coefficient ρ pg = 0 when both annotators' errors are independent from each other, whereas having ρ pq ∈ [−1, 0) or (0, 1] indicates that annotators' errors are negatively or positively correlated in a linear manner, respectively (i.e., their errors decrease or increase together throughout recordings).The biases of individual annotators are now assumed to be drawn from a multivariate normal distribution constrained by Σ Σ Σ, with conditional probability density: where μ μ μ φΣ is the prior mean for φ φ φ, and k 0 is a positive scalar that expresses our belief on μ μ μ φΣ .The posterior of the parameter θ θ θ c = {φ φ φ, Σ Σ Σ, b, z i } for a given dataset D can be written using Bayes' theorem as: where The Gibbs sampler can be used to estimate the parameter of the covariance matrix Σ Σ Σ directly, without modelling the precision and correlation individually.The ground truth z i can be estimated using the precision values derived from the estimated Σ Σ Σ.The posterior of biases φ φ φ and covariance Σ Σ Σ can be modeled jointly using a conjugate prior defined by the multivariate normal inverse-Wishart distribution.Each parameter in the likelihood described in equation ( 15) can therefore be updated by sampling from its conjugate prior distribution with posterior hyperparameters (denoted by *) as follows: where We recall that V i is the set of annotators that provided annotations for the ith record.U is a 1-by-R vector, and each of its elements indicates the total number of annotations provided by a respective annotator, excluding missing annotations.
is the sample mean difference between the inferred ground truth and annotations across N j recordings provided by the jth annotator (excluding records with missing annotations).μ μ μ φΣ is the prior mean for φ φ φ, and k 0 defines the belief on this prior mean.S is proportional to the prior mean for Σ Σ Σ, and v defines our belief concerning S. v also must satisfy the condition that v > R − 1.The precision values, λ λ λ for R annotators, can be estimated as being [diag (Σ Σ Σ)] −1 .w can be learnt from the complete data loglikelihood as described before.After convergence of the Gibbs sampler, the value of each parameter can be approximated by calculating the mean.See Supplementary Materials for details concerning the implementation of BCLAc.

A. Data Description
This section introduces datasets from exemplar clinical applications involving the potential personalisation of patient care using health sensor data, and which would directly benefit from the improvements of robustness offered by the work described in this paper.
1) Simulated QT Datasets With Correlated Annotators: A total of 1,000 simulated patient records were generated, each having five annotators with correlated QT interval (i.e., the distance between the beginning of a Q wave and the end of a T wave on an electrocardiogram) annotations.The number of annotators and annotations are selected here for the purpose of demonstration.Application to other datasets using different numbers of annotators and annotations will be described later.The simulated dataset was generated using the following parameter values: the true annotation for each record, z i ∼ N (400, 40), which has a mean, a = 400 ms and a standard deviation b −1/2 = 40 ms.The gold standard of the simulated QT dataset was defined following the ICH E14 clinical guidelines [14].No additional features were considered in this simulation (i.e., x i = 1) as it was solely used to investigate the performance of the model in estimating correlation between annotators, but an intercept was assumed for f (w, x).Furthermore, it was assumed that α φ ∼ Gamma(3, 0.0005), and that b ∼ Gamma(3, 0.0002).The simulated annotations of the five annotators for a particular record were y i ∼ N (z + φ φ φ, Σ Σ Σ).The Σ Σ Σ ∼ IW(τ, 5) for five annotators, where τ is assumed to be a diagonal matrix with diagonal elements being the expected mean of the Gamma(4, 0.0003); their biases, φ φ φ ∼ N (μ μ μ φΣ , Σ Σ Σ/k 0 ), has a 0 ms mean vector and a covariance Σ Σ Σ/k 0 .The k 0 describes the confidence of the mean μ μ μ φΣ of the distribution of biases φ φ φ and was set to k 0 = 1 for the purposes of illustration.The true precision values of individual annotators, λ λ λ, can be estimated by finding the inverse of the values in the diagonal elements of Σ Σ Σ where each of their correlations is ρ = 1.The correlation matrix of the annotation errors is then obtained through decomposing the Σ Σ Σ matrix, and the correlation of a pair of annotators (e.g., a 1 and a 2 ) can be estimated as follows: The estimated correlations of errors of BCLAc for five annotators are shown in Fig. 3(a).It differs from the correlation matrix derived directly from the annotations provided by these annotators as shown in Fig. 3(b).As shown in Fig. 3(b), annotations from all five annotators are positively correlated with ρ ρ ρ > 0. This phenomenon can be explained by the fact that the correlation measures a change in trend (i.e., an increase or decrease of two variables together, or an increase and decrease inversely) for two sets of annotations, taken from a pair of annotators.As long as the annotations both increase similarly, the correlation coefficients would therefore be positive.However, the true correlation of errors among annotators is not based on the absolute variation of values about their respective means; instead, it is an The correlation matrix derived directly from the annotations provided by these annotators.Note that each correlated matrix is subtracted from an identity matrix to remove the correlation of an annotator with itself.
estimation of correlation obtained from the difference between the annotations provided and the underlying ground truth.That is, Fig. 3(a) captures this variation of labels around the latent ground truth, which is different to the variation of the absolute values of labels about their means, where the latter is shown in Fig. 3

(b).
To test the robustness of the BCLAc-Gibbs approach, it was applied to six simulated QT datasets generated from the BCLAc model.Simulated datasets of 100, 500, and 1000 records were generated with parameter values described above, each with (i) five and (ii) 10 correlated QT interval annotations from corresponding annotators.
2) Publicly-Available QT Dataset: We also used the 2006 PhysioNet "Computing in Cardiology Challenge" QT dataset [15]: 548 ECG records were provided with 38,621 annotations of the QT interval sourced from: 20 human annotators in Division 1 of the dataset; 48 automated algorithms in Division 2; and 21 automated algorithms in Division 3.An additional division (Division 4) was defined as being the union of the labels from all automated algorithms from Divisions 2 and 3. Division 1 was used in the competition to generate the reference annotations, and so we therefore focused on the analysis of the sets of automated labels (i.e., Divisions 2, 3, and 4).The records were obtained from 290 subjects (209 men with mean age of 55.5 and 81 women with mean age of 61.6), each represented by between one and five recordings.About 20% of the subjects were healthy controls, and the rest of subjects had a variety of ECG morphologies with QT intervals ranging from 256 ms to 529ms.Diagnostic classifications are detailed elsewhere [16].
Not all annotators provided annotations for all records, and where a minimum of 33% of the annotators labelled any single recording.For the work described in this paper, only those annotators that had provided annotations for more than 50% of records were used, which thereby corresponds to 40 annotators in Division 2, 15 annotators in Division 3, and 55 annotators in Division 4.
3) Capnobase Respiratory Rate (RR) Dataset: The Capnobase dataset [17] was collected from subjects undergoing elective surgery and routine anaesthesia.It consists of photoplethysmogram (PPG) recordings from a pulse oximeter and capnometry data (F s = 300 Hz), from 59 children (median age: Fig. 4. Example of a 32-second PPG window used for RR estimation.AM (green), BW (blue), and FM (red) respiratory modulations are extracted; for the FFT-based method, the power spectrum is calculated for each modulation using FFT, and the maximum power is selected within the physiologically-plausible RR range (grey area); for the ARbased method, the poles for each modulation are determined using an AR model, and the dominant pole within the plausible range of RR (grey area) is selected.Our proposed models are then used to combine estimates, and provide a final "fused" value for that window.9, range: 1-17 years) and 35 adults (median age: 52, range: 26-76 years).We used the set as described in [18], which has 42 recordings of 8-minutes duration (336 minutes in total) containing reliable recordings of spontaneous breathing or controlled ventilation.The capnometric waveform was used as the reference for RR estimates derived from the PPG (we note that estimating RR from a pulse oximeter is an important application of physiological monitoring using wearable devices) [18].
RR was computed for 32-second windows, with successive windows having 29 s overlap.To extract the three respiratoryinduced modulations (AM, BW, FM), beat detection was performed on the PPG using a segmentation algorithm [19].The latter produces a series of maximum and minimum intensities for each pulse.As shown in Figure 4, the series of maximum intensities of the PPG pulses was used for extracting the BW timeseries.The (max-min) amplitude was used to derive the AM timeseries.The intervals between successive beats were used to extract the FM timeseries.RR was estimated using two different spectral approaches that have been used in the literature: Fourier analysis (FFT) and autoregressive (AR) modelling.Spectral analysis requires evenly-sampled data, and so each timeseries (corresponding to BW, AM and FM) was first re-sampled at 4 Hz using linear interpolation.The frequency spectra of the resulting respiratory signals were calculated.The frequency at which the maximum intensity of each spectrum is obtained within the frequency range of interest (corresponding to 3 to 60 beats-per-minute, or bpm), was taken as corresponding to the respiratory frequency (Fig. 4).For the AR method, an AR model of order 7 was fitted to each timeseries.The respiratory frequency was identified as that corresponding to the pole with the greatest magnitude within the plausible range of frequencies for respiration.We note that for each window, for each approach (FFT and AR), three RR estimates were determined (i.e., on each from the BW, AM, and FM signals).

B. Methodology Evaluation 1) Simulated and Real QT Datasets:
The precision values λ λ λ and biases φ φ φ inferred by BCLA-Gibbs and BCLAc-Gibbs were compared with those estimated using BCLA-MAP.For the estimation of the ground truth, the root-mean-squared-error (RMSE) was calculated using the "gold standard" reference provided.Results were additionally compared with the following methodologies, which represent the existing state-of-theart: (i) an EM-based formulation proposed by Raykar et al. [6] (denoted as EM-R); (ii) Scalar Simultaneous Truth and Performance Level Estimation (denoted "sSTAPLE") proposed by Warfield et al. [3]; (iii) Mean voting between the annotations provided by the "experts"; (iv) Median voting between the annotations as above.To enable us to assess the performance of the proposed models as a function of the number of annotators, a random number of annotators was selected 500 times.This was repeated with the number of the annotators being R = {3, 5, 7, 9} in Division 4. The minimum number of annotators was chosen to be R = 3 to allow for obtaining results via the median voting approach.A Friedman test (p < 0.01) was applied to the bootstrapped RMSEs, to provide a comparison between the various methods.
2) RR Dataset: In the context of personalised care, RRs were inferred for each subject individually over all windows.The mean-absolute-error (MAE) of the inferred RR estimates from the PPG across all subjects using the proposed fusion frameworks, BCLA and BCLAc, were compared to the "gold standard" reference RR values from capnography.The BCLA and BCLAc models were applied to the RR estimates extracted using the conventional FFT-and AR-based algorithms outlined earlier [20] for each individual subject.Additionally, the performance of BCLA-Gibbs and BCLAc-Gibbs were compared to that of "Smart Fusion" (i.e., the benchmarking algorithm in this field proposed by Karlen et al. [18]), and with the bestperforming (lowest-MAE) single algorithm (denoted "Best"), EM-R, sSTAPLE, BCLA-MAP, and also the traditional naïve mean and median voting approaches.To further analyse the performance of different voting algorithms with varied signal qualities, a comparison was made in which we compute the MAE results of (i) RR estimates for all windows; (ii) only RR estimates for those windows following the criteria considered by "Smart Fusion" (denoted *)5 ; (iii) only RR estimates for noisy windows (i.e., those which are rejected by "Smart Fusion") were considered.We note that the sets of windows (i) = (ii) (iii).

VI. RESULTS AND DISCUSSION
A. Simulated QT Datasets BCLA-Gibbs took approximately 350 s to generate 10 4 draws of the 2,500 annotations created from R = 5 annotators for N = 500 recordings using MATLAB R2011a on a 3.3 GHz Intel Xeon processor.In comparison, BCLAc-Gibbs took approximately 404 s to run the same number of draws.Half of the samples were discarded as burn-in.
When R = 5 annotators were considered, RMSE results as shown in Fig. 5(a) varied depending on how the correlation of errors among annotators was generated.In the simulated datasets as described earlier, we assumed that the uncertainty of the bias values φ φ φ was controlled by the covariance matrix Σ Σ Σ.In comparison, the BCLA-Gibbs approach assumed the bias values φ φ φ were modeled independently from the covariance matrix Σ Σ Σ.It was therefore expected that BCLA-Gibbs would provide less reliable estimation of the true bias values φ φ φ.The BCLA-Gibbs approach performed consistently worse than BCLAc-Gibbs as it over-estimated the bias values φ φ φ of the annotators constantly, as shown in Fig. 5(b).However when there were R = 10 annotators, the BCLA-Gibbs approach was sufficient to provide reliable estimation without introducing the correlation of errors among annotators.In comparison, results obtained with BCLA-MAP were slightly worse than those obtained for BCLA-Gibbs when estimating RMSEs; BCLA-MAP also provides less reliable results when estimating the bias values than BCLA-Gibbs.In contrast to BCLA-MAP, the BCLA-Gibbs model not only provides estimates but also produces confidence in its estimation -this is a key advantage of fully-Bayesian inference.

B. QT Dataset
BCLA-Gibbs took approximately 152 s to generate 5,000 draws of the 548 annotations, each provided from R = 5 algorithms using the same system as before.With the same dataset, BCLAc-Gibbs took approximately 70 s to run 2,000 draws until convergence.When fusing a large number of algorithms (e.g., R = 55 with N = 26, 922 annotations), BCLA-Gibbs required 5,000 draws and took about 500 s, whereas BCLAc-Gibbs ran for 290 s to compute 3,000 draws.Both methods discarded the first half of the samples as burn-in, as before.The resulting values of the parameters and hyperparameters of the BCLA and BCLAc models for the QT dataset are shown in Table I.
In the table above: b is the precision for the estimate of the ground truth.Σ Σ Σ is the covariance matrix. 1  λ M I is an R × R diagonal matrix with entries obtained from a scalar, λ M , as the inverse of the mean of the Gamma(k λ , ϑ λ ).λ λ λ refers to annotator/signalspecific precisions.For the BCLA-Gibbs approach only: α φ is the precision for the estimate of the bias from ground truth.RR dataset: the values with ‡ are determined with the assumption that the RR estimates provided by the best modulation signal are ±2 bpm away from the reference [21], [22].The values with † are estimated from the median RR estimates provided by the algorithms.QT dataset: the values with * are determined with the assumption that the annotations provided by the best performing algorithm is ±5 ms away from the reference [14].The values with are derived from [23]- [25].The values with are derived from [26]- [28].
Both the BCLA and BCLAc methods produced accurate estimation of the bias values for Division 2 and 4, an example of Division 4 is shown in Fig. 6(b).More results are demonstrated in the Supplementary Materials.In the case of Division 3 (see Fig. 6(b) in Supplementary Materials), BCLAc-Gibbs produced more accurate estimation of the bias values in comparison to those computed using BCLA-Gibbs and BCLA-MAP.However for σ prediction, the BCLA-Gibbs and BCLA-MAP methods were more reliable than BCLAc-Gibbs.This might be due to the fact that both the correlation of errors and the precision values were jointly modeled as a whole using one distribution (i.e., the IW distribution) in the BCLAc-Gibbs model, limiting its accuracy, where one small imperfect estimation of the correlation of errors would directly affect the estimation of the precision values.In comparison to BCLA-MAP, the BCLA-Gibbs and BCLAc-Gibbs models provide accurate estimates along with A further advantage of the BCLAc model is its ability to measure the correlation of errors among algorithms (experts).An example of the inferred correlation of errors using BCLAc-Gibbs is shown in Fig. 6(c) for R = 55 algorithms in Division 4.Although the exact coefficient values were not fully recovered, BCLAc-Gibbs was able to identify the key relationship of correlation in annotation between algorithms, while a direct estimation of the correlation of annotations failed to do so.When comparing the estimation of the ground truth, the resulting RM-SEs are given in Table II, where it may be seen that BCLAc-Gibbs produced the smallest errors, effectively outperforming all other voting strategies.Fig. 7 shows a further evaluation of the accuracies (in terms of RMSE) of different voting strategies as a function of the number of annotators.The results were generated by sub-sampling (i.e., bootstrap with replacement) annotators 500 times with R = {3, • • • , 9} annotators selected from Division 4. A non-parametric Friedman test was conducted to compare all methods and rendered a chi-squared value of 253.27, 348.84, 427.79, and 431.68 for selecting R = {3, 5, 7, 9} annotators, respectively, which were significant (p < 0.01).We further performed a post-hoc test using Dunn & Sidàk's Approach [29] to determine if the mean RMSEs of BCLA-Gibbs were significantly different from other methods; no significant difference (p > 0.01) was observed.Nevertheless, BLCA-Gibbs outperformed all other voting strategies with the least mean or median RMSEs when number of annotators varied from three to nine.RMSEs of 26.64 ± 10.32 ms, 21.10 ± 5.64 ms, 19.26 ± 4.83 ms, 18.34 ± 4.03 ms were obtained for selecting R = {3, 5, 7, 9} annotators, respectively).
Inspecting the performance of BCLAc-Gibbs, it produced smaller RMSEs when compared with the BCLA-MAP approach when R < 9: 27.22 ± 8.34 ms versus 29.26 ± 11.41 ms, 22.49 ± 6.28 ms versus 23.73 ± 7.67 ms, 20.77 ± 6.15 ms versus 21.24 ± 6.26 ms, and 19.58 ± 5.87 ms versus 19.15 ± 4.40 ms for R = {3, 5, 7, 9} annotators, respectively.Although the performance of BCLAc-Gibbs was sub-optimal when compared to BCLA-Gibbs, it only required approximately half of the sampling time, and it provided an additional modelling of the correlation of errors among annotators.
When working with incomplete data, our proposed models assume that annotations are missing at random given the observed data, as there was no direct correlated relationship found between MAEs in QT interval estimation and the length of the QT intervals.Future work will focus on simulating various degrees of incompleteness and explore the performance of our models with percentage of missing annotations.

C. RR Dataset
BCLA-Gibbs took approximately 38 s to generate 5,000 draws of the N = 900 annotations provided from R = 6 algorithms.Similarly, BCLAc-Gibbs took approximately 68 s, and the first 3,000 samples was discarded as burn-in for both methods.The average time for 5,000 iterations using BCLA-MAP was under 2 s, similar to EM-R and sSTAPLE.The resulting values of the parameters and hyperparameters of the BCLA and BCLAc models for the Capnobase dataset were described in Table I.
Fig. 8 shows MAE results across 42 subjects.In the case where all windows or only those with standard deviation greater than four bpm were considered, the BCLA and BCLAc methods outperformed the single best-performing algorithm with least MAE across subjects (i.e., BA in the table).In the case when discarding windows with standard deviation greater than four bpm (with 55.4% of the windows thereby remaining), the proposed BCLA and BCLAc methods outperformed "Smart Fusion" but were slightly worse than the BA*.
Furthermore, the results show that BCLAc-Gibbs has similar performance when compared to the BCLA approaches, with their mean MAEs being closest to the "theoretical best" (TB) algorithm (i.e., selecting the best algorithm with least MAE per subject which is of course impossible in practice, because it requires knowledge of the ground truth) for the three different groups.For using all windows, the MAE improved with BCLA-MAP (i.e., smaller error) over "Smart Fusion" for 69.1% of subjects.When windows were excluded if the standard deviation of RR estimates was greater than four, BCLA-MAP has a MAE improved for 81.0% of subjects.Furthermore, the MAE improved with BCLA-Gibbs over "Smart Fusion" for 73.8% of subjects when using all available windows, and 83.3% of subjects for excluding those with a large standard deviation as aforementioned.In addition, BCLA-Gibbs had 61.9% of subjects with smaller MAEs than those of BCLA-MAP using all windows when fusing R = 6 algorithms.When considering BCLAc-Gibbs using all windows, it had 17 subjects (40.5%) with smaller MAEs than BCLA-Gibbs, and had 22 and 30 subjects (corresponding to 52.4% and 71.4% of subjects respectively) with smaller MAEs than BCLA-MAP and "Smart Fusion".In comparison to BCLA, our BCLAc-Gibbs model provides correlation of errors among algorithms (see Fig. 9): as BCLAc-Gibbs is a more complex model where it has to learn an additional parameter (i.e., the correlation of errors), a larger set of annotators should be expected to be required to better infer the ground truth.
Our proposed models assume that each window in the latent ground truth model is independent.This potentially limits their applications in area where a label produced by an algorithm has a Markov dependence on the previous labels, such as time dependency between windows.Future work could extend the current ground-truth model to Gaussian process regression, where the window-dependent features can be modeled using a

D. Signal Quality Extension
A signal quality index (SQI) can be seen as a measure of task difficulty: noisy records/windows are harder to label due to noise contamination of the signals, while clean records can be seen as easier tasks for labelling.Experts are able to "filter" noise to some degree and provide consistently reliable annotations across data of differing noise levels, whereas non-experts might mistake noise for intrinsic features of the signals.Hence, an independent variable can be introduced into our proposed models that acts as a probabilistic score to better infer the underlying ground truth of a physiological signal.
To demonstrate proof-of-concept, we have included a scaling factor, t j i , in the range of (0, 1] as an indication of record-specific and annotator-specific SQI in BCLA-Gibbs (denoted as BCLA-SQI), where y j i can be drawn from a normal distribution defined as N (y j i | z i + φ j , (t j i λ j ) −1 ).Different from BCLA-Gibbs, the estimation of z i and φ j are now dependent on the signal quality t j i : Our preliminary results on the signal quality extension are shown in the Supplementary Materials.

VII. CONCLUSION
This paper has proposed two Bayesian generative models for aggregating automated labels to form a consensus where subjective continuous annotations of some presumed underlying ground truth are provided, but where the desired ground truth is not readily available in practice.This is motivated by the need to improve the robustness of methods using biosignals acquired from sensor data, such that the data can be used to support precision medicine.
Simulated and two clinical datasets were considered as exemplars to validate the proposed methods for aggregating the outputs of a group of mixed imperfect automated algorithms in an unsupervised manner.The results of the proposed models had optimal performance over the other comparison voting strategies in all datasets.When incorporating the modelling of potentially-correlated annotators, it was shown that annotators can be grouped based on their correlated decision-making process.For example, we might identify sets of annotators that perform well ("trained experts") from those that perform not well ("novices").Both proposed models were robust in dealing with missing values, and there is no need for additional pre-processing to discard noisy data.No training data need to be "held out" for optimising the parameters of the models, and no prior knowledge of the performance of each algorithm was given.It is important to note that the increased performance of our models can be explained by their ability to fit the data in a different manner for different individuals (records or subjects).This means the proposed models do not contain fixed parameters (i.e., precision and bias of an algorithm); rather they adapt to the given data from an individual by, for example, assigning a higher weight to a set of labels (which can be the results of one algorithm) that are relevant to that individual; for a different individual, the set of annotations from the same labeller can have a lower weight.This leads to a better understanding of an individuals physiology, which in turn, may lead to better informed decisions for personalised care.

Fig. 1 .
Fig.1.Graphical representation of the BCLA model[7]: y j i corresponds to the annotation provided by the jth annotator for the ith record, and is modeled by the z i (the unknown underlying ground truth), the φ j (bias), and the λ j (precision).Furthermore, z i is drawn from a Gaussian distribution with parameters mean a and variance 1/b, where a for the ith record is a function of feature vector x i as a linear regression function f (w, x), and w being the coefficients of the regression.φ j is modeled from a Gaussian distribution with mean μ φ and variance 1/α φ .The b, λ j , and α φ are drawn from a gamma distribution with parameters k b , ϑ b , k λ , ϑ λ , and k α , ϑ α , respectively.

Fig. 2 .
Fig.2.Graphical representation of the BCLAc model: y i corresponds to the annotations provided by all annotators for the ith record, and it is modeled by the z i (the unknown underlying ground truth), the φ φ φ (biases), and the Σ Σ Σ (covariance matrix).Furthermore, z i is drawn from a Gaussian distribution with parameters mean a and variance 1/b, where a for the ith record can be a function of feature vector x i and coefficients w.The biases, φ φ φ are modeled from a multivariate Gaussian distribution with mean vector μ μ μ φ Σ and covariance Σ Σ Σ over a scalar factor k 0 .The Σ Σ Σ is drawn from an Inverse-Wishart (denoted IW) distribution with parameters v and S. The b is drawn from a gamma distribution (denoted as Gamma) with parameters k b and ϑ b .

Fig. 3 .
Fig. 3. Correlation of errors of a simulated dataset: (a) The estimated correlation matrix of errors from five annotators using equation (16).(b) The correlation matrix derived directly from the annotations provided by these annotators.Note that each correlated matrix is subtracted from an identity matrix to remove the correlation of an annotator with itself.

Fig. 5 .
Fig. 5. (a) Plots of RMSE using different strategies for simulated datasets.In each dataset, the number of annotators and number of annotations are indicated on the left and right of "×" respectively (i.e., we show R × N ).(b) A comparison of the simulated and inferred bias values of R = 5 annotators in the dataset of 5 × 1000.The diagonal line indicates a perfect match between the simulated and estimated results.The results are plotted with one standard deviation from the mean.

Fig. 6 .Fig. 7 .
Fig.6.A comparison of the QT reference and inferred (a) σ and (b) bias of each annotator for Division 4. The precision can be estimated by taking 1/(σ)2 .The diagonal (grey) line indicates a perfect match between the reference and estimated results.The results are plotted with one standard deviation from the mean for BCLA-Gibbs and BCLAc-Gibbs.(c) A comparison of the reference and inferred correlation of errors among 55 algorithms for Division 4 in the QT dataset: in each row, the reference ρ ρ ρ (left), is compared with its inferred values using BCLAc-Gibbs (middle), and the correlation estimated directly from the annotations provided (right).Note that each correlated matrix is subtracted from an identity matrix to remove the correlation of an algorithm with itself.

Fig. 8 .
Fig. 8. Box plot of MAEs across 42 subjects for different fusion approaches.The median MAEs are indicated in red '-' in each box, while the black '×' indicates the mean MAE.The span of each box indicate the interquartile MAE range over 42 subjects for each fusion method when combining six algorithms.Notation: BA -the single best-performing algorithm with least MAE across all subjects; TB -theoretical best algorithm with least MAE selected per subject; Mean* -the "Smart Fusion".Note that results associated with * indicate the MAEs were derived from windows excluding those have a standard deviation greater than four bpm, while those associated with ** indicate the results derived only from windows that have a standard deviation greater than four bpm.

Fig. 9 .
Fig. 9. Example of correlation matrix of errors among six algorithms estimated for a subject: (a) computed with reference provided and (b) computed using BCLAc-Gibbs without knowing the reference.Note that each correlated matrix is subtracted from an identity matrix to remove the correlation of an algorithm with itself.

TABLE I THE
PARAMETERS OF BCLA AND BCLAC FOR MODELLING THE CAPNOBASE RR AND QT DATASETS