Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation

Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.


Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation
Irene Martín-Morató , Member, IEEE, and Annamaria Mesaros , Member, IEEE Abstract-Crowdsourcing is a popular tool for collecting large amounts of annotated data, but the specific format of the strong labels necessary for sound event detection is not easily obtainable through crowdsourcing. In this work, we propose a novel annotation workflow that leverages the efficiency of crowdsourcing weak labels, and uses a high number of annotators to produce reliable and objective strong labels. The weak labels are collected in a highly redundant setup, to allow reconstruction of the temporal information. To obtain reliable labels, the annotators' competence is estimated using MACE (Multi-Annotator Competence Estimation) and incorporated into the strong labels estimation through weighing of individual opinions. We show that the proposed method produces consistently reliable strong annotations not only for synthetic audio mixtures, but also for audio recordings of real everyday environments. While only a maximum 80% coincidence with the complete and correct reference annotations was obtained for synthetic data, these results are explained by an extended study of how polyphony and SNR levels affect the identification rate of the sound events by the annotators. On real data, even though the estimated annotators' competence is significantly lower and the coincidence with reference labels is under 69%, the proposed majority opinion approach produces reliable aggregated strong labels in comparison with the more difficult task of crowdsourcing directly strong labels.

I. INTRODUCTION
A NNOTATED data is a key player in the development of machine learning methods. While advanced methods may be capable of learning from data without or with only partial annotations, evaluation of their performance does require annotated data. The degree of difficulty and effort necessary for producing annotated audio datasets varies depending on the task. Some tasks require classification of audio at a coarse temporal level, such as the general-purpose audio tagging of Freesound content [1] or AudioSet [2]. On the other hand, tasks like sound event detection (SED) [3] or sound event localization and detection (SELD) [4] require a fine temporal The authors are with the Computing Sciences, Tampere University, 33720 Tampere, Finland (e-mail: irene.martinmorato@tuni.fi; annamaria.mesaros@ tuni.fi).
Digital Object Identifier 10.1109/TASLP.2022.3233468 resolution output, to indicate the onset and offset of sound event instances. The textbook case for training a SED system is based on strongly annotated data, in which textual labels, onsets and offsets are provided for the sound event instances [5]. Such annotation requires a significant effort and as a consequence strongly-labeled datasets are small in size, if they are real-life recordings. Synthetic strongly-labeled data can be easily created [6], [7], but often lacks the complexity and variability of real acoustic environments, which creates a mismatch for methods expected to be used in practical situations. On the other hand, weakly-annotated data that contains only textual labels to indicate the presence of different sound events requires less annotation effort and has become the predominant type of data in the field.
Research on SED and SELD is continuously developing, but the acute lack of strongly annotated datasets steers the approaches towards learning based on weak labels [8], [9] and semi-supervised methods [10]. There is also a large body of work that has produced powerful, highly-performing approaches that use semi-supervised methods, such as student-teacher learning paradigm, to compensate for the weak labels in learning [11], [12], [13]. For example training is possible using unlabeled training data together with smaller amounts of weakly-labeled data, and possibly strongly-labeled synthetic data, as proposed by Turpault et al., [11]. However, there is always a need for strongly-labeled data for evaluation, and this is often manually annotated.
The measured system performance is dependent on the quality of the evaluation data, since the reference annotations of the evaluation dataset define what is considered correctly and erroneously detected in the system output. It is therefore important that these reference annotations are reliable, in order for the measured performance to reflect reality. It is widely accepted that the manual annotations are highly subjective, which manifests in variability of textual labels (when annotators are required to provide them) [14] and inaccurate timestamps for the event instances [5]. Sound event detection is evaluated with respect to the temporal location of reference event instances [15], [16], which creates a strong dependence of the system performance on the quality of the annotations.
An alternative method to manual annotation is automatic content analysis with added human verification of the proposed labels, a method that has mostly been employed for weak This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ labeling [2], [17]. For example the FSD50 k dataset labels were proposed based on the tags provided by users and then verified manually by expert and non-expert annotators [17].
Crowdsourcing offers a more efficient method for annotation of large amounts of data. Even though mostly used for weak labeling, attempts to collect strong labels using crowdsourcing exist [18], [19]. Cartwright et al. [18] employed the classical annotation approach, requiring the annotators to provide onset, offset, and a textual label to all event instances; the task was simplified by providing the annotators with a list of labels to choose from. In our previous work [19], the annotation was formulated as weak labeling of overlapping temporal segments, and the strong labels were reconstructed with a 1 s resolution; similar to [18], a preselected list of textual labels was provided, to simplify the annotation task.
One important factor in using crowdsourced data is the availability of multiple opinions, and the way they are aggregated. The aggregation of annotator opinions is typically based on simple strategies like majority vote (consensus) [20], [21]. Using multiple expert annotators is more common in medical imaging than audio, with different strategies employed for aggregating the expert annotator opinions. Simple aggregation methods include, similar to the audio studies, intersection and union [22]; more complex strategies estimate an optimal ground truth using expectation-maximization as done in STAPLE [23] or maximizing the joint agreement between annotators [24]. A review of these approaches indicates that the method used to estimate the ground truth has a significant effect on the evaluated performance of the system, with STAPLE causing underestimation of performance when only few annotations are available, and consensus overestimating it [25]. In our previous work, [26] we proposed an extended version of MACE -Multi-Annotator Competence Estimation [27] to predict the "true" labels for multi-labeled audio data using models of the annotators' competence. The method weighs the annotator opinions based on their competence, in contrast to majority voting which trusts and weighs all annotators equally. This approach was incorporated in the strong label estimation proposed in [19], and shown to produce better estimates than the majority vote procedure.
In this paper, we present two key contributions to the problem of strongly-labeling audio data for SED. First, we propose a method for estimating strong labels, using crowdsourcing of weak labels and a processing stage to reconstruct the temporal information. While the method has been introduced in our previous work [19], we now extensively test its effectiveness on real-life recordings, to understand its applicability in practical situations. Second, we propose a novel aggregation method that we call "majority opinion," applied directly to the weak labels as provided by the annotators. This approach operates on the raw data obtained from annotators instead of estimating the tags for each annotated segment, as done in [19], and uses annotator competence to weigh the individual opinions. All the previous work on crowdsourcing strong labels has been done on synthetic data, and the methods have not been tested on real-life audio. In this work, we investigate the crowdsourced annotation outcome on two known real-life SED datasets, and also compare the outcome of our proposed method with the approach of Cartwright et al. [18]. Finally, we investigate the effect of the reference annotation generation method on the evaluated performance of a SED system, to understand what proportion of the measured errors are due to incomplete reference data.
The remainder of this paper is organized as follows: Section II presents the related work in more detail, and the novel elements of the proposed approach; Section III presents the crowdsourcing annotation procedure, annotator competence estimation and the proposed strong label estimation method. Section IV introduces the datasets and the annotator competence analysis. The experimental results for the labels estimation are presented in Section V, which includes analysis of the resulting weak labels, strong label estimation, the comparison to direct strong annotation and discusses the sound event detection experiments using estimated labels. Finally, Section VI presents conclusions and future work.

II. RELATED WORK
Manual annotation is the most obvious approach to obtaining strong labels. Because the annotation task is difficult and time-consuming, most datasets containing strong labels are very small, for example TUT Sound events 2016 [28] and TUT Sound events 2017 [3] datasets contain only about 2 h of data each, in files of length 3-5 minutes. Their reference annotation was produced by two annotators that listened to the audio and could inspect the spectrogram, and had to provide a textual label composed of noun and verb (object and action), and onset and offset for all audible sound event instances [28]. The obtained set of labels was later manually processed to merge some classes, and the most frequent ones were selected and provided with the data. Similarly, the MAVD-traffic datased for SED in Urban environments [29] was manually annotated using the ELAN software, displaying the audio waveform, the video, and the spectrogram of the audio signal. The dataset consists of 4 h of data in files of approximately 5 minutes, and contains 21 classes.
The largest strongly-labeled dataset to date is a portion of AudioSet consisting of around 120 K files that were manually annotated. The annotation process consisted of several steps, in which a first-pass labeling was reviewed by a different annotator who could adjust the temporal boundaries. The verification/adjustment step was repeated, but even with 5 stages this process rarely converged to consensus [30], which implies that the annotators did not seem to agree on the boundaries. While very large in terms of classes and size, the audio files in this dataset have a length of only 10 s, which makes it very different from the aforementioned ones which are more representative of the overlaps and sequentiality of sound events in everyday environments.
As mentioned earlier, crowdsourcing is a very effective way to collect or curate data because it provides immediate access to a large number of nonexpert annotators. For example in FSD50 k dataset, selected clips were automatically assigned labels based on the tags provided by the users, mapped onto the AudioSet ontology [17]. A specifically created tool, the Freesound Annotator (FSA), was then used to curate the data: volunteer users were asked to validate that a certain sound is present in the audio or not. The sound classes were divided according to an estimated level of difficulty and only the easy and medium difficulty classes were validated publicly through FSA. The classes considered difficult to annotate were validated by a pool of hired raters. Crowdsourcing was used to collect annotations for many notable image datasets such as ImageNet and Microsoft COCO, and a number of recent audio datasets, for example Clotho [31] and Open-MIC [20].
When multiple opinions are available for one annotated item, they are commonly aggregated through a majority vote. As a consequence, the expertise and diligence of the annotators in the annotation task influences the result. Our previous work addressed the problem of analysing annotator behavior for generating a reliable reference annotation based on their aggregated opinions [26]. A pool of 133 annotators was used to annotate 3930 audio recordings, providing 3-5 opinions per file. Aggregation based on annotator competence estimation was found to provide the best set of labels, evaluated using annotator agreement metrics. A second experiment using synthetic data, for which the ground truth was available, confirmed that the competence-based aggregation approach is superior to majority vote, validating the connection between annotator competence and reliability of the aggregated annotation [19].
Crowdsourcing of strong labels has been studied by Cartwright et al. in a controlled experiment that aimed to investigate the effect of visualizations and complexity on the crowdsourced annotations [18]. The study used 3000 synthesized soundscapes which were 10 s in duration, each containing up to 9 sound events, and a maximum polyphony of 4. The aggregated annotation was obtained by converting the annotations to a frame-based time-series representation using a frame size of 100 ms, and majority vote: a time frame was marked as active if at least half of the participants marked it as active. The study observed a sharp increase in quality of the estimated aggregated annotation for the first 5 annotators, followed by more subtle improvements as the number of annotators considered in the aggregation increased.
Our previous work introduced an alternative to the crowdsourcing of strong labels by breaking the annotation task into weak labeling of consecutive audio segments, followed by postprocessing to recover the temporal connection between the labeled events [19]. Aggregation based on annotator competence was also incorporated into the strong label estimation process. The study was based on 20 synthetic soundscapes containing a maximum number of six sound event classes and a maximum polyphony of 2. The comparison of the resulting estimated strong annotation with the reference generated with the data showed that the proposed method successfully reconstructs about 80% of the ground truth information.
In this work, we continue exploring the method in [19] and propose a novel aggregation method that uses directly the segment labels as provided by the annotators, instead of estimating the true labels with MACE. The aggregation starts from the raw data and takes into account annotator competence directly in the estimation of strong labels. For the synthetic data, we perform additional analyses of the labels with respect to signal-to-noise ratio and polyphony of sounds in the audio. Most importantly, we investigate the proposed method's applicability to real-life data, which is much more complex in terms of acoustic content than the synthetically generated one. In addition, we compare the outcome of the proposed method with the strong annotation approach from [18], to understand the tradeoff between costeffectiveness and labeling process outcome.

III. CROWDSOURCING ANNOTATIONS
A simple and well-defined annotation task is the key for successful and consistent behavior of the annotators. The typical annotation process for creating strong labels requires the annotator to listen to an audio excerpt, recognize the target sound events, and annotate their presence by marking the temporal boundaries for each instance of the target classes. Oftentimes this requires repeatedly listening to the audio example to annotate sounds that overlap, or to make corrections to the already marked temporal boundaries. Selection of the temporal boundaries is subjective, and different annotators tend to disagree on their exact location [30], which indicates that the strong labeling annotation task is a difficult one.

A. Annotation Procedure
We propose a procedure that simplifies the annotation task by dividing it into unit tasks that require only weak labeling. The files to be annotated are segmented into short, overlapping segments, which are to be annotated with weak labels by indicating binary activity of sound events within the entire segment. The list of target sound classes is selected in advance and presented to the annotator, making the labeling task as simple as possible. The proposed method is illustrated in Fig. 1. A sliding "annotation window" goes over the length of the audio file, with a high rate of overlap between consecutive segments covered by this window. The temporal sequence of these annotated segments provides the temporal activity of the sounds within the original long file, by aggregating activity indicators at each time step. If all weak annotations are correct, therefore all annotators have indicated correctly that a sound is active or not, the event boundaries correspond to the boundaries of the maximum-valued region in the count-based activity indicators.
To facilitate accurate recognition of sound sources in the audio segments provided to annotator, we choose a segment length of 10 s. This length is motivated by studies that examined the recognition by humans of a list of 42 different sounds, and concluded that listeners need a maximum of 6.8 s to accurately identify the sounds of the studied categories [32]. A hop of one second between the segments will provide a one second resolution in the temporal reconstruction of the events activity, which is in line with the diffuse labels created in [30] and the segment length used in the evaluation of most SED systems [3]. We formulate the annotation task as a single-pass multi-label annotation, as done in [21] and [26]. As a consequence of this procedure, the presence of a sound is explicitly indicated by selecting the corresponding label, while the absence is implicit by the label not being selected.

B. Annotator Competence and Ground Truth Estimation
When working with non-expert annotators, it is important to be able to trust their answers. We employ MACE [27] to estimate how reliable these annotators are. The method allows identification of trustworthy annotators and provides a prediction for the ground truth based on aggregation of the annotators opinions. MACE does not necessarily require that all annotators provide answers on all data, but requires at least that a large pool of annotators annotate partially the same data, in order to learn from redundant annotations.
The model, as originally introduced by Hovy et al. [27], considers that annotator j produces label A ij on instance i. The annotated label depends on the true label T i , and whether annotator j is spamming (spamming means that the annotator is selecting the answer at random). Annotator behavior is modeled by binary variable S ij drawn from a Bernoulli distribution with parameter (1 − θ j ). The behavior assumes that when an annotator is not spamming on instance i (S ij = 0), the annotation A ij corresponds to the true label. When the annotator is spamming, S ij = 1, A ij is sampled from a multinomial distribution with parameter vector ξ j . The annotations A ij are observed, the true labels T i and the spamming indicators S ij are unobserved. The model parameter θ j specifies the probability of trustworthiness for annotator j, while ξ j determines the spamming behavior of annotator j.
The model parameters are estimated using the expectation maximization algorithm, to maximize the probability of the observed data: where A is the matrix of annotations, S is the matrix of competence indicators, and T is the vector of true labels. Here N refers to the number of instances i that are annotated, and M to the number of annotators j that provide an opinion for instance i. The method was shown to produce predicted labels very accurately in comparison with ground truth data on a few tasks. At the same time, the model's θ j was shown to correlate strongly with annotator proficiency [27].
Because MACE was originally defined for single-labeled items, we extend the representation of our multi-labeled data such that each file is assigned a set of binary yes/no labels, each corresponding to one target sound class. This implies that each (file, sound label) pair is considered an independently annotated item, equivalent to a multiple-pass binary annotation [21]. The difference is that in a multi-pass binary annotation both the present (yes) and absent (no) labels would be explicitly provided by the annotator, while in the single-pass multi-label annotation such as our task, the absence is implicit. We consider that the tagging task is easy enough to allow changing the data representation without introducing significant errors. We therefore explicitly represent as absent the items that were not explicitly marked as present by the annotators. The annotations are represented as a matrix containing the answers of all annotators per file and per label, as illustrated in Table I. Each row refers to a (file, sound label) item, and each column represents the answer of one annotator in the format [0, 1, −], marking the presence (1, explicit) or absence (0, implicit) of this label within the audio file; "−" indicates that this file was not assigned to this specific annotator.
Using this representation, we estimate the annotators' competence and predict the aggregated weak labels using MACE. It is important to note that MACE does not discard annotators, but weighs their opinion based on their competence, which results in a different procedure than majority voting which trusts and weighs all annotators equally. In some experiments, we also eliminate the most unreliable annotators based on their estimated competence, to study if relying on a smaller pool of better annotators is more advantageous than using a higher number of annotators wherein low-competence annotators are also present.

C. Strong Label Estimation Based on Majority Opinion
The illustration in Fig. 1 takes into account one weak label for each segment and reconstructs the temporal activity pattern of a sound event as a count-based activity at hop-size resolution. Having multiple annotators per segment allows for estimation of this weak label using MACE. The count-based activity indicators are then binarized to obtain the maximum-valued regions that corresponds to the estimated temporal boundaries of the sound event instances. In [19], a threshold of 80% was used instead of the maximum, in order to allow for possible incorrect answers from the annotators.
We propose a novel method of estimating the strong labels, in which we consider directly the labels provided by the individual annotators. This way, the method takes into account the fine-grained differences in annotators' opinions instead of transforming them first into an estimated weak label per segment. Given the procedure explained above, we consider all annotator opinions in each hop-size segment t and aggregate them such that the vote of each annotator (sound event active or not active) is weighed by his/her estimated competence. The individual competence associated to each annotator is the model parameter θ j estimated using (1), in other words the probability of trustworthiness for annotator j. We calculate the activity indicators using the following expression: where a t is the activity level a for one class in segment t, M is the number of available opinions for that segment, θ j is the competence of annotator j, and v j indicates the annotator's opinion, being 1 for the presence and 0 for the absence of the label. The estimation is done independently for each class.
This formulation is a generalization of the majority vote: if we consider all annotators as equally and perfectly competent, their competence level θ j is 1. With the opinions being 0 or 1, normalizing the sum of opinions by the sum of the annotators' competence results in a value higher than 0.5 only when over half of the annotators have indicated a sound as being active. If the annotator competence is not always 1, the resulting value is still a number between 0 and 1, but it can be higher than 0.5 when less than half of the annotators indicated a sound as active, given that these annotators are the most trustworthy ones. This is still a consensus-based aggregation, but instead of majority vote (over half the annotators voting 1) we are considering the majority opinion, i.e. enough weight brought by the trustworthiness of annotators.

IV. DATASETS ANNOTATION TASK SETUP
In the experiments we use both synthetic and real audio recordings. The synthetic data offers the possibility of performing a detailed analysis of how the polyphony and SNR levels of the sound events present in the soundscape affect the outcome of the annotation, and allows the comparison of the method outcome with the correct reference annotation that is generated at the same time with the audio mixtures. On the other hand, the real recordings are more complex than synthetic data due to the unrestricted and uncontrolled sounds distribution and overlap, and present a difficult task to the annotators. To the best of our knowledge, this is the first experiment to attempt crowdsourcing strong annotations for real recordings, and the detailed analysis of its outcome will allow us to understand how the estimation of the annotators and annotations reliability translates from the highly controlled and simplified synthetic case to a real-world situation.
A. Datasets 1) Synthetic Data: The synthetic dataset used in this study is MAESTRO Synthetic (Multi-Annotator Estimated STROng labels) [33], which was created using a slightly modified version of Scaper [34]. Soundscapes were generated by iteratively placing sound events at random intervals until the desired maximum polyphony of 2 is obtained. Intervals between two consecutive events were selected at random between 2 and 10 seconds. The sound event classes and sound instances were chosen uniformly, and mixed with a signal-to-noise ratio (SNR) randomly selected between 0 and 20 dB over a Brownian noise background. The mixing procedure did not allow two overlapping sounds of the same class.
The dataset contains the following classes: car horn, children voices, dog bark, engine idling, siren, and street music. The isolated sound event instances were extracted from the UrbanSound dataset [6] based on their temporal boundaries which were manually annotated by the dataset authors (children playing label from the UrbanSound dataset was renamed to children voices for the annotation task, as often the audio examples contained childrens' laughter). Only sounds marked as being in the foreground were used. The selection of target classes was based on the intention to mimic the content of the street scenes annotated in our previous study [26] and from the real-life TUT Sound Events 2016 and 2017 datasets. MAESTRO Synthetic dataset consists of 20 audio files, each having a length of 3 minutes. The reference annotation of this dataset is created at the same time with the audio mixtures. We consider this reference annotation as correct and complete, because of the way it is produced. Dataset statistics are presented in Table II.
2) Real-Life Data: The real life-recordings used in this study include a subset of the TUT Sound Events 2016 [28] and a subset of TUT Sound Events 2017 [35]. We use the residential area acoustic scene from TUT Sound Events 2016, and select six target classes: bird singing, car, children, people speaking, people walking, and wind blowing (i.e. we do not consider the object banging class of the dataset). From TUT Sound Events 2017 we use the recordings corresponding to the city center acoustic scene, with target classes brakes squeaking, car, children, large vehicle, people speaking and people walking. We will refer to the strong annotations produced by the described method as MAESTRO Real and publish them for further study. 1 The reference annotation for MAESTRO Real is the annotation provided with the original datasets, which was obtained through manual annotation performed by two expert annotators that each annotated half of the data [28]. While these manually annotated data cannot be considered correct and complete due to the complexity of the acoustic content, our purpose is to understand the differences between different methods to produce annotations, therefore we use these reference annotations to evaluate how the different crowdsourced versions coincide with expert opinions. We accept the fact that the expert annotations are also subjective, and analyze the effect of different annotation procedures on the produced labels and on the evaluation of SED systems. The statistics of the data are presented in Table II. The two acoustic scenes (city center and residential area) are treated separately in all our experiments.

B. Crowdsourcing Task Setup
As explained in Section III, the audio soundscapes were cut into 10 s segments with 1 s offsets. Each individual 10 s segment was considered as an independent annotation task, provided on Amazon Mechanical Turk as one HIT (Human Intelligence Task). In order to prevent the same worker annotating overlapping segments, the data was organized into batches containing segments located at least 15 seconds apart in the original audio. The batches were launched one at a time, and workers that already performed at least 50 hits in previous batch(es) were disqualified from working on the task. A payment of $0.10 was offered per HIT. Worker qualification was requested as at least 1000 completed HITs with average approval rating of at least 85%.
One HIT consisted of listening to the provided audio excerpt and indicating which sounds are present in it, from the given list of classes or "none of the above". The number of playbacks allowed was not limited. No visualization (e.g. spectrogram) was provided. Workers were instructed to complete the task using headphones, and in a quiet environment. Before the job, they were also provided short descriptions for every class, and four example audio excerpts that contained sounds from all target classes. Each 10 s segment was annotated by 5 workers. While MTurk requires reviewing the assignments in order to approve or reject the answers submitted by workers, we approved all assignments, irrespective of the quality of the answers, in order to study the annotator behavior.

C. Annotators Competence Analysis
Annotators' competence analysis performed with MACE is shown in Fig. 2. This analysis considers only the weak labels provided by the annotators to the 10 s segments, and the audio segments are considered as independently annotated items. The synthetic data was annotated by a pool of 680 workers, while the real data was annotated by 861 and 717 workers for the residential area and city center scenes, respectively. Each set consisted of approximately 20 thousand HITs.
Most annotators seem to have high competence for the synthetic data, with about one third of the annotators in the highest tier (competence 0.9 to 1.0). Competence of the annotators on the real data shows a completely different situation: the values are distributed over the entire range, and a high number of annotators have extremely low competence (17% for city center and 14% for residential area have an estimated competence of under 0.1). We did expect to see a deterioration of overall competence for the annotation of the real soundscapes, but such a pronounced difference was surprising. This in itself is a very good indicator of the task difficulty for a non-expert annotator.
It is important to note that the annotators of the real and synthetic soundscapes are different, and individual annotators were limited to maximum 50 HITs. The competence estimation is therefore applied to a large pool of annotators, and the result can be seen as an indicator of the task complexity. Previous works that studied annotation procedures all used synthetic data to draw their conclusions [18], [19], while the difficulty and subjectivity of annotating real data was always mentioned and accepted as true [5]. Fig. 2 shows histograms of the estimated annotators' competence on the different datasets. These results are the first that demonstrate in a quantifiable way that real data is much more difficult to annotate than synthetically generated one.
Inter-annotator agreement was calculated using Krippendorff's alpha, and is presented in Table III along with more details about the annotation task. In the table, α all represents Krippendorff's alpha for the entire set of annotations. The values show how difficult it is for annotators to agree on annotation of the real data, compared to the synthetic data. Removing the less competent annotators increases the inter-annotator agreement: using annotators with competence higher than 0.6 results in a 30% relative increase in agreement for the synthetic data; for the real data, the relative increase is 116% and 170%, respectively, to α comp>0.6 of 0.54.
When removing annotators based on their competence values (α comp > 0.6), the number of annotators left for the agreement calculation is considerably reduced, being 532, 430 and 228 workers, respectively. As a consequence, the number of HITs that the agreement is calculated on is reduced to 20514, 19164 and 16164, respectively. The most affected subset is the city center data: 4770 of the 21264 annotated items are left without annotations, because 489 annotators have an estimated competence below 0.6.

V. EXPERIMENTAL RESULTS
The weak and strong labels estimation methods are analyzed by comparing their output with the reference annotation. We evaluate the quality of the resulting weak labels using precision, recall, and F1, and the strong labels using the most common metrics from SED.

A. Weak Label Estimation
Considering the annotated segments individually, the annotation process output is evaluated by comparing the audio tags with the reference tags for each segment. The reference tags per segment were generated based on the reference strong labels by assigning a label to a segment if the sound is active at any time within that segment.
The multiple annotations were aggregated for each segment using three different methods: union, majority vote and MACE. Union assigns a label to an item if at least one of the annotators has assigned it to that item; majority vote assigns a label to an item if most annotators have assigned it (in this case at least 3 of the 5 annotators). MACE uses the estimated competence of the annotators to predict the labels for each item, as explained in Section III. The comparison to the reference labels is done using F1, precision and recall. The results are presented in Table IV.   TABLE IV  WEAK LABEL ESTIMATION COMPARED TO THE REFERENCE ANNOTATION  USING THREE DIFFERENT AGGREGATION METHODS For the synthetic data, the best F1 is obtained using MACE: 86%, with 97% precision and 77% recall. Recall values show that many sounds are not annotated: with the majority vote, only slightly over half of the tags are found, while taking into account all opinions through union aggregation brings recall close to 90%. MACE produces a good compromise between a high precision and a good recall.
Looking at the real data, the metrics behavior is very similar, although the actual values are much lower: aggregation through union produces the best recall, while majority vote produces the best precision, and MACE raises the recall level while slightly lowering precision. It is worth noting that, for the real audio recordings, the reference annotations should not be considered as being absolutely correct, since even though they were produced by expert annotators, they were produced by a single person for each file. It is however discouraging that the aggregated opinion of multiple annotators overlaps only so little with the original annotator's opinion. The results nevertheless show that MACE is the best aggregation method for both types of data, synthetic and real. For this reason, we will focus on MACE-based aggregation approaches for the remainder of the experiments.

1) Polyphony Analysis:
We analyze the influence of the polyphony on the aggregated weak labels using the synthetic data, for which such details are available. The synthetic data has been designed to have maximum two overlapping sound events at a given time. However, a 10 s segment may have more than two labels assigned, depending on its content. We use the term "polyphony" broadly to mean the number of events present in one 10 s segment, not necessarily all overlapping in time. We also calculated the average gini-polyphony introduced in [18] and defined based on the sound event polyphony at 100 ms time intervals throughout the soundscape. Interpreted as a measure of soundscape complexity, with zero representing maximal equality (low soundscape complexity) and one representing maximal inequality (high soundscape complexity) [18], the average gini-polyphony of the data is 0.74, which shows that the complexity of the soundscapes is generally high.  have one or two events, indicating a large number of missing labels, therefore explaining the lower recall. Table V indicates the number of segments with different degree of polyphony, with column 2 corresponding to the reference labels, and column 3 to the labels estimated using MACE. The metrics in columns 5-7 compare the reference and MACE output with respect to the number of segments in the reference (N s GT ). For the 10 segments of polyphony 1, all labels were correctly estimated (R = 100), but some of them were assigned more than one label (PR = 90.9). For the case of maximum polyphony, only half of the labels (R = 57.1) within the four segments where labeled correctly (PR = 100).
As expected, precision and recall vary with the polyphony, with recall decreasing at a high rate when polyphony increases. A similar annotator behavior was observed by Cartwright et al. [18] in the case of strong annotations: when more than two sounds overlapped, annotators failed to recall all the concurrent sounds. This may be because it is more difficult to identify sound events when there are more than two, but it may also show a tendency of the annotators to only identify one sound, and annotating a second one only if it was clearly identifiable. Additionally, the listening conditions play a role in identification too: in reality humans have better capabilities to distinguish overlapping sound events because of using both ears, therefore the spatial perception plays a role in the process, while listening to a mono recording in headphones does not provide the necessary spatial cues for disambiguation.
2) SNR Analysis: We investigate the effect of the SNR on the precision and recall of the sound events by annotators. Because each sound instance has been randomly assigned an SNR level when creating the synthetic mixtures, in most cases the 10 s segments contain more than one sound with different values of SNR. We group the segments by considering a segment into a specific SNR range if at least one of the sounds in the segment has the SNR within that range, and all other sounds in the segment have SNR within that range or lower (e.g. a segment with a sound at 7 dB, one at 3 dB, and one at 4 dB is in the [5][6][7][8][9][10] dB range).
The results, presented in table VI, show that recall is increasing with SNR, with a 7% absolute increase for sounds in the [10][11][12][13][14][15] dB range compared to those in the [0-5] dB range. The lower values for the [15][16][17][18][19][20] dB are observed due to the definition of these groups: based on the statistics in Table V, most segments have 2 or 3 sounds, so most of the 1778 segments with sounds in [15][16][17][18][19][20] dB interval also have some other sounds that are at lower SNR and are missed, hence the lower F1, P and R. Of the 1778 segments, 1660 segments (93%) have events with SNR lower than 15 dB. In this case inter-event ratios also play an important role: when two events occur simultaneously, the louder one will be masking strongly or at least partially the other one, making it harder to identify. For the other ranges this relative ratio is smaller (40% for [5][6][7][8][9][10] dB, 60% for [10][11][12][13][14][15] dB), resulting in less chances for masking. According to [32], identification accuracy and speed depends on the type of sound, therefore identifying the concurrent sounds will depend not only on the relative prominence of the sounds in a scene, but also on the degree of overlap, and the familiarity of the annotator with the sounds to be annotated. If we consider only segments where all the sound events present have the SNR within the same interval, the number of evaluated segments decreases to about 20% of the total. In this case F1 for range [0-5] dB is 88.92% (202 segments), while for range [15][16][17][18][19][20] dB is 95.71% (118 segments), demonstrating the ease of annotating sound events that are relatively loud compared to the background.

B. Strong Label Estimation
Following the scheme for temporal activity reconstruction of the sound events described in Section III-C, we stack the annotated segments in their original order and combine the multiple annotator opinions using the proposed majority opinion. An example of estimating the strong labels from the count-based activity curves is presented in Fig. 3. The annotation task produces  VII  SOUND EVENT DETECTION METRICS CALCULATED BETWEEN THE ESTIMATED STRONG LABELS AND THE GROUND TRUTH   5 opinions per 10 s segment, which translates into 50 opinions per 1 s segment, due to the staggered annotation procedure. According to the estimation method explained in Section III-C, the temporal location of an event instance corresponds to the region in which all annotators have considered it active in the weakly-labeled segments. To accommodate possible incorrect answers from the annotators, in [19] we used a threshold of 80% for binarizing this representation, i.e. a sound event was considered active in a 1 s segment if at least 80% of the opinions available for that segment considered it active [19].
We compare the proposed method with the MACE estimate as presented in [19], considering that MACE provided the best estimation of the reference weak labels; in addition, we also compare it with the aggregation of data from annotators with a competence higher than 0.6. The results are presented in VII in the following order: (1) using only annotators with a competence higher than 0.6; in this case, low-competence annotators are eliminated, resulting in a varying number of opinions per 1 s segment (on average 37, 20, and 13 annotators for synthetic, real-residential and real-city-center, respectively); (2) using the labels estimated with MACE; in this case, each 10 s segment is assigned the labels estimated by MACE, and there is only one opinion per 10 s segment (the MACE output), which translates into 10 opinions per 1 s segment, due to the staggered annotation procedure; (3) majority opinion. For cases (1) and (2) we use the 80% threshold to binarize the count-based activity, as explained above. For majority opinion, we binarize the activity at the midpoint of 0.5, according to the definition in Section III-B.
Table VII presents the SED scores between the reference annotations and the estimated strong labels based on the three described approaches, using segment-based F1 and ER [36] and intersection-based F-score as defined for the Polyphonic Sound Detection Score (PSDS) [16]. PSDS is evaluated for two scenarios, as defined in DCASE 2021 Challenge Task 4. 2 The two metrics are evaluated using the following parameters:  [16].
The error rate (ER) consists of deletions (D), events present in the reference which are missed in the output, insertions (I), events erroneously marked as present in the output, and substitutions (S), events that are mislabeled in the output compared to the reference. We observe that a large proportion of errors in ER are deletions. This means many of the sound events were not identified, which is expected based on the previously observed recall rates in the weak labels analysis. Deletions (and implicitly ER) are very high for the real data, being about twice as many in comparison to the synthetic data, for a similar amount of annotated segments. This, in particular, indicates the high difficulty in identifying the target sounds in real-life mixtures.
The strong annotations estimated for the real-life data compare rather poorly with the reference annotation. The use of MACE has a clear effect on increasing recall, with the proposed majority opinion aggregation (3) providing the best outcome. However, the higher recall is reflected in a lower precision and a significant increase in insertions, even though the overall ER decreases. The best precision is obtained by using a selected proportion of highly competent annotators according to method (1), but this means discarding large amounts of raw data, in particular for the real audio recordings. F1 values show a similar trend, with MACE helping improve the scores significantly. The majority opinion approach provides by far the best F1 for the real data, for all three calculated versions. For the synthetic data, the proposed method does not always provide the better strong label estimates.
Here one should not forget that the synthetic data comes with correct and complete reference annotations for the sound event instances, while the real recordings were manually annotated and therefore are prone to labeling errors that arise from subjective perception of each annotator. While the superiority of the proposed method can only be demonstrated numerically on the synthetic data, this does not diminish its importance; on the contrary, it shows that the proposed competence-weighted aggregation provides consistent results across different types of datasets, and may be used as an objective and reproducible procedure for creating strong annotations. One scenario in which this method fails is when two events of the same class follow each other at short intervals, within a 10 s segment. In this case, correctly indicating presence of the sound event class in all segments that overlap any of the instances will create a situation where there are no gaps, leading to the estimation of a continuous, single instance.

C. Comparison to Direct Strong Annotation
For comparison, we reproduced the annotation method of Cartwright et al. [18] which provided workers with the spectrogram visualization along with the audio, and required annotators to produce strong annotations. We used the exact same annotation protocol through Amazon Mechanical Turk, using the code provided by the authors, 3 to collect five annotations for each audio file. We provided the visualization as a spectrogram, and explained to the annotators how it can be interpreted. The workers for this task were selected to have at least 95% accepted jobs.
Aggregation of the multiple annotations was done following the same procedure as in [18]: each annotation was transformed into a discrete sequence of 100 ms length segments; for each 100 ms segment, an event was considered active if the majority of the annotators (in this case 3 of 5) have annotated it as active. The resulting aggregated strong labels are compared with the ground truth (for synthetic data) or with the reference annotation (for real-life data). Table VIII shows information retrieval measures in 100 ms segments for the synthetic data, for comparison with the work in [18]. The F1 of 68.3% is much lower than the approximately 93% in [18] in the case of 5 annotators. We hypothesize that this large difference is due to the annotation task being more difficult: our soundscapes have a length of 3 minutes, and may exhaust the worker's attention, in comparison with a short 10 s one. While for our experiment the precision and recall are 89.6% and 55.2%, respectively, the same metrics for the 10 s soundscapes in [18] are 98% and 95%. As an attempt to increase recall to the maximum possible, we verify the outcome of a union-based aggregation instead of consensus on the 100 ms segments, and obtain a recall of 85.9%. The method does however deteriorate precision, leading also to a much higher error rate.
In line with the other experiments presented in this paper, we calculate the SED metrics between the reference annotation and the aggregated strong annotation. The results are presented in Table VII as approach (4). For the synthetic data, the segmentbased F1, P and R calculated in 1 s segments are in the same range with the same metric in 100 ms segments (Table VIII). In comparison with the evaluation of the other approaches in Table VII, we can conclude that this method provides very poor results, in particular on the real data. While precision values are comparable among the four approaches, the recall in the direct strong labeling approach is very low, also visible in the high proportion of deletions. An example of how our proposed method based on weak labeling and majority opinion behaves better than the direct strong annotation and majority vote aggregation shown in Fig. 4.
While we conclude that the strong annotation crowdsourcing as studied in [18] does not seem to be suitable for minutes-long real recordings, we have to mention a peculiar behavior of the annotators: the number of annotated event instances for the real data was very high, with a visible tendency of "filling up" the length of the audio. As can be seen in Fig. 5, the spectrogram visualization of a synthetic soundscape has more prominent segments corresponding to individual sound instances that are easily noticeable on the background, which may elicit a different annotator behavior. The complexity of the spectrogram for real data, brought by the unconstrained presence and overlapping of non-target sounds might give the impression that there is always something happening that needs to be annotated. In light of this, providing the spectrogram in the annotation task may have been detrimental to the quality of annotations instead of aiding the process.
In terms of time and cost, the required annotation effort for the two methods is quite different. While the weak labels are faster to annotate, the HITs were published in batches, and the average time for completing all batches of one dataset was 4 hours. In comparison, the strong annotation took on average 7 h for each set. Cost-wise, the tagging HITs were paid $0.10, while the strong annotation HITs were paid $5 each, which resulted in a 4 times higher cost for tagging than for the strong labeling. However, we observed that many workers in the tagging task completed the maximum allowed HITs, while for the strong annotations most workers completed only one HIT, indicating that they considered the work load too high. This reinforces our intuition that simple unit annotation tasks like tagging are preferable to ones requiring the annotator to take complex decisions such as is the case for strong labeling.

D. Sound Event Detection Using Estimated Labels
As an additional experiment, we investigate how the reference annotations influence the evaluated performance of a SED system. Traditionally, algorithms are trained and evaluated using the reference produced through manual annotation. Accepting that such annotation is subjective means that the reference is not necessarily complete, and the differences and disagreement between annotators may be the cause of some of the measured errors. Moreover, it is difficult to create a consistent annotation protocol that results in similar annotator differences for different datasets. As a consequence, testing a method across datasets will be affected by errors caused not only by the acoustic content mismatch but also by the mismatch in the labeling procedure.
In order to observe how this mismatch in the labeling procedure affects the evaluated performance, we train a SED system using the official reference labels, and evaluate its output against differently produced strong labels. We consider as baseline the system trained and evaluated using the reference annotation (generated for the synthetic data, annotated by experts for the real data). The experiments follow a leave-one-out setup in order to use as much as possible data for training the model. For each of the three datasets, the training/test procedure uses one soundscape for testing, one for validation, and the rest of the soundscapes for training the model. All training/test experiments were run first, and the evaluation was performed on the entire data at once, to avoid possible imbalances due to averaging over file-wise results [36].
We use PANNs [37], specifically the wavegram-logmel-CNN14 model, 4 consisting of six convolutional (conv) blocks. Each conv block contains two 2-D convolutional layers with a 3x3 kernel and batch normalization, followed by a ReLU nonlinearity layer. After each conv block 2x2 average pooling and a dropout layer with 0.2 rate are applied. The input to the PANNs is the concatenation of the log-mel spectogram features and the wavegram. Wavegram is a feature representation proposed by the authors in [37], which is learnt by a CNN block from the raw audio file. Using both mel spectrogram and wavegram as input features has been shown to improve the performance significantly compared to mel only [37]. The model is pretrained using AudioSet, with all audio converted to mono, resampled to 32 kHz, and padded to 10 s.
To fine-tune the model for our experiment, a fully-connected layer consisting of six units was added to the pre-trained conv layers of the selected PANNs model, after which the complete model was further trained for a few epochs (maximum number pool of annotators, therefore it has the potential of producing the labels in a more objective and reproducible manner.

VI. CONCLUSION
While crowdsourcing has been repeatedly used as a fast method to collect large amounts of labeled data, the specific format of strong labels for sound event detection is still difficult to crowdsource. In addition to the complexity of the task itself, the outcome is affected by subjectivity of the annotators in perceiving the sounds. Collecting multiple opinions alleviates the subjectivity, but comes with the question on how to aggregate the multiple annotations for the best outcome. This paper presented two key contributions to the research problem of crowdsourcing strong labels. First, we introduced a novel workflow in the crowdsourcing task which breaks the strong annotation process into two stages: weak labeling and reconstruction of temporal information based on the weak labels. The weak labeling task is much simpler than strong labeling, therefore expected to produce consistent quality labels. Second, we proposed a novel method for aggregating multiple annotator opinions, using annotator competence estimation tools. Given that some users produce more reliable annotations than others, replacing the majority vote aggregation with a majority opinion scheme was expected to produce higher quality outcome.
Results have shown that weighing the annotators' opinions by their estimated competence produces better strong labels than any other method, including direct strong annotation. In addition, the results show that the proposed majority opinion approach produces reliable aggregated strong labels in comparison with a manually annotated reference produced by an expert annotator. Using a SED experiment, we have also shown how a model's evaluated performance is linked to the selected reference annotation. Annotations produced manually by different annotators reflect their personal biases and are prone to annotator-dependent errors, which are not separable from the system-produced errors when evaluated against. The proposed method uses multiple annotators in a crowdsourced manner and a data-independent processing chain for producing the strong labels, therefore has the advantage of being objective and reproducible, even though the produced annotations were shown to be incomplete.
Future research may investigate incorporating additional knowledge into the workflow. The main advantage of the proposed approach is its streamlined and reproducible setup, but the drawback is its high level of redundancy. For a more efficient method, it would be useful to preprocess the audio to select regions of interest, so that only the parts expected to contain the target events are annotated with high redundancy.