Hierarchical Classification for Instrument Activity Detection in Orchestral Music Recordings

Instrument activity detection is a fundamental task in music information retrieval, serving as a basis for many applications, such as music recommendation, music tagging, or remixing. Most published works on this task cover popular music and music for smaller ensembles. In this article, we embrace orchestral and opera music recordings as a rarely considered scenario for automated instrument activity detection. Orchestral music is particularly challenging since it consists of intricate polyphonic and polytimbral sound mixtures where multiple instruments are playing simultaneously. Orchestral instruments can naturally be arranged in hierarchical taxonomies, according to instrument families. As the main contribution of this article, we show that a hierarchical classification approach can be used to detect instrument activity in our scenario, even if only few fine-grained, instrument-level annotations are available. We further consider additional loss terms for improving the hierarchical consistency of predictions. For our experiments, we collect a dataset containing 14 hours of orchestral music recordings with aligned instrument activity annotations. Finally, we perform an analysis of the behavior of our proposed approach with regard to potential confounding errors.


I. INTRODUCTION
I NSTRUMENT recognition is a long-studied task in the field of music information retrieval (MIR), which aims at identifying the musical instruments that are playing in an audio excerpt. It is a difficult task, since many instruments produce sounds in overlapping pitch ranges and may exhibit similar timbral characteristics, especially those from the same instrument family. Furthermore, in real music recordings, multiple instruments may be active simultaneously (also called polyphonic instrument recognition). The task is closely related to instrument activity detection (IAD), where the aim is to identify the active instruments in a frame-wise fashion, over the course of an entire music recording. IAD can be useful to inform music recommendation or auto tagging systems, as well as aid in music editing and remixing. The authors are with the International Audio Laboratories Erlangen, Friedrich-Alexander-Universitat Erlangen-Nurnberg, 91058 Erlangen, Germany (e-mail: michael.krause@audiolabs-erlangen.de; meinard.mueller@ audiolabs-erlangen.de).
Digital Object Identifier 10.1109/TASLP. 2023.3291506 In this article, we are concerned with IAD in complex orchestral and opera recordings. Orchestral music in general constitutes a very challenging scenario for instrument detection, as many individual instruments are playing simultaneously, creating a highly complex sound mixture. In addition to the high degree of polyphony, orchestral music is also polytimbral, i.e., sounds from various instrument groups merge to create a single texture of sound. To approach this scenario, we utilize the hierarchical relationships that exist between orchestral instruments and instrument families, as illustrated in Fig. 1. In this context, we show that hierarchical classification improves results of a deep neural network for IAD, especially when fine-grained, instrument-level annotations are unavailable, but coarse, family-level annotations exist. Moreover, we investigate the consistency of predictions across hierarchy levels and demonstrate how additional training losses can promote consistent predictions, while preserving detection quality.
In contrast to popular music settings, public datasets with instrument activity annotations for orchestral or opera music rarely exist. For our experiments, we thus collect a dataset based on a combination of existing multi-track datasets and semiautomatically annotated commercial recordings. In total, our dataset consists of 14 hours of real-life orchestral recordings and covers 18 different classes. We make the instruments annotations for these recordings publicly available. 1 A common pitfall of music classification systems is overreliance on confounding factors in training and test data, which may lead to poor generalization ability. A system for genre classification may, for example, make decisions based on inaudible artifacts rather than musical content [1]. Such confounding effects may also arise for our IAD system by, e.g., affecting predictions for classes that are often active simultaneously (such as brass and woodwinds). To explore the impact of such effects on our system, we perform an analysis of model predictions with regard to classes that composers often use in conjunction.
We now summarize the main contributions of this article. First, we introduce a challenging new setting for IAD, for which no standard datasets exist. Second, we show how one can improve detection results and reduce the need for instrument-level annotations by exploiting the hierarchical class structure of our scenario. Third, we show how the consistency of predictions made by our model can be improved through additional loss terms. Fourth, we perform an analysis of our model's behavior and uncover confounding effects for certain instrument classes. In a previous work [2], we explored hierarchical classification for detecting singing activity, singer gender and voice type in orchestral recordings. In terms of the hierarchical classification techniques used, this article is an extended version of [2]. Here, we go beyond [2] by considering an instrument detection scenario, involving a different and larger class hierarchy compared to [2]. Furthermore, we present additional technical details, use different datasets and models, and provide extensive additional experiments and analyses.
The remainder of the article is organized as follows: Section II discusses related work on instrument recognition, hierarchical classification, and analysis of orchestra recordings. In Section III, we formalize our problem statement, outline our main classification approach, and describe evaluation measures used. Sections IV and V cover our dataset and model architecture, respectively. Section VI contains the main experimental results. In Section VII, we describe and evaluate losses for improving the consistency of predictions. In Section VIII, we analyze our model with regard to confounding effects among instrument classes. Finally, Section IX concludes the article with an outlook on possible future work.

II. RELATED WORK
Our article draws upon related work from several fields, including the vast field of music instrument classification. In our review of these fields, we focus on key references relevant to the present paper and relate our contributions to the state of the art.
Works on instrument recognition can also be categorized according to whether only the predominant instrument in a mixture (e.g., [13], [15]) or all active instruments (e.g., [16]) are to be recognized. Furthermore, some works classify activity for an entire audio excerpt lasting several seconds (e.g., [13], [14], [15], [18], [19], [20], [21]) whereas more fine-grained approaches yield predictions on a frame-level (e.g., [16], [17]). Such frame-level outputs can be used to obtain instrument predictions for every time step in a music recording-also called instrument activity detection (IAD). The scenario considered in this article is polyphonic IAD and considers all instruments playing. In contrast to prior work on this scenario, which usually examines popular music or works for small ensembles, we consider complex orchestral and opera music.

B. Hierarchical Classification for Audio
Some previous works have used hierarchical class structures for classification of audio data. For example, the authors in [22] propose a specialized network architecture for classifying bird calls, based on bird taxonomies. They perform classification on an excerpt-and not on a frame-level. In [23], a network for sound event detection is iteratively pretrained on successive hierarchy levels. Some papers [24], [25] employ tree hierarchies for audio representation learning (but not for classification). These works also usually do not take into account audio inputs where several classes may be active at the same time.
Fewer works use hierarchical structures for music audio classification. In [26], hierarchical classification is used in the context of singing transcription. Essid et al. [7] use hierarchies for instrument classification, but their scenario involves only synthetic audio data and their system does not yield predictions on a frame-level (which are necessary for IAD). In our previous work [2], which we extend here, we explored hierarchical classification for singing detection (see also Section I).
Hierarchical structures have also been exploited for other tasks in MIR, including musical instrument separation [27]. Garcia et al. [28] use instrument hierarchies for few-shot detection, where the model requires examples of the target class at test time (see also [29]). In contrast, we use a classification approach with a fixed set of classes, since the instruments in orchestral recordings are known a-priori. Nolasco and Stowell [30] employ instrument hierarchies for audio representation learning (not for classification).
In contrast, we perform hierarchical classification on a dataset of real orchestra and opera recordings.

C. Orchestra and Opera in MIR
Only few works in MIR have focused on opera and orchestral music. Among these are works on singing detection [2], [36], [37], emotion identification [38], predominant melody estimation [39], as well as source separation informed by multichannel recordings [40] or by score information [41]. Taenzer et al. [42] performed instrument family classification on classical music recordings, but they only considered monotimbral pieces (i.e., works for ensembles consisting of one family only). To the best of our knowledge, there rarely is work in MIR on IAD in real-world, polyphonic and polytimbral recordings of orchestra and opera.

III. HIERARCHICAL INSTRUMENT DETECTION
In this article, we aim to utilize the hierarchical relationships among orchestral instruments. To this end, we now formalize the hierarchical class model, classification approach, and evaluation measures used throughout this work. In this section, we follow [2], where we introduced the concepts and notation.

A. Hierarchical Class Model
We write C for the set of all classes in our detection problem. We can partition these classes across a total of H hierarchy levels, where C h ⊆ C are the sets of classes at hierarchy level h ∈ {1, . . . , H}. Thus, In our setting, we consider 18 different classes, corresponding to the nodes of the tree illustrated in Fig. 1. We use H = 3 hierarchy levels, with the lowest level h = 1 corresponding to fine-grained instrument classes, level h = 2 containing coarse instrument families, and the highest level h = 3 signifying any kind of instrument activity (as opposed to silence or noise). Thus, C 1 = {Fl, Ob, Cl, . . . }, C 2 = {WW, BR, . . . }, and C 3 = {INST}. It should be noted that the hierarchy could also be constructed in alternative ways. Our choice here is motivated by practical considerations, i.e., simplicity and the availability of sufficient data for each class c ∈ C. Generally, constructing appropriate instrument hierarchies can be a challenging problem, especially for electronic or non-standard instruments [43], [44].

B. Classification Approach
To approach IAD using a deep network, we pose the problem as a frame-wise, multi-label classification task. Thus, several instrument classes may be active simultaneously and we want to produce predictions for every frame in a music recording. Formally, we model both reference annotations and predictions as families of subsets of frames. We write I for the set of all audio frames in our test recordings. Now, our instrument annotations are given as families containing all frames where class c ∈ C is active. Note that the sets I Ref c are generally not disjoint, e.g., in cases where instruments are playing at the same time. Frames with silence or noise are not contained in any I Ref c . In a similar fashion, we model the estimates made by our IAD system as families of sets (I Est c ) c∈C . These outputs are obtained from a deep network that takes an audio excerpt as input and yields predictions for the center frame of that excerpt. These estimates are values in [0, 1] for every class c ∈ C, which are subsequently thresholded to obtain the sets I Est c . Since this approach jointly considers all hierarchy levels 1 ≤ h ≤ H, we will refer to it as HC (hierarchical classification). 2 More details on the network architecture used are provided in Section V.

C. Evaluation Measures
With our formulation above, we can define some classical metrics used for evaluating detection systems, such as framewise precision, recall, and F-measure for each class c ∈ C: (2) Intuitively, P c is the fraction of frames that are correctly predicted as belonging to class c, R c refers to the fraction of ground truth frames of class c that are correctly identified, and F c is an average of the two. Note that the three measures may be overly optimistic if c is a very common class that is active in most frames (as is the case, e.g., for c = INST). In this case, it is important to additionally consider the specificity for class c, i.e., the recall of non-active frames, defined as where \ denotes the set difference operator. A classification approach that is aware of class hierarchies should not produce predictions that are hierarchically inconsistent. For example, a frame i may be classified as trumpet (Tpt), so i ∈ I Est Tpt , but at the same time may not be classified as brass (BR), so i / ∈ I Est BR . We will refer to this as a bottom-up inconsistency. Such an inconsistency makes the output of a detection system difficult to interpret, since it is unclear whether the frame was erroneously classified as trumpet or whether it does indeed contain brass, but the system failed to identify BR. Similarly, we may have a frame i ∈ I Est BR that is classified as brass but at the same time is neither classified as horn nor trumpet, thus i / ∈ I Est Hn ∪ I Est Tpt . We call this a top-down inconsistency. Fig. 2 gives an illustration of these inconsistencies. Ideally, a detection system produces no inconsistencies, making its output straightforward to interpret.
We now define additional measures that capture different kinds of inconsistencies. For a class c ∈ C, we write c↑ for the parent of c in the hierarchy (e.g., Fl↑ = WW). Similarly, c↓ contains the set of children of c (e.g., BR↓ = {Hn, Tpt}). Additionally, for a subset C ⊆ C, we use the notation I C = ∪ c∈C I c . Now, for any h > 1 and c ∈ C h , we define measures of bottom-up consistency, top-down consistency, and overall consistency as: . γ c is also called intersection-over-union or Jaccard index. Note that the consistency measures can trivially be maximized by setting, e.g., I Est c = I or I Est c = ∅ for all c ∈ C. However, obtaining good detection results (e.g., in terms of F-measure) while simultaneously preserving consistency is non-trivial.

IV. ORCHESTRAL DATASETS
As mentioned in Section II, orchestral and opera music are seldom explored in MIR. In particular, no standard datasets for IAD on real orchestral recordings exist. The instrument classes commonly considered in IAD datasets for popular music (e.g., [47], which contains guitar, drums, base etc.) do not usually appear in orchestral music. We thus assembled our own datasets for training and evaluating our system, based on existing datasets and our own annotation efforts.
For effectively training our deep learning system, we require a dataset of several hours length. For such a size, manually annotating instrument activity for all 18 classes in our hierarchy would be prohibitively expensive. We thus consider two ways of obtaining orchestral recordings with aligned instrument annotations: 1) Use multi-track recordings of orchestral pieces, where activity annotations can easily be obtained from the individual tracks. 2) Use music synchronization techniques to align a score representation to an audio recording of a piece. Instrument activity in the recording can then be transferred from the aligned score. A third possible option would be to use artificial recordings of pieces based on synthesized score representations. Recently, Sarkar et al. [48] released a dataset of synthesized, multi-track recordings of classical pieces. However, their dataset contains, for the most part, chamber music rather than full orchestra pieces. Their work also demonstrates that synthesizing convincing renditions of classical music is a challenging task in itself. Here, rather than creating artificial recordings of orchestral scores, we instead collect multiple real recordings per score for synchronization (option 2).

A. Multi-Track Datasets
Due to the challenges of recording orchestra pieces in a multitrack fashion 3 , only a few such datasets have been released. One of these is Phenicx Anechoic [40], which contains clean multi-track recordings of four orchestral excerpts by different composers with note on-and offsets manually annotated for each track. We derive instrument activity from these annotations. Böhm et al. [46] provide multi-track recordings for three movements of Beethoven's Symphony No. 8, resulting in the longest multi-track dataset we use. The individual tracks are mostly free of cross-talk. We therefore use a simple energy thresholding procedure on the tracks to obtain instrument activity annotations.
Since both datasets are annotated based on clean multi-track recordings, we expect the derived activity labels for Phenicx and Beethoven Anechoic to be highly reliable. Nevertheless, there remains some ambiguity in defining note on-and offsets, especially for strings and woodwind instruments [49].
Prätzlich et al. [45] provide multi-track audio for three numbers from Carl-Maria von Weber's opera "Der Freischütz". However, due to their recording setup, the individual tracks incur a large amount of cross-talk coming from other sources. We obtained annotations for this dataset using music synchronization, see below.

B. Music Synchronization
Audio-to-score alignment techniques are used to temporally align a recorded music performance with the corresponding musical score. Once an alignment is obtained, information about instruments or pitches played can be transferred from the aligned score to the recorded performance. Here, we use audio-to-score alignment to transfer instrument activity from a symbolic score to several recorded performances of a piece. A popular dataset annotated in this fashion is MusicNet [50], which contains chamber rather than orchestra pieces. Note that automatic audio-to-score alignments may introduce annotation errors. We thus expect the resulting activity labels to be less reliable compared to those obtained from multi-track data.
Even though a large amount of MIDI files for classical music pieces can be found online, only few correspond to full orchestral scores with separate MIDI tracks per instrument. The lack of available score-data represents a big bottleneck for this approach to obtaining instrument annotations. For this work, we manually encoded a score representation of the first act of Richard Wagner's opera "Die Walküre" (requiring several months of work for musically trained annotators). For some other musical works-namely, several movements of Beethoven's Symphony No. 3, Dvorak's Symphony No. 9, and Tschaikowsky's Violin Concerto-we obtained clean orchestral scores from the Mutopia project. 4 We choose these works because they each contain many instruments from the hierarchy we employ. Furthermore, they belong to the classical and romantic periods and are thus stylistically similar to the remaining pieces we use. We then obtained six orchestral audio versions (i.e., performances, recordings) of these pieces from commercial CD releases and created instrument activity annotations using a state-of-the-art score-to-audio synchronization pipeline [51], [52]. Care had to be taken to verify that there are no structural differences between score and recorded versions, which would corrupt the alignment results. Thus, the annotation process can be considered as a semi-automatic approach, where one must expect alignment errors in the order of 0.2 s [53]. Such errors need to be kept in mind when interpreting evaluation results.

C. Dataset Overview and Split
An overview of all recordings used in this work is given in Table I. The multi-track datasets we use each contain one recording per piece and contribute a total duration of around 50 minutes of orchestral music. For the pieces annotated via music synchronization, we have several versions per piece. In total, our dataset contains roughly 14 hours of orchestral music, making it amenable to deep learning. We make all instrument activity annotations publicly available on our accompanying website. 5 Note that not all instrument classes are present in every recording and that there is a large imbalance among activity of different classes. We define the fraction δ c of frames where class c ∈ C is active as Additionally, for any h < H and c ∈ C h , we denote the fraction of items where both c and some other class c ∈ C h of the same 5 Link provided in Section I.
Note that 0 ≤ λ c ≤ δ c . Intuitively, λ c captures the amount of "multi-labeledness" for class c. Fig. 3 shows the values of δ c and λ c for each class c in the different subsets of our dataset. We can observe that strings and woodwind instruments are the most common classes. Only "Die Walküre" and Freischütz Digital contain singing. For most classes and pieces, there is a large amount of joint activity among classes on the same hierarchy level, indicated by λ c being close to δ c . A notable exception is Vn in the Tschaikowsky recordings (δ c = 0.92, λ c = 0.35), due to the many solo parts of the violin in this concerto. To train and evaluate our IAD system, we split our dataset into train and test recordings. We put different movements into train and test in order to investigate whether our models are capable of generalizing to new musical content or whether they overfit to specific compositions (see also [54]). We also choose different versions in train and test to control for varying recording characteristics and aspects of interpretation (reducing the impact of the so-called album-effect [55]). Among the multi-track datasets, we select No. 6 of Freischütz Digital, the second movement of Beethoven's Symphony No. 8, and all recordings in Phenicx Anechoic for our test set. From the remaining pieces, we choose the first movement of Beethoven's Symphony No. 3, the fourth movement of Dvorak's Symphony No. 9 and the third movement of Tschaikowsky's Violin Concerto for testing. Since we do not have multiple opera works that could be distributed into train and test sets, we choose an excerpt of the Wagner opera act (measures 697 to 955, corresponding to around twelve minutes of music), omit this excerpt during training, and use it for testing. In all cases, we choose one version for testing and use the remaining five versions for training.

V. MODEL ARCHITECTURE
In this section, we give details on the architecture of our model for hierarchical classification. Note that the main technical focus of our work is on hierarchical structures and consistency losses rather than a particular architecture and alternative architectures (e. g., based on ResNets [56]) could also be used here. We employ a convolutional neural network (CNN) inspired by standard VGG-like CNN architectures [57]. The architecture is illustrated in Table II. The network takes a harmonic CQT representation (HCQT, [58]) of an audio excerpt as input and outputs a vector of 18 values in [0, 1], corresponding to activity of the 18 classes in C predicted for the center frame of the input excerpt.
The HCQT input consists of 201 frames (roughly 4.7 seconds), computed using a hop-size of 512 on recordings sampled at 22 050 Hz (i. e., frame rate of 43 Hz). The constant-Q spectrum ranges from C1 to B7 with three bins per pitch, meaning 252 bins in total. For the harmonic CQT, five harmonic representations (including one subharmonic) are stacked in channel dimension. The final input tensor has a size of (201, 252, 5).
The network consists of three stages, separated by doubled lines in the table. Inspired by [59], [60], we first process each input tensor with a large pre-filtering kernel of size (15,15) and strides (1, 1), followed by a kernel of size and stride (1, 3), i.e., the kernel is applied in pitch direction only. The resulting intermediate feature map has a pitch axis with a single bin per pitch. The second stage of our network applies three conv-conv-pool processing blocks, as in [57], [61]. In the final stage, we use a convolutional filter of size (5, 1) and stride (1, 1) to aggregate temporal context (as in [59]) and apply several dense layers to obtain the final output. We further use batch normalization after each learnable layer and apply dropout before dense layers. All layers are followed by a leaky ReLU   TABLE II  NETWORK ARCHITECTURE USED FOR OUR IAD SYSTEM activation, except for the final dense layer, which is followed by a sigmoid.
We train our network by minimizing a binary cross-entropy loss for a maximum of 1000 epochs (with 320 batches per epoch, each containing 32 input excerpts) using the Adam optimizer with a learning rate of 0.002. We additionally use early stopping by terminating training after the validation loss (evaluated on a randomly selected subset of the training set) has not decreased for 15 epochs. We further half the learning rate after 10 epochs without improvement of the validation loss. We use label smoothing inside the cross-entropy loss as regularization (thus, 0-labels are replaced with 0.02 and 1-labels are replaced by 0.98).
Each of the 252 bins in the input excerpts is individually normalized to be zero-mean and unit-variance (for this, mean and standard deviation per bin are estimated on the training set). As is common practice [42], [62], [63], [64], we augment training excerpts by randomly applying time warping, pitch shifting, masking of time-frequency bins, adding random noise, or applying random equalization. Training the model on our dataset takes around 14 hours on an RTX 2080 Ti At test time, we evaluate our network for every frame in the test recordings, apply a median filter of length 0.5 seconds on the sequence of predictions for each class, and then downsample to a feature rate of 5 Hz. This is common practice in musical activity detection systems (e. g., [63], [65]) and makes our evaluation results more robust to annotation errors introduced by music synchronization (see Section IV). We finally binarize these predictions using a threshold of 0.5. Inference requires 3.5 seconds per minute of input audio (given pre-computed HCQT representations).

VI. MAIN RESULTS
In this section, we present the main evaluation results for our IAD system and, additionally, demonstrate that hierarchy information reduces the need for fine-grained labels during training.
We begin with our main experiment. We train the model described in Section V on the training subset of our dataset and subsequently evaluate it on the test set (according to the split described in Section IV). Recall that we choose a joint classification approach HC (short for hierarchical classification), i. e., the network outputs predictions for all 18 classes in C, see also Section III.
Evaluation results on the test set are shown in Table III. Columns contain different metrics, computed over the entire test set (note that the consistency metrics γ are only defined for classes in C h for h > 1). Rows correspond to different classes in C, and the last three rows show averages over classes. In particular, we report averages for families (C 2 ) and instruments (C 1 ). 6 Overall, evaluation results are moderately high with an average F-measure of 0.81 and specificity of 0.83. For comparison, Hung and Yang [16] achieve an average F-measure of 0.89 for IAD on recordings of small classical ensembles 6 These are macro averages, i. e., computed as the arithmetic mean of the results for the individual classes. As such, both common and uncommon classes contribute equally to these averages. with seven different instrument classes. However, our results vary across classes. We can observe that classifying instrument families works better on average (F = 0.82) than classifying fine-grained classes (F = 0.79). For example, for woodwinds (WW), the family F-measure of 0.87 is much higher than the detection results obtained for the individual woodwind instruments (e. g., for clarinets: F CL = 0.72). Some very common classes yield high F-measures, but low specificity (e.g., for strings: δ ST = 0.87, F ST = 0.95, and S ST = 0.65), indicating that our system produces many false positive predictions for these classes. In Section VIII, we will conduct some additional analyses to better understand these detection results with regard to possible confounding effects.
For singing activity (VOC), we obtain a family F-measure of 0.90 and the difference to the results for individual vocal classes female (Fe) and male (Ma, for both: F = 0.88) is small. For reference, one obtains accuracies of around 0.91 for singing voice detection on popular music [61], [63]. With regards to consistency, γ ↑ is always high. Therefore, predictions on a finegrained class are almost always accompanied by a prediction for the parent class. However, γ ↓ is lower for some classes such as WW (γ ↓ WW = 0.92). Thus, for about eight percent of frames where the woodwind family is predicted as active, neither of the woodwind instruments is identified. In Section VII, we will consider loss terms for improving these consistency issues.
To demonstrate the impact of our hierarchical classification approach HC, we now analyze whether utilizing the instrument hierarchy during training can reduce the amount of fine-grained labels required. To this end, we compare HC with a flat classification baseline FC. There, our model is trained to only produce predictions for classes in C 1 (i. e., instrument-level). At test time, we obtain predictions for classes in C \ C 1 by aggregating predictions from lower levels in a bottom-up fashion. 7 For example, VOC is predicted if and only if the model has classified the input as Fe or Ma. By construction, the predictions obtained in this way are always consistent, so γ c = γ ↑ c = γ ↓ c = 1 for all classes c. The FC baseline allows us to determine the detection quality that can be achieved without informing the model about the class hierarchy.
We now reduce the amount of fine-grained instrument labels that are available during training. For FC (flat classification baseline), we do so by reducing the size of the training set, since this approach does not utilize class labels at higher levels for training. For HC (hierarchical classification), we disable the cross-entropy loss associated with classes in C 1 on a portion of the training items, but we still use the instrument family and activity labels for these items. 8 This experiment setup is illustrated in Fig. 4. Fig. 5 shows the results of this experiment. The upper plot shows results for higher-level classes (families and instrument activity), whilst the lower plot contains results for fine-grained classes. When utilizing all fine-grained labels of the training dataset, we observe almost identical average F-measures for C 1 with both FC and HC (lower plot, leftmost point). For C \ C 1 (upper plot), HC yields slightly higher average F-measures at 0.89 (compared to F = 0.87 for FC). When reducing the amount of C 1 labels used, HC outperforms FC on both fine-grained and higher-level classes. For example, at 10% of labels used, we obtain an average F-measure of 0.84 on C \ C 1 for FC, which drops to F = 0.60 for 0.1% of labels. Meanwhile, the results for HC on C \ C 1 stay roughly constant at around F = 0.88. HC also yields higher F-measures for C 1 , with F = 0.76 at 1% of labels used as opposed to F = 0.55 for FC.
We have seen that, by utilizing higher-level structure when fine-grained labels are scarce, our hierarchical classification approach HC can still yield good results, even for small amounts of instrument-level labels. This opens up the possibility of incorporating partially labeled data for training, where the instrument family is known, but fine-grained labels are unavailable (e.g., monotimbral recordings of brass or string ensembles).

VII. CONSISTENCY LOSSES
As discussed in Section VI, our hierarchical IAD approach HC outperforms the flat classification baseline FC in terms of Fmeasures. However, the predictions of FC are always consistent, making the output of that system easier to understand compared to HC, which may produce inconsistent outputs. In this section, we will investigate additional loss terms for HC that can address this shortcoming.
In our previous work [2], which this article extends, we describe two loss terms for improving bottom-up and top-down consistency of predictions, respectively. In the following, we write p c for the probability predicted by a model for class c on a given training item. In order to improve bottom-up consistency, we minimize the following loss term, which is adapted from [32]: This sum contains terms for every pair (c, c ) of a parent class c and a child class c , where the model incurs a loss whenever p c > p c . Thus, the model is penalized for any bottom-up inconsistency. Similarly, to improve top-down consistency, we introduced the following loss term in [2]: Here, the model incurs a penalty whenever the output p c for a parent class c is larger than p c for any of the child classes c (e.g., in case of a top-down inconsistent output). These losses are combined with the standard cross entropy loss L BCE using weights α, β ∈ R, yielding the final loss for our model. We will denote the hierarchical classification approach trained with these additional losses as HC α,β Results for training with these additional consistency losses are shown in Table IV. Here, we set α = β = 10 (as in [2]). With regard to the detection evaluation measures like F-measure and  specificity, we obtain similar results as in our previous experiments that did not employ consistency losses (see in Table III). For example, we get an average F = 0.82 and S = 0.82 for HC α,β compared to F = 0.81 and S = 0.83 for HC. However, the top-down consistency scores γ ↓ have improved (e.g., γ ↓ BR = 1.00 and γ ↓ WW = 0.98 for HC α,β compared to γ ↓ BR = 0.94 and γ ↓ WW = 0.92 for HC). Thus, the additional loss terms can improve consistency while retaining the overall quality of results.
A more detailed analysis of the impact of the loss weights α and β is provided in Fig. 6. Here, we either use solely L ↓ (by setting β = 0 and increasing α, first column), use solely L ↑ (setting α = 0 and increasing β, second column), or use both losses simultaneously (α = β, third column). The upper row shows classification results in terms of average Fmeasure (F) and specificity (S), while the lower row shows average consistency scores. As expected, by using solely L ↓ , we are able to improve γ ↓ . However, γ ↑ decreases for large values of α. The opposite behavior can be observed for using only L ↑ . In both cases, γ decreases while F-measure and specificity remain roughly constant. By utilizing both L ↓ and L ↑ , we are able to improve γ. However, F and S are reduced for large values of α = β. We conclude that both L ↓ and L ↑ are required to increase consistency, while α = β should be chosen small enough in order not to deteriorate detection results.
Overall, our results suggest that, while consistency is a necessary condition for interpretable system outputs, it is not sufficient to achieve good classification results. Our losses can be used to induce more consistent detection outputs for our model, but high consistency needs to be balanced with preserving classification results.

VIII. ANALYSIS OF CONFOUNDING FACTORS
In this section, we aim at a deeper understanding of the behavior of our model. In particular, we analyze how the detection of an instrument class is affected by the presence of other instruments playing simultaneously. To this end, we systematically evaluate our model on audio inputs for which we can calculate the relative salience of different classes within the overall mixture (concretely, we use the multi-track orchestral datasets discussed in Section IV). In this way, we can reach conclusions about confounding effects exploited by our model.
A representative example of an analysis result is shown in Fig. 7. In this example, we are interested in the interactions between brass (BR) and woodwind (WW) instruments. In orchestra music, brass and woodwinds are often active simultaneously. For this analysis, we select frames for which no brass (BR) is active and then observe the predictions for BR depending on the salience of woodwind (WW) instruments. The vertical axis shows average probabilities predicted by our model for BR. The probabilities are averaged over all frames where woodwinds have a certain salience within the orchestral mixture (horizontal axis). Here, salience is measured as a signal-to-noise ratio, with woodwinds considered as signal and all other instrument sounds considered as noise. We observe that the model outputs low probabilities of around 0.1 for BR if WW is inactive. However, these outputs gradually increase as woodwinds become more salient in the input. We conclude that brass detection is highly sensitive to woodwind instruments, even if no brass instrument is active in the mixture. This is a confounding effect.
We can reach a number of similar conclusions by performing systematic analyses: 1) The model also exploits the presence of woodwind instruments for detecting brass if brass instruments are present in the mixture. 2) Our model is biased towards predicting string activity, even when no strings are active in the input. Thus, the model exploits the fact that strings are active in most frames of the recordings in our dataset (see also Fig. 3). 3) Predictions for strings are not affected much by the presence of other instrument families. 4) Looking at the model behavior for woodwind instrument classes, we find that detection of flutes is improved by the simultaneous activity of other woodwind instruments, similar to the behavior observed for BR. It is important to note that it may indeed be desirable for our model to use these confounding factors for detection, since these are strong cues for IAD on many orchestral works from the classical and romantic periods (as present in our dataset). Yet, use of these confounding factors may also limit the generalization ability of our model to music with other instrument statistics, e. g., works with many brass-only sections. To reduce the impact of these effects, one may collect additional training data (which is cumbersome, see Section IV). Entirely removing confounding effects from our system may be impossible, however, as we also discuss in the next section.

IX. CONCLUSION
In this article, we investigated instrument activity detection in the context of complex orchestral music recordings. We showed that utilizing information about hierarchical relationships between instruments is helpful, especially when only few fine-grained instrument-level labels are available. Furthermore, we demonstrated how one can increase the consistency of predictions across hierarchy levels using additional consistency losses, while preserving detection quality. To perform these experiments, we collected a large dataset of real-world opera and orchestra recordings with aligned instrument activity annotations. Finally, we analyzed the behavior of our detection system and identified confounding effects exploited by our model.
Future work may make use of more complex instrument hierarchies (e.g., hierarchies that incorporate knowledge about different sound production techniques), train more complex detection models, collect additional data for training and testing (e.g., using music synchronization or synthesizers), or extend the scenario considered here towards orchestral music from other epochs (e.g., Baroque).
However, even when collecting additional data, it is likely that our model will continue to exploit confounding effects arising from the training data, as shown in our analysis in Section VIII. This may be desirable (in case that training and test conditions are very similar) or undesirable (when generalizing to music from different styles). Our analysis results mirror those obtained for other systems for music classification. For example, Kelz and Widmer [67] found that a neural network for piano transcription trained on combinations of notes struggles with correctly classifying unknown note combinations. The system performs transcription for certain notes by considering other, unrelated notes as well. The confounding effects exploited in MIR systems can often be even less intuitive. In [68], for example, a system for recognizing Latin music styles (which are usually characterized by their rhythms) was shown to exploit tempo information. Similarly, the authors in [69] found that a system for genre recognition is sensitive to sounds outside the human hearing range. In [70], a deep learning system for leitmotif detection in opera recordings is introduced. It is shown to rely heavily on spectral statistics as opposed to melody or rhythm. Such problems are amplified when evaluating a system on music styles not seen during training. For example, the authors in [71] showed that off-the-shelf music classifiers perform poorly on a large and diverse music database and are highly sensitive to encoding artifacts.
In order to tackle this challenge for IAD, one may consider using source separation as pre-processing, thereby reducing the impact of unrelated instruments on detection outputs. However, due to the small amount of multi-track orchestral data available, no off-the-shelf systems for source separation in orchestra recordings are currently available, leaving this as an avenue for future work.