Vocal92: Audio Dataset With a Cappella Solo Singing and Speech

Singer recognition plays a vital role in music information retrieval systems. Most songs in the singer recognition system are mixed audios of music and voice. In contrast, there is a lack of labeled a cappella solo singing data suitable for singer recognition. Text-independent singer recognition systems successfully encode audio features such as voice pitch, intensity, and timbre to achieve good performance. Most such systems are trained and evaluated using data from music with accompaniment. However, due to the influence of background music, the performance of the singer recognition model was limited. Contrarily, a powerful singer identification system can be trained and evaluated using a cappella solo singing voice with a clear and broad range of qualities. There needs to be labeled clear singing data suitable for singer recognition research. To address this issue, we present Vocal92, a multivariate a cappella solo singing and speech audio dataset spanning around 146.73 hours sourced from volunteers. Furthermore, we use three models to construct the singer recognition baseline system. In experiments, the singer recognition model developed by a cappella solo singing data performs well in both single-mode and cross-modal verification data, significantly improving related works. The dataset is accessible to everyone at https://pan.baidu.com/s/1Pn62DHfal2OOZ_5JqgGBdQ with jnz5 as the validation code. For non-commercial use, the dataset is available free of charge at the IEEE DataPort (https://ieee-dataport.org/documents/vocal92-multimodal-audio-dataset-cappella-solo-singing-and-speech).


I. INTRODUCTION
Singing is an exclusive sound art produced by a rhythmic combination of one or more vocal organs [1].Singing voice has always been an exciting and abundant area of research.There is a variety of singing styles and techniques.Different singing styles can produce proper coordination and control of vocal organs such as the lungs, throat, pharynx, nose, and mouse [2].To analyze and study different singing styles, we need to analyze the generalization process of singing sincerely.The analysis of singing sounds is challenging.It enables exploring various areas of study (e.g., song emotion analysis, lyric recognition, separation of singing sounds, classification of singing types, singer identification, and singer tracking in duet songs) only through songs [3].Singer recognition is one of the leading research areas of the The associate editor coordinating the review of this manuscript and approving it for publication was Sunil Karamchandani .singing voice.Speech and singing are different expressive entities of human beings.Even though the organs involved in producing sound are the same, their extended frequency domain information is different.Speech is a natural use of vocal organs, but singing involves precise control of various organs.E. A. Zveglic proposed a comprehensive study of the relationship between speech and singing [1].The research shows that singing stretches or lengthens acoustic features, while speech sacrifices acoustic features.Medeiros et al. invited three speakers and three singers to give a lecture and sing on a book of modern Brazilian literature [4].They tested the hypothesis that singing is more stable than speech, particularly pitch and duration.Livingstone et al. observed that singing exhibits longer duration, higher pitch, and greater sound intensity than speech [5].
Singer recognition based on speaker recognition requires comparing two audio samples and evaluating whether the voices belong to the same person [6].Most research in singer recognition focuses on modeling the features of singers from hybrid entities of music and voice [7].However, accompanied songs only exhibit a limited range of the singer's possible dynamic vocal range [8].As a result, such singer recognition systems can be less generalized to various singing styles and pronunciation effects.The voice of the a cappella solo singing is a speaking style that clearly demonstrates the singer's multiple features [9].There are significant differences between the spoken language and a cappella solo of the same speaker.Apart from the perceived differences in pitch, intensity, and timbre, there are also differences in the physiological formation of sung speech [10], [11].Different singing styles and languages further enrich the acoustic differences between spoken and sung sounds, bringing some challenges to the speaker recognition system [12].Due to intentional voice modulation, singing voice increase intra-speaker variance and decrease inter-speaker variance, resulting in a broader acoustic spectrum, which is one of the main challenges in identifying a singer from a singing voice [13].In addition, the presence of background music and choruses in existing music datasets increases the uncertainty of the task [8].Thus, the ability of a singer recognition system to correctly evaluate whether it belongs to the same person in multiple songs can be used to assess its robustness.
Although many audio datasets exist, speech-singing modes audio datasets, especially those containing a cappella solo singing, still need to be improved.Therefore, a sizeable dataset containing a cappella solo singing and speech is necessary.In this study, we collected a new audio dataset, Vocal92, to study singer identification from speech and a cappella singing voice.We also explore the influence of taking singing data into training and testing on the generalization ability and robustness of the singer recognition model.
The structure of this paper is as follows.First, we compare Vocal92 with existing datasets, including existing work on singing sound analysis and application.Then, we record the collection and collation of Vocal92 and describe the structure of the dataset in detail.Finally, we construct a singer recognition baseline system to prove a broader range and richer feature information of the a cappella solo and achieve better performance, which also shows the practicability of this dataset.

II. RELATED WORK
Some early literature classified singing as a speech style and used speaker clustering algorithms to cluster it [5], [9], [16].In another paper [14], the author used the singing voice for speaker recognition.However, cross-modal experiments were not committed in which models were trained on the speaking data and tested on the singing voice (and vice versa).
Those works were extended in [15] and [16] to evaluate cross-modal speaker recognition; moreover, the results of it needed to be more satisfactory.The JukeBox dataset expanded cross-modal experiments and facilitated speaker recognition research on singing voice data.A key reason for the lack of research on singer recognition is the need for adequate developmental and evaluation data [17], [18].Although there have been some singing voice datasets in the field of singer recognition, they have yet to be able to evaluate the robust performance of singer recognition systems across modals.
The Artist20 dataset [19] contains 1413 songs from different albums by 20 European and American pop music artists or groups.The labels of artists/groups refer to associated musical groups or bands rather than individual singers so the features may vary considerably.
Vocalset [20] is a dataset of clean singing by nine female and eleven male professional singers.The dataset consists of 3560 wave files with a total of 10.1 hours of recorded audio ranging from 1s to 1 minute.Vocalset recorded vowels and various vocal techniques, such as scales, arpeggios, and long notes.
The singing voice dataset [21] contains over 70 significant recordings of Chinese opera performed by 28 professional and amateur singers.It is mainly opera, with no multilingual popular music, and the dataset can only be used as a test set since it is not large enough.
The JukeBox [22] dataset contains 467 hours of 16 kHz sampled singing audio data downloaded from the Internet Archive (IA).With a total of 936 different singers, 533 of whom are male.The singing voice dataset is annotated with singer, gender, and language labels for developing and evaluating speaker recognition methods.However, most of the singing audio data downloaded from the Internet has background accompaniment, which affects the accuracy of singer identification.
In the era of data-driven deep learning technology development, the lack of high-quality datasets with a cappella solo singing data has limited the progress of singer recognition research and applications [23].
In this paper, we propose a large vocal dataset of a cappella solo singing and speech that is annotated with labels such as singer, gender, age, and language.Table 1 shows a list of music datasets in comparison to Vocal92.It also illustrates the usefulness of Vocal92 by implementing a baseline system based on Vocal92 training, which can be used in areas such as singer recognition and song conversion.In the following  few sections, we will describe this dataset in detail, the data collection process, several experimental scenarios, and analyze the performance of state-of-the-art singer recognition methods on this dataset.

III. DATA COLLECTION A. SINGER RECRUITMENT
A call for participants for vocal recordings was posted online when offline activities were suspended because of the COVID-19 pandemic.We have the following requirements for volunteers: 1) Passion for music and a certain level of singing ability.
2) Recording clear audio through a smartphone or computer microphone in a quiet environment.3) Choose at least 10 songs that you are good at singing, and each song should be no less than 2 minutes long.We recruited 92 amateur singers to record voice data.The majority of participants were university graduates, along with some of their family and friends.The data set collected consisted of popular music sung in Mandarin Chinese, Cantonese and English, in the same language as the song chosen by each volunteer.The recordings were collected over a period of 10 to 50 days.The gender and age distribution of the singers is shown in Figure 1.

B. RECORDING SETTINGS
Singing individuals recorded a cappella songs and read lyrics in a quiet setting using either a mobile phone or computer microphone.A minimum of ten songs were recorded by each performer, with each vocalist utilizing a separate audio file in various formats such as WAV, MP3, and M4A.Sampling rates for these recordings generally ranged from 48kHz and 44.1kHz.
The audio files were recorded in a 2-channel stereo format and subsequently converted to a single-channel, 16kHz sampling rate wav format after undergoing preprocessing.
The distribution of audio length is shown in Figure 2.

Dataset Organization:
The Vocal92 dataset includes 4453 a cappella solo recordings and lyric readings, totaling 146.73 hours of voice data.140960 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.It consists of both singing and speech by singers (36 male and 56 female) speaking in various languages.The data is organized in folders, with subfolders for each artist and song title, and includes audio files.To facilitate use in research, the dataset has been divided into train and test subsets.Additionally, the set of 92 singers in the dataset has been split into two subsets, as shown in Table 2.
We also selected audio from two volunteers singing and speaking for 5 seconds each and plotted the narrowband spectrograms, shown in Figure 3. • Test set: Some singers with at least two audio samples constitute the test set (9 subjects).The test set is separated into a speech test set, a singing test set, and a speech plus singing test set.This test set is reserved for evaluating trained singer recognition models on speech and singing voice data.

IV. METHODS
In this section, the proposed methodology will be discussed with the workflow that will be incorporated for identifying the singers.First, we discuss the architecture of our singer recognition baseline system.1

A. SINGER RECOGNITION SYSTEM
The singer recognition model consists of a training stage and a testing stage.During the training phase, three advanced neural networks are used to create embedding models for individual singers.In the testing phase, the similarity between the enroll audio and the test audio is measured using probabilistic linear discriminant analysis (PLDA) scoring and cosine similarity scoring to calculate the embedding similarity.The general architecture of our baseline is depicted in Figure 4.
The feature extraction component converts the input audio into spectrogram features using the Speech Brain toolkit [24].
Our baseline system has been designed to facilitate the recognition of singers through the integration of these advanced algorithms and the neural network model.

B. MODEL ARCHITECTURE
In this study, we utilize three state-of-the-art systems to extract speaker embeddings: X-vector [25], emphasized channel attention, propagation and aggregation in time delay neural network (ECAPA-TDNN) [26], and ResNet50 for the singer identification task.Table 3 lists the detailed parameters for X-vector and ECAPA-TDNN architectures.

1) X-VECTOR-PLDA
The x-vector model [25] is a time delay neural network (TDNN) that aggregates variable length inputs across time to create fixed-length representations capable of capturing speaker characteristics.Speaker embeddings are extracted from a bottleneck layer before the output layer.The method follows an end-to-end system that uses time-delayed DNNs to generate embeddings combined with similarity measures.It compares them through an independently trained classifier such as PLDA.Firstly, the time delay is used to extract short-time frame-level context.The statistical pooling layer aggregates over the input segments and calculates the mean and standard deviation.Finally, the singer is classified by DNN.The resulting segment-level singer embeddings are called x vectors.Zhang et al. [27] selected the X-vector model in the training phase of the singer recognition system and used PLDA to calculate the verification score in the testing phase.
In our approach, the x-vector of the training set is used to train the PLDA model [28], which is subsequently utilized for scoring.The parameters φ and of the PLDA model are estimated from the training data.The method to estimate these two parameters is the classical EM algorithm iterative solution.In the test phase, we calculate whether two audio sounds are generated in the same speaker space regardless of intra-class spatial differences.
We use the log-likelihood ratio to calculate the score presented in Equation (1).
η 1 and η 2 are the x-vectors of two sounds, respectively.The hypothesis that these two sounds come from the same space is Hs, and the hypothesis that they come from different Spaces is Hd.The p(η 1 , η 2 |H s ) for two voices come from the same space likelihood function, p(η 1 |H d )p(η 2 |H d ) respectively to different space likelihood function.Calculating the log-likelihood ratio, we can measure how similar the two sounds are.The higher the score, the more likely it is that the two voices belong to the same speaker.
2) ECAPA-TDNN Desplanques et al. [26] propose several improvements to the x-vector architecture.Specifically, they introduce the ECAPA-TDNN model, which includes 1-dimensional Res2Net modules with skip connections and squeeze excitation (SE) blocks to capture channel interdependencies and a channel-dependent self-attention mechanism that uses global context at the frame-level layers and the statistics pooling layer.Additionally, the ECAPA-TDNN model aggregates and propagates features across multiple layers.
To measure the similarity of two audio segments using the ECAPA-TDNN model, we employ the cosine similarity measure.Cosine similarity is a measure of similarity between two non-zero vectors in a multi-dimensional space, calculated as the cosine of the angle between the vectors.The functions can be mathematically presented in Equation (2): Here, A, B are two non-zero vectors andCos (θ) refers to the cosine similarity.We introduce ResNet50 as the third neural network model for the baseline system and use the same cosine similarity as the ECAPA TDNN model for scoring.

C. METRICS
The equal error rate (EER) is a commonly used metric for evaluating the performance of singer recognition systems.It is defined as the point at which the false acceptance and rejection rates are equal.In addition to EER, the minimum detection cost function (minDCF) is also used as a secondary metric for comparing the confirmation thresholds of speaker recognition systems.This is represented by Equation (3).The minDCF is computed at a prior probability of 0.01 for the specified target speaker (P target ) with the cost of missed detection (C MISS ) and the cost of wrong detection (C FalseAlarm ) of 1.0.

A. EXPERIMENTAL SETUP 1) DATASET PARTITIONING
In each experiment described in this work, the entire dataset is randomly divided into a training set and an evaluation set with a 9:1 ratio.The evaluation set consists of an enrollment set and a test set, with each audio file from each singer in the evaluation set becoming enrollment data in succession.The remaining items will be reviewed.
The training set consists of the speaking training set, the singing training set, and the overall training set.Similarly, the evaluation data is separated into a speech evaluation set, a singing evaluation set, and a speech plus singing evaluation set.

2) TRAINING SETUP
During the training phase, we implement a random sampling strategy in which 3-second segments are randomly chosen from audio files and their starting times are selected on the fly.The ECAPA TDNN models in this study are trained using the Additive Angular Margin (AAM) loss [29] and the x-vector models are trained using the Negative Log Likelihood (NLL) [30] loss.
The input features for the x-vector models consist of 24-dimensional filterbanks with a frame length of 25ms, which are mean-normalized over a sliding window of up to 3 seconds.The input features for the ResNet50 model and the ECAPA-TDNN model are 80-dimensional filterbanks from a 25ms window.
The training set is divided into 90% and 10% for training and validation purposes.The validation set is randomly chosen from the training set.As in the test set, it is possible for different performers to sing the same song.To optimize the ECAPA-TDNN models, we utilized the Adam optimizer [31] with a learning rate of 0.0001 and a weight decay of 0.000002.If the validation loss does not change for two epochs, the learning rate is reduced by a factor of 0.3.

3) DATA AUGMENTATION
In this study, we also investigated suitable Data Augmentation (DA) strategies, i.e., the creation of moderately changed new data obtained from the original.As a result of DA, neural networks can learn new parameters and improve performance without overfitting.In addition to the training and evaluation datasets mentioned above, we use the MUSAN and RIRs datasets for noise augmentation.The former contains three types of noise, and the latter contains reverberation data in several different conditions.We used Speech Brain's augment model to add room impulse responses (RIRs) and noises and resample the audio at a slightly different rate to alter its speed.Models trained without DA required 100 epochs of training, while models trained with DA required 150.

4) ADAPTIVE SCORE NORMALIZATION
Adaptive score normalization means that the mean and variance are calculated for selecting voices from the impersonated speech set.Through adaptive normalization, each validation pair may use different impersonated speech sets.Adaptive score normalization selects the impersonated speech set according to specific rules, often using the top speech with the highest score of the registered speech or test speech.We performed adaptive normalization of the test scores of the ECAPA-TDNN model, which provided some optimization for the results of singer identification.

5) SINGLE OR MULTIPLE SPEAKING STYLES EXPERIMENTS
Both human and machine recognition performance degrades when the audio being evaluated is in a modality unfamiliar to the evaluator.Most of the previous speaker recognition systems use homologous data for experiments.We experimentally investigate the effect of cross-modal training and test data on speaker recognition systems' performance and generalization ability.In this paper, experiments were conducted on unimodal and multiple speaking styles data to verify the effect of adding modality on speaker recognition.We found that the a cappella solo singing data performs better in the cross-modal experiments and generalizes better to the singer recognition system because of its more comprehensive range and variable timbre.

1) EXPERIMENTS ON SINGLE AUDIO MODALITIES OF EXPRESSION
The experimental results of X-vector-PLDA, ECAPA-TDNN, and ResNet50 models are shown in Table 4 when homogenous data are used for training and evaluating.
It is observed that a cappella solo singing data presents a more comprehensive representation of the singer due to its wider vocal range and more diverse features.When trained with a cappella solo singing data, the robustness of the models was significantly improved compared to using speech data, with the equal error rate of the x-vector-PLDA model decreasing to 0.4718%.Additionally, the recognition performance of the ECAPA-TDNN model was also satisfactory, correctly identifying the singing evaluation set.Upon comparison of the three models, it appears that the ECAPA-TDNN model exhibits a greater advantage on this dataset.

2) EXPERIMENTS ON MULTIPLE AUDIO MODALITIES OF EXPRESSION
The experimental results obtained when using different audio modalities of expression data for training and testing are presented in Table 5.These results demonstrate that the ECAPA-TDNN model trained on a cappella singing data exhibits a stronger migration ability and robustness among the different speaking styles of test data, achieving an equal error rate of 1.4723%.The use of all available training data led to a noteworthy enhancement in the performance of the cross-modal test set, as evidenced by the equal error rate (EER) of 1.1659% for the X-vector-PLDA model and an EER of 1.0496% for the ECAPA-TDNN model.The ResNet50 The results of the multiple speaking styles experiments indicate that the models trained with a cappella solo singing data exhibit superior generalization ability.According to our findings, models trained on clear singing song data exhibit superior generalization performance when evaluated on speech data.The X-Vector-PLDA model, ECAPA-TDNN model, and ResNet50 model have equal error rates of 2.3418%, 1.1360%, and 0.4435%, respectively.In contrast, the models trained on speech data do not perform well when evaluated on singing data.The ResNet50 model performs comparably to ECAPA-TDNN in terms of total performance, and both exhibit strong performance.

3) DATA AUGMENT EXPERIMENTS
The results of our experiments revealed the effectiveness of using data augmentation in the X-vector-PLDA model.In the singer verification experiments reported in Table 6, we observed that the X-vector-PLDA models performed better with data augmentation.However, the use of data augmentation did not improve the performance of the ECAPA-TDNN and ResNet50model.

4) X-VECTOR MODEL WITH PLDA OR COSINE SIMILARITY FOR SCORING
Multiple scoring methods were utilized for the X-Vector model, as demonstrated in Table 7.The experimental results show that scoring after training a PLDA model has better robustness than directly calculating cosine similarity.

5) THE EFFECT OF AUDIO LENGTH
We also explored the effect of different audio lengths on the experimental results which are shown in Table 8.During the evaluation of 3-second audio segments from our 140964 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.test set, the x-vector-PLDA model and the ECAPA-TDNN model demonstrated equal error rates (EER) of 7.6483% and 9.8382%, respectively, on the speech evaluation set.When the audio segment length was increased to 5 seconds, the EERs of the two models on the speech evaluation set were 4.4893% and 3.8105%, respectively.The results indicated that longer audio segments resulted in better performance.When the audio length reached 10 seconds, both models exhibited significant improvements in all test results.On the other hand, the song evaluation set did not perform as well as the speech evaluation set when the test audio length was shorter.However, as the test audio length increased, the singing data experimental results improved.In comparison, the x-vector-PLDA model outperformed the ECAPA-TDNN model on the speech evaluation set.
These findings suggest that longer audio samples tend to have more rich and varied audio features, leading to improved experimental results in singer recognition tasks.

VI. CONCLUSION
We present the first audio dataset specifically focusing on a cappella solo singing and speech.Vocal92 consists of both singing and speech by 92 singers and represents a significant advancement in the field, filling a gap in the availability of multiple speaking styles audio datasets for singer recognition.The experimental results demonstrate the singer recognition models trained on singing data exhibit a more vital migration ability and robustness among the cross-modal test data.These findings suggest that singing data may contain a more exhaustive range and features, such as timbre and pitch, which contribute to better model performance.The Vocal92 dataset will also be a valuable resource for music information retrieval, singer recognition, and speaker recognition.

FIGURE 1 .
FIGURE 1.The distribution of singers' gender and age in the Vocal92 dataset.

FIGURE 2 .
FIGURE 2. The distribution of audio length in the Vocal92 dataset.

FIGURE 3 .
FIGURE 3. The narrowband spectrograms of four audio files.

•
Training set: Some singers with at least ten audio samples constitute the training set (83subjects).The training set consists of the speech training set, the singing training set, and the overall training set.This set is reserved for training singer recognition models.
50-layer convolutional neural network architecture, was first introduced by Microsoft Research in 2015.The design of this network was motivated by the vanishing gradient problem, which affects the effectiveness of very deep neural networks in image recognition tasks.ResNet50 addresses this issue by incorporating residual connections that allow the network to bypass certain layers, thereby mitigating the vanishing gradient problem.Residual connections in ResNet50 allow link connections to skip one or more layers and add their output directly to the output of the link, facilitating effective training of very deep networks.The architecture includes 1×1, 3×3 and 5×5 convolutional layers, as well as maximum pooling and average pooling layers.It also incorporates batch normalization and ReLU activation functions to further enhance performance.These design elements contribute to ResNet50's high performance in various image recognition tasks, including object detection, image classification, and semantic segmentation.

TABLE 4 .
Experimental results of a single speaking styles.(Divide each audio of the training set into 3 seconds, and input all the audio of the enrolled set and the test set).

TABLE 8 .
The experimental results of different audio lengths in the models trained without data augment.(All training audios are divided into 3-second segments).

TABLE 1 .
A list of music datasets compared to Vocal92.

TABLE 2 .
Dataset statistics of the Vocal92 dataset.

TABLE 5 .
Results of multiple speaking styles experiments (the training set audio in the experiments was divided into 3s, and the whole audio was input for both registration and test audio).model shows good performance when the training, enrollment, and test data are not in the same audio expression, especially when the training and test data are from different sources.

TABLE 6 .
Experimental results of data augmentation.

TABLE 7 .
Experiment results of the X-Vector model using different back-ends for scoring.