Music Deep Learning: Deep Learning Methods for Music Signal Processing—A Review of the State-of-the-Art

The discipline of Deep Learning has been recognized for its strong computational tools, which have been extensively used in data and signal processing, with innumerable promising results. Among the many commercial applications of Deep Learning, Music Signal Processing has received an increasing amount of attention over the last decade. This work reviews the most recent developments of Deep Learning in Music signal processing. Two main applications that are discussed are Music Information Retrieval, which spans a plethora of applications, and Music Generation, which can fit a range of musical styles. After a review of both topics, several emerging directions are identified for future research.


I. INTRODUCTION A. DEEP LEARNING IN MUSIC SIGNAL PROCESSING
Deep Learning (DL) [1], a sub-field of Machine Learning (ML), has been established as a strong computational toolbox, with applications in numerous tasks, like feature extraction, classification, and pattern recognition. Such functionalities enable the extraction of meaningful information from raw data, and thus find applications in a wide range of disciplines, including computer vision (CV) [2], The associate editor coordinating the review of this manuscript and approving it for publication was Pasquale De Meo. natural language processing (NLP) [3], bioinformatics [4], medical diagnosis [5], speech recognition [6], image processing (IP) [7], system identification [8], recommendation systems [9], and more [10].
A research field where DL has emerged as a valuable tool over the last decade is that of audio signal processing (ASP) [11] and music signal processing (MSP) [12]. Music is a well-known art form that is a big part of the most fun and educational human activities. As a result, the music industry includes a wide range of organizations and consumers. The application of DL tools in MSP has led to a collection of successful commercial applications, the VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ most famous of which is Music Recommendation Systems (MRS) [13]. As shown in Fig. 1, the number of publications indexed in Scopus under the keywords ''deep,'' ''learning,'' and ''music'' demonstrate the applicability of DL in music processing. From 2014 to 2021, there are 638 publications, a sharp increase each year. This shows that scientists are becoming more interested in this field. The diversity of the field is also made apparent when looking at the subject area categorization of these works, with 567 being listed in Computer Science, 296 in Engineering, 136 in Mathematics, 74 in Physics and Astronomy, 63 in Decision Sciences, 51 in Arts and Humanities, and the rest covering disciplines such as Materials Science, Medicine, Social Sciences, Energy and more.
The broad field of DL in music-related applications could be termed Music Deep Learning (MDL) and can be divided into two categories, Music Information Retrieval (MIR) [11] and Music Generation (MG) [14]. MIR refers to the extraction of characterizing information from music data. Such information can then be exploited for a wide range of applications, such as genre classification [15], [16], music recommendation [17], [18], music source separation [19], singing voice detection [20], instrument recognition [21], music emotion recognition [22] and transcription [23]. All of the above applications aid in the digital preservation of music, by constructing and managing song databases, as well as the study of different music genres.
MG, under the framework of DL, broadly refers to the automatic generation of music content. This task is performed by first extracting valuable information from music databases using MIR techniques, and then building DL architectures to generate original music content. This has several commercial applications, like movie and game score generation. The automatic generation of music content has spun discussions on whether this new way to create art will eventually replace musicians. However, the more realistic projection for the future is that MG can serve as a valuable tool to musicians and educators alike, to explore new approaches to composition and teaching [24].

B. RELATED SURVEYS
There have been some reviews of the results so far in MDL. In [11] a review of the (at the time) current DL techniques for ASP is provided. Three types of audio are considered, speech, music, and environmental sounds, with applications like audio recognition, synthesis, and transformation. Several reviews have also considered specific applications of MDL. In [21], a tutorial on MIR is provided that is especially useful to newcomers in the field [13], [25] reviews MRS. Music Genre Classification (MGC) is reviewed in [26]. Drum transcription is reviewed in [27], focusing on non-negative matrix factorization and recurrent neural network architectures. In [28], a review of the audio signal representations for use with CNNs is given. A review of DL for speech recognition is available in [6], though the focus is not on music signals. For singing information processing, [29] reviews several aspects, like singing skill evaluation, singing voice synthesis, singing voice separation, lyric synchronization, and transcription. Specifically for singing voice detection, the review in [20] investigates the traditional and deep learning techniques available. DL for music emotion recognition is reviewed in [22].
For MG, the extensive survey in [14] offers an in-depth analysis, covering five key aspects of MG, the objective, representation, architecture, challenge, and strategy. The work [30] provides a systematic review of AI techniques in MG with valuable information regarding publications, citations, geographical distribution, and many more. A review of the composition tasks for various music generation levels is provided in [31]. Finally, [32] talks about the challenges and limitations of MG. These include, for example, the designer's creative limitations, the lack of structure, the extent of control the designer has over the generated music features, and the lack of direct user interaction. Moreover, it argues on how to address these issues.

C. MOTIVATION
From the above, it is clear that different aspects of DL in MSP have been surveyed, with many reviews being dedicated to focused topics, thus providing highly detailed insights into it. In this work, DL for both MIR and MG is discussed, which to the authors' knowledge are discussed for the first time together. The purpose of this work is to provide a more spherical overview of the current research in this field, which could serve as a guide for identifying new research trends.
For that matter, after a review of recent results on both MIR and MG, a section is dedicated to identifying future directions on MDL. Specifically, four research directions are identified, all of which can yield fruitful results in MDL. An earlier version of this study was presented in [33]. The current work extends [33] by expanding upon the literature review, and the discussion on future topics of interest.
The main contributions of this work are summarized as follows: 1) To complement previous surveys, emphasis is given to works published in 2020 or later. In this way, the evolution of MDL into a mature field is presented. 2) To the best of the authors' knowledge, this is the first time that MIR techniques and MG processes are reviewed together, highlighting the interconnection between the two research directions. 3) Attention is given to four areas, which are identified as emerging research topics. These areas are hybrid architectures, DL in traditional music genres, MDL in medical applications, and DL for music generated from dynamic systems. The rest of the work is outlined as follows: In Section II, the DL methods for Music Information Retrieval are presented. In Section III the field of DL-based Music Generation is discussed. Section IV identifies future research directions. Finally, Section V concludes the work. For a list of Abbreviations, see Appendix A.

II. DL METHODS FOR MIR
In this section, the application of DL for different MIR applications is reviewed. The section is divided into subsections based on the DL architecture used, and the different applications are talked about in each subsection. Table 1, summarizes all the reviewed works in MIR, organized by architecture.
First, a short description is provided of the various applications of MIR: 1) Music Recommendation Systems (MRS): MRS is the most fundamental application of MDL. Its goal is to successfully recommend new music tracks to users based on their previous listening history. For new users with no prior information, the problem is termed ''coldstart MR.'' 2) Music classification: The goal is to identify the musical genre of a song, which is of fundamental importance in MRS. A more general goal is to identify music from other audio tracks, like speech, natural sounds, etc. 3) Emotion classification and prediction: The goal is to identify the underlying emotions that can be triggered by a song. This is again useful in MRS and music therapy. 4) Instrument/voice identification: The goal is to identify and separate the different instruments used to compose a music track. This also applies to detecting singing voices. Several objective measures can be used to evaluate MIR architectures. These include accuracy, precision, recall, f1score, mean absolute and square error, Area Under the Receiver Operating Characteristic Curve (ROC-AUC), and more. In the following, a note is made for each work on the accuracy achieved, or the ROC-AUC score, when provided. The reader should refer to each work for an extensive presentation of the evaluation analysis. DL-based on the dataset used for training and validation is also provided, for works that used public datasets.

A. FULLY CONNECTED DEEP NEURAL NETWORKS (FCDNN)
FCDNNs refer to the most basic type of deep neural network, where multiple hidden layers are applied and all nodes between consecutive layers are connected, as shown in Fig. 2.
For MR, in [18], an architecture termed HitMusicNet, using an FCDNN was presented, for predicting the popularity of a music recording, using inputs that incorporate text, audio, and meta-data. The authors also construct a database termed the SpotGenTrack Popularity Dataset (SPD), which unifies information from the Spotify and Genius music and lyric databases. Meta-data information that was considered was the number of an artist's followers, an artist's popularity, as well as market availability. The resulting system can reach an 83% precision score. In [34], an FCDNN was used for MR combining content-based and collaborative filtering in its input. The dataset used was the Spotify Recsys Challenge 2018 million playlist dataset [35], reaching an 88% precision score.
For emotion classification in [36], classification was performed on the Music4All dataset [37], using valence, danceability, and energy as features. The classification is binary, with happy/sad classes. The model has a mean accuracy of 98.3%.

B. RECURRENT NEURAL NETWORKS (RNN)
RNNs are a class of neural networks used for processing sequential data [10], and are thus suitable for time series input signals. In contrast to the FCDNN architectures, RNNs are composed of loops or cycles. RNNs also possess an internal memory state, that is utilized to process long sequences. There are many variants of such architectures, including Long Short Term Memory (LSTM), Gated RNN (GRU), bidirectional RNNs, Hopfield networks, etc. [10]. A simple RNN structure is shown in Fig. 3. In [38], a tagging system is developed using RNN. A scattering transform is used to extract features from the data. The MagnaTagATune dataset [39] is used. The resulting architecture achieves an average AUC-ROC score of 0.909. In [40], a web application was developed that can take as input any YouTube video song and classify its music genre, using four different architectures. The classification is performed for individual 10-second segments of the input track. The results are visualized in a graph. The music genre samples from the Audioset database [41] are used for training. The supporting website, being highly visual, can offer great help to music composers and students, and also has the potential to be used for user feedback.
For emotion classification tasks, in [42] an RNN is proposed that uses a two-note melody trend as a music feature. Five emotion classes were considered, aggressive, bittersweet, happy, humorous, and passionate. Data files from YouTube were used, and the accuracy is up to 75.4%. In [43], emotion recognition is performed on classes of instruments. Four instrument classes are considered: string, percussion, woodwind, and brass, and four emotion classes are deemed: happy, sad, neutral, and fear. The study shows that the system recognizes more specific instrument-emotional pairings.
RNNs have also been employed for music recommendation. In [44], an RNN architecture was used, and the study showed that song order does not significantly affect the quality of playlist recommendations. The AotM-2011 [45] and 8tracks [46] playlist datasets were used.
For singing voice separation, in [47], a curriculum learning approach was considered, where the learning begins with easy examples and the difficulty is steadily increased. Three different databases were tested: MIR-1K [48], ccMixter [49], and MUSDB18 [50], with the model yielding improved performance with respect to the global normalized source to distortion ratio measure.
A piano harmony automatic arrangement architecture is proposed in [51]. The model performs three tasks, note detection, multibasic frequency estimation, and training. Apart from objective evaluation, the resulting tracks were evaluated by human listeners and were positively received.
For Music Classification, Attention Mechanism (AM) has proven to be a strong technique for improving performance and is adopted in many architectures. An RNN with an attention mechanism is used with MIDI formatted input by [52]. Five classes are considered, classical, country, dance music, folk, and metal. The accuracy achieved is 90.1%.

C. LONG SHORT-TERM MEMORY (LSTM)
Long Short-Term Memory networks (LSTM) [53] constitute a special case of RNNs, which have found applications in MIR. An LSTM unit is shown in Fig. 4.
An LSTM network can be mathematically represented as follows. For a given input vector u k at time step k and N h hidden layers, the activation vector of the forget gate is f k ∈ (0, 1) N h .
where W f and U F are weight matrices, q k ∈ (0, 1) N h is the vector representing the hidden state, and b f is the bias vector. In addition, the activation vectors for the input/update gate I k ∈ (0, 1) N h and the output O k ∈ (0, 1) M are represented similarly and where I and O represent input and output, respectively, whereas the rest of the symbols have the same meaning as previous.
An LSTM unit also contains a cell input activation vector denoted by C k ∈ (−1, 1) N h , expressed as Using the following principles, the cell state vector and the hidden state vector are updated by combining the preceding equations where • is the Hadamard product and S 0 = 0 and q 0 = 0. Finally, For music Classification, a model is proposed in [54], where the segment features are the statistics of frame features in each segment. The ISMIR database [55] is used, which includes a collection of songs from different genres. The model achieves an accuracy of 89.71%. In [56], a complex architecture is used, combining a Bidirectional Long Short-Term Memory (BLSTM) model with an attention mechanism, paired with a Graphical Convolutional Network. Three datasets are tested, GTZAN [57], ISMIR [55] and MagnaTagATune [58]. An accuracy of 93.51% is achieved.
For emotion prediction, in [59] the valence-arousal (V-A) emotion model was used to represent the dynamic emotion, using a BLSTM network. The dataset used was taken from the Emotion in Music task in MediaEval 2015 [60].
The problem of music source separation was studied using a BLSTM network for instrument detection and identification in [61]. Data augmentation was used during the training to avoid overfitting. To improve performance, the BLSTM network is combined with a feed-forward neural network, which outperforms both individual networks. The SiSEC DSD100 dataset is cited [62].
For MR, an architecture was developed in [63] that analyzes the connection between dance moves and music to recommend tracks. The database used is [64], which includes samples of synchronized dance and music. The dataset contains four classes of dance, waltz, tango, cha-cha, and rumba. The accuracy can reach up to 91.3%.
For singing voice detection in [65], a Long-Term Recurrent Convolutional Network (LRCN) was considered for electronic music. The architecture consists of a voice separation step and a feature extraction step. The CNN layer extracts the audio features, and the LSTM layer uses the CNN output to differentiate between the singing and non-singing parts. The Arcadium [66] and NCS [67] were used as sources to create ''Electrobyte,'' a new copyright-free electronic music dataset. The model was also tested in a pop dataset Jamendo [68], yielding an accuracy score of 0.833 (Electrobyte) and 0.939 (Jamendo). In [69], an LRCN architecture was developed for the vocal separation and temporal smoothing. The CNN layer is again used for feature extraction, and the LSTM learns the time-sequence relationship. The model was tested on five datasets, RWC pop music dataset [70], Jamendo [68], Med-leyDB [71], MIR-1K [48], and iKala [72], yielding accuracy as high as 0.992.

D. CONVOLUTIONAL NEURAL NETWORKS (CNN)
CNNs are models that can operate on data with a grid-like structure [10]. This is why they've had success with problems involving IP, CV, NLP, and other technologies [73]. In MIR, CNNs are often used to obtain information from music signals, which are mostly represented as two-dimensional timefrequency data. A deep CNN model utilizes the convolution operation instead of the general matrix multiplication in at least one of its layers. In addition, the architecture consists of fully connected layers and pooling layers. The purpose of the latter is to reduce in a computationally efficient manner, the size of the incoming data. Compared to a fully connected layer, a convolutional layer is characterized by a neuron's receptive field. This receptive field indicates that every single unit receives input from only a restricted area of the previous layer. As an activation function, most CNNs in the current literature use either the rectified linear unit (ReLU) function or some kind of variant. ReLU is mathematically defined in [10] and can be expressed by A general CNN architecture is depicted in Fig. 5.
In audio Classification, an architecture was developed for spatial audio location and classification between speech and music in [74]. Two different microphone arrangements were considered. The classification can achieve an accuracy of up to 97.9%. Although audio location is not unique to music signals, it can be especially useful in MIR, such as live audio processing. In [75], different CNN architectures are used for the classification of audio videos, using a wide class of labels and a large dataset from YouTube, which is termed YouTube-100M. The ROC-AUC can reach up to 0.926. The Audioset [41] is also considered. In [76], a CNN is used for sound representation learning, using sound from an unlabeled video dataset, gathered from the Flickr website. To improve its performance, the network is trained by moving knowledge from networks that recognize images to networks that recognize sounds.
For music classification, in [15] a CNN is tested on the ISMIR dataset [55], a Latin Music Database (LMD) [77], and an African ethnic database, provided by the Royal Museum of Central-Africa (RMCA) in Belgium [78]. In all cases, the CNN performed either equally well or better than other architectures. In [16], the CNN input consists of eight music features chosen in three music dimensions: dynamics, timbre, and tonality. This outperforms the use of a spectrogram. The GTZAN dataset [57] is used for the experiments, and an accuracy of 91% is reached. In [79], sample-level CNNs were used for auto-tagging using raw waveform data. The term ''sample-level'' refers to learning representations from very small waveforms, like 2-3 samples. The MagnaTa-gATune [39] and Million Song Dataset [80] were considered, and an AUC of over 0.905 can be achieved. In [81], a 3D convolutional denoising autoencoder architecture is built for music classification, using MIDI input format. The model gives out latent representations of the data, which are then used to classify the data with a multi-layer perceptron network. The Lakh MIDI dataset [82], [83] was used for testing, with accuracy surpassing 88% and a ROC-AUC of over 0. 86. CNNs are used for note onset detection in audio recordings in the early work [84] for sound event recognition. The use of a spectrogram as an input to the network instead of the enhanced auto-correlation yields better detection performance. The dataset used is combined from several different sources. In [85], a simple CNN was proposed for event recognition under noise, with only three layers: convolutional, pooling, and softmax. The databases used are the Real Word Computing Partnership (RWCP) Sound Scene Database in Real Acoustic Environments [86], and the NOISEX-92 database [87]. The accuracy can reach up to 99%.
For singing voice separation, in [88], a CNN architecture was successfully developed that utilized pixel-wise classification on the spectrogram image. The model is trained using the Ideal Binary Mask as the target label and cross-entropy as the objective function. The iKala database [72] was used, as well as the DSD100 dataset [62], [89].
For singing voice evaluation, in [90], a one-dimensional CNN is used, that applies fractional processing node theory for training, which reduces the training time. For the experiment, 100 music major students were selected to provide input. Accuracy can be as high as 86.3%.
For musical instrument identification, a CNN with a simple architecture is used for classification into 11 different classes in [91]. The MedleyDB database is used [71], and the accuracy surpasses 82%. In [92], three different weight-sharing strategies for CNNs are considered, temporal kernels, time-frequency kernels, and a linear combination of time-frequency kernels which are one octave apart. MedleyDB is used [71] for training and testing, with hybrid models having the best overall performance. In [93], a Temporal Convolutional Network was trained on a weakly labeled dataset. The OpenMIC-2018 [94] dataset was used for training and testing, and the MUSDB18 [50] for testing. The model slightly outperforms an LSTM model with respect to the ROC-AUC score, which indicates a strong candidate for such problems. Attention-augmented CNNs are used for instrument identification in [95]. When 25% of the filters are assigned to attention, the resulting CNN outperforms the attention-free ones. The datasets used were the London Philharmonic Orchestra Dataset [96], and the University of Iowa Musical Instrument Samples [97]. Judging from the consistently positive outcomes, it only makes sense to assume that in the future, AM-enhanced NNs will be extensively used for MIR. In [98], identification is performed for four instruments: bass, drums, piano, and guitar. The model architecture consists of four identical, independent sub-models, each catering to one instrument. The Slakh dataset is used [99], and the AUC ROC measure reached an average of 0.96, with the drums being easier to identify, and the guitar and piano being the more difficult ones.
In [100], a CNN is developed for emotion classification with 18 emotion tags, using time and frequency domain information. The experiments make use of the CAL500 [288] and CAL500exp [101] datasets. In [102], classification is performed specifically for film music, with 9 emotional classes. Each class is also associated with specific colors. The Epidemic Sound Online database [103] was used. The classification is performed using 30-second excerpts of tracks.
In [104], a feature combination CNN architecture for automatic playlist continuation is proposed, with collaborative filtering integrating information from curated playlists as well as song feature vectors. The databases used are Art of the Mix [105] and 8tracks [46]. In [106], distance measuring is used for the classification system, which is then used for the recommendation system. The GTZAN database [57] is used for training, and the Emotify music dataset [107] and Music Audio Benchmark Dataset (MABD) [108] for testing. The designed system can reach a good level of accuracy on the 10-best list. In [109], a CNN architecture is tested using the MIREX database [110], along with the Baidu Music service. The model has a ROC-AUC that can exceed 0.90.
For music transcription, a toolbox termed nnAudio was developed for audio-to-spectrogram conversion using one-dimensional CNNs in [111]. The MusicNet dataset [112] is used for testing. The toolbox can significantly reduce execution time compared to the existing librosa Python library [113].

E. GENERATIVE ADVERSARIAL NETWORKS (GAN)
Despite the fact that RNNs and CNNs are the most popular MIR architectures, there have been studies that look at alternative networks for MIR. GANs (Fig. 6) were first proposed in the original version of [114]. A GAN consists of two competitive agents: a generator and a discriminator. Starting with a training set of real data, the generator is trained to generate new samples that follow the distribution of the real data, while the discriminator must identify the real from the artificial samples.
For emotion classification, a GAN is proposed in [115] that utilizes a double-channel fusion strategy to extract local and global features of an input voice or image. There are five emotion classes considered: sad, happy, quiet, lonely, and miss. The information used in the experiments comes from a number of websites, such as Kuwo Music Box, Baidu Heartlisten, and others. The recognition rates achieved are between 87.6% and 91.2% for all emotions.
In [116], an architecture combining computer vision and note recognition is proposed for music notation recognition. The experiments make use of several datasets, including the JSB Chorales [117], Maestro [118], Video Game [119], Lakh MIDI [82], [83], and another MIDI dataset. The recognition accuracy ranges from 0.88 to 0.92 for all the datasets. The proposed model's intended application is music education.
For Singing voice separation, in [120], a GAN with a time-frequency masking function is used. The databases MIR-1K [48], iKala [72], and DSD100 [62], [89] are used in the experiments, and the model outperforms a conventional DNN.

F. CONVOLUTIONAL RNNs (CRNN)
Complementary to standard models, more complex ones have been developed that utilize couplings between different architectures, often in a series interconnection, to combine their characteristics and improve performance. Convolutional RNNs (CRNNs) are one of these examples.
For music classification, a CRNN was considered in [121], which is a CNN network with the last layers replaced by an RNN. The CNN part is used for feature extraction and the RNN part as a temporal summarizer. The Million Song Dataset [80] is used for training, to predict genre, mood, instrument, and era. The model outperforms other architectures with respect to AUC-ROC.
For MR, a CRNN is used in [122] for classifying and recommending music, in the categories of classical, electronic, folk, hip-hop, instrumental, jazz, and rock music. The database used is the Free Music Archive [123]. The system was tested on a group of 30 users, and the best architecture was the one that implemented a cosine similarity, along with information on music genre.

G. CNN-LSTM
Similarly to CRNNs, some works combine the architectures of CNNs with LSTMs. For emotion classification, a model in [124], consisting of a 2d input through a CNN-LSTM and a 1d input through a DNN, combines two types of features and improves audio and lyrics classification performance. Four classes are considered, angry, happy, relaxed, and sad. The dataset used is the Last.fm tag subset of the Million Song Dataset [80], with an average accuracy of 78%. In [125], a novel database of Turkish songs is constructed for experimentation. The model uses a CNN as the feature extractor and an LSTM with a DNN as the classifier. An accuracy of over 99% is obtained. In [126], the model extracts features from the lyrics, combining a word vector and a CNN-LSTM architecture, with a word frequency weight vector along with a DNN. The outputs of the two architectures are combined on a matching attention mechanism to derive the text emotion classification. Four classes are considered, happy, sad, healing, and calm. The classification accuracy for all emotions ranges between 0.809 to 0.903.
For music score recognition, the proposed architecture takes as input an image of a music score and outputs the duration, pitch, and coordinate for each note in [127]. Data from Muse Score [128] were used for the experiments, and the model outperforms other architectures, with respect to all accuracy measures.
For sound event recognition, [129] considers polyphonic sounds, for a wide family of 61 classes, including music, taken out of a dataset of ten different daily contexts, like a sports game, a bus, a restaurant, and more [130]. The model achieves an average f1 score of around 65%.

H. ARCHITECTURE OVERVIEW
From the above review, it is clear that the ''classical'' DL models perform well in a variety of MIR tasks. However, the models under consideration need to be appropriately designed, so that they can achieve good results for their set problem. Thus, (and accordingly to the no free lunch theorem) there is no architecture that can be considered holistically better than the rest. On the contrary, complex architectures that incorporate layers of different types are the most promising, since they combine the best characteristics of each DL module, as discussed in Section IV.

III. DL METHODS FOR MUSIC GENERATION
In this section, the application of DL in MG is reviewed. Automatic MG utilizes the MIR techniques mentioned in the previous section to generate novel music scores of desired characteristics, like genre, rhythm, tonality, and underlying emotion. The resulting output can either be a music track in the form of audio, so it can be directly listened to, or it can be in a symbolic notation form. Along with the generation of novel tracks, some tasks can be considered adjacent to MG. One such application is Genre Transfer (GT). This refers to preserving key content characteristics of a music score and applying style characteristics that are typical of a different genre. An example would be transforming a pop song into its heavy metal cover. Another application is Music Inpainting (MI), which refers to filling a missing part of a music track, using information from the rest of its content. Again, the section is divided into subsections based on the DL architecture used. The public databases used in each work are also mentioned. Table 2, summarizes the reviewed works for MG, categorized by their architecture.
The MG architectures can be evaluated both objectively and subjectively. Objective evaluation refers to using mathematical and statistical tools, to measure the similarity of the generated music tracks to the training dataset, as well as other characteristics that can measure their similarity to real VOLUME 11, 2023  music. For objective evaluation, there are several measures, including the loss and accuracy of the training process, the empty bar rate, polyphonicity, note in a scale, qualified note rate, tonal distance, and note length histogram, among others. Most studies consider a subset of these measures or similar ones, so the reader can refer to each work for details.
For subjective evaluation, a test audience is usually given a collection of DL-generated tracks from different architectures, along with human compositions, and is asked to rate them with respect to different aspects, usually on a five-point Likert scale. Variations of this include comparing pairs of tracks and choosing which one they prefer the most or being asked to decide if a track is computer or human-made. In the following sections, we point out which works have conducted subjective evaluations, as the positive audience perception of AI music tracks is essential for the future applicability of MDL. The reader can again refer to each work for the extensive presentation of the evaluation results.
As a closing note, it is worth mentioning an issue that emerges from the field of AI-based MG, that of copyrighting [131], [132]. As AI methods use different software and sample databases, legal problems may arise when claiming authorship of the final musical product. It is thus important that legislators update the existing policies, to avoid rising such issues in the future.

A. RNNs
As with MIR, RNNs have proved popular for MG tasks. For works on classical music, the model termed Sam-pleRNN [133] generates one audio sample at a time, with the resulting signals receiving positive evaluation from human listeners. Three different datasets were considered, one containing a female English voice actor, one containing human sounds like breathing, grunts, coughs, etc, and one containing Beethoven's piano sonatas, taken from the Internet Archive [134]. The models were evaluated by a human group, with the samples of the 3-tier model gaining the highest preference. In [117], an RNN model termed Deep-Bach is designed, for generating hymn-like scores mimicking the style of Bach. The dataset is taken from the music21 library [135]. The model offers some control to the user, allowing the placement of constraints like notes, rhythms, or cadences to the score. The model was evaluated by human listeners of varying expertise, who were given several samples, and had to guess between Bach or computer generated. Around 50% of the time, the computer tracks were passed as real samples, which is a very satisfying result for such complex music. The work was expanded in [136], with an architecture termed Anticipation-RNN which again offered control to the user to place defined positional constraints. The music21 library [135] was used once again.
In [137], a Graphical User Interface (GUI) system termed BachDuet was developed for promoting classical music improvisation training through user and computer interaction. The JSB chorales data from the music21 dataset [135] is used for training. The GUI was warmly received by test users, who found the improvisation interaction easy to use, enjoyable, and helpful for improving their counterpoint improvisation skills. Additionally, a second group of participants were asked to listen to music clips, rate them, and also decide whether they resulted from a human-machine improvisation using BachDuet, or human-human interaction. Both types of tracks received similar scores, and the listeners were also unable to differentiate between the duets, as they wrongly classified them around 50% of the time.
In [138] the model produces drum rhythms for a seven-piece drum kit. Natural language translation was used to express the hit sequences. An online interface was designed and evaluated by users, who gave an overall average to positive score.
In [139], the effects of different conditioning inputs on the performance of a recurrent monophonic melody generation model are studied. The model was trained on the FolkDB dataset [140] and a novel Bebop Jazz dataset. The validation Negative Log Likelihood loss (NLL) can be as low as 0.190 for the pitch and 0.045 for the duration.
In [141], the problem of inpainting was considered, which combines a VAE that takes as input past and future context sequences, with an RNN that takes as input the latent vectors from the VAE, and as output a latent vector sequence that is passed through a decoder, to create the inpainting sequence. A folk dataset from The Session [142] is used for testing. The model outperforms others with respect to the NLL measure. The architecture was also tested by users, who were given pairs of segmented sequences, and had to choose among excerpts that fit. The model performance was on the same level as other architectures.

B. LSTMs
LSTMs have been considered for several scenarios. In [143], data preprocessing has been applied to improve the quality of the generated music, and also reduce training time.
In [144], BLSTM networks are used for chord generation. The database used was Wikifonia, which is now inactive, that included sheets for several music genres [145]. The user evaluation showed a preference for the BLSTM model over others, although the original music still received the highest score.
In [146], BLSTM is used for chord generation. The model consists of three parts: a chord generator, which uses some starting chords as input, a chord-to-note generator, which generates the melody line from the generated chords, and a music styler, which combines the chords and melody into a final music piece. Multiple music genres were used as a training database, including Nottingham [147], a collection of British and American folk tunes, Wikifonia [145], and the McGill-Billboard Chord Annotations [148]. The model was evaluated by listeners, which gave a score ranging from neutral to positive, taking into consideration harmony, rhythm, and structure.
In [149], a combination of two LSTM models, termed CLSTMS, is used to build chords that can match a given melody. One sub-model is used for the analysis of measure note information, and the other is used for chord transfer information. Wikifonia is used with data taken from [144] and [145].
In [150], a variation of Biaxial LSTM was used, and a model termed DeepJ was developed for MG. The model was tested on three types of music, baroque, classical, and romantic, with test participants being able to successfully categorize the generated samples most of the time. The Piano-MIDI dataset [151] was used. The model is also capable of mixing musical styles by tuning the values of a single input vector.
In [152], a two-stage architecture is proposed that utilizes BLSTM, where the harmony and rhythm templates are first produced, and the melody is then generated and conditioned on these templates. The Wikifonia dataset is used [145]. In the subjective evaluation, participants were given a collection of tracks and were asked to rate them according to how much they found them pleasing and coherent, and whether they believe they were human or AI-generated. The highest scores were achieved by the model where the melody generator is conditioned on an existing chord and rhythm scheme from a real song. This melody is also perceived as human-made by many participants. The authors also noted that there are high standard deviations in all answers, and slightly more so in the models rated positively, indicating that there is a much wider perception of what is considered good-sounding music, than a bad one.
In [153], an architecture combining LSTM with a Recurrent Temporal Restricted Boltzmann Machine is designed.
Experiments were conducted in MuseData [154], a classical music dataset, and JSB chorales [155] dataset. The model outperforms other architectures with respect to Log-likelihood (LL) and frame-level accuracy (ACC%) measures.
In [156], variations of the LSTM are discussed, termed Tied Parallel LSTM with a neural autoregressive distribution estimator (NADE), and Biaxial LSTM. The model was tested on the datasets of JSB Chorales [155], MuseData [154], Nottingham [147], and Piano-MIDI [151], a classical piano dataset. The architectures perform well concerning the Loglikelihood measure. The architectures also have translation invariance.
In [157], an RNN-LSTM architecture is proposed, using the Meier cepstrum coefficients as features. The dataset consists of folk tunes collected by the author. The model achieves an accuracy of 99% and a loss rate of 0.03.
In [158], a model termed Chord conditioned Melody Transformer (CMT) is proposed, which generates rhythm and pitch conditioned on a chord progression. The training has two phases, first, a rhythm decoder is trained, and second, a pitch decoder is trained based on the rhythm decoder. The model was trained on a novel K-Pop dataset. In addition to various measures, like rhythm accuracy, the model was also evaluated by listeners, with respect to rhythm, harmony, creativity, and naturalness. The model outperforms the Explicitly-constrained conditional variational auto-encoder (EC 2 -VAE) [159], with respect to rhythm, harmony, and naturalness. The model also has a higher score for creativity than the real dataset tracks, meaning that it can indeed generate novel melodies.
In [160], an LSTM specifically for Jazz music was designed, using a novel Jazz music dataset in MIDI format, and the Piano-MIDI [151]. The model can also generate music using only a chosen instrument. The model can achieve a very low final loss value.
In [161], a BLSTM network with attention is considered for Jazz MG. The architecture consists of a BLSTM network, an attention layer, and another LSTM layer. The Jazz ML ready MIDI dataset [162] is considered. The model outperforms simpler architectures like LSTM without attention and the attention LSTM without the BLSTM layer.
In [163], a piano composer is designed, that uses information from given composers to generate music. The datasets used were Classical Music MIDI [164] and MIDI_classic_music [165], from which tracks of Beethoven, Mozart, Bach, and Chopin were considered. The model was evaluated through a human survey, where participants had to choose the real sample among the computer-generated and composer ones. Around half the time, people mistook the model-generated music for the human-composed track, meaning that the model can generate music that is relatively indistinguishable from real samples. The generated tracks can also be perceived as fairly interesting, pleasing, and realistic.
In [166], an architecture, comprising of an LSTM paired with a Feed Forward layer, can generate drum sequences resembling a learned style, and can also match up to set constraints. The LSTM part learns drum sequences, while the feed-forward part processes information on guitar, bass, metrical structure, tempo, and grouping. The dataset was collected from 911tabs [167], and broken into three parts, for 80s disco, 70s blues and rock, and progressive rock/metal, with the model being effective in all styles.
Finally, in [168], the MI problem was considered by combining half-toning and steganography, and various methods were compared using a dataset of various instruments, with satisfying results for the considered models.

C. CNNs
For CNN architectures, in [169], the architecture comprises an LSTM as a generator, a CNN as a discriminator, and a control network that introduces restriction rules for a particular style of music generation. The matching subset of the Lakh MIDI dataset (LMD) [82] and Piano-MIDI dataset [151] was used. The model was evaluated by music experts, with respect to melody, rhythm, chord harmony, musical texture, and emotion. The model is rated higher than other ones in all of the above aspects.
In [170], a CNN with a Bidirectional Gate Recurrent Unit (BiGRU) and attention mechanism is used for folk music generation. The ESAC dataset [171] is used for testing. The results were evaluated by listeners, who gave overall positive ratings, although lower than the real ones. There were also some exceptions of low scores, meaning that the model generation may have some inconsistencies in its performance.
In [172] a Convolution-LSTM for piano track generation is considered. The CNN layer is used for feature extraction, and the output is fed into the LSTM for music generation. Piano tracks from Midiworld [173] were used for training. The model was evaluated by listeners, who were given 10 music segments, and had to decide whether they were human-made or computer generated. In most cases, the segments were correctly identified, but the Convolution-LSTM model performed better than the simple LSTM.

D. GANs
Symbolic music is stored using a notation-based format, which makes it an easier-to-use input for training NNs. For symbolic music generation, a GAN model is proposed in [174] for piano roll generation, equipped with LSTM layers in the generator and discriminator. The generated files were evaluated by participants with respect to melody and rhythm, and the proposed model received a higher score than files generated from other architectures.
In [175], an inception model conditional GAN termed INCO-GAN is proposed that can generate variable-length music. This complex architecture consists of two phases, that of training and generation, and each phase is broken into three processes: preprocessing, CVG training, and conditional GAN training for the training stage, and CVG executing, phrase generation, and postprocessing for the generation phase. The Lakh MIDI dataset is used for the experiments [82]. The model achieves high cosine similarity with the human-composed music for the frequency vector.
In [176], the problem of symbolic music GT was studied using CycleGAN, a model consisting of two GANs that exchange data and are trained simultaneously. The model was evaluated using genre classifiers, verifying the successful style transfer.
In [177], DrumGan is proposed, an architecture for generating drum sounds (kick, snare, and cymbal). The model offers user control over the resulting score, by tuning the timbre features.
In [178], the authors generated log-magnitude spectrograms and phases directly with GAN to produce more coherent waveforms than directly generating waveforms with strided convolutions. The resulting scores are generated at a much higher speed. The NSynth dataset [179] is used, which contains single notes from many instruments, at different pitches, timbres, and volumes. The human audience rated the audio quality of the tracks, and the model was received as slightly inferior to the real tracks.
In [180], a GAN equipped with a self-attention mechanism is used to generate multi-instrument music. The self-attention mechanism is used to allow the extraction of spatial and temporal features from data. The Lakh MIDI [82] and Million Song [80] datasets were used here.
In [181], a GAN was designed for symbolic MG, along with a conditional mechanism to use available prior information, so that the model can generate melodies either starting from zero, by following a chord sequence, or by conditioning on the melody of previous bars. Pop music tabs from Theory-Tab [182] were used. The resulting system, termed MidiNet, is compared to Google's MelodyRNN and performs equally, with the test audience characterizing the results as being more interesting.
In [183], multi-track MG was considered using three different GAN models, termed the Jamming, Composer, and Hybrid. The Jamming model consists of multiple independent generators. The Composer consists of a single generator and discriminator, and a shared random input vector. In the Hybrid model, the independent generators have both an independent and a shared random input vector. The models were trained on a rock music database and used to generate piano rolls for bass, drums, guitar, piano, and strings. The database is termed Lakh Pianoroll Dataset, as it is created from the Lakh MIDI [82], by converting the MIDI files to multi-track piano rolls. A subset is also used with matched entries from the Million Song dataset [80]. Additionally to using the training database, the model can also use as an input a given music track from the user and generate four additional tracks from it. The model was evaluated by professional and casual users and received overall neutral to positive scores.
In [184], Sequence Generative Adversarial Net (SeqGAN) is proposed, which applies policy gradient update. The Nottingham folk dataset [147] is used in the experiments. The model outperforms a maximum likelihood estimation (MLE) trained LSTM with respect to the mean squared error and other measures.
In [185], sequence generative GANs were considered for polyphonic music generation. The method condenses the duration, octaves, and keys of melodies and chords into a one-word vector representation. The Nottingham dataset [147] was used. The results were well received by a test audience, with respect to pleasantness, realism, and interest.
In [186], a conditional GAN is proposed for long inpainting up to a few seconds. The model was trained on datasets of increasing complexity, like the Lakh MIDI [82] and Million song [80], the Maestro dataset [118], recordings of grand pianos, and free music archive dataset [123], and extensive audience experiments were performed to evaluate the model. The inpaintings were generally detectable, especially in tracks with higher complexity, but were considered slightly or non-disturbing.

E. TRANSFORMERS
Transformers constitute a relatively recent architecture [187], which has found popularity in NLP. A key aspect of transformers is self-attention, which refers to the process of weighting the relevance between different positions of a single sequence. Transformers process sequential input data, but not necessarily in order.
The transformer's architecture is basically an encoderdecoder scheme. The encoder maps the sequence of inputs (x 1 , . . . , x N ) to a sequence of vector representations (z 1 , . . . , z N ). The decoder then takes this vector representation and generates a sequence of outputs (y 1 , . . . , y M ), one at a time.
Let W q , W k , W v be the three parameter matrices that are trained. These matrices are used to define the following parameters: The self-attention score is calculated as follows: For every input, our desire is to calculate how it attends to all the tokens in the sequence. To achieve this, the query vector is used and since every token becomes the query for once, we calculate e ij = q i k j , with i, j ∈ {1, . . . , N }.
To have more stable gradients, normalization is performed as The final step is to calculate the self-attention score as In practice, the aforementioned procedure is performed in matrix form and is depicted in Fig. 7.
Modifications of the simple transformer are proposed in various works. In [188], a relative attention mechanism is used to generate minute-long compositions, with reduced intermediate memory requirements from quadratic to linear. The JSB chorales dataset [155] and Piano-e-Competition dataset [189] were used. The model was evaluated by listeners, who were asked to rate pairs of musical excerpts. The model outperformed other architectures and was seconded only by the real music tracks.
In [190], an adversarial transformer is proposed to generate single-track or multitrack music. The results were positively received by a test audience, who rated tracks with respect to being human-like, harmonious, rhythmic, structured, fluent, and overall quality. The model scores better compared to another architecture, and much closer to the real track scores.
In [191], sparse factorization was applied to the attention matrix, which reduced the memory and time requirements from quadratic to sub-quadratic. Five-second-long samples were generated. A piano recording dataset from [192] was used for training.
In [193] a model termed Pop Music Transformer is proposed to generate pop piano music. The model uses a beat-based music representation. The generated tracks were evaluated by experts and casual listeners and were preferred by both groups over other architectures.
In [194] a model termed Jukebox can generate music along with vocals in various musical styles. The model uses multiscale Vector Quantization -Variational Autoencoders (VQ-VAE) to compress the raw audio input to discrete codes. Then the output is generated using an auto-regressive transformer. The architecture provides lyric conditioning, to control the singing part. The Maestro dataset was used [118] for training, and the LyricWiki (now closed) to gather metadata, among others. The model can generate music in any chosen style by supplying conditioning signals during training.
In [195], a model for symbolic MG for Mandarin pop is proposed, where the transformer training considers the conditioning sequence as a thematic material. The POP909 dataset is used [196]. The model was evaluated by participants, on the aspects of theme controllability, repetition, timing, variation, VOLUME 11, 2023 and overall structure and quality. The proposed model outperforms others in all metrics.
In [197], conditional drum generation is considered, inspired by [166]. A BLSTM encoder receives the conditioning parameter information, and a transformer-based decoder with relative global attention generates the drum sequence. A subset of rock and metal songs from the Lakh MIDI dataset is used [82]. For subjective evaluation, participants were given a set of three tracks, two being the accompanying or condition tracks, and the third being the drum track to be evaluated. They were asked to rate the drum tracks with respect to rhythm, pitch, naturalness, groove, and coherence. The tracks generated from the proposed model outperform another baseline model and are even rated higher than real compositions with respect to naturalness, groove, and coherence. The users were also asked their opinion on whether the given drum tracks each time were real compositions or computer generated. The drum tracks from the model were perceived as computer generated only 39% of the time, indicating the natural feel of the tracks.
In [198], the problem of melody harmonization was considered. The model maps lower-level melody notes into semantic higher-level chords. Three architectures are proposed using a standard transformer, variational transformer, and regularized variational transformer. The Chord Melody [199] and Hooktheory Lead Sheet [200] datasets are used. In the human evaluation conducted, participants, comprising casual music listeners and professionals, were asked to rate samples with respect to harmonicity, unexpectedness, complexity, and preference. The standard model achieved the highest scores in harmony and preference, whereas the variational model achieved the highest in unexpectedness and complexity.

F. ARCHITECTURE OVERVIEW
As with the case of MIR, it is clear that there is no single architecture that can outperform the rest in MG tasks. Multilayered architectures though can be a path for building better models, especially when additional objectives are set, like conditioning the generated music to desired features.

IV. FUTURE STUDIES IN MDL
In this section, future research directions in MDL are identified and discussed.

A. MIXED ARCHITECTURES
So far there have been multiple approaches and different architectures to address key problems in MDL. However, despite most works reporting positive results, due to the complexity of the applications under study and their peculiarities, there is no dominant method that should be followed for a given task. Thus, there is no overall superior architecture that is guaranteed to outperform all others for any given MDL problem.
Apart from hybrid architectures, MDL will be significantly benefited from the use fusion of diverse input modalities. This would increase performance, as the conjunction of different modalities can help build connections between different features. For example, in [76] sound signals were extracted from unlabelled video sources. In [205], the combination of singing signals along with laryngoscope images was combined for voice parts division. In [206], a system that combined heart rate measurements and facial expressions was composed to detect drowsiness in drivers, which is accompanied by a music recommendation system used as a countermeasure to avoid accidents. In [63] and [64], a synchronized music and dance dataset were used for recommendation. In [207], music emotion classification is performed for four emotional classes, combining features from lyrics and acoustics. These are indicative examples of an emerging trend of bridging the gap between different modalities.
For the above techniques, an all-present problem is the computational cost of training [208]. The increase in hardware requirements creates practical issues with energy consumption and environmental footprint, which under the scope of the global energy and environmental crisis, are mandatory to address. Addressing the above will require the performance improvement of current architectures, or the consideration of different ones [209]. Understandably, any improvements in the computational cost will, by extension, also boost the commercialization of MDL applications.

B. TRADITIONAL MUSIC
Most of the existing works use widely available training databases, which mainly include western music genres, like classical music, pop, rock, metal, jazz, blues, etc. Using widely established music genres make sense, due to their popularity, but it is highly important to enrich and diversify the training databases by including more genres. So, while it is essential to consider new and emerging genres, especially ones that are computer-based, like electronic, synth-wave, and vaporwave [65], [210], [211], [212], another trend that is gaining popularity is the application of MDL and MG for traditional and regional music. Traditional music refers to music originating from a specific country or region and is closely tied to its culture [213]. Examples include the recitation of religious excerpts like the Holy QurBan [214], and traditional music from different regions, like Byzantine [215], Greek [216], [217], [218], Persian [219], Chinese [220], [221], Indian [222], [223], and many more.
In the development of MDL for regional and traditional music, several challenges may appear, as a result of the distinct nature of the topic. One issue is the dataset availability, which in contrast to western popular music, is in many cases hard to gather, especially in large amounts, which are required for optimal training. In most cases, the research groups take it upon themselves to build their own dataset, due to the lack of existing ones, so hopefully, in the future, more authorities will help towards building free databases [77], [78], [142], [196], [221], [224], [225], [226]. For this task, recording difficulties may arise, especially for recordings made outside a music studio, with varying acoustics, for example in religious singing. Coming along with the problem of dataset collection is that of appropriate feature tagging of the tracks. This is strenuous work that requires time, and often the collaboration of music experts, for tasks like the annotation of music features, and testing audiences, for more ambiguous characterizations, like the emotion that a track evokes.
Moreover, many musical instruments, like the guitar and piano, are present in almost all music genres, so it is easier to adopt MG architectures for a specific instrument to many different styles. This may not be the case for regional instruments, which are only used for playing a region's traditional music. So, for preserving and learning musical styles through DL, it is essential to build datasets for specific instruments [221]. Finally, many traditional music styles have a distinct musical notation, like Mensural notation, Chinese Gongche, and Organ tablature, meaning that MDL architectures for transcription, pattern recognition, and symbolic MG would have to be adjusted to fit the characteristics of each genre. This again requires the existence of appropriate databases for different musical notations.
Overall, it seems that there are still several practical challenges to fully developing DL for traditional music. These are steadily addressed by the efforts of several research groups over the world. Table 3 lists the recent works that study Traditional Music Deep Learning (TMDL), categorized by music type. These works offer great service to the preservation of history, culture, and art, as the digitization, study, and generation of traditional music will help open it up to new generations of listeners and also promote thematic (music, religious) tourism. Thus, it is expected that more research groups will contribute to regional MDL in the future, and hopefully, such research endeavors will also receive governmental support and recognition.

C. MEDICAL APPLICATIONS
The field of Music Therapy (MT) lies at the intersection of Medicine and Music. MT is an evident-based approach for treating a plethora of pathological conditions, including, among others, anxiety, depression, substance abuse, Alzheimer's, eating disorders, sleep disorders, and more [261], [262], [263]. Naturally, DL can prove a valuable tool to therapists and patients, as a complement to existing treatments. Table 4 summarizes the recent applications of DL in music therapy, categorized by architecture. The conditions that have been addressed include music remixing to improve cochlear implant performance, effective MRS and MG for mood transformation, including anxiety and depression, MG for stimulating the musical memory in patients with Alzheimer's, MG for relieving Tinnitus, and voice parts classification for vocal art medicine. Existing architectures of DL for tasks like music recommendation and emotion classification can be adapted to fit many of the above conditions. For example, music recommendation systems can be updated to make suggestions based on emotion and mood, using a collection of patient inputs, like facial expressions, and other physiological signals, like heart rate, temperature, respiratory rate, EEG signals, and more. By designing appropriate user interfaces [40], [117], [137], MDL architectures could also be used as an entertainment and educational tool, especially for interventions with children. Finally, it would also be interesting to see if knowledge transfer could be applied to models developed for treating conditions with overlapping symptoms, for example, anxiety and depression.
MT is a field that is constantly developing, with medical researchers turning to it as a method for effectively treating, or reducing the symptoms of many conditions. By developing proper training databases and MIR and MG architectures, DL will help in establishing open-access tools that can be used by anyone alike, without the need for increased medical expenses. Moreover, tools like MRS for mood transformation can be directly available to patients, providing daily help coverage. Overall, there are many promising future directions to be considered by researchers.

D. MUSIC GENERATED FROM DYNAMICAL SYSTEMS
Another field that would also be interesting to consider is that of chaos-based music generation [278], [279], [280], [281], [282], [283]. In this interdisciplinary field, which bridges MG with the rich area of chaos theory, the time series solution of a chaotic system is used as a high entropy source for music generation, in tuning parameters like the extraction of musical pitches, the duration of a musical note, the amplitude, and the velocity. Chaotic systems are characterized by nonperiodicity, and sensitivity to parameter changes, meaning that two solutions of the same system, starting from almost identical initial configurations, will quickly diverge from each other, yielding two different, non-periodic time series. This feature can thus be exploited in MG, as it can aid in the generation of non-repeating musical patterns. So exploring DL methods in this area could give rise to applications in numerous fields, including medical treatment [284], [285], and possibly secure communications [286], and system identification [287].

V. CONCLUSION
MDL has evolved into a very active field, with an increasing number of contributions each year, addressing its vast applications. This work provided a review of the recent developments in Music Deep Learning. The review was divided into two main categories, Music Information Retrieval, and Music Generation. After reviewing each field, future research trends were identified.
The future of MDL lies in developing hybrid architectures to improve performance, while applications span a plethora of commercial, conservational, medical, and experimental applications being developed. Of these, applying DL for studying and preserving the cultural heritage of each country is of high importance. So is the exploitation of MDL for medical applications. The integration of MDL and chaos seems much more experimental, but its multidisciplinarity will surely lead to new developments in both fields. For all of the aforementioned applications, bringing together research groups consisting of heterogeneous and complementing researchers, like computer scientists, physicists, mathematicians, musicians, audio engineers, and medical practitioners, is the key to success. The authors hope that the present work can be of service to these researchers, by providing a clear overview of recent and emerging developments in the field. Table 5 lists the abbreviations used throughout the text.

APPENDIX A LIST OF ABBREVIATIONS
CHRISTOS VOLOS received the Diploma degree in physics, the M.Sc. degree in electronics, and the Ph.D. degree in chaotic electronics from the Physics Department, Aristotle University of Thessaloniki, Greece, in 1999Greece, in , 2002, and 2008, respectively. He is currently an Associate Professor with the Physics Department, Aristotle University of Thessaloniki. He is also a member of the Laboratory of Nonlinear Systems, Circuits and Complexity, Physics Department, Aristotle University of Thessaloniki. His current research interests include the design and study of analog and mixed signal electronic circuits, chaotic electronics and their applications (secure communications, cryptography, and robotics), experimental chaotic synchronization, chaotic UWB communications, and measurement and instrumentation systems. SPIRIDON NIKOLAIDIS (Senior Member, IEEE) received the Diploma and Ph.D. degrees in electrical engineering from Patras University, Greece, in 1988 and 1994, respectively. Since September 1996, he has been with the Department of Physics, Aristotle University of Thessaloniki, Greece, where he is currently a Full Professor. From 2003 to 2017, he was also a contract teaching staff of Hellenic Open University. He has worked in the areas of digital circuits and system design. He is the author or coauthor of more than 200 scientific articles in international journals and conference proceedings, while his work has more than 2300 references (Google Scholar, H-index=23). Two articles presented at international conferences achieved honorary awards. His current research interests include the design of high-speed and low-power digital circuits and embedded systems, modeling the operations of basic CMOS structures, modeling the power consumption of embedded processors, and development of algorithms for leak detection and localization in pipelines. He was a member of the organization committees of three international conferences. He is the founder and organizer of the Annual International Conference on Modern Circuit and System Technologies (MOCAST) since 2012. He also organized the 27th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS), in 2017. He contributes or has contributed to a number of research projects funded by the European Union and the Greek Government, for many of which he has scientific responsibility. He is currently an Associate Professor with the Department of Physics, Aristotle University of Thessaloniki. He is also the Director of the ELEDIA@AUTH and a Laboratory Member of the ELEDIA Research Center Network. He has participated in more than 16 national and European-funded projects and has been a principal investigator of five national funded research projects. He is the author of the book titled Emerging Evolutionary Algorithms for Antennas and Wireless Communications (The Institution of Engineering and Technology, 2021). His research interests include antenna and microwave structures design, evolutionary algorithms, wireless communications, machine learning, and semantic web technologies.
Prof. Goudos is a member of the IEICE, the Greek Physics Society, the Technical Chamber of Greece, and the Greek Computer Society. He is also a member of the editorial boards of the International Journal of Antennas and Propagation (IJAP), the EURASIP Journal on Wireless Communications and Networking, and the International Journal on Advances on Intelligent Systems. He is also a member of the Topic Board of the open access journal Electronics. He has also served as a member of the technical program committees for several IEEE and non-IEEE conferences. He is the founding Editor-in-Chief of the open access journal Telecom (MDPI publishing). He is serving as an Associate Editor for the IEEE TRANSACTIONS ON ANTENNAS AND PROPAGATION, IEEE ACCESS, and the IEEE OPEN JOURNAL OF THE COMMUNICATION SOCIETY. He was honored as an IEEE ACCESS Outstanding Associate Editor, in 2019, 2020, and 2021. He has participated as a guest editor or a lead guest editor of more than 20 special issues of international journals. He has co-organized four special sessions in international conferences. He is also serving as the Chapter/AG Coordinator for the IEEE Greece Section. He has been elected as the IEEE Greece Section Secretary, in 2022.