Toward Interactive Music Generation: A Position Paper

Music generation using deep learning has received considerable attention in recent years. Researchers have developed various generative models capable of imitating musical conventions, comprehending the musical corpora, and generating new samples based on the learning outcome. Although the samples generated by these models are persuasive, they often lack musical structure and creativity. For instance, a vanilla end-to-end approach, which deals with all levels of music representation at once, does not offer human-level control and interaction during the learning process, leading to constrained results. Indeed, music creation is a recurrent process that follows some principles by a musician, where various musical features are reused or adapted. On the other hand, a musical piece adheres to a musical style, breaking down into precise concepts of timbre style, performance style, composition style, and the coherency between these aspects. Here, we study and analyze the current advances in music generation using deep learning models through different criteria. We discuss the shortcomings and limitations of these models regarding interactivity and adaptability. Finally, we draw the potential future research direction addressing multi-agent systems and reinforcement learning algorithms to alleviate these shortcomings and limitations.


I. INTRODUCTION
Computers have introduced a new way of approaching music composition to create an elaborate piece of music. There are several approaches for the algorithmic composition of music [1], such as mathematical models [2], knowledgebased systems and grammars [3], evolutionary methods [4], and Markov models [5]. Although these models have shown the ability to create melodies in various styles such as [6] and [7], they lack generalization [8] and, in some cases, require manual preparation of rule-based definitions for different types of music. In contrast to handcrafted models, machine learning models, and particularly deep learning (DL) models, can learn from large distribution of musical examples and generate new content. Besides, deep learning models exhibit strength in processing raw unstructured data by extracting higher-level features associated with the task.
The associate editor coordinating the review of this manuscript and approving it for publication was Wei Jiang . Mainly, music generation comprises subtasks like melody and multi-instrument generation, style transfer, and audio synthesis. Models such as DeepJ [9], DeepBach [10], and BachBot [11] can mimic a particular musical style with plausible results. JukeBox [12] can generate complete high-quality songs with singing in raw audio in an end-to-end approach.
Despite this promising progress, there are challenges in using end-to-end deep generative models for music generation. These models often suffer from the scarcity of musical structure, expressiveness, and creativity. Besides, there is no unified music evaluation method for deep learning models [13]. Furthermore, these models are primarily limited in interactivity and controllability. It is demanding for artists to generate creative and genuine content using end-to-end models [14]. Consequently, it is essential to have a clear perspective of challenges and problems to improve the performance and ability of these models. VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In this study, we provide an overview of the advances in deep learning methods for music generation in the symbolic domain. We further outline different evaluation techniques and the challenges and limitations of these models in music generation tasks. Additionally, we summarise these models' characteristics and the challenges they addressed, including 73 deep-learning models. Accordingly, we describe a potential approach to overcome these issues. Our study concentrates explicitly on adaptability and interactivity issues by demonstrating a better approach using multi-agent systems and reinforcement learning algorithms.
This paper is organized as follows. Section II briefly introduces various aspects of the music generation task. Section III presents different domains of music representation. Section IV sorts out the common deep learning architectures of generative models. Section V deals with the methods for music generation, categorized based on the architectures in Section IV. Section VI presents different music evaluation methods from objective and subjective points of view. Section VII points out some shortcomings of current methods and challenges in the music generation task. Section VIII exposes the potential future research direction. Finally, Section IX concludes this paper.

II. ASPECTS OF MUSIC GENERATION TASK
The objective of the music generation task refers to the musical content to be generated. Reference [15] determines the music generation objectives with five aspects: type, destination, use, mode, and style. The most important factor among these five aspects is the type, which defines the nature of the music generation model. In this context, we can classify the main musical types as single-track monophony, singletrack polyphony, multi-track polyphony, and accompaniment. The single-track monophony represents the sequence of notes with at most one note at a time for a single instrument or vocal.
In comparison, single-track polyphony represents more than one note at a time. Examples of single-track polyphony instruments are the piano and guitar. While single-track monophony and polyphony are for a single instrument, multitrack polyphony is intended for more than one voice or instrument. Multi-track polyphony can capture a complete band, such as a Jazz trio with piano, bass, and drums, and it constitutes the traditional recording format. Additionally, the accompaniment can be rhythmic or harmonic support (or both) to a given melody, like chord progression and counterpoint. Note that this is only one of several ways to classify musical types, but useful in discussing music generation tasks in this study.
The mode aspect defines whether humans can intervene in the music generation process or if it is fully automated. The interactive ability of a musical system provides some degree of control over the content generation. Based on the mode, we can determine the destination and use of the generated content. For instance, the generated musical content can be played by an audio system (waveform), processed by sequencer software (Musical Instrument Digital Interface (MIDI)), or performed by a human (score). Moreover, the generated musical content can be influenced by the style of certain musicians such as for example Bach. Indeed, the choice of training examples directly affects the model's learning outcome regarding the musical style.

III. REPRESENTATION OF MUSIC
Musicians work with many levels of inference, ranging from abstract symbolic representation like the lead sheet to the continuous and concrete representation of audio signals. We can divide music into symbolic and audio domains [16]. Mainly, the symbolic domain consists of discrete variables, while the audio domain is continuous. Additionally, the symbolic domain includes a representation referred to as performance control. Considering the multi-level and multimodal characteristics of music representation: • The high level is the score representation, including the structure and symbolic features (like note, pitch, and chord). It is an abstract representation of music that enables musicians to develop and communicate musical ideas seamlessly.
• The middle level is the performance representation consisting of detailed timing and dynamics for the musical expression. The performance representation conveys the changes in emotion and information, which are not marked in the score but performed by the musician.
• The bottom level is the audio representation related to acoustic features, such as timbre, that can be determined as a sound.
The music generation can be addressed relative to each of these levels. Deep learning models and computer programs generally solicit a precise definition of input representation. In this study, we concentrate on deep learning models for the symbolic representation of music.

IV. DEEP LEARNING
In recent years, deep learning has seen many advancements in the architecture of generative models. The most utilized generative models in the music generation tasks are Recurrent Neural Networks (RNNs), Variational AutoEncoders (VAEs), Generative Adversarial Networks (GANs), and Transformers. Additionally, researchers have investigated the potential of reinforcement learning algorithms for music generation. This section outlined the architecture of these methods.

A. RECURRENT NEURAL NETWORKS
Recurrent Neural Networks (RNNs) are neural network architectures suitable for learning the sequence of data. They can capture the time dependencies between input sequences by sampling from the neuron's output and feeding in the sample as input in the next time step. However, due to the gradient vanishing problem, RNNs struggle to learn long-term dependencies within the input sequences. The Long Short Term Memory (LSTM) network [17] is an advanced type of RNN that comprises layers of neurons with recurrent connections. LSTM contains a computational unit called a memory cell or memory block, consisting of weights and gates connected recurrently. The network can interact with memory cells through the gates that increase the number of parameters to be estimated during training. In this manner, the network can control the flow of information in detail for each cell, resulting in faster convergence.

B. GENERATIVE ADVERSARIAL NETWORKS
Generative adversarial networks (GANs) [18] are another family of deep generative models. The main idea is to train two neural networks at the same time. The GAN's architecture includes the generator G and the discriminator D. The generator learns a distribution of the input data during the training process to resemble the actual samples. At the same time, the discriminator takes examples of the real (input examples) and generated data (output examples by generator) and attempts to maximize the probability of assigning the correct label to real and synthetic (generated) data. Indeed, the training process of GAN forms a two-player MiniMax game in which the models are trained until the discriminator is fooled half the time.

C. VARIATIONAL AUTO-ENCODERS
The variational autoencoders (VAEs) [19] are powerful deep generative models. They have shown an excellent capacity to produce various high-quality content such as images, texts, and sounds. VAE is an autoencoder (AE) with constraints on encoded representation (latent variables), denoted by the variable z. The applied constraints ensure that the encoder produces latent variables with a predefined structure and properties.
To elaborate, AE is a neural network with one hidden layer in which the output layer (decoder) reflects the input layer (encoder). In other words, the encoder compresses each example in the dataset into a vector of numbers (latent variables) to create the latent space of the dataset. The decoder reconstructs the same examples using the latent variables. However, it is difficult to ensure the regularity of the latent space organized (encoded) by the encoder. The training regime in AE results in encoding and decoding with no information loss, which indicates the overfitting problem. Therefore, the decoder prunes to generate poor quality content caused by a lack of structure in the latent space.
The VAEs architecture alleviates the latent space irregularity issue by encoding the examples following a probability distribution P(z) like the Gaussian distribution. In this manner, VAEs ensure a better structure of latent space by forcing the encoder to return a distribution over the latent space instead of a single point.

D. TRANSFORMERS
Transformers [20] have been used widely in natural language processing (NLP) [21], and computer vision [22] tasks with outstanding performance. Transformers architecture relies on an attention mechanism that computes the representation of its input and output by concentrating on some specific elements of the input sequences. Particularly, the transformers belong to the family of sequence-to-sequence models. Their architecture includes an encoder and decoder, yet recognizable to AE models and backpropagation-based learning. The given inputs are prepared as tokens to train the transformer model, which is a structured representation. In this manner, the positional information is preserved, which enables the model to determine temporal dependencies within the input sequences.

E. REINFORCEMENT LEARNING
In Reinforcement Learning (RL), an agent learns to interact with an environment through trial and error. The agent selects and performs actions sequentially within the environment. Each action takes the agent into a new state, where the agent receives a reward. The given reward relies on the fitness of the action to the current state (environment). The agent's goal is to learn an optimal policy to maximize its cumulative rewards (gain) through the learning process. Indeed, the agent maximizes its gain by knowing when to explore to learn more and when to exploit what it has learned.
For instance, Q-learning is a model-free reinforcement learning algorithm in which the agent learns to estimate the value of an action in a particular state. Q-learning is model-free as it does not require assessing the dynamics of the environment as in the case of transition and reward function. Indeed, Q-learning is a value-based learning algorithm that updates the value function based on an equation (Bellman equation). The agent maintains the estimated values in Q-table and updates the table's values during its interaction with the environment.
For an overview of approaches and algorithms for reinforcement learning, we refer to [23] and [24]. In this study, we mainly concentrate on using deep learning in reinforcement learning algorithms, known as deep reinforcement learning (DRL). Reinforcement learning emerges to be a promising approach to the music generation task. It can enhance the interactivity issue of the deep learning architecture through control methods like the reward mechanism. Furthermore, DRL algorithms an process large input examples, which is important in the case of music generation tasks.

V. RELATED WORK
This section studies the current advances and state-of-theart approaches to music generation using deep learning techniques in the symbolic domain. We discuss the approaches based on the architectures mentioned in Section IV. Note VOLUME 10, 2022 that some of the mentioned papers are under peer review or preprint.

A. RECURRENT NEURAL NETWORKS
Following the success of LSTM architecture [17], Eck and Schmidhuber [34] used LSTM to address the lack of global coherence in algorithmic composition since they are better for learning temporal dependencies than vanilla-RNN. Their work demonstrated the LSTM network's ability to learn the local and global structures and reproduce long-term conventions. However, the network's tendency to bind with the training set conventions stop exploring and producing new musical forms. To further improve the performance of the LSTM model in the music generation task, Eck and Lapalme [35] proposed a music-specific sequence learner that can capture long-timescale structure in the musical piece. They introduced a bias toward the metrical structure to confront the network's problem to learn repetitive musical sequences by providing time-delayed copies of input.
Sturm et al. [36] used a character-based approach that works with a vocabulary of single characters, with textual transcriptions of folk music, to train a deep LSTM model. The training examples contain 24,000 high-level transcriptions of folk tunes in ''ABC'' notation with a vocabulary size of 134. The input representation carries the one-hot encoded input vectors similar to [35] with the softmax output layer providing the distribution over the vocabulary adapted to the input. They developed two models, charRNN trained on a consecutive text file, and folkRNN trained on single complete transcriptions. Similarly, Choi et al. [37] used word-based learning (wordRNN) in addition to character-based learning (charRNN) for automatic music composition. They utilized textual data representation to generate Jazz chord progressions and Rock music drum tracks. In the preprocessing step, the start and end flags indicate the score's beginning and end. They transposed all the scores to the key of C. For drum tracks, they used a binary representation of pitches to encode drum components where only nine components were included for training efficiency.
Li et al. [38] proposed a novel technique to improve the performance of LSTM RNN models to learn long-temporal dependencies. The proposed model is named Enhanced Memory Network (EMN), which consists of several recurrent units known as Enhanced Memory Units (EMU). EMN incorporates musical beat information and historical hidden states to improve the learning ability of LSTM RNN. Medeot et al. [39] proposed StructureNet that learns musical structure space to generate melody. StructureNet includes two networks: structure and melody model. The structure model induces the musical structure within the given training examples (melodies) and encodes them as a sequence of binary vectors. They used the trained structure model to steer the melodies generated by the melody model during the generation process. The melody model is a probabilistic model that predicts the probability distribution of musical events.
Similarly, Dai et al. [40] introduced MusicFrameworks for controllable melody generation using hierarchical music structures. Their system composes a melody by arranging a musical piece into sections and phrase-level structures. Then, it generates rhythm and basic melody parts using two transformer-based models. Finally, the system generates the final melody by conditioning on various musical attributes. Keerti et al. [41] utilized Bi-directional LSTM RNN to compose polyphonic Jazz pieces. Their model employs the attention mechanism to identify the parts of the input sequences with salient musical features. For similar approaches to address the structure in music using LSTM RNN, we refer to [42], [43], and [44].
Although the above systems can generate musical content, their generations lack musical expressions. Oore et al [45] proposed PerformanceRNN to address the expressiveness in music. They utilized a dataset of recorded human performances, including notes' exact timing and dynamics. Hadjeres and Nielsen [46] proposed AnticipationRNN to implement positional constraints on model's generation. Their architecture and method provide interactivity to the RNN-based model, enabling the users to perform positional constraints on notes.
We often desire to generate music based on sentiment or a specific music style. Ferreira and Whitehead [47] proposed a method to control the deep learning model and generate musical pieces using a specific sentiment. Their generative model includes an LSTM network paired with a Logistic Regression model. Their model also shows the potential to perform sentiment analysis of symbolic music. Furthermore, they provided a labeled dataset of symbolic music annotated according to sentiment for future research. Cífka et al [48] presented a style transfer method to generate polyphonic accompaniment styles for Jazz. They trained neural networks with encoder-decoder architecture in a supervised manner on synthetic parallel training data labeled by the styles of music [49]. The training data includes chord charts from a chord language model of the Jazz music standards and rhythmic variations. The sampled chord charts are prepared as a token of the chord's root, quality, and duration. To evaluate the model's performance, they used the content preservation technique to estimate how well the model captured the harmonic structure and the style fit technique to measure how well the output matched the desired style.
Chen et al. [50] used the chord progression as constraints to generate melody using the WaveNet. Their work compares temporal-CNN and LSTM RNN models systematically. Furthermore, they propose a technique to encode chords and melodies in a staggering representation. They used the Information Dynamics method to analyze and evaluate the content generated by the model through pattern identification. Lu et al. [51] proposed MeloForm, an expert system to compose music according to a musical form. Their system consists of two modules: expert systems and a transformer model. MeloForm can generate different forms of music, such as verse and chorus, rondo, and sonata forms. The expert systems module utilizes the handcraft rules (music theory) for melody generation. The transformer model refines the generated melody using various strategies, such as refining phrase by phrase, conditioning on harmony and rhythm, and others.
Ziegler and Rush [52] utilized the normalizing flow architectures for generative models to compose the melody and polyphonic music. In their approach, they considered character-level language modeling and polyphonic music generation, where the normalizing flow method models the continuous representation of the input sequences.

B. VARIATIONAL AUTO-ENCODERS
An example of a VAE-based model for music composition tasks is MusicVAE [53] for monophonic and polyphonic music. The architecture of the proposed model includes an encoder and a decoder with a two-level hierarchical RNN structure. They utilized a corpus of MIDI files collected from the web to extract monophonic melodies, drum patterns, and trio sequences (drum, bass, melody (piano or guitar)). They trained the model on 2 or 16 measures long for monophonic melodies and drum patterns, and 16 measures long for trio sequences. Furthermore, by utilizing the latent space of the VAE architecture, the model can generate musical content through different operations like translation and interpolation.
Later, Simon et al. [54] proposed an extension of Music-VAE, called multi-track MusicVAE, to generate musical pieces with an arbitrary number of instruments. In their model, both the encoder and decoder adopted the hierarchical architecture. Similarly, by benefiting from VAE latent space, the model has the capacity to generate samples by chord conditioning. Dinculescu et al. [55] proposed a new method to learn the latent space of the MusicVAE model to enhance conditional sampling. They achieved this by employing the latent constraints [56] to lower the dimension of the latent space, concentrating on the portions that are similar to a particular style or genre. Wang et al. [57] proposed a novel tree-structure model called PianoTreeVAE by addressing the hierarchical structure of music. The architecture of the network resembles the tree structure, where each node represents the embeddings of musical elements with bidirectional edges.
Liang et al. [58] proposed MIDI-Sandwich2, a hierarchical VAE-based model for polyphonic music generation. In contrast to other hierarchical VAE-based models, they used RNN instead of CNN models to build the generative model. Their model utilizes Binary VAE (BVAE) method to handle various multi-track music information. Mittal et al. [59] presented a new approach to utilizing probabilistic diffusion models for melody generation. Their training regime includes first training the VAE model on input sequences and then training the diffusion model to learn the VAE latent space. Indeed, the diffusion model is trained to learn long-term dependencies and expand the ability of the VAE model to generate long sequences, in this case, 64 bars.
Chen et al. [60] introduced Music SketchNet, a novel guided music generation framework. Their model is intended to complete the missing parts of musical measures, given the musical piece and related parameters as input. The input parameters are pitch contours and rhythm patterns defined by the user. The model's architecture consists of three components: SketchVAE, SketchInpainter, and Sketch-Connector. SketchVAE is a VAE model that encodes and decodes the training examples into high-dimensional latent variables, while SketchInpainitng is a stacked RNN model that handles the prediction of musical ideas by utilizing the latent variables. SketchConnector combines the predictions from SketchInpainitng and musical ideas given by the user to carry out the final latent variables. The decoder of the SketchVAE receives these latent variables to generate music output.
Akbari and Liang [61] proposed a semi-recurrent CNNbased VAE-GAN model for melody generation. The model includes the encoder, generator/decoder, and discriminator. They put the VAE decoder and the GAN generator under one hood, where they shared the parameters and trained together. The encoder encodes the input sequences and constructs the latent representation. The generator/decoder utilizes the latent variables to carry out the output. Then, the discriminator module receives the real (original training examples) and fake (generated output) data. They trained their model for piano music generation. Similarly, Brunner et al. [62] proposed MIDI-VAE for polyphonic music generation and modeling the dynamics of music. Their model includes a VAE model paired with a style classifier which navigates the encoder in VAE to construct the latent space based on the style information. Their model can perform style transfer by changing the attributes such as pitch, velocity, and instrument of a musical piece.
Wang et al. [63] introduced hierarchical variational recurrent auto-encoders (VRAE) to model polyphonic music. They used normalized note representation proposed by BachProp BachProp [44] and multiple embedding layers to project each melodic feature. For the encoder, they utilized four GRU layers to construct the latent representation of the melodic features given by the embedding layers. The decoder has a similar architecture with 7 GRU layers for modeling attribute-specific context, combining multiple attributes, and generating corresponding note attributes. The architecture of their model represents the capability to generate dynamic music with various time signatures. VOLUME 10, 2022 Tan and Herremans [64] proposed Music FaderNets, a framework that utilizes latent variable models to learn high-level musical features through the low-level representation of music. They used Gaussian Mixture Variational Autoencoders (GM-VAEs) as their model architecture to capture low-level musical attributes latent space. Indeed, by employing such hierarchical latent space architecture, they could derive high-level musical attributes from low-level representations. Music FaderNets provide an interactive and controllable generation by tweaking the low-level musical features. This possibility is appeared as sliding knobs and is inspired by visual controllers in Fader Networks Fader Networks [65].
Pati et al. [66] proposed music inpainting, a technique to traverse the latent space of VAE models. Inpainting is a task in which the purpose is to refine or complete the missing parts of a media [67]. Their model can generate content based on past and future musical contexts in an interactive manner.

C. GENERATIVE ADVERSARIAL NETWORKS
Mogren [68] represented one of the earliest use of GAN-based music generation models. Their model, C-RNN-GAN, is an RNN model with adversarial training using a continuous sequence of data. They used real-valued continuous quadruplets of frequency, length, intensity, and timing as musical features to model the musical signals. Later, Guimaraes et al [69] proposed ORGAN, a new GANbased approach to compose polyphonic music. ORGAN architecture includes an LSTM RNN for the generator and CNN for the discriminator. It uses a reinforcement learning (RL) based reward function representing domain-specific metrics to train the generator model.
Multi-track polyphony music includes multiple voices independent in terms of time. Each of these voices has its temporal dynamics, layered on top of each other to shape the desired sound. Dong et al. [70] proposed MuseGAN to generate multi-track polyphonic music. MuseGAN is the integration and extension of generative and temporal models. The generative models are forward multi-track music generators based on WGAN-GP [71], including composer, jamming, and hybrid models. Each generative model can generate multi-track music bar by bar, following a specific scenario. Therefore, they proposed temporal models to generate multiple bars with temporal structure and coherency. Nevertheless, the music generated by MuseGAN is inconsistent in musical segments and harmony and contains fragmented notes [16]. The instrument set in MuseGAN is a fixed quintet composed of bass, drum, guitar, piano, and string.
The instability of GAN-based models for music generation is mainly due to the use of convolutional layers in their architecture to extract features [16]. Indeed, the CNNs are not effective in capturing the temporal dependencies. Therefore, Guan et al. [72] proposed Dual Multi-branches GAN (DBM-GAN) to overcome the lack of consistency. DBM-GAN integrates the self-attention mechanism in its architecture to learn temporal dependencies and extract spatial features. Besides, the model's multi-branch architecture enables the arrangement of various instruments across time. Similarly, Valenti et al. [73] proposed the first music adversarial autoencoder called MusAE. MusAE uses adversarial regularization instead of the Kullback-Leibler (KL) divergence in VAEs. It can reconstruct new phrases and interpolate between latent representations to change specific musical attributes.
Liu and Yang [74] defined a new music generation task called lead sheet arrangement for multi-instrument music generation. The proposed model takes the lead sheet as input and generates accompaniment for the given melody with instruments such as guitar, bass, piano, strings, and drum. The model architecture includes a recurrent convolutional network with adversarial training composed of three stages: lead sheet generation to generate lead sheets of eight bars from scratch, feature extraction to extract harmonic features, and arrangement generation stage to generate five-track piano-rolls of one bar, respectively.
Angioloni et al. [75] introduced CONLON to generate polyphonic and multi-instrument music. Their work presented a Wasserstein autoencoder (WAE) model trained on lossless input representation, including the velocity and duration information from MIDI data in two separate channels. The proposed generative process includes exploring the WAE model's latent space based on interpolation to maintain consistency between transitions and variations within the generated musical piece.
One of the exciting tasks within the music generation field is the ability to transfer a musical piece from one domain to another. Notably, we like to obtain a mapping function that learns and underlines the attributes and characteristics of musical structure. Accordingly, Chen et al. [76] proposed a GAN-based model with a dual learning method to combine music across multiple domains. They utilized the Wasserstein-based metric to approximate the distance between the target and existing domains and represent the model's learning progress. Furthermore, Brunner et al. [77] explored the ability of the CycleGAN-based model [78] for music genre transfer in the symbolic domain of music. The CycleGAN architecture includes two GANs arranged in a cyclic manner and trained together, in which one generator transfers data from domain A to B and the other from B to A. One discriminator is tied to each generator's output to identify the fake and real outputs. Later, Brunner et al. [79] further analyzed the influence of spectral normalization and self-attention on GAN training using the proposed model in [77].
Tokui [80] proposed an extended GAN model to compose genre-conditioned music rhythm patterns. To do so, they added a second discriminator model with genre ambiguity loss to classify the genre of the generated musical piece. Particularly, the genre ambiguity loss is a cross-entropy loss [81]. In this manner, the generator is encouraged to generate new content in a new musical genre. Similarly, Lattner and Grachten [82] proposed a convolutional variant of the gated autoencoder (GAE) to generate music rhythm patterns. Their model encodes the rhythmic interactions of the kick drum against bass and snare patterns and captures the local relations between them.

D. TRANSFORMERS
The attention mechanism facilitates the extraction of spatial and temporal dependencies but depends on absolute positions in its inputs. Therefore, it struggles to track the dependencies in music, such as regularities, event orderings, and periodicity. To alleviate this issue, Shaw et al. [83] proposed the relative attention mechanism, which focuses on relational features by approximating the distance between two tokens. Huang et al. [84] proposed Music Transformer that exhibits the relative attention mechanism to generate polyphonic music. The model can learn the long-term musical structure to develop long melodies or continue a given motif. Similarly, Payne [85] created MuseNet based on GPT-2 that can generate a long musical piece with ten different instruments in various styles. Nevertheless, Music Transformer and MuseNet lean to generate random notes and harmonies after a few bars [16].
Many attempts have been made to overcome the issue of randomness and generate pieces with a high musical structure. Zhang [86] proposed a novel adversarial transformer, which combines generative adversarial learning with the attention mechanism. The adversarial objectives facilitate the transformer to concentrate on temporal dependencies within the musical structure. Compared to Music Transformer and MuseNet, their model depicts advancement in musical quality for a monophonic and polyphonic generation. Similarly, Jiang et al. [87] proposed TransformerVAE, a combination of VAEs and transformers. Their approach benefits from MusicVAE hierarchical structure and attention mechanism in transformer models for representation learning. Huang and Yang [88] expands the learning ability of the generative models by introducing a new approach for discrete representation of music. They proposed revamped MIDI-derived events (REMI), an explicit metrical grid that extracts the hierarchical structure of music using events such as Chord, Bar, and Position. Their study experimented with transformer-based models, where they examined various musical features to capture higher-level characteristics of music.
Peracha [89] concentrated on the sequential modeling of polyphonic music instead of the network architecture. Their study experimented with a multi-layer transformer encoder and a GRU-based model named TonicNet using the JSB chorales dataset. 1 Their results depict improvement in both models' performance by introducing new salient musical features in the form of chords and intra-voice token repetition. Dai et al. [40] presented Music Frameworks to generate customizable full-length melodies. Music Frameworks inherits a hierarchical architecture to represent high-level musical features such as repeated sections and phrases, and low-level features such as rhythm structure and melodic contour. Music Frameworks can generate long-term music structures conditioned on the basic melody and rhythm structures. Wu and Yang [90] proposed MuseMorphose to generate full song and perform style transfer. Their model represents an ability to generate long sequences with fine-grained controllability and conditioning over musical attributes such as rhythmic intensity and polyphony.
Zhang et al. [91] proposed a transformer-based model that learns and captures the harmonic attributes of the musical structure, such as form and texture. Rütte et al. [92] proposed FIGARO, a novel self-supervised task called description-to-sequence, that can generate music based on the defined descriptions with global and fine-grained control. Their model includes two distinct description functions: learned and expert modules. The learned module extracts the salient musical features using the constructed low-fidelity, human-interpretable sequences by the expert module. For music generation, they utilized a transformer-based model that receives the extracted features by learned and expert modules. For similar approaches to address the structure and control in music using Transformer, we refer to [93], [94], [95], [96], [97], [98], and [99].
Zou et al. [100] introduced MELONS, a full-song melody generation framework using a graph representation of music and transformers model. MELONS generation process includes structure and conditional melody generation. Their work concentrates on the generation of pop music by constructing eight types of bar-level relations to represent the musical structure. Furthermore, they used a directed graph to describe the melody structure of a song using bar-level relations. MELONS architecture includes two transformer-based generation models: structure and melody generation. The structure generation models and generates the structure graph as a sequence of relations. The melody generation uses event-based music representation to compose conditional or unconditional structured melodies. The unconditional generator is trained on the original training data, while the training data for the conditional generator is organized according to the specified condition.
Liu et al. [101] introduced a novel approach to composing symphony music. Their study presented Multi-track Multiinstrument Repeatable (MMR) and Music Byte Pair Encoding (BPE) methods to model and represent symphony music. MMR models symphony music by separating and capturing repeated instruments within a single track. On the other hand, Music BPE is a BPE-based algorithm to tokenize and preprocess the musical examples by considering the concurrence of the notes. Their model inherits transformer-based model architecture with 3-D positional embedding that compresses the spatial and structural details of the input sequences. Furthermore, they gathered and processed a large-scale corpus of symphonic music, which is made publicly available. VOLUME 10, 2022 Furthermore, Shih et al. [102] introduced a theme-based method to condition the generative model. Their model uses contrastive learning [103] and density-based [104] methods to cluster similar fragments of a musical piece to form a latent space. In this manner, they formed an augmentation strategy to generate various variations of musical examples for each cluster to train the transformer-based model. Besides, they utilized the same clustering approach to generate new test examples and evaluate the model. Hawthorne et al. [105] proposed TransformerNADE, a transformer-based model for expressive piano performances. To generate meaningful piano performances, they proposed a new representation using NADE [106]. Their model architecture is inspired by RNN-NADE [107].
Training deep learning models often requires a large amount of data. Researchers have used methods such as transfer learning to solve problems in case of data scarcity [108], [109]. For music generation, Donahue et al. [110] presented the benefit of transfer learning to improve transformer-based model performance. They also employed data augmentation methods in their study. Similarly, Hung et al. [111] examined the outcome of two transfer learning methods for the Jazz music generation. Their work studied model fine-tuning and multitask learning methods for unconditioned melody generation.

E. REINFORCEMENT LEARNING
Although the automatic music generation can inspire human creation, it is limited to certain musical examples such as Bach. Interactive music generation can help enhance the sample generations by incorporating human objectives and preferences in the music creation process. Jaques et al. [112] proposed RL-Tuner, a reinforcement learning model to generate music using user-defined constraints. The RL-Tuner architecture includes two deep Q networks and two RNN models. One RNN model, called NoteRNN, is trained on the dataset of melodies. The second RNN model is a copy of NoteRNN, called RewardRNN. The Q network goal is to learn to select the following note (action) based on the generated melody so far (state). The second Q network is called the Target Q network in parallel to the Q network. The Target Q network is trained to estimate the accumulated rewards (gain) achieved by NoteRNN. The Q network's reward combines RewardRNN output and adherence to music theory constraints. Kumar and Ravindran [113] used LSTM RNN with RL to compose melody and basic chords. They processed the polyphonic pieces by dividing them into a stream of monophonic examples. They trained the LSTM model on these examples and created an RL agent to find a suitable combination of songs.
Later, Jiang et al. [114] proposed RL-Duet for online accompaniment using reinforcement learning. It can generate melodic and harmonic music responses to the human part. RL-Duet uses actor-critic with a generalized advantage estimator (GAE) for the reinforcement learning architecture. They introduce a reward function that considers the fittingness of the inter-part and intra-part of the generated notes in horizontal and vertical perspectives. The reward model is learned from monophonic and polyphonic examples instead of hand-crafted composition rules and criteria utilized in RL-Tuner.
Subsequently, Liu et al. [115] proposed RE-RLTuner, an extension to RL-Tuner that uses the Latent Dirichlet Allocation (LDA) as a musical feature extractor. The LDA extractor represents the musical structure characteristics by clustering music at different scales (musical segments) and extracting the musical features into three aspects called topics. The topic models maintain different music structure information. The architecture of the model is similar to RL-Tuner. The network's reward combines the reward model (RewardRNN) and topic models extracted by the LDA extractor.

F. OTHERS
This study mainly focuses on deep learning methods for music generation. However, researchers investigated and examined other approaches along the deep learning methods to tackle music generation tasks. For instance, Moulieras and Pachet [116] introduced a new approach for melody generation using the maximum entropy statistical model [117]. In this approach, the melodies are considered a network of interacting notes. The model assigns a probability distribution to this network and learns the statistical dependencies of the pitch sequences. Later, Hadjeres et al. [118] and Moulieras and Pachet [116] extended the model to handle polyphonic music with multiple voices and generate expressive music, respectively.
Zhao and Xia [119] proposed a hybrid model that can generate piano accompaniment based on a lead sheet. Their model includes phrase selection and neural transfer models to generate content. Phrase selection is a rule-based model that carries out the phrase montages from the database. The neural transfer model receives the phrase montages and manipulates them to match the corresponding style of the given lead sheet. Furthermore, the model's output can be conditioned on rhythm density and voice number.

VI. EVALUATION
Researchers use diverse methods to evaluate deep learning models for music generation. These methods mainly depend on the model's output, which can be subjective or objective. Often it is viable to perform the subjective evaluation in music generation tasks as they involve creativity. However, a thorough subjective evaluation requires an appropriate experimental design and resources to produce reliable, valid, and replicable results [120]. Consequently, the objective evaluation methods facilitate the evaluation of the generative models by providing comparable and relevant results. Indeed, by utilizing objective methods, it is easier to control the variables entangled in the test and reduce bias. The final evaluation results are obtained from both subjective and objective approaches for a better model assessment and a reliable scientific benchmark. This section covers the current evaluation methods for music generation tasks. We refer to [16] and [121] for complete review of music evaluation methods.

A. SUBJECTIVE EVALUATION
The subjective methods evaluate the model's generated content in terms of creativity and novelty. It is essential to evaluate the music from a subjective stance, as a musical piece consists of perceptual qualities that numerical metrics can not measure. Among the available listening tests [16], the Turing test is a standard method for subjective evaluation [122]. This model was introduced by Alan Turing [123] to answer the question: ''Can a machine think?''. In the case of music generation tasks, the questions often include whether the generated content is aesthetically pleasing and whether it is composed by a human. During the Turing test, the human listener tries to differentiate the machine-generated from the human-created piece. Two examples of models of the Turing test for music generation systems are the musical directive toy test (MDtT), and the musical output toy test (MOtT) [124]. The MDtT depends on musical directives such as genre, style, or melodic or rhythmic fragments, while MotT is free from musical directives. Both of these models are only dependent on the human listener's judgments.
Overall, to obtain a valid listening test, [13] specifies some requirements: • A sufficient number of listening subjects with diverse musical knowledge to obtain meaningful statistical results; • The subjects are evenly distributed based on their musical knowledge, including the amateurs with no or basic music knowledge and experts in the field; • Experiments are performed in a controlled environment under specific acoustic characteristics and equipment; • Each subject receives the exact instructions and stimuli.
Note that each of these requirements confines a study's degree of accuracy and repeatability. Furthermore, it is possible to utilize online platforms to conduct listening tests. For example, crowdMOS [13] is a platform for subjective listening tests using Amazon Mechanical Turk. CrowdMOS contains a set of freely distributable and open-source tools that delivers quality results by detecting and discarding inaccurate or malicious submissions. Défossez et al. [125] used crowdMOS in their study to obtain Mean Opinion Score for the ground truth samples.
Another method of subjective assessment of music is the visual analysis that is conducted by a human expert. The methods in visual analysis utilize visual representations like score, waveform, and spectrogram instead of the auditory form of music. For instance, the authors in MuseGAN Engel et al. [70] performed score analysis on different aspects of generated melodies, such as stability and smoothness analysis of the chord and rhythm patterns.
Engel et al. [126] performed spectrogram analysis by employing the Rainbowgram to compare the reconstructed notes of different instruments with the original audio.

B. OBJECTIVE EVALUATION
The objective evaluation methods measure the model's performance and generated content. We can measure the model's performance using numerical metrics such as loss and accuracy. While for evaluation of the generated content, we use statistical descriptors derived from musical concepts. In the following, we explain each of these measurement methods.
Numerical metrics do not contain music domain knowledge and only represent the model's ability to process the data. It is common to use numerical metrics like loss and perplexity during the training process. They mainly consider the statistical distribution of the generated samples or classification accuracy. For instance, loss indicates the difference between inputs and outputs from a mathematical perspective, while perplexity evaluates the model's generalization capability [127]. Additionally, Jeong et al. [128] used mean squared error (MSE) and correlation metrics to assess the model's performance ability using the generated performance and human performance characteristics. Similarly, Gillick et al. [129] proposed metrics such as Timing mean absolute error (MAE), Timing MSE, Velocity and Timing Kullback-Leibler (KL) divergence to measure the model's performance.
Besides the numerical metrics, we can evaluate the generated music by utilizing methods such as log-likelihood and density estimation [130], [70], [131], [132]. For instance, Huang et al. [131] proposed a frame-wise evaluation of the generated content by calculating the negative log-likelihood between the model's output and the ground truth. However, based on the observations of the Theis et al. [133], the probabilistic measure is not always consistent, as generative models can produce irrelevant samples and represent a perfect probabilistic measurement. Other techniques such as chord classification [134], style classification [77], style likelihood [77], and reconstruction accuracy [53] are examples of metrics for specific tasks.
To improve the interpretability of the generative system's outcome, researchers proposed musical metrics by integrating the musical domain knowledge. These metrics provide a detailed evaluation concerning specific music characteristics. Ji et al. [16] categorizes these metrics into pitch-related, rhythm-related, chord/harmony-related, and style transfer and provides a comprehensive overview of these methods. As an example, Sabathé et al. [135] proposed a novel evaluation method using the Mahalanobis distance [136] by using high-level symbolic music descriptors to describe the musical samples. Yang and Lerch [121] introduced a musical metric using absolute and relative metrics. They represent a practical and reproducible approach to evaluating the model's performance and generated content. Their evaluation framework has been used by [111], [137], [138], and [114]. VOLUME 10, 2022 Furthermore, there are evaluation methods to assess specific musical aspects using other theories or algorithms. Variable Markov Oracle (VMO) [139] is a method to evaluate the repetitive patterns in a musical piece. [10] introduced a technique to assess the originality and creativity of a piece and avoid plagiarism. Minimum Distance Classifier (MDC) [140] is a method to determine the style similarity of the generated content with the expectation style. Lattner et al. [141] utilized Humdrum toolkit [142] to evaluate the tonality of the generated musical piece. Wu and Yang [93] used the Scape plot [143] to capture, visualize, and compare the repetitive structure of the generated piece with the original examples.

VII. CHALLENGES
Compared to traditional approaches, deep learning methods have shown great capabilities in the music generation task. However, there are still many difficulties and challenges in using deep learning to generate music. Indeed, the multi-modal nature of music makes the field of music generation with deep learning even more challenging. On the other hand, the black-box nature of deep learning models makes it hard to diagnose their learning process. Here, we address some challenges deep learning models face in music generation tasks.

A. STRUCTURE
A musical piece evolves over time through the development of musical ideas. The musical structure refers to the arrangement of these musical ideas as a whole. Particularly, the musical structure consists of local and global structures. Global structure relates to the long patterns, extended multiple bars like AABA. On the other hand, local structure relates to each musical idea repeated or developed to create themes and variations. Although much work has been done to model and generate music, making a complete musical piece is still challenging. In most cases, the generated content by deep learning models gradually becomes tedious as there is no clear sense of direction, and it may end unexpectedly.
Researchers have investigated various methods for better structure representation. Models such as [100] used graph representation of melody with eight types of bar-level relations such as repetition, transposition, rhythmic sequence, and harmonious cadence. Other models, such as [53], [54], and [58], utilized hierarchical architectures to address this issue. The template-based method proposed by Zhou et al. [42] has shown the ability to generate a specific overall structure. The harmony-Aware Hierchical model proposed by Zhang et al. [91] improved the issue further, possessing the ability to imitate the outline structure of real music. Nonetheless, the generated content by these models still lacks musical details and requires refinement to present an actual musical piece.

B. REPRESENTATION
The representation in nearly all of the current deep learning models involves the pitch and duration of notes, and primarily triads for chords [16]. This simplification restricts the musical understanding of the deep learning models to generate quality musical content. Furthermore, the current methods use relatively simple mechanisms to model instrument characteristics. For instance, it is challenging to model the piano's sustain pedal, which influences the duration of all notes until the pedal is released [105]. Indeed, it is necessary to utilize a better form of representation that can convey musical intricacies, such as the performance of instruments, harmonic content, and ornaments.
Some efforts have been made to ameliorate this issue. Revamped MIDI-derived events (REMI) [88] is an enhanced representation of music that denotes an explicit metrical grid to model music. Specifically, REMI has been shown effective for pop piano music. Wu and Yang [93] and Chen et al. [94] expand REMI further for other scenarios such as guitar tabulator and Jazz music. Compound Words [98] is another technique that utilizes REMI to generate musical tokens and group them into super tokens. Nevertheless, these methods are primarily tailored and applied to a specific genre like pop music. Therefore, further investigation is required to determine their effectiveness for other scenarios.

C. CREATIVITY
Another issue that comes to the scene with the deep learning music generation is the shortcoming of creative musical ideas. The deep learning models are data-driven, and the learning outcome of the models relies heavily on the given training examples. Even with a good learning outcome, the generations can be marked as inaccurate, inconsistent, or monotonous when studied by human listeners.
We can define creativity as an innovative combination of two or more variations in a meaningful manner. Therefore, a generative model requires first understanding the underlying dynamics of musical compositions and second learning how to compile that knowledge into a new meaningful composition. Models like MusicVAE [53] can generate variations by interpolating motifs and sampling from latent space. However, we can encounter a lack of quality in harmonic content and understanding of rhythmic patterns by analyzing the generated content. In other words, the current models can mainly exploit the learning outcome rather than explore and extrapolate to create new variations.
Models such as [77] and [79] attempted to create new musical styles by compelling the model to diverge from the existing styles. Other models, such as [90] and [40], utilized conditioning techniques as a strategy to address creativity. However, the lack of evaluation methods to measure the creativity aspect of a musical piece makes creativity an arduous and open challenge.

D. STYLE
Currently, there are some deep learning models which can generate music with specific styles, like DeepJ [9] and DeepBach [10]. However, these models are limited to the style of classical music extracted from the training examples. Indeed, the main challenge lies in the ability of the model to extract the musical features according to the musical style. Other models such as [77] and [79] can perform style transfer from Jazz to classic music genres. However, the generated content lacks musical details, although it sounds plausible. In fact, different musical styles require distinct definitions, making it challenging to obtain an adaptable framework for diverse musical styles. To achieve this, we need a better representation of music. As we have discussed previously, there are challenges tied to music representation, limiting the generative models' ability.

E. INTERACTIVITY
The algorithmic composition systems are desired to achieve the ability to create musical pieces inspired by human compositions rather than pure imitation. However, the black-box nature of neural networks makes it demanding to interact with and control the output of the deep learning models for human users.
It is necessary to differentiate control from interactivity in generative models. To elaborate, control refers to the possibility of defining a set of parameters to achieve an objective and generate a specific context. While models such as Markov Chains allow the definition of constraints during the generation process [6], [7], deep learning models do not possess such possibilities. Therefore, some techniques are introduced to alleviate this issue, such as the unary constraints [146], positional constraints [46], and conditioning [54]. Although these methods provide some degree of control, they are still insufficient to control the model generation in an arbitrary direction.
On the other hand, interactivity refers to the model's ability to be utilized in a fine-grained manner. Music creation is a concurrent iterative process. Artists adapt various strategies to develop a musical idea and create a musical piece. An example of musical strategies in music generation is incremental variable instantiation that has been used by [11] and [10]. Comparably, models such as [64] provide interactive and controllable generation through the captured latent space. Indeed, interactivity allows artists to perform local modifications and regenerate specific musical parts incrementally. This functionality is essential for the music generation systems to be practical and assist artists in composing music.

F. EVALUATION
Often, it is the case that a musical piece performs well in the objective evaluation and poorly in the subjective evaluation. On the other hand, the subjective assessment is only conducted on the generated content, not during the training process. Moreover, the current deep learning models lack automatic content evaluation, and there is no direct objective method to evaluate attributes such as creativity. Furthermore, a good subjective evaluation lacks a clear explanation of quantitative metrics. Auditory fatigue must also be considered in the case of subjective evaluation, which can cause bias in the listeners if they listen to similar samples for an extended period. Consequently, it is demanding to define an evaluation metric for performance generations similar to human experts to obtain a meaningful assessment based on musical attributes. Indeed, the challenge of music evaluation portrays a complex task that is hard to automate using computational models. Therefore, the development of a universal evaluation system facilitates maintaining an accurate benchmark of the model's performance subjectively and objectively. Table 1 summarises the characteristics of the models overviewed in this work. Music production is an iterative process where a musician or composer as an artist creates and develops musical ideas. Indeed, it is a complex task that involves multiple levels of processing. Although these models can generate novel, innovative and pleasant music, they cannot handle various musical objectives. Therefore, they fail to model the process of music composition.

VIII. FUTURE DIRECTION
Mainly, music production is a complex and hierarchical process divided into five main stages: composition, arrangement, sound design, mixing, and mastering. The composition stage includes creating and developing new melodic, harmonic, and rhythmic ideas. The arrangement is a stage of organizing the created musical ideas in the form of a timeline to make a complete piece. The sound design stage consists of sampling, synthesizing, and manipulating sounds. The mixing stage involves instrument arrangement, combining, and balancing the audio layers. Finally, the mastering stage includes the post-production process to balance all the audio elements and ensure the final mix is ready. Note that a musician may step into these stages concurrently by following a particular strategy or approach to create a complete song. Indeed, the creative process in music production involves a complex relationship between each of the music production stages [147].
A cooperative system like Multi-agent systems (MAS) [148] can be a suitable approach for music generation. MAS are distributed artificial intelligence systems consisting of multiple autonomous agents that work together and make independent decisions. The MAS architecture allows the utilization of various computational intelligence methods like deep learning, which is advantageous for modeling music production and musical creativity. The action abilities and perception of MAS agents enable them to cooperate and coordinate with each other to satisfy the objectives of the task [148].
The main challenge of using deep generative models is performing the creative and technological processes while conserving the balance between these two processes. These models involve a series of processing decisions that can significantly influence how artists think about music when they collaborate with these models. Indeed, the shortage of interpretability makes it hard to understand the decision-making process behind the generated content. The interpretability of the system allows us to locate and correct causes of undesired results. However, the lack of interpretability influences the extensibility of AI systems. Extensibility is important to interact and extend the behaviors and features of a system. Notably, for the algorithmic composition of music, artists often like to create music in a specific style to comply with their desires and musical ideas. Indeed, the extensibility allows human users to be creative and experiment with the system differently. Models such as PianoTreeVAE [57], and MeloForm [51] alleviate the interpretability issue and can provide a better framework. Nevertheless, this is still an open issue for deep learning models.
We can formulate the strategic part of the model exploration, exploitation, and selection processes by emerging effective model combinations. Through MAS, we can combine the flexibility of smaller models with the benefit of global structure awareness of end-to-end models in a modular manner. Indeed, this approach represents a more dynamic behavior as it divides the main task into sub-tasks and distributes them among the multiple agents. Consequently, MAS can further improve the extensibility and interpretability of the system. Hutchings and McCormack [149] is an example of MAS using deep learning models. It consists of harmonic and melodic agents working cooperatively. The harmonic agent is an RNN-based model, while the melodic agent is a rule-based system. Additionally, Tatar and Pasquier [150] surveys the typology and state-of-the-art agent-based learning in music generation tasks.
Moreover, some of the deep learning models provide some degree of control and interactivity. However, they still lack human participation during content generation. As presented in Table 1, these models are primarily standalone systems. Based on a study conducted by Huang et al. [14], artists mainly achieve their musical goals by leveraging and incorporating a wide range of generative models in a modular way. Indeed, it is challenging to control end-to-end deep learning models to produce high-quality songs in one shot. Artists would like to retain a certain amount of control and freedom to navigate deep learning models to generate samples creatively.
Furthermore, artists may desire to generate musical content strictly coherent with their style. Herein, the system requires to be adaptable and flexible. In Section V we studied the reinforcement learning models for music generation. These models show the ability to learn and adapt to changes by observing the modifications in the environment. Based on the observations, the agent takes action and receives a reward for the action's suitability. The reward function can combine objective and subjective evaluation methods to preserve a balance in performance and creativeness. Therefore, the combination of RL and MAS could provide a more flexible workflow and building blocks through a dynamic learning process. For instance, the agents can cooperate and share their progress using the Blackboard [151] communication approach to fulfill the task (music generation).
Besides, users can work on new ideas efficiently by benefiting from past experiences, utilizing the system to get inspired, and broadening the creative process to various extents. Additionally, the flexibility of the RL framework regarding the models' learnability lets the artists adapt the system to their needs. Therefore, we can enhance the human and AI interaction in the context of music generation.

IX. CONCLUSION
In this paper, we have investigated and studied the generative models in symbolic music generation using deep learning techniques. We have underlined the current state-of-the-art methods and provided an overview of their architectures and strategies to generate musical content. We have outlined the main criteria to model, generate and evaluate musical content.
We have discussed the current challenges in music generation and emphasized the essential aspects of these challenges in deep generative models. Notably, we have concentrated on the interactivity and adaptability of these models and proposed a potential research direction to alleviate these challenges and strengthen AI and human interaction.
Almost all of the studies of deep learning models are concentrated on developing the algorithms and specific methods in an end-to-end manner. Indeed, these models are mainly autonomous music-making systems. This type of system is more intended for purposes such as commercial use or entertainment. Notably, artists are more interested in assisted composition systems, where the system is intended, for instance, to provide a glimpse into possible musical variations and inspire the artists to develop new musical ideas. Besides, it is essential to note that music creation is a concurrent process involving many stages of pre-processing and post-processing of musical ideas and materials.
Multi-agent systems have shown great potential in music generation tasks, particularly modeling the music creation process. They can provide a framework in which a combination of multiple approaches can be used to fulfill the desired goal and present a system capable of processing various tasks and inputs. Indeed, its modular and hybrid characteristics can help to alleviate the shortcomings and challenges of the music generation tasks. For example, each instrument consists of specific nuances and characteristics that distinguish their representation of music and musical style. By utilizing MAS architecture, we can simplify the representation of music by concentrating on one instrument at a time, where different agents can be assigned to a specific instrument. This is analogous to how musicians work in a band.
In RL algorithms, the reward function plays an important role, where it assesses the agent's action suitability to the current state of the environment. Therefore, we can formulate the model's evaluation using the RL reward function by combining objective and subjective techniques. The objective evaluation can involve one or multiple agents assessing the sample consistency according to the musical goal using the combination of methods provided in Section VI. On the other hand, the subjective evaluation can be performed by the human listener (agent) who interacts with the musical system. For instance, we can formulate this with a thumbs-up or thumbs-down approach, where the agent receives a reward accordingly. Consequently, the agents within the system incorporate the provided feedback to adapt and adjust their behavior, strategy, or musical goals.