Monophonic Music Generation With a Given Emotion Using Conditional Variational Autoencoder

The rapid increase in the importance of human-machine interaction and the accelerating pace of life pose various challenges for the creators of digital environments. Continuous improvement of human-machine interaction requires precise modeling of the physical and emotional state of people. By implementing emotional intelligence in machines, robots are expected not only to recognize and track emotions when interacting with humans, but also to respond and behave appropriately. The machine should match its reaction to the mood of the user as precisely as possible. Music generation with a given emotion can be a good start to fulfilling such a requirement. This article presents the process of building a system generating music content of a specified emotion. As the emotion labels, four basic emotions: happy, angry, sad, relaxed, corresponding to the four quarters of Russell’s model, were used. Conditional variational autoencoder using a recurrent neural network for sequence processing was used as a generative model. The obtained results in the form of the generated music examples with a specific emotion are convincing in their structure and sound. The generated examples were evaluated with two methods, in the first using metrics for comparison with the training set and in the second using expert annotation.


I. INTRODUCTION
More and more devices and machines enter our everyday life. Nowadays, human-machine interaction can be encountered not only in industry. It started more than half a century ago with industrial robots [1]. Gradually, they were joined by increasingly complex and multifunctional information and vending machines, and today this interaction is almost everywhere, e.g. a great number of people are increasingly using e-assistants like Amazon Alexa 1 and Google Assistant. 2 The importance of human-machine interaction on the one hand and customer expectations on the other set quality requirements for new machine generation. The continuously improving digital environment reflects the gradually progressing state of people. The implemented emotional The associate editor coordinating the review of this manuscript and approving it for publication was Luca Turchet . 1 https://developer.amazon.com/en-IN/alexa 2 https://assistant.google.com/ intelligence machines and robots are expected not only to recognize and track emotions when interacting with humans but also to respond and behave appropriately to the actual human mood. One way to fulfill this requirement is through the appropriate creative behavior such as music generation with a given emotion. The generation of content with a specific emotion [2] by intelligent machines is the next stage in the development of systems that deal with the emotions expressed by humans, its recognition and tracking. Expressing emotions by robots interacting with humans is quite an important issue if this interaction is to be successful. Generating music with a specific emotion is also part of this form of communication, as music transmits content in which emotions play a dominant role.
Deep learning techniques for music generation is a relatively new phenomenon [3] that enters the area of music composition, which is typically an area of human creativity and artistry. There are more and more music generating systems [4]- [6] that try to imitate human creativity, even compete with it, and learn from compositions created over the past centuries of human development. Music is an expression of human thoughts in the form of an organization of sounds in time. It can be compared to verbal expression, which is also spread over time in the form of words, creating sentences, conveying content and abstract concepts. Due to this similarity, also technological solutions to problems such as text generation and music generation have similar approaches.
Similarly to verbal expression, in addition to its content, music conveys emotions, which in music are evoked by musical elements spread over time. Depending on the changes over time in melody, timbre, dynamics, rhythm, or harmony, we can notice different emotions in music [7]. Song lyrics may also affect emotions [8], however in this study we focused on music files without lyrics.
The aim of this paper was to build a model generating monophonic music sequences with one of four basic emotions: happy, angry, sad, relaxed. The musical elements of the generated sequences should affect the emotion they contain. The model should recognize the emotion-affecting patterns in the training set and apply them to the generated examples.
The use of the four categories of emotions is of course a simplification of the possible emotional variations of the generated music, but it helps to start experiments on the problem of generating music with a specific emotion. This choice facilitates the labeling process with emotion music data as well as model building. A more advanced version of the music generation problem would be based on continuous values of emotion descriptions. A similar selection of four categories for generating emotional symbolic music was also selected in [9].

II. RELATED WORK
Studying human-machine interaction in industrial environments, a lot of research on the robot's perception and hardware for the recognition of human activities emerged [10], [11]. Being an important basis for human-machine interaction success, emotion recognition is also a popular field of exploration. Over the last decades, a wide range of deep learning techniques based on various models and databases [12], [13], research on feature extraction algorithms [14], [15], etc. have been conducted. However, the implementation of their results for human-machine interaction improvement has not yet been fully resolved.
In human-machine interaction, human emotions are seldom the central theme. Usually they are only a one-sided background. However, once the machine identifies and recognizes them, it would be quite nice if the machine responded appropriately. By combining emotion recognition with an appropriate machine reaction, the machine is expected to generate or create a response containing at least a partially human element in the audio or video domain. Williams et al. [16] demonstrated how real-time generated music can improve runners' performance considering an individual user's needs. In [17], Navarro-Cáceres et al. proposed melody generating under the supervision of the user. The user is supported by a mechanical device capturing the user's movements, and translates them into a melody.
After recognizing an emotion, the next step toward the development of the human-machine interaction, the intelligent machine should be able to generate at least an appropriate musical phrase. The resulting music could be set as a background or audio theme to support the ongoing interaction. In this way, additional psycho-physical comfort is provided without any additional commitment.
Division into categorical and dimensional approach can be found in papers devoted to music emotion recognition [18]. In the categorical approach, a number of emotional categories (adjectives) are used for labeling music excerpts [19]- [21]. In the dimensional approach, emotion is described using dimensional space, like the 2D model proposed by [22], where the dimensions are represented by arousal and valence [23]- [27]. In our work we will use categorical approach with four basic emotions: happy, angry, sad, relaxed.
A comprehensive overview of music generating systems such as recurrent neural networks, convolutional networks, generative adversarial networks, and autoencoders was presented by Briot et al. [3]. A functional taxonomy and state of the art in music generation systems includes work by Herremans et al. [28]. The main concepts, specific tasks, and open challenges of music generation were the topics of the work of Carnovalini and Rodà [29].
A review of systems for algorithmic composition with the intention of targeting specific emotional responses in the listener was presented by Williams et al. [30]. It described using sequencing, transformative and generative algorithms to create novel and emotionally satisfying music. Additionally, it also considered the use of various emotional models and musical features, which were employed by such systems. Scirea et al. [31], described a music generator for games, MetaCompose, which is based on evolutionary computation and creates music that can express different mood states in real-time. The authors evaluated the affective expression perceived in the music generated by the proposed system, based on human annotation. The idea of automatically generated music with a given sentiment (positive/negative) was presented in [32]. It developed the method used for generating textual product reviews with a sentiment [33] by using a single-layer multiplicative long short-term memory (mLSTM) network. The network is controlled by optimizing the weights of neurons found that are responsible for the sentiment signal. A variant of this network, where logistic regression uses the hidden states of the generative mLSTM to encode the labeled MIDI phrases, was used as a classifier of sentiment. The training dataset was extracted from video game soundtracks in MIDI format, a part of which was annotated according to a two-dimensional model that represents emotion using valence-arousal.
In [34], Hadjeres et al. proposed geodesic latent space regularization for the variational autoencoder, which enhances VOLUME 9, 2021 latent space navigation with the change of the attributes of the decoded sequences. The paper presents a music generation system using the proposed regulation that controls the number of notes generated by variations of a given monophonic melody. In [35], Valenti et al. presented the architecture for music generation that is based on an adversarial autoencoder. The conducted experiments show that the model can organize the latent space according to high-level genre information of the musical pieces, which allows you to modify the style of the input song. In [36], a generative VAE model to control tonal tension in generated music was used. For identifying latent tension variables, the labeled musical fragment positions in the latent space were calculated. The generated music is similar to the original music by keeping the rhythm and manipulating the pitches to match the tonal tension.
What distinguishes this work from others is that it uses a conditional variational autoencoder with the emotion parameter influencing the generated examples. The use of this model with four basic emotions has not yet been noted in the literature.
The rest of this paper is organized as follows. Section III describes the phases of building a music dataset and the emotion model used in the experiments. Section IV presents the representation of symbolic music, which is the data form used during the generated model training. Section V describes the concept of conditional variational autoencoder, its implementations, parameters, and training. Section VI presents the generated music samples as well as their evaluation. Finally, Section VII summarizes the main findings.

A. PREPARING OF SYMBOLIC MUSIC DATASET
The first phase of building a music generating system is building or selecting a database with musical compositions. In this study, the symbolic music library music21 [37] containing compositions by J.S. Bach was used. This collection mostly includes chorales (382) as well as several other compositions, 410 pieces in total. The full list of compositions in the MusicXML format is available in [38].
Due to the fact that the symbolic music library was to be annotated with emotion labels, the selection of the database was guided by the fact that the database should contain files with varying emotions. In [39], Dong et al. studied key mode distributions of different music datasets, among others (Lakh MIDI Dataset, Wikifonia Lead Sheet Dataset, Hymnal Dataset, J. S. Bach music21 Dataset). They found that key mode distributions (minor, major) in most databases were rather imbalanced, with the exception of the J. S. Bach music21 Dataset, where the occurrence of major compositions is equal to 56% in relation to the whole. A fairly even key mode distribution of compositions is important when creating a database in which emotions will be assessed, therefore the J. S. Bach music21 Dataset was selected as the starting database for building the training set. The database was accessed via the MusPy Toolkit [39] and imported into the MusPy format. The music generation system created in this work should generate monophonic sequences, therefore the original J. S. Bach music21 Dataset underwent several transformations ( Fig. 1). First, the tempo of all songs was standardized to 120 BPM. The note values in songs with a tempo other than 120 BPM were adjusted so that only the note lengths (sixteenths, eighth notes, quarter notes, half notes, whole notes) affected the tempo. Another transformation is the limitation of the music example length to four bars and the selection of pieces only in a 4/4 time signature, which prevail in the J. S. Bach music21 Dataset, but which resulted in a reduction in the number of examples in the dataset. Thus, the rhythmic structure of the examples was standardized, covering four bars with four quarter notes. The result was eight-second examples, each example having 16 beats at a tempo of 120 BPM.
Another transformation concerned the keys of the examples, which vary greatly in the J. S. Bach music21 Dataset. When generating simple musical sequences, distances between sounds and rhythmic values are important, the key does not play a significant role, and even examples in different keys could interfere with model training. All compositions were transposed into C minor or C major.
Our model is supposed to generate one-voice musical sequences, and therefore the next transformation concerned only the highest voice of the composition, the soprano part, which usually contains the main melody of the piece. After applying all the transformations, a unified set of 344 single-voice musical sequences was obtained, all examples of which have the same length (8 s), are in the key of C major or C minor, and are saved in the MIDI format.

B. DATASET ANNOTATION
During annotation of music samples, we used one of four basic emotions: happy, angry, sad, relaxed, which correspond to the four quarters of Russell's model (Fig. 2), which consists of two independent dimensions of arousal (vertical axis) and valence (horizontal axis). Happy, angry, sad, relaxed, these Russell's circumplex model [22].
are just labels representing the individual quarters of the emotion model. Under each label, there are secondary emotions from a given quarter of Russell's model, i.e. the happy label groups emotions with high arousal and high valence; angry, high arousal and low valence; sad, low arousal, low valence; and relaxed, low arousal, high valence. Similar divisions of emotions into categories were used in papers [19], [40], [41].
The annotated set of MIDI files was played with one volume and timbre (MIDI instrument: Grand Piano), these elements in our experiment will not affect the emotions. What affects the emotions of a music fragment is the musical content: sounds, their number, the pitch, rhythmic values, organization, minor/major scale [20], [42], [43].
The psychologist Gabrielsson in his work [44] made a distinction between emotion perception into perceived and felt (induced) emotions. In the case of the former, we can perceive emotional expression in music without necessarily being affected ourselves; while in the latter, we have an actual emotional response to the music. Perceived emotion is the emotion recognized in the music, and induced emotion is the emotion experienced by the listener. The music expert's task was to annotate the MIDI files with the perceived emotion.
Data annotation was done by three music experts with a university musical education. The musical education of the experts, people who deal with the creation and analysis of emotions in music on a daily basis, enables to trust the quality of their annotations. The musicians involved in the annotation are practitioners. They play in music bands, compose, give concerts, express emotions through music, i.e. they specialize not only in perceiving emotions but also in creating them, which makes them more competent in the subject of perceiving musical emotions than people who only listen to music.
Each music expert heard all the examples, 344 eight-second MIDI files, as a result of which each annotator was able to notice all the shades of emotions in the music, which is not always the case in databases with the emotions determined.
This had a positive effect on the quality of the received data, which was emphasized by Aljanaki et al. [45]. The data collected from the three music experts was averaged. Considering the internal consistency of the collected data, Cronbachs α [46]

IV. REPRESENTATION OF SYMBOLIC MUSIC
Data from the MIDI files must be processed before being used to train the model to be understandable for the neural network. Since the music generation system will learn using monophonic melodies, all MIDI files from the dataset have been encoded into pitch-based representation using the MusPy Toolkit. The pitch-based representation represents music as a sequence of pitch, rest, and hold tokens. The output shape is T × 1, where T is the number of time steps. The values in the sequence indicate whether the current time step is a pitch (0-127), a rest (128), or a hold (129). Hold tokens are used to hold the duration of a note when the note is longer than the selected resolution, in our case the resolution was sixteenth notes.
Details of the transformation are presented in Fig. 3. The first note, a quarter note with pitch E4, was coded with MIDI number 64, and therefore its length is four times the length of the sixteenth note; it was supplemented with three hold values (129). The next note (an eighth note E4) was coded similarly. An eighth note is two times longer than a sixteenth note and therefore was coded with two values: MIDI number 64 and hold value 129. The coding of subsequent notes followed the same rules. The length of each example from the dataset corresponds to four bars in a 4/4 time signature, which is four quarter notes per bar, making a total of 16 quarter notes. The shortest note value in the dataset is sixteenth notes, and therefore examples with sixteen notes were discretized. There are four sixteenth notes for each quarter note, dividing the segment with the shortest note (sixteenth note) we get 64 time steps, 4(bar) × 4(quarter note) × 4(sixteenth note). Thus, each MIDI file from the dataset was encoded into a pitch-based representation with 64 time steps.
After processing the MIDI dataset, the number of different pitch notes was reduced to 29, which after adding rest and hold tokens gives a total of 31 different tokens in a sequence, which were additionally one-hot encoded. The shape of the target output tensor for one example was 64(time step) × 31(token).

V. CONDITIONAL VAE
A generative model based on variational autoencoder (VAE) [47] was used to generate the musical sequences, which encodes the input data into latent space with Gaussian distribution and then decodes samples from the latent vector to a similar form as the input. The advantage of VAE is the ability to move in the continuous latent space of trained VAE, which allows to generate new musical sequences. In order to add the possibility of controlling the type of emotions in the generated musical sequences, the model was extended to conditional VAE (CVAE) [48]. What makes CVAE different from VAE is the addition of a condition, which in our case is an emotion label (Fig. 4). The condition is added on both the encoder and decoder inputs.

A. IMPLEMENTATION OF GENERATIVE MODEL
For building implementation of the CVAE network and conducting the experiments, the Keras 4 deep learning library written in Python with Tensorflow 5 as backend was used. Figs. 5 and 6 show the encoder and decoder of CVAE, which were implemented using the recurrent neural network (RNN). CVAE allows to generate musical sequences with a specific emotion through random sampling from the latent space, which in our case has 20 dimensions.
On the first encoder input (Fig. 5), music sequences with 64 time steps and 31 unique one-hot encoded music pitch values are given. For faster RNN learning, the sequences are normalized (mean: 0.00, std: 1.00). On the second encoder input, one-hot encoded four emotion labels are given. Before concatenating two inputs, the dimension of labels is extended with a Dense layer and reshaped to the same size as the shape of the music sequences. The combined sequences are processed by 512 Gated Recurrent Units (GRU) [49], which make up RNN. The next two Dense layers reduce dimensionality and generate the mean and log variance. The last output layer of the encoder is a sampling of latent vector z.
On the first decoder input (Fig. 6), the samples of latent vector z from the encoder output are given. On the second decoder input, one-hot encoded four emotion labels are given, same as for the encoder. After combining, two inputs are used to layer RepeatVector to prepare the data size for the next layer which is RNN with 512 GRU. The last TimeDistributed layer allows to apply a Dense layer across the time steps of the music sequence.
The CVAE network consist of the encoder and the decoder joined together. The shape of the music sequences on the CVAE input and output is the same (None, 64, 31). The encoder takes input x, and estimates the mean µ, and the standard deviation σ , of the multivariate Gaussian distribution of latent vector z. The decoder takes samples from latent vector z to reconstruct the input on the output asx. The loss function is the sum of both the Reconstruction Loss (L R ) and Latent loss (L L ). Reconstruction Loss calculates the difference between input x and outputx using cross entropy. Latent loss is calculated using the Kullback-Leibler divergence, which calculates the distance between the target distribution (the Gaussian distribution) and the actual distribution in latent vector z: where K is the dimensionality of latent vector z, µ i and σ i are mean and standard deviation of i dimension of latent vector z.

B. TRAINING OF THE NETWORK
For our classification task, which is the prediction of one category (one pitch of note), the softmax function was used as the activation for the last decoder layer. As a loss function to train the CVAE network, categorical crossentropy was used, which computes the crossentropy loss between the one-hot pitch values and predictions. A tanh activation function was used for GRU units. A series of experiments were performed with and without standardization of the input data, the number of GRUs (64, 128, 248, 512), and with varying latent space size. Finally, a combination of 512 GRU and a latent space with dimension 20, and standardization of the input data were selected.
The CVAE was trained with RMSprop optimizer (lr = 0.001). The network was trained with 900 epochs and to avoid overfitting an early stopping strategy was used. The training process was stopped as soon as the loss did not improve any more for 50 epochs. The loss was evaluated on a validation set (20% of the training data).
CVAE+Dense was chosen as a baseline for comparing the results of the obtained models. It differed from CVAE+GRU in that a simple Dense layer in the encoder and decoder was used instead of the recurrent GRU layer. Table 2 presents the validation loss obtained during model building. The number in parentheses next to the model name indicates the number of units used. The best results are marked in bold. From the obtained results, we can see that models CVAE+GRU with more than 64 GRU units are superior to the baseline model (CVAE+Dense). We can see that the recurrent units in CVAE are better suited for encoding and decoding sequential data, which is of course well known. Testing how the use of the baseline model (CVAE+Dense) and the proposed model (CVAE+GRU) affects the obtained metrics for the generated music depending on the type of emotion will be presented in Section VI-B. Fig. 7 presents the stages of CVAE training with the use of input and output data visualization. The verification of the degree of training is illustrated by the ability of the decoder to reproduce the input sequence (Fig. 7a) on the output (Figs. 7b, 7c, 7d, 7e). The presented notations were completed with sequences of numbers constituting the pitch representation of a 64-element sequence of a given musical example. It was noticed that in the initial stage of training (Fig. 7b) the sequence was shorter, monotonous, with no clear musical meaning. CVAE is not yet sufficiently trained and is unable to generate a sequence close to the input sequence. The next steps (Figs. 7c, 7d, 7e) show how the sequence obtained at the output of the autoencoder starts to resemble the input sequence. Fig. 8 shows one view of the 20-dimensional latent space obtained during model training. New musical examples will be sampled and generated from the latent space. The points in latent space correspond to the training files, and the colors define the emotion assigned to them. We can see that the coordinate values of all points are distributed around mean value equal to 0. Different emotions are not grouped in one place, but spread throughout the entire latent space.

A. EXAMPLES OF GENERATED MUSIC SEQUENCES WITH PROVIDED EMOTION
A trained CVAE model was used to generate new music sequences with a specific emotion. The generation consisted of giving an emotion label and a random sample with a latent space size into the decoder input (Fig. 9).    We notice the minor scale in the examples of Figs. 10b and 10c, which places them on the negative part of the valence axis of Russell's model -emotions angry (e2) and sad (e3). In Figs. 10a and 10d we notice the sounds of the C major scale, which indicate positive emotions from Russell's model -happy (e1) and relaxed (e4).

B. EVALUATION OF RESULTS USING METRICS
To evaluate the generated music sequences, they were tested using the following metrics [5], [39], [50]: • pitch range -defined as the difference between the highest and the lowest pitch; 6 https://github.com/grekowj/musgenvae • n pitches used -defined as the number of unique pitches used in a melody; • pitch in scale C major rate -defined as the ratio of the number of notes in the C major scale to the total number of notes; • pitch in scale C minor rate -defined as the ratio of the number of notes in the C minor scale to the total number of notes; To test the statistical difference between the training data and the generated samples, a set of 20 musical sequences was generated for each of the four emotions (e1, e2, e3, e4) for a total of 80 examples. Four metrics were calculated for each generated example. The same metrics were also calculated for the training set. Comparing the distributions of the values of these metrics allowed us to assess whether the generated files have the specific emotions. Table 3 presents the mean and standard deviation (σ ) of the metrics obtained from the music generated with the proposed and baseline models, and from music used as a training set. Note that for pitch range and n pitches used the mean values are lower for the baseline model than for the proposed model, especially for emotions e1 and e2. The baseline model produces melodies with less differences between the highest and lowest tones and also with fewer unique pitches used in the melody. The mean and σ values obtained from the music generated with the proposed model are closer to the values obtained from the training set, especially when it comes to the metrics pitch range and n pitches used. Distributions of the calculated metrics for the generated (proposed model) and the training set labeled by emotion are shown in Figs. 11, 12, 13 and 14. In Fig. 11 we can see that the pitch range is lower for emotions e3 and e4, both in the generated and in the training set. This particularly concerns the emotion sad (e3), which has the lowest values. Similar differences between sets e1, e2 and e3, e4 can be seen in Fig. 12, which presents the number of unique pitches used in a melody. The sequences with emotions happy (e1)  and angry (e2) use more varying sounds than sequences with emotions sad (e3) and relaxed (e4). We could conclude that the pitch range and n pitches used metrics are suitable for distinguishing emotions on the arousal axis of Russell's emotion model.
Analyzing the box plots in Fig. 13, we can see that the musical sequences with emotions e1 and e4 use the C major scale sounds both in the generated and the training set. The use of C major scale sounds in files with emotions e2 and e4 is much smaller. We see an inverse distribution of values using the pitch in scale C minor rate metric (Fig. 14), where files with emotions e2 and e3 have greater values than e1 and e4. It could be concluded that the pitch in scale C major rate and pitch in scale C minor rate metrics are suitable for distinguishing emotions on the valence axis of Russell's emotion model.
To compare the statistics of the obtained value distributions for the individual metrics, the Kolmogorov-Smirnov (KS) statistic [51] was calculated to determine whether two distributions differ (Tables 4, 5, 6 and 7). The smaller the KS value, the more similar both distributions are, the samples are drawn from the same continuous distribution. The lowest values are in bold, which is the greatest similarity between sets.
KS values in one line in the table were computed by selecting a set with a specific emotion (e.g. e1) from the generated sets and compared with each set (e1-e4) from the training sets. This was repeated for the subsequent emotions (e2-e4), obtaining the next lines of the table.
Separate statistics are not always able to identify the most similar sets. To summarize the most similar sets, a win matrix was calculated based on the previous four Tables 4, 5, 6 and 7. Table 8 shows which of the training sets are closest to 129096 VOLUME 9, 2021 FIGURE 12. Box plots of the metric n pitches used for the generated and training data sets labeled with emotions e1-e4.  the generated sets with the given emotion. The most similar sets for each metric were recorded with an increment of 1 (or 0.5 in the case of two winners) in the matrix.
The table should be viewed horizontally as it indicates in how many cases the given generated set with a given emotion was similar to the training set with the same emotion.
The sum of the horizontal lines in Table 8 is 4.0 as the similarities were counted for four metrics.
From the information generalized in Table 8, we can see that the diagonal values are the largest, i.e. that a given set generated with a given emotion is most similar to its counterpart from the training set. The diagonal values are VOLUME 9, 2021 TABLE 4. Kolmogorov-Smirnov statistic between distributions of metric pitch range obtained from the generated and training sets labeled with emotions e1-e4.

TABLE 5.
Kolmogorov-Smirnov statistic between distributions of metric n pitches used obtained from the generated and training sets labeled with emotions e1-e4.

TABLE 6.
Kolmogorov-Smirnov statistic between distributions of metric pitch in scale C major rate obtained from the generated and training sets labeled with emotions e1-e4.

TABLE 7.
Kolmogorov-Smirnov statistic between distributions of metric pitch in scale C minor rate obtained from the generated and training sets labeled with emotions e1-e4. also exactly the winners that indicate for the success of this method. All other elements of the table should be 0 in the ideal case. Their values, different from 0, show that the separately used metrics are not as sensitive as they should be and these elements could be interpreted as errors. Hence, we can make one more interesting observation, the metric-set used in this evaluation is much more sensitive to arousal than to valence. Such a conclusion can be made by properly summarizing the appropriate elements of the table. Thus we can see that the errors between the examined sets are greater between the right and left hemispheres of Russell's model, emotions e1-e2 and e3-e4, i.e. on the valence axis it is 2.5 vs. the 1.5 on the arousal one.
Additionally, KS statistic was calculated between the music generated by the baseline model (CVAE+Dense) and the training sets, which are presented in Table 9. We notice a clear deterioration in the similarities generated to the training sets with emotions from the upper quarters of Russell's model (e1, e2), when arousal is high (diagonal values: 0.0, 2.0). Comparing Table 9 with Table 8, we notice a smaller number of wins on the diagonal, which indicates that the music generated by the baseline model is less similar to the original songs than that generated by the proposed model. This is confirmed by the use of the CVAE+GRU model with recurrent units for sequence processing; it is better suited than CVAE+Dense. GRU provides better possibilities for coding and encoding sequences. The generated music using the proposed model according to the presented metrics is more similar to the music of J.S. Bach, which was used as a training set. The use of CVAE+Dense as the baseline model showed that the non-recursive model is worse at generating music with e1 and e2 emotions than the CVAE+GRU model (Tables 8 and 9). A smaller deterioration in the quality of the generated music was noticed for emotions e3 and e4. The use of a simpler model generates worse music, particularly in the upper quarters of Russell's model. This proves that even the use of a non-recursive model as a baseline in our experiment made sense because it showed changes in the obtained metrics for different emotions, which is a very interesting observation.

C. EVALUATION OF RESULTS USING EXPERT OPINIONS
The same method that was used to label the training dataset (Section III-B), i.e. asking the same three music experts with a university music education to annotate the emotion of the generated music files, was used as a second method of evaluating the generated music sequences.
The evaluation concerned the same files as during the evaluation using the metrics (Section VI-B), i.e. each model was assessed using 80 music sequences, generated 20 for each of the four emotions (e1, e2, e3, e4). Assessment of the generated examples pertained two models: the baseline (CVAE+Dense) and the proposed model (CVAE+GRU). The task of each music expert was to listen and determine the emotions for all the examples generated by a given model, i.e. making 80 annotations for the evaluated model. The annotated examples were mixed up so that their order was not grouped by emotion. The obtained annotations from the music experts were averaged.
Expert annotations of the generated set by the baseline model (CVAE+Dense) and by the proposed model (CVAE+GRU) are presented in Table 10 and 11. The values in the rows refer to the generated files with the given emotion. Due to the fact that 20 files were generated for each emotion, the sum in the rows is also equal to 20. The values in the columns mean the number of files with a given emotion determined by the music experts.  The obtained annotations show that the created models generated music sequences with four categories of emotions. Comparing Table 10 with Table 11, we notice the higher accuracy (89%) of the generated examples by the proposed model relative to the baseline model (85%).
An interesting observation is that the more complex model, which is the proposed model (CVAE+GRU), is better at generating files with positive emotions (e1 -accuracy 100%; e4 -accuracy 100%) and slightly worse at generating files with negative emotions (e2 -accuracy 85%; e3 -accuracy 70%).
In the case of files generated by the baseline model (CVAE+Dense), we notice a deterioration in the generation of files with positive emotions (e1 -80%, e4 -60%), i.e. the opposite situation than in the case of the proposed model (greater accuracy for emotions e2 and e3, and worse for e1 and e4). The music experts expressed the opinion that the melodies generated by the simpler model were often underdeveloped, chaotic with high arousal, or monotonous with low arousal, which shifted the emotions towards the negative, the left hemisphere of Russell's model. Also, in the case of the baseline model, the smaller similarities of the generated examples to the original melodies was reflected in the metrics in Section VI-B (Table 9).
We noticed the rule that in both models we have errors mainly between emotions e1 and e2 or between e3 and e4, i.e. on the valence axis of Russell's model. This confirms that assessing and generating music with emotions on the valence axis is more difficult compared with the arousal axis, where there are almost no errors.
In summary, although the annotations showed that the created models generate music sequences with a given emotion with an accuracy above 85%, the mere determination of the emotions of the generated music files by music experts did not give a definite answer which of the tested models is better -the difference in accuracy is only four percentage points. After conducting both evaluations (using metrics and expert opinions), we can see that using additional objective metrics to evaluate the model (Section VI-B) is helpful in this case. In the future, additional parameters for the expert assessment of the generated music sequences could be used, such as rhythm, melody, and musical structure.

VII. CONCLUSION
More and more different kinds of machines and devices are entering our daily life. Robots, machines, and even objects called things offer services and information. In order to improve their interaction and collaboration with the user, in-machine feedback is necessary. The machine perception, concentrated on the analysis of a human's orders but also emotional state, must cause an appropriate quasi-human reaction. Therefore, studies of music generation with a specific emotion are reasonable and current trends.
This article presents the stages of creating a system generating monophonic musical sequences with one of four basic emotions. The generated examples based on random samples from latent space resemble real musical sequences and, additionally, we notice the appropriate emotions in them. A trained model recognizes the patterns influencing emotions in the training set and is able to transfer them to the generated examples.
The evaluation showed that the generated music examples are similar to the training set. Due to the random element, the generated examples are slightly different than in the training set, but their emotional characteristics are similar to the training data.
The limitations of this study include the emotional model we adopt, the musical area used in the training set, and the length of the monophonic pieces. All of these result from the initial stage of our research and were intentionally accepted as a compromise for this pilot study. The emotional model we apply considers just the four main emotional groups from Russell's model.
Thanks to such a system, in any human-machine interaction, a robot would be able to create a varied bunch of suitable and well corresponding to the current human mood melodies. Sensing in meaning ''detecting and tracking'' on the one hand and proper acoustical response of the machine on the other complement the human-machine collaboration making it a bit more human. The system could assist a composer in finding new themes with a specific emotion and could also be used to generate musical sequences in computer games depending on the emotional context, or the background music in shops. Another potential application of the system is music therapy, where the generated melodies with a specific emotion could be used to change or enhance the emotional state of the patient.
In the future, the generating system should be broadened to the possibility of working with polyphonic, four-voice music. Also, the use of emotion descriptions using continuous values, arousal or valence from Russell's model, would be a continuation of this work.