Deep Correlation Analysis for Audio-EEG Decoding

The electroencephalography (EEG), which is one of the easiest modes of recording brain activations in a non-invasive manner, is often distorted due to recording artifacts which adversely impacts the stimulus-response analysis. The most prominent techniques thus far attempt to improve the stimulus-response correlations using linear methods. In this paper, we propose a neural network based correlation analysis framework that significantly improves over the linear methods for auditory stimuli. A deep model is proposed for intra-subject audio-EEG analysis based on directly optimizing the correlation loss. Further, a neural network model with a shared encoder architecture is proposed for improving the inter-subject stimulus response correlations. These models attempt to suppress the EEG artifacts while preserving the components related to the stimulus. Several experiments are performed using EEG recordings from subjects listening to speech and music stimuli. In these experiments, we show that the deep models improve the Pearson correlation significantly over the linear methods (average absolute improvements of 7.4% in speech tasks and 29.3% in music tasks). We also analyze the impact of several model parameters on the stimulus-response correlation.


I. INTRODUCTION
U NDERSTANDING the human brain has been a topic of profound interest in both science and engineering fields. One of the most common methods to perform this analysis is to measure the evoked brain response for a given stimuli and establish a relation between them. The electroencephalography (EEG) constitutes the simplest non-invasive technique to collect brain signals while having sufficient temporal resolution for auditory analysis. Since the EEG recordings involve scalp level measurements, the recordings are significantly impacted by noise [1]. The most popular method for analyzing auditory invoked EEG signals is the classical event-related potential (ERP) approach [2], [3]. This approach involves averaging the EEG responses in time/frequency domain to suppress the noise in the recordings [4]. However, this approach is limited to isolated stimuli that have to be repeated and is therefore often restrictive for use in the analysis of natural stimuli like speech and music. One of the first successful attempts in this direction is the temporal response function (TRF) proposed by Lalor et al. [5]. The linear TRF model describes the relationship between a stimulus and its response as a linear time-invariant (LTI) system. It can be a linear forward model where the model estimates the EEG response for the stimulus or a backward model where the model predicts the components of the stimulus from the EEG response. The model estimation is performed using linear least squares. The performance of these models is typically validated using the Pearson correlation value between the target signal and the predicted signal [6].
The initial studies used the slowly varying speech envelopes of the stimuli and the corresponding single-trial EEG responses [7], [8]. The analysis can also be extended to speech spectrograms [9], phonemes [10], or semantic features [11].
The Canonical Correlation Analysis (CCA) is an extension of the linear methods for analysis. Here, two signals are projected onto a subspace that maximizes the correlation between them [12]. It determines a set of orthogonal directions on which the two signals are highly correlated. The CCA has been shown to be better than forward and backward TRF models in auditory-EEG analysis recently [13], [14].
For each subject, the stimulus and response representations are defined as "views" of the auditory signal. The stimulusview represents the audio signal using a temporal envelope and the response-view represents it using the brain responses collected as EEG recordings. The linear CCA can be performed only on two views of the data at a time. In order to aggregate the EEG responses from multiple-views (subjects), multiway CCA (MCCA) or generalized CCA [15]- [17] has been proposed. As all views (EEG responses) represent the same object (audio stimulus), some components are common across the views [18]. The application of multiway CCA for EEG mapping has shown improvements over the intra-subject CCA [19].
In this paper, we explore a deep neural network based architecture for correlating the EEG response with the stimulus features. The deep CCA framework, introduced by Andrew et al. [20], had shown promise over the linear CCA for image data. However, the direct application of the deep CCA to the EEG data is cumbersome as the EEG data is significantly noisy with signal-to-noise ratio (SNR) below −20dB [1]. The dropout strategy [21] alleviates the impact of noise partly. In audio-EEG experiments, we show that the deep CCA consistently improves over the linear CCA model.
We also propose an approach for deep multiway CCA where arXiv:2105.08492v2 [eess.AS] 27 Nov 2021 multiple EEG responses for the same stimuli can be combined in a neural network architecture. For this task, we use a reconstruction approach with a shared hidden representation to derive the deep transform that aligns multiple views. Using this novel approach, we show that the deep MCCA improves over the linear MCCA model [19]. In subsequent analysis, we also illustrate how the deep MCCA can be combined with the deep CCA model for EEG analysis. The data used in the experiments consists of EEG responses for speech and music listening tasks. The speech task is the same dataset used in the linear CCA work by Cheveigné et al. [13] where subjects listen to the narration of an audiobook. The music dataset used in this work is the Naturalistic Music EEG Dataset -Hindi (NMED-H) [22] which is an open dataset of EEG responses collected for Hindi pop songs.
The remainder of the paper is arranged as follows. Section II highlights prior work done in the domain. Section III describes the background of linear CCA and multiway CCA. Section IV discusses the proposed deep CCA model and the deep MCCA model. Section V details the datasets and experimental setup. Section VI reports the results of the proposed deep models and a comparison with linear models. Section VII presents a discussion on the hyper-parameters. Finally, Section VIII presents the summary of this work.

II. RELATED PRIOR WORK
Machine learning methods for the extraction of information from brain signals like EEG have a significant impact on both understanding and applications like brain-computer interfaces (BCI) [23]. It is therefore of profound importance to transfer the recent advancements in machine learning (for example, deep learning [24]) to improve the models for brain signal decoding and single-trial analysis. One of the first works in this direction involved the use of convolutional neural networks to identify the P300 wave in EEG signals [25]. The recent years have seen the use of deep learning for several brain mapping tasks like computational memory prediction [26], driver's cognitive state prediction [27], and the brain activity reconstruction for visual stimuli [28]. A review of several efforts in decoding brain activity using deep learning techniques is given in Zheng et al. [29].
In auditory tasks, EEG recordings have shown to contain rhythm information in music perception using classifiers based on deep networks [30]. A recent work by Das et al. [31] has shown that auditory attention decoding in the perception of noisy speech can also be improved by deep learning techniques. In multi-speaker cocktail party scenarios, Deckers et al. [32] showed that neural networks are capable of identifying the attended speaker. A Convolutional Neural Network (CNN) based model for EEG-based speech stimulus reconstruction was also proposed by de Taillez et al. [33], showing that deep learning is a feasible alternative to linear decoding methods. Our work on deep CCA models [34] for intra-subject analysis and inter-subject analysis [35] is extended in this paper with inter-subject models and additional evaluations on music dataset.

A. Linear Canonical Correlation Analysis
For a dataset with pairs of multi-variates, linear Canonical Correlation Analysis [36] obtains an optimal linear transform for both the views such that the Pearson correlation of the two transformed vectors is maximized. Let x ∈ R D1 and y ∈ R D2 be two random vectors that represent two views of the data. Let d be the dimension of the projected subspace. The subspace is determined such that resultant projection vectors are maximally correlated. For example, if d = 1, then a pair of transform vectors u 1 ∈ R D1 and v 1 ∈ R D2 need to be determined such thatx = u 1 x andŷ = v 1 y are maximally correlated. Mathematically, where, C xx , C yy are the auto-correlation matrices of x, y respectively while C xy = E[(x − µ x )(y − µ y ) ] is the cross correlation matrix. Here, µ x and µ y are the mean vectors of x and y respectively.
It can be shown that the optimal solution to the transform vectors u * 1 and v * 1 is given by the first left and right singular vectors of the matrix respectively. For d > 1, the solution is obtained by finding the subsequent singular vectors of T [20].

B. Linear Multiway CCA
The multiway CCA is a linear method that generalizes the linear CCA to multiple (more than two) data-views. It finds a linear transform for each data-view, such that all the projections are maximally correlated to each other [18], [19]. Let x n ∈ R dn , for n = 1 to N , denote the N data-views and D N = N n=1 d n . For the 1-D projection case, let v n ∈ R dn denote the transform vector that projects x n onto the common subspace. The goal of MCCA is to find the transform vectors {v n } N n=1 such that the inter-set correlation (ISC) is maximized. The ISC is defined as where r B is the between-set covariance and r W is the withinset covariance. The between-set and within-set covariances are, where R jk ∈ R dj ×d k is the cross-covariance matrix between the views x j and x k . The cross-covariance matrices among all the views form elements of a block matrix R ∈ R D N ×D N such that [R] ij = R ij . By considering only the autocovariance matrices, a blockdiagonal matrix D ∈ R D N ×D N is formed whose block diagonal entries are the same as that of R. The optimal transform vectors {v n } N n=1 are obtained by solving the equation [18]: The eigenvector v ∈ R D N ×1 with the maximum eigenvalue is the concatenation of the transform vectors {v n } N n=1 . For projection onto a higher dimensional subspace d (> 1), the transform matrix for each multivariate x n becomes V n ∈ R dn×d . This involves finding the top d eigenvectors of Equation (4).

C. Understanding CCA for EEG Decoding
The CCA model attempts to find a subspace of brain activity which is maximally correlated with the auditory stimulus. The EEG signal in the form of spatial channels (electrodes) and time-domain lags are used while the time-lagged audio envelopes are used as stimulus features. These two vectors form the data-views for CCA [13]. Specifically, the stimuli features, x, in our experiments represent the time-lagged envelope of the audio signal while the response features, y, represent the EEG recordings. These two features are provided at the same sampling rate and the CCA model attempts to relate them. This is done by finding the linear transforms, u * 1 and v * 1 , that transform the stimulus and the response respectively, in such a way as to maximize the correlation.
In this case, the components of CCA can also be regarded as spatio-temporal filters applied on the EEG data and modulation filters on the audio envelope. The multiple CCA components of the audio signal correspond to FIR filtered envelopes that are orthogonal. Similarly, the CCA components of the EEG signal represent spatio-spectrally filtered projections that are orthogonal. The CCA was used recently for auditory and audio-visual EEG analysis by Dmochowski et al. [37]. The use of CCA in forward and backward models with time lags has shown additional improvements in correlation values [13].
When multiple subjects are presented with the same stimulus, the functional similarity is expected to generate similar responses [38]. However, the position or orientation of the sources with respect to the electrodes may be different for different subjects and therefore, the direct mapping of the EEG responses for the same stimulus from different subjects is cumbersome. The multi-CCA attempts to align the data from each subject to a common representation that makes it possible to compare across subjects. This is achieved by deriving spatial filters that are specific to each subject [39].
For the MCCA model, the N subjects' response features (EEG recordings), x n for n = 1 to N and the common stimulus features (audio envelope) x N +1 , are provided as inputs. Now, the model provides a linear transform for each of them, v * n for n = 1 to N + 1, such that the projected representations (v * n x n ) are highly correlated to each other.

A. Deep CCA
A deep CCA model finds a pair of optimal non-linear transforms for the two views of the dataset through a pair of neural networks such that the two new projections are highly correlated [20]. As before, let random vectors x ∈ R D1 and y ∈ R D2 , denote the data-views. Let f 1 (·) and f 2 (·) denote the non-linear functions realized by the neural networks operating on x and y respectively. Let θ 1 and θ 2 represent the trainable parameters of f 1 and f 2 respectively. We find their optimal values by solving the following optimization problem: where ρ corresponds to the cross correlation coefficient. For a batch of m examples from each of the (x, y) dataviews, let H x ∈ R d×m denote the matrix whose columns are the neural network outputs f 1 (·). Similarly, let H y ∈ R d×m denote the outputs from the second network f 2 (·).
Let,H x = H x − 1 m H x 1 and similarly,H y = H y − 1 m H y 1 denote the centred data matrices, where 1 is an all-1 matrix of dimension m×m. The covariance matrices of the feed-forward network outputs are given by, Let T H = UDV denote the SVD of T H . It can be shown that optimization 1 of Equation (5) is given by, where Similar expression can be obtained for gradient with respect to H y . These gradients are backpropagated to learn the optimal model parameters θ 1 and θ 2 . Note that the gradient ascent, for j = 1, 2, is performed as where the η is the learning rate for the gradient ascent.

B. Deep Multiway CCA
For each of the N data-views {x n ∈ R dn } N n=1 , the goal of deep MCCA is to derive optimal non-linear transforms such that the transformed vectors are highly correlated. Let f n (·), for n = 1 to N , represent neural networks with trainable parameters θ n operating on x n . The N neural networks are trained to maximize the inter-set correlations defined as: Comparing with Equation (3), the correlation cost here (ρ Total ) is the summation of pairwise correlation coefficients. While the two definitions are related, the cost based on sum of pairwise correlations is more suitable for gradient based optimization. The parameters are obtained as: The backpropagation for each network is similar to the deep CCA model, as described in Equation (7).
The proposed model (shown in Figure 1) has multiple autoencoders sharing encoded representations (N autoencoders for N dataviews respectively). Each view x n forward propagates through the encoder part of its autoencoder, f n (θ n , ·). All the encoded representations, f n (x n ; θ n ) are concatenated (denoted as y), and propagated to the decoders,f n (θ n , ·). This shared encoder-decoder model allows the learning of data-view specific transforms that align the views.
The model is trained to maximize the joint cost function of correlation and negative of the mean square error (MSE) in reconstruction. This cost function is given as, where ρ Total is defined by Equation (11) and MSE(·) is the average squared reconstruction loss. The parameter λ controls the trade-off between maximizing the correlation metric and minimizing the MSE in learning the model parameters.
The model is trained using the N data-views {x n } N n=1 with the cost metric defined. Note that the correlation loss is independent of the decoder parametersθ n while the MSE(·) is a function of both the encoder parameters θ n and decoder parametersθ n . Once the model is trained, each data-view x n is projected using the encoder f n (x n ; θ n ).

V. AUDIO-EEG SETUP A. Datasets
We experiment our methods on two datasets. The first one is a dataset of EEG responses for speech stimuli, recorded by Liberto et al. [10]. The second dataset is NMED-H [22]. It contains EEG recordings for a music listening task.
Speech-EEG dataset: The EEG recordings were collected using 128 channels, when the subjects were listening to a male speaker reading snippets of a novel 2 . A Biosemi system, sampled at 512 Hz, was used to collect the EEG data. We perform the same preprocessing steps as described in Cheveigné et al. [13]. Specifically, the EEG data are down-sampled to 64 Hz and processed using noise suppression software [40]. A band-pass filtering with a passband in the range of 0.1 − 12 Hz is applied to the EEG data. At the stimulus side, the speech envelopes sampled at 44, 100 Hz, are squared and smoothed by a convolution with a square window. Finally, the stimuli data are downsampled to 64 Hz with a cubic-root compression.
Music-EEG dataset : The NMED-H [41] is an open source dataset containing EEG responses to naturalistic music -4 versions (normal, time-reversed, phase-scrambled, and shuffled) of 4 full-length "Bollywood" songs, each approximately of 4.5 minutes long. The last three stimuli versions were chosen to manipulate the temporal features at varying degrees, while the aggregate frequency content of each stimulus was same. The shuffled version imposes minimal temporal disruption whereas the phase-scrambled versions were considerably distorted [41].
In the phase-scrambled subset, three stimuli files were not considered in the analysis as the features had discontinuities. The EEG recordings were recorded from 125 electrodes at 1 kHz. Each recording is filtered between 0.3-50 Hz and downsampled to 125 Hz. We use the "Clean EEG" recordings which are cleaned and aggregated on a per-stimulus, per-listen basis. More details on data acquisition and preprocessing are given in Kaneshiro [42] [41].
The stimuli features are extracted as described in Gang et al. [43]. The acoustic features are extracted using the music information retrieval (MIR) toolbox, Version 1.7.2 [44]. From the stimuli, 20 features are extracted in 25ms analysis windows with a 50% overlap between frames [45], [46]. The 20 features are discussed in the Section IX-B as an appendix. The principal component analysis (PCA) is performed to obtain a 1D representation (PC1) on these 20 extracted features. The two individual features, root mean square (RMS) and spectral flux, along with the PC1 are chosen to obtain a 3D representation for the stimuli. The EEG responses are re-sampled to the sampling rate of the acoustic features (80 Hz).

B. CCA Methods
In all our experiments, the linear CCA (LCCA) [13] and linear MCCA (LMCCA) [19] analysis act as the baseline setup for comparison with the deep CCA (DCCA) and the deep multiway CCA (DMCCA) methods. For the multi-subject EEG analysis, the outputs from the multiway CCA are further processed with CCA (either LCCA or DCCA).  1) LCCA: On the 1D preprocessed stimuli data, a dyadic bank of 21 FIR band-pass filters is applied that contains filters that are approximately uniformly distributed on a logarithmic scale [13]. At the response end, a PCA is applied to the EEG data that transforms the 128D (or 125D for music data) EEG data to 60D. The filterbank is applied on these 60D EEG data to yield 1260D data. A second PCA is applied that transforms them to 139D subspace. Now, the 21D stimuli data and the 139D EEG data are projected onto common subspace using CCA transforms. The data are processed (the choice of PCA and the dimensions after each stage) as proposed by Cheveigné et al. [13]. The combination of PCA and filterbank acts as a spatio-temporal filter on the data.
2) DCCA: The neural networks used in DCCA model have a 2 hidden layer architecture, for each view, with 2038 and 1608 units for the first and second layers respectively followed by a d dimensional output layer. The data are processed similar to the LCCA method, and the final 21D and 139D representations are input to a deep CCA model. Figure 2 describes the LCCA and DCCA methods.
3) LMCCA: The preprocessed EEG responses from N subjects and a time-lagged version of their common stimuli (d s D), are provided to a linear MCCA model to obtain the denoised representations for each subject's EEG response. Now, each subject's denoised EEG data and their corresponding stimuli can be provided, separately, to the LCCA and DCCA methods to obtain final representations. This is performed as proposed in Cheveigné et al. [19]. 4) DMCCA: The preprocessed EEG responses, along with the common stimuli are provided to a deep MCCA model to obtain a dD denoised representation for each EEG response.
The architecture of the DMCCA model is shown in Figure 1. The encoder has two hidden layers of 60 units each and an output layer of d units. The decoding part has two hidden layers of 60 and 110 units respectively.
The d s and d are hyperparameters and the best values are selected by varying them, for both the variants of MCCA. More details are discussed in Section VII-A. The outputs from the linear MCCA model are of 128D for speech (125D for music) dataset. For both the MCCA methods, the denoised responses are provided to the filterbank followed by a PCA to generate 139D vectors. The dD stimuli obtained are projected onto a 1D subspace using PCA, followed by the filterbank resulting in a 21D data. These steps make sure that the inputs to the CCA transforms are equivalent in both versions of MCCA.
For intra-subject analysis, the LCCA/DCCA are performed on each subject's data separately. For inter-subject analysis, the LMCCA/DMCCA are performed on data from multiple subjects data followed by a subject-specific LCCA/DCCA method. Thus, we have four combinations 1) LMLC: LMCCA + LCCA 2) LMDC: LMCCA + DCCA 3) DMLC: DMCCA + LCCA 4) DMDC: DMCCA + DCCA Figure 3 shows the four combinations of the MCCA denoising followed by CCA analysis for each subject.

C. Experimental setup
For the speech dataset, the methods are performed on the preprocessed 1D stimuli envelopes and 128D EEG responses. For NMED-H, along with the 1D stimuli envelopes, each dimension of the 3D preprocessed stimuli is also used with the 125D clean EEG recordings.
From the speech dataset, stimulus-response data of 8 subjects are considered randomly to perform the experiments. The NMED-H dataset contains recordings from 48 subjects and 16 stimuli. The subjects were divided into 16 groups of 12 subjects with each subject appearing in 4 groups. Each group is presented with 2 trials of a stimulus which results in each subject listening to 2 trials of 4 different stimuli. In our analysis, we have used all the data available. All the 12 subjects that were available for each stimulus were used in the inter-subject analyses and the intra-subject analysis.
We split the data into 90 − 5 − 5 for training, validation and test respectively. It results in about 155k samples for training and 8.5k samples for testing and validation, for each subject in the CCA experiments. Similarly, we use 38k samples for training and 2k samples for testing and validation, for each subject per stimulus in the MCCA experiments.
A leaky ReLU activation function with a negative slope coefficient of 0.1 is used in the DCCA model and the encoder part of the DMCCA model. A linear activation function is used at the output layer of the decoder sections in the DMCCA model 3 . Further, dropout regularization [21], [47] is incorporated in the deep models training to avoid over-fitting in the noisy conditions.
Performance Metric: The primary metric used is the Pearson correlation between the transformed EEG and audio signals on the held-out test set. For each subject, the LCCA and DCCA methods performance is measured by the correlation coefficient (ρ) of the two final representations. The LMCCA tries to maximize the ρ ISC , and the DMCCA tries to maximize the ρ Total . For overall results, instead of direct averaging of Pearson correlation values which is mathematically incorrect, we perform a z-score based averaging implemented in the Statsoft software [48].
We also use a secondary performance metric based on classification of aligned versus misaligned EEG-audio segments [13]. Here, fixed-length segments of EEG and audio signals that are randomly located are correlated using the models. If the audio and EEG segments are aligned, the model is expected to generate a higher correlation score than when the two signals are misaligned. The correlation scores are analyzed using the sensitivity index (Cohen's d statistic).
Let the means of the matched and mismatched segments' correlation coefficients be µ 1 and µ 2 respectively. Let, σ 2 1 and σ 2 2 be their respective variances. The Cohen's d is:

VI. RESULTS
The results comparing the linear and deep CCA models for intra-subject and inter-subject experiments on the speech-EEG dataset are given in Table I and Table II respectively. The results for the music-EEG dataset are shown in Table III and  Table IV. Pairwise one-tailed t-tests are performed on the pairs of LCCA-DCCA (Table I and III) and LMLC-DMDC (Table II and IV).

A. Speech-EEG dataset results
For speech-EEG experiments, the 20 cross-validation results (correlation values) for the 20 folds for all the subjects are considered for a pairwise t-test. As seen in Table I, the DCCA improves over the LCCA for all the subjects. The average absolute improvements for DCCA over the LCCA in terms of correlation value is 5%. The improvements are also statistically  Table II. Here, the intersubject alignment using linear method (LMLC) improved over the linear intra-subject correlation model (LCCA) on all subjects except subject 3 and 4. The inter-subject alignment using deep learning (DMLC/DMDC) improves the correlation scores compared to the intra-subject scores reported in Table I for all cases except subject 4. Further, the deep models consistently improve over the linear counterparts. In particular, the deep multiway CCA approach improves over the linear multiway CCA by an absolute correlation value of 7.4 % on the average. The improvements are found to be statistically significant (p < 0.025) for 5 out of 8 subjects. The overall aggregate score is found to be statistically significant as well.

B. Music-EEG dataset results
For LCCA/DCCA methods, the average correlation values for the 48 subjects in the NMED-H dataset is reported in Table III. The results are reported for different music conditions -normal, shuffled, time-reversed and phase-scrambled; and stimuli features -envelope, PC1, RMS and spectral flux. The pair-wise t-test on the NMED-H dataset shows that all   improvements obtained by the deep versions are statistically significant (p < 0.025). The performance of LCCA and DCCA methods on the PC1 features of 48 subjects from the NMED-H dataset is shown in Figure 6. The average absolute improvements are about 11% for the DCCA over LCCA method. For inter-subject analysis, the Table IV shows that the DMDC improves over the LMLC method with an average absolute improvement of 29.3%.

C. Statistical Analysis
In order to measure the significance of the improved correlations of our proposed deep models over the baseline system, we have performed two statistical tests: a) one-tailed pairwise t-test and b) Cohen's d . The pairwise t-test is performed to objectively quantify the difference in the distribution of the correlation scores from the two methods. Given that the same hypothesis (LCCA versus DCCA on intra-subject analysis or LMLC versus DMDC on inter-subject analysis) is tested on two different datasets (speech-EEG and music-EEG), a compensation is required for multiple comparisons. We use the Bonferroni correction [49]. Thus, a p-value 0.05/2 = 0.025 is used to check if the improvements in the correlation are statistically significant on each dataset. The pairwise t-test results comparing the linear and deep models are reported for speech-EEG (Table I, II) and music-EEG (Table III, IV).
As mentioned in Section V-C, a classification metric is also performed where audio-EEG segments are classified as aligned or misaligned based on the Pearson correlation measure. The second statistical test, Cohen's d [50], is an effect size used to indicate the standardised difference between two classes (in our case, these classes are aligned and mis-aligned audio-EEG pairs). The d metric quantifies the model's ability to match the corresponding stimulus-response pair based on the correlation value, ρ. The test data is divided into N segments of t seconds each, and the correlation coefficient values ρ are calculated for the linear the deep methods for both aligned and mis-aligned segments (speech/audio-EEG pairs). The Cohen's d metric measures the model's ability to separate aligned versus mis-aligned pairs. The LMLC/DMLC methods are used on the audio-EEG segments and the correlation values are computed. Using the correlation score from the respective models, the Cohen's d is computed on the correlation score. The d statistics are presented in Figure 5. This is performed separately for the speech-EEG and music-EEG datasets. As seen in Figure 5, the deep model improves over the linear model in all the cases except for 1 second segments in speech-EEG data. In longer segments, considerable improvements in the d statistic are observed for the deep models.

A. Impact of Hyperparameters
In this section, we analyze the impact of the hyperparameters involved in the deep CCA/MCCA models on the correlation metric. The parameters analyzed are: model architecture, dropout percentage, regularization parameter in DMCCA, and the number of output dimensions in DMCCA. We use a learning rate of 1e−3 and a batch size of 2048 on experiments where these parameters are not mentioned. Further, the number of time-lags used in the stimulus input is also varied to understand its impact. For initializing the models, we start with multiple random seeds and choose the one which gives the best correlation on the validation data before the model training. Unless specified otherwise, the values of parameters d s and λ are fixed to 60 and 0.1. The value of d is fixed to 1 and 10 for all the DCCA and DMCCA models respectively. 1) Dropout: For the speech-EEG dataset, we experiment with dropout percentage from 0 − 20% in the DCCA/DMCCA model. The correlation values obtained by DMLC and DCCA methods are shown in Figure 6 (A). When there is no dropout, there is a tendency for the model to overfit. A similar effect is seen in the DCCA model as well.
2) Batchsize: The effect of the batch size is analyzed for the DCCA model. The average correlation values of 6 subjects from the speech dataset, for 20 cross-validation trials is reported in Figure 6 (B). Given the noisy nature of the data, we find that the higher batchsizes (compared to typical choices of few hundreds in supervised classification setting) are found to improve the final correlation value. The optimal batch size on the validation data is 2048.
3) DCCA output dimension: Just like the CCA model where multiple canonical components dimensions can be derived from the data, the DCCA model also can be trained for multiple output dimensions. The comparison of the linear and deep CCA models for 5 canonical correlation components is shown in Figure 6 (C) for each subject in the speech-EEG dataset. As seen here, the DCCA model improves over the linear CCA model consistently for all the subjects.

B. Model Architecture
We also experimented with various architecture choices for the DCCA model. The experiments with varying the number of hidden layers (L) from 2 to 5 and number of units (n L ) in each layer is shown in Figure 7. As seen in this plot, increasing the number of layers degrades the correlation, as the model tends to over-fit. This trend may also be attributed to the lack of sufficient audio-EEG data for each subject. We hypothesize that, with more training data, the deeper models may further improve over linear models as well as the shallow models.

C. Impact of Improved Correlations
The EEG recordings capture various unrelated brain processes along with the response to the stimuli. Thus, only a fraction of the variance in the EEG signal can be explained by its stimulus. This results in low correlation values for many of the linear methods proposed in the past. In this paper, we have explored the application of deep models whereby consistent improvements in correlations are illustrated. Many applications based on EEG would benefit significantly with the improvements in stimulus-response correlations. For example, the improved correlations will help the EEG enabled  . auditory assistance device (e.g. hearing aid) as suggested by Cheveigné et al. [13]. In music information retrieval, the performance improvement in EEG decoding systems will enable the understanding of the perceptual attributes of music. Throughout the study, we have pursued simple features like envelope. However, audio signals are encoded in several other dimensions like pitch, rhythm, zero-crossings, phase, semantic/linguistic features etc. The exploration of the model with additional features may further throw light on the encoding of these dimensions in the EEG signals. Further, the techniques proposed in this work would be applicable to other brain signals like MEG, ECoG and fMRI signals as well.

VIII. SUMMARY
In this paper, we have proposed extensions to models that uncover the stimulus-response relationships between auditory signals like speech and music and their EEG responses. The models advance linear correlation methods and are proposed for improving single trial analyses. The models pose the problem of finding the optimal transforms that need to be applied to the stimulus and response in a deep learning framework which enables the learning of these transforms using established optimization methods. Using the proposed deep models, we show that the correlations between the stimulus and the response can be significantly improved over the linear methods. Further, the applicability of the proposed models is separately analyzed for speech and music EEG tasks.

A. Derivation of the deep CCA model gradients
The matrix T H is defined as T H Ĉ −1/2 xxĈxyĈ −1/2 yy and its SVD decomposition is denoted as T H = UDV . The objective function that needs to be maximized is where · tr denotes the trace norm of the matrix. We can derive the partial derivative of the objective function with respect to the matrixĈ xx , ∇ xx , as following: where, Now, Equation (15) can be rewritten as : Similarly, the partial derivative of the objective function with respect to the matrixĈ xy , ∇ xy , can be derived as: Now, the Equation (7) can be derived as following. Let us denote the gradient of the objective function ρ (H x , H y ) with respect to H x as : Each term from the right hand side can be derived as: Hence, equation (19) can be rewritten as: Therefore,

B. Acoustic features used for NMED-H Dataset
The 20 acoustic features are extracted from the NMED-H Dataset as discussed in the baseline work by Gang et al. [43]. The features are extracted using the MIR toolbox provided by Lartillot et al. [44]. The 20 acoustic features are calculated as following.
Let, M t [f ] represent the magnitude of Discrete Fourier transform (DFT) of a given audio signal (m t (n)) at frame instant t and frequency bin f . Let, the number of frequency bins be F . 1) Zero Crossing Rate: It represents the number of sign changes of the audio signal m t (n). 2) Spectral centroid: The spectral centroid is the first order moment of the DFT given as : 3) High/Low Energy Ratio: It represents the ratio of the highest to the lowest magnitudes in M t [f ]. 4) Spectral Spread: It represents the standard deviation of the M t [f ] in the frequency domain. 5) Spectral Roll-off: At a given instant, the roll-off R t is measured as the frequency below which 85% of the magnitude of the Fourier transform is concentrated.
6) Spectral Entropy: It is measured as the relative Shannon entropy of the normalized magnitude M t [f ]. 7) Spectral Flatness: It represents whether the magnitude distribution in the frequency domain is smooth or spiky. It is measured as the ratio between the geometric mean and the arithmetic mean of the magnitudes in the frequency domain at each instant. S More details about these features are available in the primer of the music information retrieval (MIR) toolbox by Lartillot et al. [44].