Deep Latent Fusion Layers for Binaural Speech Enhancement

This work addresses the issue of enhancing speech in binaural hearing scenarios. Specifically, we present a method to improve binaural noise reduction by integrating latent features produced by monaural speech enhancement algorithms through the use of “Fusion layers.” These layers perform Hadamard products between latent spaces at specific processing stages. These fusion layers draw inspiration from multi-task learning techniques, which involve sharing model weights across various models aimed at handling interconnected tasks. The layers perform element-wise dot products between tensors representing latent representations at the same processing stage, mimicking the physiological excitatory and inhibitory mechanisms of the binaural hearing system. This study initially presents a general fusion model, demonstrating its ability to better fit synthetic data compared to independent linear models, equalize activation variance between learning modules, and exploit input data redundancy to improve the training error. We then apply the concept of fusion layers to enhance speech in binaural listening conditions. The proposed method shows promise for improved noise reduction compared to other feature-sharing approaches. The study also suggests that including fusion can enhance predicted speech intelligibility and quality, but too many fused features may have a negative impact on expected speech intelligibility. Furthermore, the results suggest that fusion layers should share parameterized latent representations to effectively utilize information from each listening side, rather than using deterministic representations. Overall, this study highlights the potential of sharing information between speech enhancement modules through deep fusion layers to improve binaural speech enhancement while maintaining constant trainable parameters and improving generalization.

address the problem of speech enhancement in binaural listening by introducing a simple weight-sharing mechanism between two monaural speech enhancement algorithms.
Commonly, deep learning models are trained to perform one task at a time. For example, in image processing, a deep neural network (DNN) can be trained to classify images between a set of classes or to segment particular objects of interest within images (e.g., [3], [4], [5]). In the context of speech processing, DNNs can be trained to recognize the words in speech sentences from the raw audio (e.g., [6], [7], [8]), or to automatically remove the unwanted components of a corrupted speech signal, such as noise or other speakers (e.g., [9], [10], [11], [12]). These approaches work generally well, but they may ignore potentially rich sources of information contained in real-world problems. For instance, speech enhancement systems improve noise reduction performance when also relying on visual feedback, giving rise to audio-visual speech enhancement [13]. Here is where multi-task learning (MTL) comes into play.
MTL is a subset of deep learning techniques in which multiple learning tasks are solved at the same time while exploiting similarities and differences between them. This technique is generally the result of sharing parameters between different models [14], [15], [16]. MTL can provide the models with higher generalization capabilities by leveraging the domain-specific information contained in the training signals of related tasks. It does this by training tasks in parallel while sharing latent representations of the input data. This method can be used, for example, to identify an object within an image, recognize the overall scene and generate a verbal caption for it (e.g., [17], [18]). Also, for speech processing, MTL can be used to improve speech activity detection (e.g., [19], [20]).
Much of the current deep learning research has focused on coming up with better architectures, and it is not different for MTL. Actually, architecture plays possibly an even larger role in MTL because of the number of possibilities that one has to tie multiple tasks together. In other words, the way the parameter sharing between the networks is performed is not obvious. In fact, there is research devoted to finding optimal latent multi-task architectures [21], [22]. However, simple approaches such as cross-stitch networks that learn linear combinations of latent representations between the models have proven to be successful in generalizing into multiple tasks [23], [24]. In this work, we present a simple weight-sharing method to perform binaural speech enhancement.
A healthy human auditory system is excellent at isolating target signals in acoustically challenging conditions, this is due to the ability it has to exploit both acoustic inputs captured by each of the ears, and to centrally compare features contained in them; this is known as binaural hearing [25], [26]. The problem of binaural speech enhancement has been an active research problem for already some time (e.g., [27], [28], [29], [30]). However, more recently, DNNs have proven to be successful at performing speech separation in binaural listening by sharing acoustic binaural features. For example, previous research has used feature concatenation at the input level to perform binaural speech enhancement (e.g., [2], [31]). These methods have been shown to improve speech enhancement performance when compared to independent models, however, they rely on explicit spectral feature extraction and are not necessarily motivated by the human binaural auditory system.
Although the exact fundamental physiological mechanisms by which the binaural hearing system exploits different acoustic cues are not fully understood [32], [33], there have been attempts to develop computational models that explain empirically observed human binaural hearing abilities, such as the equalization-cancellation model [34], [35]. This model suggests explaining binaural masking level differences with processes of relative delay compensation and then subtraction of particular acoustic features captured by each ear to attenuate the interfering noise. In this work, we propose DNNs that although do not perform the same operations as the equalization model, may learn to combine latent features to emulate neural excitation and inhibition processes that happen in the brain stem for binaural acoustic processing [33].
Inspired by the physiological excitatory and inhibitory mechanisms that occur in the binaural hearing system [36], we investigate the influence that sharing the latent representations of two single-channel end-to-end speech enhancement DNNs has on the speech enhancement performance of binaural noisy speech signals. The latent representations are shared through fusion layers that apply element-wise dot product operations to each of the features contained in them. These layers are designed to introduce non-linearities to the learning model that will allow better fitting of the training data while improving generalization without affecting the number of trainable parameters. We expect that the fused models will emphasize latent target feature representations in the fused layers by canceling unwanted noisy elements contained in the input audio signal, causing also a decrease in layer activation variance. Here we extend a previous study 1 presented at the 2021 Clarity speech enhancement challenge [37] by formalizing the concept and by analyzing the effect of input data correlation, latent activation variance, and encoding methods.
This work proposes a method for improving binaural speech enhancement by combining latent representations generated by DNNs using "Fusion layers." These layers perform elementwise dot products between tensors representing latent representations at the same processing stage, inspired by the physiological excitatory and inhibitory mechanisms of the binaural hearing system. The proposed method shows potential for better noise reduction compared to other data merging methods like spectral feature concatenation, and for improving predicted speech intelligibility and quality. However, fusing too many features can have a negative impact on predicted speech intelligibility. This highlights the need for caution when using fusion to prevent excessive degradation of output signals.
The rest of this manuscript is organized as follows. Section II describes the method. The experimental results are presented in Section III, and Section IV concludes this manuscript.

A. General Fused Model
The main aspect we aim at investigating in this study is the effect that sharing information between deep learning models has on data fitting and generalization performance. We propose to share this information by means of fusion layers that apply dot-product operations to specific latent representations at different stages of data processing. We will first describe a general fused model to formalize the notation that will be used throughout the manuscript. Let where D L is the dimensionality of the output tensors, be the output tensors computed by a set of learning models given by Ω m (·), for a given set of input tensors X m ∈ R D 0 , where D 0 is the dimensionality of the input tensors, m = {1, . . . , M}, and M is the number of DNNs. Each of the models contains L learning modules (i.e., layers, multi-layer perceptrons, etc...) that apply a function ω l,m (·) to transform its input tensor into a latent representation of it, i.e., X l,m = ω l,m (X l−1,m ), l = {1, . . . , L} (note that for the input and output tensors, the index l is omitted). At this point, we introduce the fusion layer. This layer is designed to share information between the different models by means of an element-wise dot product of the latent representations at different stages of the processing. Let ρ(·) be the Hadamard product operator. The output of the fusion layers will be represented by tensors χ l,m = ρ(X l,m , Λ l,m ), where X l,m is the output of the learning module (l, m), and Λ l,m is the set of tensors that will be fused at layer (l, m) with X l,m , such that Λ l,m := {X l,m |m = m ∧ 1 ≤ m ≤ M }. Here, the direct path without fusion is indicated by Λ l,m = {J l } ∈ R D l (all-ones tensor), with D l being the output dimensionality of layer l. In this case χ l,m = X l,m .
A general deep fusion model is shown in Fig. 1. In this graph, learning modules and fusion layers are indicated by black and white vertices, respectively, whereas the flow of tensors is indicated by directed edges. This model can be simply described with matrix notation through the deep latent fusion matrix Δ for each fusion set Λ l,m ∈ R D l , as follows: The here presented fusion layers have three purposes, namely: 1) Introduce non-linearities to the model in a controlled way; 2) Leverage input feature redundancy (i.e., correlations) to improve data fitting, and; 3) Act as a channel for the gradients to back-propagate through, to reduce the activation variance between learning modules and improve generalization on unseen data [38].

B. Fully Fused Linear Models
To investigate the effects that the fusion layers have on a specific model we will simplify the generic fused model by assuming that all learning modules (i.e., fully connected layers) are linear, and that input tensors are vectors X m ∈ R 1×T , where T can be interpreted as the number of time steps. This will allow us to assess how non-linearities are introduced due to the interconnection of the independent models, characterize how the input data correlation affects the data fitting, and assess how the variance of the layer activations is impacted. The general model shown in Fig. 1 that does not contain any fusion layers will be referred to as "independent" (i.e., Λ l,m = {J l } ∀ l, m). Each of the models contains L layers (i.e., the learning modules) consisting of n l parameters. Activation functions for each of the layers are defined by φ l,m (·), ∀ l, m. The output at layer l for model m is given by X l,m = ω(X l−1,m ; w l,m , b l,m ) = φ l,m (X l−1,m w l,m +b l,m ), where w l,m ∈R (n l−1 )×n l and b l,m ∈ R 1×n l are the weights and biases, respectively. Assuming that all activations are linear, the output of each layer and model X l,m will satisfy ∂Y m (X l,m )/∂X l−1,m = C l,m ∈ R; i.e., a constant. Hence, every model m will be reduced to a linear regression.
1) Generating Non-Linear Models Through Fusion: Now let us define a fused model where all layers are multiplied with each other for all learning modules, that is Λ l,m := {X l,m ∀ l ∧ m = m}. We will introduce two fusion modalities, namely: side-wise fusion and depth-wise fusion. These two ways of making the models interact with each other will have different effects on the non-linearities introduced and on how latent information is transmitted throughout the models. These will be described in the following lines.
Side-wise fusion level is defined as the size of the fusion set, that is, |Λ l,m | (where |A| represents the cardinality of a set A). In general, the fusion output at layer l in a fully fused model (side-wise fusion level of |Λ l,m | = M − 1) is given by: This fusion operator (i.e., chained Hadamard products) will cause the M models to no longer be independent, introducing non-linearities at the output of a given learning module l such that the leading order term (LOT) is: Depth-wise fusion level is here defined as the number of fusion operations that precede the deepest fusion layer. It occurs for models with multiple learning modules (i.e., deep multi-layer models), that include deeper processing stages to increase the order of the modeled function. If we consider a fully fused linear model, the fusion output of layer l can be written as (2). At layer L − 1 the output of the fusion layer will be not only dependent on the side-wise fusion operation but also on the previous latent representations. This output can be written as a function of previous fusion operations as follows: where L is the number of learning modules that each model contains. In this case the introduced non-linearities at the output of a given learning module m such that the LOT is: (5) It is important to note that the special case where M = 1 leads to a model with no fusion operations, where |Λ l,1 | = 0 ∀ l, and the output is reduced to a linear regression.

A. Experiment 1: Study on Synthetic Data
In this experiment, our objective is to examine the impact of the fusion operation on basic regression problems using a dataset generated artificially. We partition this experiment into two sub-experiments. The first one will demonstrate through empirical evidence that the operation presented in (2) introduces non-linearities. In the second sub-experiment, we explore the trade-off between the correlation of the input data in each submodel and its fitting capabilities. Model: In this experiment we will keep the number of submodels m = 2 (as shown in Fig. 2). All learning modules are fully connected layers with linear activation functions. The input and output layers of all sub-models consist of one single unit and the number of units in each of the hidden layers will be specified by n l , for which we tested n l ={32, 64, 128, 256}.
Dataset: The dataset for this experiment was artificially generated by creating input vectors with elements sampled from random uniform distributions. Because we keep the number of models m = 2, two input vectors were created, X 1 ∈ U{0, 1} and X 2 ∈ U{0, 1} containing 500 samples each (T = 500, see Fig. 3, first panel). From the input data, we generated a non-linear output for each sub-model (Y 1 for sub-model 1 and Y 2 for sub-model 2) as follows: where . X n1 and X n2 are noisy samples with a maximum amplitude of 0.3, and d is a multiplicative factor that controls the amount of correlation at the input (d = 0 for fully correlated inputs, i.e., identical input signals, and d = 1 for fully uncorrelated inputs). Loss function: To fit the artificial training data to the target functions described in (6), we minimized the mean-squarederror (MSE) between the predicted output Y and the targetỸ. The MSE computed over n samples is defined as: Training: The models were trained for a maximum of 100 epochs in batches of 10 samples. The initial learning rate was set to 1e-3. The learning rate was halved if the accuracy of the validation set did not improve during 3 consecutive epochs, early stopping with a patience of 5 epochs was applied as a regularization method, and only the best-performing model was saved.
For the model optimization, Adam [39] was used to minimize the MSE (see (7)) between the estimated and true outputs.

1) Visual Intuition:
An illustrative example of how the output of a model of size n l = 64, for l = {2, 3} (see Fig. 2) is affected by the addition of fusion layers is shown in Fig. 3. The first panel shows the raw data generated by (6). The second panel shows the data fitted by an independent model. The third panel shows the non-linearity introduced by this model using a side-wise fusion level of 1 and a depth-wise fusion level of 0 (i.e., a polynomial of order 2). Finally, the last panel shows the fitting performed by a fully fused model with a side-wise fusion level of 1 and a depth-wise fusion level of 1; obtaining a quartic polynomial regression. a) Independent model: The data regressions obtained with this model can be seen to be linear for both predicted outputs (Fig. 3, second panel). In this model, the two sub-models shown in Fig. 2 are disconnected, that is, latent representations at any stage are independent of each other. This is equivalent to having a side-wise and depth-wise fusion level of zero. Because no fusion layers are present throughout the model, we can apply (5) for M = 1, obtaining outputs that satisfy ∂Ỹ 1,2 /∂X = C 1,2 ∼ O(n 0 ), i.e., linear regressions.
b) Single fusion model: The regressions produced by this model display a quadratic trend in both predicted outputs, as depicted in the third panel of Fig. 3. In this case, we fuse the latent representations of the model between two fully connected layers (note that it does not matter whether is between l = 1 and l = 2 or l = 2 and l = 3, because of the symmetry of the model, that is, all deep learning modules have the same dimensionality). The fusion operation performed in this case is a one-sided fusion and not a depth-wise fusion. Therefore, we apply (3) for M = 2, which satisfies ∂Ỹ 1,2 /∂X = f (X) 1,2 ∼ O(n 1 ), which can be seen by the unique global minima in the third panel of Fig. 3. Note that in this example both quadratic functions show a convex nature, caused by the fact that most raw target data points are located in the top half of the panel. One would expect the second derivative of the output regressions to change sign if the target data would be vertically flipped around Y = 0.5.
c) Double fusion model: In this case, the model's regressions are represented by quartic functions for both outputs, as illustrated in the fourth and last panel of Fig. 3. This model presents a depth-fusion level of one, which in this particular model represents a fully fused model. For this reason, we can apply (5) for M = 2 and L = 3, which will satisfy ∂Ỹ 1,2 /∂X = f (X) 1,2 ∼ O(n 3 ), which can be seen by the three function turning points in each regression.
2) Experiment 1.1: In this experiment, we aim to investigate the effect that the non-linearities introduced by the fusion mechanisms have on the training error. We do this by comparing the output errors obtained by the linear independent and fused models. Also, for this experiment, the input vector fed to each sub-model will be identical (X 1 = X 2 ). This experiment may reveal if one can profit by adding non-linearities in a controlled way through fusion compared to a completely linear model. Fig. 4 shows box plots of the MSE improvement given by the fused models with linear activations computed as δMSE = MSE ind − MSE Λ , where MSE ind and MSE Λ represent the MSE produced by the independent and fused model, respectively. δMSE is shown for the front, back, and double fusion.  and Q 1 = 25% quartiles, respectively. The box length is given by the interquartile range (IQR), used to define the whiskers that show the variability of the data above the upper and lower quartiles (the upper whisker is given by Q 3 + 1.5·IQR and the lower whisker is given by Q 1 − 1.5·IQR [40]). Black dots indicate observations that fall beyond the whisker range (outliers).

3) Experiment 1.2:
In this experiment, our goal is to explore the sensitivity of the proposed attention mechanism to variations between inputs in each sub-model. To achieve this, we will calculate the errors at the outputs of both the individual and fused models, based on the correlation of the input data. This investigation is crucial due to the motivation behind employing fusion layers in binaural speech enhancement systems, where there is a presence of correlation between hearing sides. However, our aim is to determine a potential threshold below which fusion might not provide significant benefits in fitting the training data distribution. Fig. 5 shows a dot plot together with its polynomial regression showing how the input data correlation affects the training δMSE. It can be seen that for the fully fused model, the performance is proportional to the input data correlation whereas, for the single fused models, the performance reaches its maximum at around 75% correlation. Note that the error of the fully fused models is smaller than the error of the independent models (i.e.,  Fig. 2) as well as their counter independent layers. The variance of the activations contained in each layer is defined as: where E[·] is the expected value operator, w l,m is the tensor containing all of the learned weights in layer l in model m, and w l,m is the average activation value in layer l and model m.
To assess how variance changes across models, we train an independent and all possible fused models (from Fig. 2 using only Λ 1,1 and Λ 1,2 , only Λ 2,1 and Λ 2,2 , both pair of sets, or none of them) 50 times using different random initialization seeds. This will give an idea of how the activation variance is affected by the fusion operation. Also, we measure the variance including correlated and uncorrelated input data to remove possible training bias. Fig. 6 shows violin plots of the activation variance (in the log 10 domain) for the front and back fusion layers in the different linear models and fused models. Box plots are also overlapped above Fig. 6. Violin plots indicating the activation variance across predictions for the front and back fusion layers (see Fig. 2) in the different models for generated synthetic data. Data are plotted on a logarithmic scale for visualization purposes. The black horizontal bars within each box represent the median for each condition, the circle-shaped marks indicate the mean improvement, and the top and bottom extremes of the boxes indicate the Q 3 = 75% and Q 1 = 25% quartiles, respectively. The box length is given by the interquartile range (IQR), used to define the whiskers that show the variability of the data above the upper and lower quartiles (the upper whisker is given by Q 3 + 1.5·IQR and the lower whisker is given by Q 1 − 1.5·IQR [40]). Black dots indicate observations that fall beyond the whisker range (outliers). the violin plots to show the mean, median, and overall locality of the data.
The violin plot shows, on the one hand, that fusion reduces the range of activation values, especially in the back layers (see in Fig. 6 how the violin plots show less deviation from the mean when adding the fusion operation). It can also be seen that variance is not only equalized between sides due to fusion but also between the front and back layers, as depicted by the violin plots corresponding to the double fusion model. It is important to note here that the fact that variance is equalized and balanced through the model is relevant to ensure that all learning modules are learning at the same rate [38].

B. Experiment 2: Ablation Study
In this experiment, we investigate the effect that fusion layers have on noise reduction performance in the context of end-to-end speech enhancement.
Model: The investigated fusion method will be investigated in the context of a well-known fully-convolutional time-domain audio separation network (Conv-TasNet [9]; which will we be referring to as "TasNet" for simplicity). In this ablation experiment we analyze the effect of introducing and/or removing fusion layers between specific latent representations of the input signals. The TasNet relies on two end-to-end audio speech enhancement models; each consisting of three processing stages, as shown in Fig. 7: an encoder, a separator (a temporal convolution module (TCN), and a mask estimator), and a decoder. The encoder extracts features from the input audio signal that are then passed into the separator that estimates a mask to remove noisy elements of the input audio, and the enhanced speech is resynthesized by the decoder. The utilized range of hyperparameters is presented in detail in Table I. The implementation was done in Tensor-Flow 2.0 [41] and the code for training and evaluating can be found online. 2

TABLE I HYPERPARAMETERS USED TO TRAIN THE DEEP LEARNING MODELS
Dataset: The speech material used for the evaluation of the speech enhancement models was obtained from the TIMIT acoustic-phonetic Continuous Speech Corpus [42] (consisting of a set dedicated to training and another set for testing). TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions as well as a 16-bit, 16-kHz speech waveform file for each utterance. The speech data contained in this corpus consists of fluent spoken sentences with a total duration of 18 hours.
The interfering noisy signals were all obtained from the DEMAND collection of multi-channel recordings of acoustic noise in diverse environments [43]. The environmental noises recorded to create this dataset are split into six categories; four are indoor noises and the other two are outdoor recordings. The indoor environments are further divided into domestic, office, public, and transportation; the open-air environments are divided into streets and nature. There are 3 environment recordings per category.
The training set was obtained by mixing all of the training data contained in the TIMIT speech dataset with 50% of the DEMAND noise signals. The validation dataset, used to monitor the models' training process, consisted of 20% of the training material. The testing set was obtained by mixing the remaining 50% of the DEMAND noise signals with the TIMIT speech testing set. As a preprocessing stage, all audio material was ensured to be stereo and sampled at 16 kHz.
Each acoustic scene corresponded to a unique target utterance and a unique segment of noise from an interferer, mixed at signalto-noise ratios (SNRs) ranging from −6 to 6 dB. The three sets were balanced for the target speaker's gender. Binaural room impulse responses (BRIRs) [44] were used to model a listener in a realistic acoustic environment. The BRIR recording data set 3 consisted of 4 different rooms of different sizes and acoustic properties. The audio signals for the scenes were generated by convolving source signals with the BRIRs and summing.
Tested topologies: To investigate how the fusion operation affected the models' performance, we tested four configurations described in Table II. To expand our intuition about the effect that fusion layers have on speech enhancement performance, two different  Tested encodings: We explore the impact of fusion operations on the performance of models when using different encodings of the input signals. Our investigation focuses on comparing a non-deterministic learned representation and a deterministic representation. The objective of this analysis is to examine whether these fusion layers effectively utilize redundant binaural data by sharing underlying representations among models through the inclusion of adaptable non-linearities that align with the input data.
The input mixture sound can be divided into overlapping segments of length R, represented by X k ∈ R 1×R , where k = 1, . . . ,T denotes the segment index andT denotes the total number of segments in the input. At the encoding stage, X k is transformed into an F -dimensional representation, λ k ∈ R 1×1×F . This representation can be obtained through 1-d convolution operations (non-deterministic encoding; deep encoding), such as in [9], or with a classic spectro-temporal representation of the signal; i.e., deterministic encoding (short-time Fourier transform; STFT). These encoding-decoding stages are represented by the encoder-decoder blocks shown in Fig. 7.
Tested loss functions: To assess whether the effect of the fusion mechanisms is dependent on the loss function used to train the models, we investigated two typical cost functions used in the context of speech enhancement, namely, the SNR and the scale-invariant signal-to-distortion ratio (SI-SDR) [45]. The SNR between a given signal with T samples, X ∈ R 1×T and its estimateỸ ∈ R 1×T is defined as:  The SI-SDR between a given signal and its estimate is defined as: SI-SDR(X,Ỹ) = 10 · log 10 ||γ · X|| 2 ||γ · X −Ỹ|| 2 , γ =Ỹ X ||X|| 2 .
(10) Training: The models were trained for a maximum of 100 epochs on batches of two 4-s long audio segments. The initial learning rate was set to 1e-3. The learning rate was halved if the accuracy of the validation set did not improve during 3 consecutive epochs, early stopping with 5-epoch patience was applied as a regularization method, and only the best-performing model was saved. For the model optimization, Adam [39] was used. The models were trained and evaluated using a PC with an Intel(R) Xeon(R) W-2145 CPU @ 3.70 GHz, 256 GB of RAM, and an NVIDIA TITAN RTX as the accelerated processing unit. Table III shows the absolute testing and validation results of the speech enhancement algorithm with no fusion layers for the tested loss functions (SNR and SI-SDR), encodings (deep non-deterministic encoding based on 1-D convolutions, and deterministic encoding based on the STFT), N (encoding size; the number of filters in the 1-D convolution or the number of STFT bins), and S (number of filters in the latent representation at the output of the temporal convolutions, before the mask estimation module; for details refer to [9]). Fig. 8. Box plots indicating the activation variance on the testing set. The black horizontal bars within each box represent the median for each condition, the circle-shaped marks indicate the mean improvement, and the top and bottom extremes of the boxes indicate the Q 3 = 75% and Q 1 = 25% quartiles, respectively. The box length is given by the interquartile range (IQR), used to define the whiskers that show the variability of the data above the upper and lower quartiles (the upper whisker is given by Q 3 + 1.5·IQR and the lower whisker is given by Q 1 − 1.5·IQR [40]). Black dots indicate observations that fall beyond the whisker range (outliers).

2) Relative Denoising Performance With Fusion Layers:
To assess the generalization capabilities of the fusion layers, we will be reporting on the test score difference (δ) of the different fused models concerning the values shown in Table III. Fig. 9 shows bar plots of the increment in the validation and testing error (δ Test score = Loss Λ − Loss ind ) of the different fused models (see Table II) as a function of fusion size, loss function, and encodings. Here it can be seen that fusion seems to improve the performance of the "independent" models only when using deep encoding. In the case of deterministic STFT encoding, the fusion mechanisms may blur or distort the signal and fail to produce final faithful decoding. This suggests that the shared information between sides is learned.
3) Speech Denoising Performance as a Function of the Number of Fused Channels: To investigate how the number of fused channels between the left and right speech enhancement models impacts the testing error, we correlated the total amount of fused channels to the objective test loss, for the different encodings and loss functions. Fig. 10 shows the relation of the performance difference between the fused and independent models as a function of the total number of fused latent channels and encoding type.
This plot corroborates that a deep encoding is necessary to take advantage of the fusion layers, as we can see that not only the STFT deterministic encoding is negatively correlated to the total number of fused channels (frequency bins when fusing the encoder outputs) but also that this encoding generally performs poorer than the independent model.

4) Layer Variance Analysis of the Different Speech Denoising
Topologies: Fig. 8 shows a box plot of the layer activation variances of the different speech enhancement algorithms tested in this study. The left panel shows the layer variance of the encoder output (note that this analysis is only applicable for the deep non-deterministic encoding) and the right panel shows the variance of the temporal convolution outputs. It can be seen that the activation variance is again affected by the fusion operation. For example, note how the single fusion models obtained an unbalanced variance being smaller where the fusion operation is performed.
The fusion operation causes a reduced layer activation variance. The double fusion model obtains activation values at the front and back layers that are numerically closer to each other, compared to the other three models. Fundamentally, this may indicate that the fusion operation causes the gradient to propagate between the left and right enhancement modules, acting as a channel that balances the learning rate.

C. Experiment 3: Comparative Study
In this section, we assess the effect of fusion compared with other baseline models. We also extend the baseline models by introducing fusion layers to assess their efficacy in improving binaural speech enhancement. All the tested models based on TCN separation share the same hyper-parameters shown in Table I with N = S = 256, and all with deep encoders and decoders. For all the other models, the number of trainable parameters was set to be roughly the same as the rest. To further assess the effect of the fusion layers on speech enhancement we computed the modified binaural short-time objective intelligibility (MBSTOI metric [46]); for each deep learning topology. MBSTOI is an extension of STOI [47] that includes a modified version of the equalization-cancellation model and enables predictions including binaural advantages, while also maintaining the monaural performance of the STOI measure.
To support this last analysis we include the averaged STOI [47] across listening sides and to monitor the quality of the separated speech we also include the PESQ [48] measure, also averaged across listening sides.
1) Final Tested Models: a) Independent [9]: This is the simplest baseline model and it is comprised of two TasNets performing single-channel speech enhancement on each listening side independently. b) CDNN [49]: In this model, binaural speech enhancement is performed by means of a fully connected complex DNN. Here, signals in the left and right channels are considered as the real and imaginary components of a monaural complex signal. Unlike alternative models, this architecture undergoes the challenge of estimating a complex ideal ratio mask. c) Front fusion: This model uses two TasNets connected with a fusion layer after the encoding blocks, as defined in Table II Table II, third row) after the TCN outputs, as shown in Fig. 7 (see back fusion block).
f) Stitch [23]: In this configuration we substitute the front fusion operation with a cross-stitch network. The inputs to the TCN module at each side (X {f,b},{r,l} , following the notation shown in Fig. 7) are given by: where α ij for i, j ∈ {r, l} are trainable parameter tensors of adequate dimensionality. g) Stitch+Fusion [23]: This model extends the "Stitch" model by introducing a back fusion layer (defined in Table II, third row) after the TCN outputs, as shown in Fig. 7 (see back fusion block).
h) Parallel concat [2]: This model is described in [2] as "parallel encoder + sum & mask," and here we concatenate the encoded spectral features obtained from the encoders.
i) Parallel fusion: This model is architecturally identical to the model "Parallel Concat," but we replace the intermediate feature concatenation layer with a fusion layer. j) Parallel cross [31]: This model is based on two Tas-Nets using two encoders per channel and shares cross-domain features. Specifically, cross-channel features are concatenated to the encoder output using interaural time and level differences as spatial features. An implementation of this model can be found online. 4 k) Parallel Cross+Fusion: This model adds a back fusion layer (defined in Table II, third row) to model "Parallel Cross." l) Double fusion: This model fuses the latent representations after the decoder and TCN outputs, as shown in Fig. 7 and defined in Table II, last row. 2) Final Performance Results: Table IV shows the objective measures for each of the tested models using the SNR loss and Table V the results using the SI-SDR loss. It can be seen that the proposed fusion operation improves noise reduction 4 [Online]. Available: https://github.com/ speechbrain   TABLE IV  ABSOLUTE TESTING OBJECTIVE INSTRUMENTAL SCORES FOR THE DIFFERENT  FINAL TESTED MODELS TRAINED USING THE SNR LOSS   TABLE V  ABSOLUTE TESTING OBJECTIVE INSTRUMENTAL SCORES FOR THE DIFFERENT FINAL TESTED MODELS TRAINED USING THE SI-SDR LOSS Fig. 10. Regression of the testing error difference between the fused and independent models as a function of the number of the total number of fused channels for each of the investigated encoders. Shaded areas represent a point-wise 95% confidence interval on the fitted values. Correlation analysis is expressed as the adjusted-R and p-value, and it is considered to be significant when p < 0.05.
performance for all baseline models. However, it can also be seen that in general, the fusion operation causes a slight drop in predicted speech intelligibility and quality, which may be related to the potential presence of artifacts and distortions introduced by the operation. This observation aligns with the findings of [50] where they demonstrate that increased noise reduction leads to a corresponding loss of spatial information and added distortions, which is a key factor in determining speech intelligibility as predicted by MBSTOI, and quality, as predicted by PESQ. Interestingly, there is one fused model that improves noise reduction and also speech intelligibility indexes and quality, namely the "Concat+Fusion" model. While the exact causes remain uncertain and require additional investigation, it can be inferred that the utilization of fusion may yield advantages by considering the balance between noise reduction and potential distortions it may introduce. Consequently, integrating fusion with other feature-sharing techniques has the potential to enhance both noise reduction and speech quality.

IV. CONCLUSION
In this manuscript, we have proposed the utilization of deep fusion layers as an approach to enhance speech in binaural listening scenarios. First, we introduce and establish the concept of the general fused model, elucidating its fundamental notation and describing its characteristics. With this work, we have demonstrated that fusion layers introduce non-linearities to the model, improving its ability to accurately represent the distribution of input data. Also, our empirical analysis has shown that fused models are susceptible to input decorrelation, highlighting the importance of considering this aspect. Additionally, we observe that the fusion layers act as a channel through which the gradients through, reducing the variance between learning modules. Furthermore, we have conducted an analysis of the impact of fusion layers on binaural speech enhancement. Our findings indicate that fused models exhibit promising capabilities in reducing noise compared to independent models. Among various topologies explored, we have discovered that the model incorporating the largest double fusion layers yields the best performance on unseen data.
Importantly, our results have demonstrated that the fusion operation leads to enhanced noise reduction performance when compared to all investigated baseline models. Nonetheless, we recognize the trade-off between the extent of noise reduction and the MBSTOI and PESQ scores, which are important metrics for evaluating noise reduction quality. It is worth noting that the fusion layers we propose not only improve noise reduction but also maintain a constant number of parameters. This aspect becomes particularly relevant when there is a necessity to share large latent representations between the listening sides.
Based on these findings, we firmly believe that our approach holds potential for enhancing future binaural speech processing systems. However, it is crucial to acknowledge that our work assumes instantaneous transmission of information between the listening sides, which may not hold true in real-life applications. Therefore, an important avenue for further investigation is evaluating how latency and the necessary reduction in bitrate for transmitting latent spaces impact the performance of fused models.
Overall, this study may help advance binaural speech processing techniques, and we anticipate that future research will build upon these insights to further refine and optimize the fusion-based approach.