Signal-to-Interference-Plus-Noise Ratio Based Optimization for Sound Zone Control

Sound zone control is posed as an optimization problem where finite impulse response control filters are jointly optimized for low transmit power, while maintaining a sufficiently high signal-to-interference-plus-noise ratio (SINR) in all zones. This problem statement in particular allows the consideration of sound zone control under the additional influence of external noise, which is rarely taken into account and indeed necessitates an alternative problem statement. In addition, the spectral characteristics of the audio signals are taken into account, both for the transmit power minimization and computing the SINR. The optimization problem is shown to be solved optimally by semidefinite relaxation, but the computational cost is high. An alternate method is proposed, inspired by a duality between transmit and receive beamforming commonly exploited in digital communications. A virtual receive optimization problem for sound zone control is derived, and it is shown to have the same optimal solution as the original transmit optimization problem. The receive optimization problem is solved efficiently through a fixed point iteration algorithm. The two proposed methods are compared against acoustic contrast control in a simulated reverberant environment, where it is shown that the proposed methods require less transmit power to produce similar SINRs.


I. INTRODUCTION
In sound zone control different audio content is reproduced in different zones of the same space using a loudspeaker array [1]. Popular methods include acoustic contrast control (ACC) [2], [3], [4], [5], [6], [7] and pressure matching [8], [9], [10], both of which can be described as special cases of the variable span trade-off filter [11], [12]. Recent methods considering more general problem formulations include [13], [14], [15]. The task is to find a control filter for each loudspeaker and audio signal, such that when the filtered signals are transmitted, only the desired signal is heard in each zone.
Sound zone control is generally solved using superposition, a simplified problem formulation where only one zone has a desired signal that should be reproduced, whereas silence should be maintained in the other zones. Solving such a problem for each zone individually and superimposing the obtained solutions, the joint sound zone control problem is solved. In practice however, there is likely a non-negligible level of noise from external sources, which can be beneficial to take into account. In that case, the superposition strategy is no longer well suited, necessitating an alternative problem statement where all zones are considered jointly.
The sound zone control problem without superposition has a close resemblance to the transmit beamforming problem in digital communications [16], [17]. In contrast to transmit beamforming, in sound zone control the signal of interest is the sound field rather than the microphone signals, meaning an operation analogous to receive beamforming to extract the transmitted signal from the noise and interference is not of interest. In addition, the transmitted signals are broadband, and the transmission channels are generally stationary on a longer time scale.
There is a duality between transmit and receive beamforming commonly exploited in digital communications, also referred to as uplink-downlink duality [18], [19], [20], [21]. It is particularly useful for solving transmit beamforming problems, which are in general more difficult than the corresponding receive beamforming problems [22]. An analogous duality exists for sound field reproduction, but is seldom used. It is used in [23] for sound zone control, however the method places significant restrictions on the loudspeaker and microphone arrays.
A common simplifying assumption in sound zone control is that the audio signals are spectrally white, which avoids the control filter being dependent on the likely non-stationary audio signals [5], [6], [7]. This is suboptimal, and sound zone control performance can be improved by taking the spectral characteristics of the audio signals into account.
Because the signals are broadband, the problem can be solved in the time domain or for a number of frequencies separately in the discrete Fourier transform domain. The latter generally has lower computational cost, but can give rise to degraded performance between frequency bins, especially when the control filter is short [5], [12], [24]. A time domain formulation avoids this problem, and allows for a natural way of taking the audio signal characteristics into account. However, it can introduce a new problem if the spectral distribution of the resulting audio is not controlled. The result can be a narrow bandpass effect at a frequency where the optimization criteria can be minimized. This problem is addressed in [5], [6], [7] for ACC, but the same ideas can be applied to the proposed methods.
The first contribution of this paper is the formulation of the sound zone control problem without superposition, taking both external noise and the spectral characteristics of the audio signals into account. Two methods are proposed to solve the problem, the first based on semi-definite relaxation (SDR), which is optimal but computationally costly. For the second method a virtual receive optimization problem for sound zone control is derived, and it is shown to have the same optimal solution as the original transmit optimization problem. It is solved efficiently through a modified version of the fixed point iteration algorithm from [20].

A. NOTATION
Vectors are denoted by lowercase bold letters (such as a), matrices by uppercase bold letters (such as A or A), and scalar variables by non-bold letters (such as a or A). Timedependent quantities are functions of the discrete time index n. The inequality ≥ of a vector should be interpreted elementwise, and denotes a generalized inequality over the cone of positive semi-definite matrices. The expectation operator is denoted by E, and the trace operator by tr. The col operator creates a column vector from its arguments, with the ordering col{a(i) . The diag and blkdiag operators create a diagonal and block diagonal matrix respectively with the analogous ordering. The identity matrix is denoted by I, and a column vector with all ones is denoted by 1, the sizes of which should be clear from context.

A. SIGNAL MODEL
An array of L loudspeakers transmits audio signals to Z zones, each with M z microphones. The problem setting is depicted in Fig. 1. The set of indices for the zones is Z = {1, . . . , Z} and for the loudspeakers L = {1, . . . , L}. The monophonic audio signal x z (n) associated with zone z should be reproduced only in that zone by filtering it with the I-tap finite impulse response (FIR) filter w zl ∈ R I before driving loudspeaker l. The audio propagates from loudspeaker l to microphone m through an acoustic path modelled by the time-invariant J-tap room impulse response (RIR) h lm ∈ R J . The signal received at microphone m is where η m (n) is additive external noise, and X z (n) ∈ R J×I is a Toeplitz matrix whose jth row and ith column is x z (n − i − j + 2). The signal model can be expressed more compactly by introducing the RIR matrix for all RIRs associated with zone z, the duplicated audio signal matrix X z (n) = blkdiag{X z (n)} L l=1 , the control filter w z = col{w zl } l∈L , and the noise signal η z (n) = col{η m (n)}

B. SIGNAL-TO-INTERFERENCE-PLUS-NOISE RATIO
With the signal model (3), a natural metric for the quality of the sound zone control is the ratio of desired to undesired sound in each zone, which is the signal-to-interference-plusnoise ratio (SINR). The SINR can be written as where the noise power of zone z is σ 2 z = E[ η z (n) 2 2 ], and the spatial covariance matrices are defined as The definition of the SINR used here to characterize a zone is using the summed power of the signals at the microphones, which is likely satisfactory if the microphones are spaced densely and evenly. Otherwise, a method such as kernel interpolation [25] could be applied according to [26]. Defining the SINR for each microphone individually instead of the zone as a whole is also possible, but leads to a non-convex NP-hard problem [27].

C. OPTIMIZATION PROBLEM
The considered optimization problem is a SINR constrained transmit power minimization problem, which allows for the desired level of audio separation to be selected via the SINR constraints. Transmit power in this context refers to the expected sum power of the signals emitted by the loudspeakers.
With the loudspeaker signals denoted by s(n) ∈ R L , the optimization problem is where the parameter γ z ∈ R ≥0 is the required SINR for the zone, and α ∈ R ≥0 is a parameter controlling the amount of regularization. The regularization can help mitigate problems in the case of imperfectly estimated covariance matrices. Introducing an audio signal vector , the loudspeaker signals are s(n) = col{ z∈Z w zl x z (n)} l∈L , and the audio signal covariance for zone z is Assuming the audio signals for each zone are mutually independent and zero mean, the objective function in (6) can be rewritten as whereÃ z = A z + αI. Any additional penalty terms, to promote for example spectral or spatial uniformity, can be added to the matrixÃ z as well [5], [6], [7]. The weighting matrix A z can be viewed as a description of the user's preference between all control filters satisfying the SINR constraints.

D. ESTIMATION OF EXPECTED VALUES
The optimization objective in (6) includes the expectation over the audio signals, the RIRs, and the external noise power. These values must in general be estimated from the available data. The audio signals are assumed to be locally stationary, such that the covariances are approximately constant over a fixed segment length. With a segment of N samples starting at time index n 0 , the audio signal covariances can be estimated as The spatial covariance matrices in (5) are estimated in an analogous way, but filtering the audio signals through the estimated RIRs first, and then calculating the sample covariance matrix.
The RIRs are slowly time-varying even if loudspeakers and zones are stationary, as temperature changes alone can cause a significant change [28]. The time-variations of the RIRs are likely slower than for the audio signals. The RIRs are therefore modelled as constant, and assumed to be measured to sufficient precision. If the variation of the RIRs is not excessive, they can be measured in an initial calibration stage and the microphones subsequently removed.
If the noise power is stationary over a long time, it could be directly estimated from the microphone signals before transmission of the audio signals is begun, for example when measuring the RIRs. In the general case, at least one microphone per zone is required to be present during operation. The transmitted audio signals at the microphones can be calculated according to (1) and subtracted from the recorded microphone signals, which will provide the true noise signal in the case of perfectly estimated RIRs.

E. ACOUSTIC CONTRAST CONTROL
For comparison, the sound zone control method ACC [5] can be obtained from the optimization problem where controls the regularization, applied according to [4]. The optimal control filter w z is the principal generalized eigenvector of the pencil (R zz , R −z ). The optimal control filter associated with each zone is computed independently, and then the obtained solutions are superimposed.

III. METHOD 1: SEMI-DEFINITE RELAXATION
The problem in (6) can be written in the form of a separable quadratically constrained quadratic program (QCQP). QCQPs are non-convex and NP-hard in general [29], however, some can be solved by SDR. Introducing the matrix W z = w z w z and using the fact that w i R zi w i = tr(W i R zi ), the SDR of (6) leads to the optimization problem The semi-definite program (10) is equivalent to (6) if the rank constraint rank(W z ) = 1 is added for all z. However, the rank constraint is impractical as it is not convex. Without the rank constraint, the problem in (10) is a relaxation of the original problem (6), and therefore provides a lower bound for the optimal value. If a rank-1 solution is optimal for the relaxed problem, it is feasible for the original problem, and since the relaxed problem provides a lower bound, the solution is optimal for the original problem. The optimization problem (10) can also be identified as the Lagrangian bidual of (6). It is shown in [29], [30], [31] that when SDR is used to solve a real-valued separable QCQP with K constraints and Z optimization variables, there is at least one optimal solution satisfying As long as no optimal matrix is the zero matrix (a rank-0 matrix), a rank-1 solution is then guaranteed to exist if K ≤ Z + 1. The optimization problem (6) corresponds to a QCQP satisfying K = Z, and the zero matrix cannot be feasible for any γ z > 0, so the problem (10) is equivalent to the original problem (6). When the optimal solution has been found, the control filter can be extracted as w z = √ λ z u z where λ z and u z are the principal eigenvalue and eigenvector of the matrix W z respectively.

IV. METHOD 2: VIRTUAL RECEIVE OPTIMIZATION PROBLEM
It can be shown that the optimal transmit control filters of (6) are the same up to a scaling as the optimal control filters for the receive optimization problem where the receive SINR is defined as The power allocations associated with the different zones are collected in a vector q = col{q z } z∈Z , as are the noise powers σ = col{σ 2 z } z∈Z . Two derivations of the receive optimization problem (12) is provided in Appendix A and B, together with proof that (12) and (6) are solved by the same control filters and with the same objective value.

A. FIXED POINT ITERATION
The problem in (12) can be solved using a fixed point iteration algorithm as in [20]. With an initial value of q = 0, the following two steps are repeated until the control filters and power allocation have converged to a fixed value.
1) Calculate control filtersw z according to Section IV-B.
2) Calculate power allocation q according to Section IV-C if (17) is satisfied, otherwise according to Section IV-D. Once converged, the transmit power allocation for the original problem (6) can be calculated according to Section IV-E.
If the chosen SINR constraints can not be supported by any set of control filters, the problem will stay infeasible, meaning (17) will remain false. A maximum number of iterations can be set, after which the problem is considered infeasible. Because the algorithm generally only requires a few iteration to converge, the maximum number of iterations can be set to a moderately low number.

B. OPTIMAL CONTROL FILTER FOR GIVEN POWER ALLOCATION
For a given power allocation q, the optimal control filter should maximize the SINR. Only the control filterw z affects SINR rec z in (13), so the control filters can be found independently. Maximization of (13) is the maximization of a generalized Rayleigh quotient, which is solved by the principal generalized eigenvector of the pencil ⎛ The principal generalized eigenvector is normalized to ensure w z 2 = 1.

C. RECEIVE POWER ALLOCATION FOR FEASIBLE CONTROL FILTER
The optimal power allocation of (12) for a given feasible control filter will satisfy the SINR constraints with equality [20]. The optimal power allocation can therefore be directly solved for. Introducing the diagonal matrix D = diag{ γ z the transmit power vector s = col{w zÃ zwz } z∈Z and the interference matrix the optimal power allocation for (12) is given by The power allocation problem is feasible if there is a power allocation q > 0 that supports the requirement SINR rec z ≥ γ z for all z. A sufficient and necessary criterion for feasibility is where ρ(·) is the spectral radius [19]. The transmit power allocation is feasible if and only if the receive power allocation is feasible, because ρ(D ) = ρ(D ) [19].

D. RECEIVE POWER ALLOCATION FOR INFEASIBLE CONTROL FILTER
If the given control filter is not feasible according to (17), then the power allocation scheme in (16) cannot be used. The following power allocation scheme can then be used instead, which if iterated according to the algorithm in Section IV-A will find a feasible control filter if it exists. While the control filter is infeasible, the power constrained SINR maximization problem will be solved instead, which can be stated as for a given fixed control filter. The optimum of (18) will satisfy [20]. The optimum objective can thereby be written in matrix form as The power constraint will also be satisfied exactly at the optimum, meaning that 1 q = P max . Equation (19) can be multiplied by 1 from the left, giving Together, (19) and (20) produces the eigenvalue problem in terms of the extended coupling matrix Only the maximum eigenvalue has a strictly positive eigenvector [20], so the power allocation is given by the first K components of the principal eigenvector of , scaled such that its last component is one.
If the ratio SINR rec z γ z ≥ 1 is satisfied then so is the feasibility criterion (17). Because the ratio SINR rec z γ z is equal for all z at the optimum, if the criterion can be satisfied for one zone, it can be satisfied for all zones. Therefore, if a feasible control filter exists, and P max is set to a large enough value, the control filter will eventually be found. The dependence on P max should not be restrictive in practice, as P max can be set to a value much larger than a reasonable transmit power. The global solution to the power constrained SINR maximization problem can be obtained instead of the SINR constrained power minimization problem by continuing to use this power allocation method until convergence.

E. TRANSMIT POWER ALLOCATION FOR OPTIMAL CONTROL FILTER
Once a set of normalized optimal control filters is obtained, the power allocation for (6) can be obtained directly as This is a required last step, as the power allocation differs between the transmit and receive optimization problems. The final control filters are obtained as w z = √ p zwz for all z.

A. EXPERIMENT DESIGN
To characterize the performance of the proposed methods, experiments in a simulated reverberant environment have been conducted. There were three zones with 16 microphones each, sharing 10 loudspeakers, all placed on the same plane in a room of size 5 × 4.8 × 2 m. The geometry is shown in Fig. 2.
The reverberation was simulated with the randomized imagesource method [32], [33] with a maximum image position shift of 5 cm, using the implementation from [34].
The external noise was simulated as spatio-temporally white Gaussian noise. The same noise power was used for all microphones associated with the same zone. The noise of any two microphones was mutually independent. The noise powers for the three zones were σ = [3 1 5] , and were assumed to be known. The SINR constraints were set to γ = [10 20 30] .
The considered methods were the SDR of Section III, the fixed point iteration (FPI) of Section IV, and ACC of Section II-E. To illustrate the impact of regularization, both FPI and ACC were computed with regularization values α = 10 −2 and α = 0. In addition, two modifications of FPI were used for comparison. The first, referred to as FPI withÃ z = I, has the optimization objective z∈Z w z 2 2 , and is designed to show it is suboptimal to only minimize the norm of the control filter in a broadband setting. The second, referred to as FPI with x z (n) = δ(n), assumes the audio signals to be spectrally white when computing the control filters, and is designed to show that it is essential to take the characteristics of the audio signal into account.
To solve the optimization problem (10) for the SDR, the splitting conic solver [35] was used, available in the Python package CVXPY [36], [37]. The maximum number of iterations for the solver was set to 10000. The FPI method was considered converged when the entry-wise mean square difference between both subsequent control filters and subsequent power allocations was less than 10 −12 .
The control filters obtained from all methods were normalized and then scaled according to the power allocation method (23). The SINR constraints were selected to be relaxed enough such that all methods generated feasible control filters, so that the power allocation method could be used. Because ACC solves an entirely different optimization problem, it is not guaranteed to find a feasible control filter even if it exists.

B. SYNTHETIC STATIONARY SIGNALS
The first experiment considered stationary audio signals. Each audio signal was generated as white Gaussian noise filtered through a sharp bandpass filter, with bandwidths 40 Hz to 450 Hz, 40 Hz to 250 Hz, and 80 Hz to 450 Hz, for zone 1 to 3 respectively. The signals were scaled after the bandpass filter to have mean power E[x 1 (n) 2 ] = 4, E[x 2 (n) 2 ] = 3, and E[x 3 (n) 2 ] = 2. The spatial covariance matrices were calculated using 10 s of audio. The sample rate was 1000 Hz, the control filter length I = 32, and the reverberation time RT 60 = 0.18 s.
The square norm of the control filters is shown in Table 1, as well as the computed transmit power, i.e. the transmit power that would be generated if the estimated covariance A z was the true covariance. The cost function P, which is a linear combination of the two using α = 10 −2 is shown as well. It can be seen that for both ACC and FPI, the control filter norm becomes significantly larger when the regularization is set to zero.
Using the calculated control filters, the sound in the room was then simulated. The resulting SINR for each zone and total transmit power is shown in Table 2. The values were obtained from the mean power estimated from 1 min of audio. The covariance estimation errors lead to SINR values slightly different from the specified level for all methods. Because FPI with x z (n) = δ(n) assumes the audio is spectrally white, the resulting SINR is not close to the desired values. For FPI, the regularization was critical to obtain a reasonable control filter, as Table 2 shows that the SINR values are not close to the targets without regularization. Even in other cases, such as for ACC in this experiment, the regularization can also lower the transmit power in practice. ACC required more transmit power to achieve similar SINRs compared to all proposed methods except FPI with x z (n) = δ(n).

1) EQUIVALENCE OF PROPOSED METHODS
In Table 1 it can be seen that SDR and FPI obtained the same objective value. Note that the optimal control filters are ambiguous with regards to sign. If w z is optimal, so is −w z . Therefore, the solutions from SDR and FPI are identical up to a sign difference, which is independent for each zone z, i.e. the two solutions can be equal for some z while being negative inverses for others. This does not affect the norm or the computed transmit power, as Table 1 confirms. However, in using the filters in practice, there can be some differences in performance depending on the signs, which can be seen in Table 2 if the values for SDR and FPI are compared. This performance difference is entirely accounted for if the sign differences are corrected.

2) COMPUTATIONAL COST
The primary computational cost of FPI and ACC is the generalized eigenvalue decomposition (GEVD), which is performed once per iteration and once in total, respectively. Each GEVD has a complexity of O((LI ) 3 ) [38]. The matrices are of the same size, so one iteration of the FPI algorithm has about the same computational cost as ACC. For the experiment presented here, it took 6 to 7 iterations for FPI to converge. The computational cost of SDR is dependent on the choice of solver, but has a considerably higher complexity in terms of LI [29]. For the experiment presented here, SDR took 1752 s, FPI took between 0.233 s to 0.303 s, and ACC took 0.046 s. These numbers only include the time required to calculate the control filter and power allocation, and not the covariance matrices which are similarly calculated for all methods.

C. MUSIC SIGNALS
More realistic signals were also considered, specifically music from the dataset [39]. The sample rate was increased to 8000 Hz, the control filter length to I = 256, and the reverberation time to RT 60 = 0.55 s. Other parameters were identical to the previous experiment. SDR was not included because of its high computational cost and because it gives identical results as FPI. The covariance matrices were computed from 1 min of audio. The computed transmit power and the control filter square norm can be seen in Table 3. The same 1min of audio used to estimate the covariances was transmitted, for which the resulting transmit power and SINR can be seen in Table 4. Table 4 shows that for FPI with x z (n) = δ(n) which does not take the audio signal characteristics into account at all, the resulting SINR is not close to the desired values. The music signals are further from spectrally white compared to the synthetic signals, so the method performs even worse here. Comparing FPI withÃ z = I to FPI with α = 10 −2 , by including the audio signal characteristics in the power minimization, the resulting transmit power is approximately halved. Finally, ACC produces more than 10 times the transmit power compared to FPI at similar SINR.
The mean spectrum of the microphone signals in zone 1 is shown in Fig. 3 for a representative selection of the considered methods, compared against the resulting spectrum without sound zone control, which is defined as x 1 (n) transmitted unchanged from a single loudspeaker in the absence of interference and noise. Both FPI and ACC produces a narrowband response compared to the uncontrolled sound. To obtain a more spectrally smooth response, additional penalty terms can be added to the cost function (6).

VI. CONCLUSION
A formulation of sound zone control without superposition was proposed, where the control filters are jointly optimized for low transmit power, while maintaining a sufficiently high SINR. The optimization problem was shown to be solved optimally by semidefinite relaxation, but the computational cost was high. An alternate method was proposed, based on a virtual receive optimization problem derived from the original transmit optimization problem. The two problems were shown to have the same optimal control filters, while the former can be solved more efficiently. Through simulations it was shown that the proposed methods are effective, leading to lower transmit power at similar SINR compared to the popular ACC method. The importance of considering the spectral characteristics of the audio signals was demonstrated, as well as including appropriate regularization.

APPENDIX A EQUIVALENCE OF TRANSMIT AND VIRTUAL RECEIVE OPTIMIZATION PROBLEMS
In this section the virtual receive optimization problem (12) is derived from the transmit optimization problem (6), demonstrating that the two problems have the same optimal control filters. The derivations follows the approach in [40].
The optimal solution of (6) will satisfy the SINR constraints with equality, meaning the problem can be equivalently expressed as minimize p,w 1 ,...,w Z p s subject to (I − D )p = Dσ The positivity constraint for p can be replaced by a constraint on the spectral radius of the matrix D . An irreducible nonnegative square matrix Q has a strictly positive spectral radius ρ(Q) > 0 which is an algebraically simple eigenvalue, and there is a unique vector that satisfies x > 0, x 1 = 1, and Qx = ρ(Q)x [41,Th. 8.4.4]. If the constraints are fulfilled in (24), so that (I − D )p = Dσ, then looking at each row individually, all three of By combining (25) and (26) the following can be asserted Any feasible point from (24) will therefore satisfy ρ(D ) < 1, and the reverse is also true, that any D satisfying ρ(D ) < 1 will give rise to a strictly positive p. This means that (24) can be equivalently rewritten as Expanding the first constraint, the same optimization problem becomes minimize q,w 1 ,...,w Z σ q subject to q zw z R zzwz where the equality constraint is relaxed into an inequality constraint, which does not change the optimum. This is the virtual receive optimization problem (12) in a familiar form, and it is thereby shown that the optimal control filters for (31) and (6) are the same.
The Lagrangian dual function is obtained by minimizing (32) over w 1 , . . . , w Z [44]. The value of the dual function is σ q if it is bounded below. The Lagrangian dual problem is the maximization of the Lagrangian dual function, which is where the constraint ensures that the dual problem is feasible, that min w 1 ,...,w Z L(w 1 , . . . , w Z , q) > −∞.
The first constraint in (34) is equivalent tow z G zwz ≥ 0 for allw z . The condition being true for all vectors is equivalent to maxw z −w z G zwz ≤ 0. Replacing the constraint in (34), the problem becomes maximize q,w 1 ,...,w Z σ q subject to q zw z R zzwz The constraints will be satisfied with equality at the optimum, so for a given control filterw z the power allocation q has a unique solution determined by the constraint. Therefore the solution of (35) is given by the fixed point of maximizē w 1 ,...,w Z q zw z R zzwz The same argument can be applied to the problem minimize q,w 1 ,...,w Z σ q subject to q zw z R zzwz so (35) and (37) have the same optimal solution. Scalingw z does not change the left-hand side of the constraint in (37). It is therefore possible to introduce the constraint w z 2 = 1, meaning that the derived problem is exactly equal to (12).