Quantized Low-Rank Multivariate Regression With Random Dithering

Low-rank multivariate regression (LRMR) is an important statistical learning model that combines highly correlated tasks as a multiresponse regression problem with low-rank priori on the coefficient matrix. In this paper, we study quantized LRMR, a practical setting where the responses and/or the covariates are discretized to finite precision. We focus on the estimation of the underlying coefficient matrix. To make consistent estimator that could achieve arbitrarily small error possible, we employ uniform quantization with random dithering, i.e., we add appropriate random noise to the data before quantization. Specifically, uniform dither and triangular dither are used for responses and covariates, respectively. Based on the quantized data, we propose the constrained Lasso and regularized Lasso estimators, and derive the non-asymptotic error bounds. With the aid of dithering, the estimators achieve minimax optimal rate, while quantization only slightly worsens the multiplicative factor in the error rate. Moreover, we extend our results to a low-rank regression model with matrix responses. We corroborate and demonstrate our theoretical results via simulations on synthetic data, image restoration, as well as a real data application.


I. INTRODUCTION
Quantization is the process of mapping continuous input to a discrete form (e.g. a finite dictionary or a finite number of bits) [27].Quantization of signals or data recently has received considerable attention in the communities of signal processing, statistics and machine learning.In some signal processing problems, power consumption, manufacturing cost and chip area of analog-to-digital devices grow exponentially with their resolution [36].In this situation, it is infeasible to use highprecision data or signals, and quantization with relatively low resolution is preferable, e.g., see the distributed machine learning system described in [19].Besides, in modern machine learning problems extremely huge datasets and highly complex models are ubiquitous, which often lead to distributed learning systems [40], i.e., a setting involving repeatedly communication among multiple compute nodes that are oftentimes GPUs linked processors within a single machine or even multiple machines.When the participating workers are typically large in number and have slow or unstable internet connections (e.g., low-power or low-bandwidth device such as a mobile device), the communication cost would become prohibitive [40], [53], and recent works have studied how to send a small number of bits by quantization to overcome the bottleneck [3], [33], [34], [38], [53], [58], [72].More specifically, working with low-precision training data has proven useful in reducing computation cost when training linear model, as shown by the experimental results in [70].Additionally, while sending the quantized gradient is the mainstream in machine learning, it may be inefficient in distributed learning with a huge number of parameters to learn; in this case, transmitting some important quantized data samples could provably reduce the communication cost [30].Thus, it is of particular interest to theoretically investigate the interplay between parameter learning and data quantization in some fundamental statistical learning or estimation problems, e.g., [15], [16], [23].
Departing momentarily from quantization, low-rank multivariate regression (LRMR), also known as multi-task learning and reduced-rank regression [2], [12], [52], is undoubtedly a widely used statistical machine learning model.For clarity we first provide its mathematical formulation: and the main goal is to learn the underlying parameter Θ 0 ∈ R d1×d2 from the covariate-response pairs (x k , y k ) ∈ R d1 × R d2 .Compared to the canonical regression problem with scalar response (e.g., linear regression), the core spirit of LRMR is to combine and jointly solve d 2 highly correlated tasks.In particular, the coefficient vectors of the d 2 tasks are merged into Θ 0 in (1), and the low-rankness of Θ 0 is often assumed to exploit the "intrinsic relatedness" of the d 2 learning problems (e.g., [25], [26], [46]).This model can capture many natural phenomena and hence has a broad range of applications.For example, in genomics study [8], the gene expression profiles (y k ) and the genetic markers (x k ) can be approximately associated through only a few linear combinations of highly-correlated genetic markers.Therefore, recovering a low-rank and sometimes also sparse coefficient matrix holds the key to reveal such connections between the responses and predictors.In addition, in the study of functional magnetic resonance imaging (fMRI) [32], each voxel within the brain is represented by a time series of neurophysiological activity.Combining with the multivariate voxel-based time series, researchers use a linear model to describe the underlying large-scale network connectivities among functionally specialized regions in the brain.A practical way is to use a suitable matrix to identify these complex interconnections in the brain; while aiming at modelling the connections via only a small subset of the given data, one often imposes appropriate structures (e.g., low-rankness, sparsity) on the coefficient matrix.Besides, other applications include analysis of electroencephalography (EEG) data decoding [1], neural response modeling [7], analysis of financial data [52], chemometrics, psychometrics and econometrics [69], to name just a few.
To address the issue, in this paper, we study LRMR under dithered quantization that involves random dithering -a process that adds random noise to the signal before quantization.The benefit of dithering for image or speech signals was empirically observed quite early [35], [43], [54], while the theoretical results for quantization error/noise were established in [56], see also a cleaner proof provided by [28].In a nutshell, the benefit of dithering is to whiten the quantization noise.Even more surprisingly, the quantization errors follow i.i.d.uniform distribution (Lemma 1(a)).While we focus on the dithered uniform quantizer, interested readers may consult [66] for an extensive treatment of quantization noise under various quantizers.
We deal with the quantization of both response and covariate.We propose to use uniform dither for y k , triangular dither for x k (see precise definition later), and then apply the uniform quantizer.Note that the quantization method is memoryless and thus well suited to hardware implementation.Our main contributions are as follows: • Based on the quantized data, we develop an empirical ℓ 2 loss, which coupled with either nuclear norm constraint or regularization leads to Lasso estimators.We establish minimax optimal non-asymptotic error bounds for the estimators in the cases of "partial quantization" (i.e., only quantize y k ) and "complete quantization" (i.e., quantize both x k , y k ).The bounds also characterize how quantization resolution affects the estimation error.• We show that our quantization method is also applicable to a low-rank linear regression model with matrix response recently studied in [41].Our Lasso estimators based on quantized data could still achieve error rate comparable to the full-data regime in [41].
Although we adopt a similar dithered quantization scheme (specifically, similar to [15]), the estimation problem in this paper totally differs from CS.In particular, we will study regression models with multivariate response that can be a vector with considerably large dimension (LRMR in section III) or even a huge matrix (see section IV), as in sharp contrast to the scalar measurement y k in CS.A different point of view is to consider each response scalar of (1).Let Θ 0 = [θ 0,1 , ..., θ 0,d2 ], then the i-th entry of y k in (1) can be expressed as y ki = x ⊤ k θ 0,i + ǫ ki .Because y ki only involves the i-th column of the desired signal Θ 0 , it is often referred to as a local measurement and considered to be less informative than the global measurement used in CS (see, e.g., [63]).As a consequence, the technical ingredients in this work, especially the technique to bound various random terms arising in the proof, significantly deviate from those in quantized CS.
From the more statistical side, without considering any data quantization procedure, a lot of statistical procedures have been developed for estimation and prediction in multivariate regression.Among them the most relevant are the regularized ones that minimize an objective constituted by a loss function and a suitable regularizer, see [17], [39], [41], [46], [55], [69] for instance.Indeed, the key theoretical achievement of this work is to show the compatibility between the dithered uniform quantizer and the Lasso estimator.That is, Lasso estimator can still achieve near optimal estimation error from data quantized by a uniform quantizer with appropriate random dither.

B. Outline
The remainder of this paper is organized as follows: we provide the notational conventions and preliminaries in section II; we propose our Lasso estimators for quantized LRMR and present the theoretical results in section III; the main results are then extended to low-rank linear model with matrix response in section IV; we provide experimental results in section V to validate our theory; we give some remarks to conclude this work in section VI.

II. PRELIMINARIES (Notation).
We denote matrices and vectors by boldface letters, while scalars by regular letters.We write [m] = {1, ..., m} for positive integer m.For vector x, y ∈ R d , we work with the ℓ p norm and inner product x, y = x ⊤ y.For matrices A, B, we work with the transpose A ⊤ , the operator norm A op , Frobenius norm A F , nuclear norm A nu (sum of singular values), max norm A ∞ = max i,j |a ij |, and the inner product A, B = Tr(A ⊤ B).The standard Euclidean sphere of R d is denoted by ) be the sub-Gaussian norm (resp.subexponential norm), X L p = |X| p 1/p be the L p norm.We represent universal constants by C, c, C i or c i , whose value may vary from line to line.We write and T 2 = Ω(T 1 ) simultaneously hold.We use U (W ) to denote the uniform distribution over W .We use vec(A) ∈ R mn×1 to vectorize a matrix A ∈ R m×n , while mat(.)denotes the inverse operator.

A. High-dimensional probability
A random variable X with finite X ψ2 is said to be sub-Gaussian.Note that sub-Gaussian X exhibits exponentially decaying probability tail, i.e., for any t > 0, Similarly, X with finite X ψ1 is sub-exponential and has the following tail bound for any t > 0 For n-dimensional random vector X we let X ψ2 = sup v∈S n−1 v ⊤ X ψ2 .

B. Dithered uniform quantization
First, we describe the dithered uniform quantization with δ > 0 for an input signal x ∈ R N as follows: • Independent of x, we i.i.d.draw the entries of the random dither τ ∈ R N from some suitable distribution; • Then, we quantize x to ẋ = Q δ (x + τ ), with Q δ (a) := δ a δ + 1 2 (a ∈ R) applied element-wisely.We adopt the following conventions (as in [27], [28]): w := ẋ − (x + τ ) is the quantization error, and ξ := ẋ − x is the quantization noise.
The principal properties of the dithered quantization that underlie our analysis are provided in Lemma 1.
Lemma 1. (Theorems 1-2 in [28]).We consider the above dithered uniform quantization: x = [x i ] is the input signal, τ = [τ i ] is the random dither whose entries are i.i.d.copies of random variable Y .We use i to denote the complex unit.(a) (Quantization Error).Let w := ẋ − (x + τ ) = [w i ] be the quantization error.If f (u) := (exp(i uY )) satisfies f 2πl δ = 0 for all non-zero integer l, then x i and w j are independent for all i, j ∈ Given positive integer p, if the p-th order derivative g (p) (u) satisfies g (p) 2πl δ = 0 for all non-zero integer l, then the p-th conditional moment of ξ i does not depend on x.More precisely, we have [ξ p i |x] = (Y + Z) p .Given quantization level δ > 0, in this work we focus on uniform dither τ ] (i.e., the sum of two independent uniform distribution).From Lemma 1, the following properties are immediate.The proof can be found in Appendix.
] , then x i and w j are independent (∀i, j), and {w j : ] ; In addition, for the triangular dither τ ] , the variance of ξ i is independent of signal; more precisely it holds that ξ2 i = δ 2 4 .The benefit of using proper dither (e.g., uniform dither) is now clear, i.e., to whiten the quantization noise.For instance, under

III. QUANTIZED LOW-RANK MULTIVARIATE REGRESSION
The low-rank multivariate regression (LRMR) model is where x k ∈ R d1 is the covariate, y k ∈ R d2 is the response perturbed by random noise ǫ k , and Θ 0 ∈ R d1×d2 is the desired parameter.Our goal is to estimate the Θ 0 from (x k , y k )'s.We make the following sub-Gaussian assumption.Note that we assume x k = 0 for simplicity, and the case of " x k = 0" can be addressed by data centering or including an intercept term in (5).It should be noted that these distributional assumptions are standard and commonly adopted for analysing regularized M-estimators (defined in (6) shortly) in multiresponse regression problems, see [46,Coro. 3], [29], [51] for instance.Indeed, our Assumption 1 slightly relaxes the assumptions made in these prior works from Gaussian data to sub-Gaussian data, and note that this relaxation is important for certain cases, e.g., when we work with binary data that cannot be captured by Gaussian distribution.
Although this multivariate regression model was already intensively studied in the literature (e.g., [26], [46], [55]), the novelty of this work lies in the quantization that is inevitable in the era of digital signal processing.In particular, we study "partial quantization" where only the response is quantized, as well as a more tricky setting of "complete quantization" where the entire covariate-response pair (x k , y k ) is quantized to finite precision.We propose the dithered quantization scheme as follows: • (Covariate quantization).Independent of (x k , y k ), we i.i.d.draw triangular dither ] d1 , and then quantize ] d2 , and then quantize y k to ẏk = Q δ2 y k + τ k .
Note that δ 1 = 0 means no quantization on x k , thus corresponding to "partial quantization" that only involves response quantization.While almost all works related works studied response quantization (as reviewed in section I-A), we comment on the necessity of also studying covariate quantization (δ 1 > 0).For instance, when LRMR appears as a distributed learning problem where the features are transmitted among multiple parties, quantization is often needed for reducing communication cost.Also note that, a mode direct benefit is the lower memory load.

A. The empirical loss under quantization
Using the vector ℓ 1 norm as regularizer to promote sparsity, Lasso is viewed as a benchmark procedure for recovering sparse vector [62].The efficacy of Lasso has extended to the recovery of low-rank matrix by replacing the ℓ 1 norm with the nuclear norm of a matrix, see [11], [46] for instance.Having assumed Θ 0 to be low-rank, one can apply similar idea to LRMR and formulate the regularized Lasso recovery program as arg min where L(Θ) is the ℓ 2 loss function for data fitting purpose, λ Θ nu is the regularization part for low-rank structure, and λ should be tuned to balance data fidelity and the low-rankness.Note that ( 6) also falls into the range of M-estimator [46], [47].When a good estimate on Θ 0 nu is available, one can also consider the constrained Lasso arg min We note that Lasso is known to achieve minimax rate in LRMR, see [46] for instance.However, under data quantization one can only access ( ẋk , ẏk ) for recovery; while the regularizer Θ nu is unproblematic, one evidently lacks full data for constructing the empirical ℓ 2 loss L(Θ); modification of L(Θ) is thus needed.To draw some inspiration, the quite instructive first step is to calculate the expected ℓ 2 loss: where (i) holds up to constant that has no effect on the optimization, and in (ii) we introduce the shorthand for the covariance, Σ xx = (x k x ⊤ k ) and Σ xy = (x k y ⊤ k ).Therefore, in order to construct a suitable empirical ℓ 2 loss, we need to find surrogates for Σ xx , Σ xy based on ( ẋk , ẏk ).
To facilitate the exposition, we reserve the following notation in subsequent developments: for quantization of x k with dither φ k , w k1 = ẋk − (x k + φ k ) is the quantization error, ξ k1 := ẋk − x k is the quantization noise; for quantization of y k with dither τ k , w k2 := ẏk − (y k + τ k ) stands for the quantization error, while ξ k2 := ẏk − y k the quantization noise.We use ξ kj,i to denote the i-th entry of ξ kj , and the meanings of notation like w kj,i , φ k,i are similar.Now we are ready to present a Lemma that indicates the suitable surrogates of Σ xx , Σ xy .Lemma 2. Based on the quantized data ( ẋk , ẏk ), we let Proof.We first calculate the easier where (i) is because in the previous step, all terms but (x k y ⊤ k ) vanish, due to the nice property that w k1 , w k2 4 I d1 , we calculate ( ẋk ẋ⊤ k ) as follows: where 4 (Corollary 1), and for i = j, (ξ k1,i ξ k1,j ) = (φ k,i + w k1,i )(φ k,j + w k1,j ) = 0, again due to the properties in Corollary 1.The proof is complete.
Remark 1. (Triangular dither) While the uniform dither is a quite standard choice in the literature, we comment on the necessity of using triangular dither for x k .In essence, this is because in the estimation of Σ xx , the quantized sample covariance contains the bias (ξ k1 ξ ⊤ k1 ) (see ( 8)), which must be removed.However, the diagonal entry ξ 2 k1,i remains unknown under the dither of , see [28,Page 3].Fortunately, by Lemma 1(b), the direct remedy is to use a dither that enjoys quantization noise with signal-independent variance, e.g., ] .Such triangular dither was also adopted in [15] when studying covariate quantization in compressed sensing.
With all these preparations, we are in a position to specify the empirical loss Note that L(Θ) reduces to the ordinary ℓ 2 loss L(Θ) (up to additive constant) if δ 1 = δ 2 = 0. Further combined with the regularizer, the Lasso recovery procedure for our quantized setting can be proposed.The remainder of this section is devoted to the theoretical analysis of Lasso.

B. Constrained Lasso
First, we study the constrained Lasso where the sparsity is promoted by a "hard" constraint.Indeed, we simply substitute the unknown L(Θ) in (7) with L(Θ), and to focus on estimation problem per se we ideally assume the prior estimate is precise, i.e., R := Θ 0 nu . 3Hence, we formulate the constrained Lasso estimator as where L(Θ) is defined as (9).For convenience we define the estimation error We begin with two Lemmas that will support the proof of our main Theorem.
Lemma 3 follows similar courses as [46, Lemma 3] and involves a standard covering argument.We defer the proof to Appendix.
holds with probability at least 1 − 2 exp(−t), where the multiplicative factor is 1 /4 .Proof.By (8) we first note that We then verify the sub-Gaussianity of ẋk .Since for any v ∈ S d1−1 .Therefore, it holds that Finally, we can invoke [64, Exercise 4.7.3], which is a well-known estimate in covariance estimation, to arrive at the desired claim.
We are now in a position to present our first main Theorem on error bound of (10).The proof follows standard lines for analysing regularized M-estimator (e.g., [46]), but there are additional random terms to bound due to quantization noise/error, e.g., (ξ k1 , ξ k2 ) in T 1 and T 2 in (18).Theorem 1. (Constrained Lasso).We consider LRMR under Assumption 1 and the quantization procedure described above.We assume the sample complexity n A1 κ0 , where A 1 is the multiplicative factor in (11).Then for the estimator Θ c in (10) we have the following guarantees.
, then with probability at least Proof.We begin with the optimality of Θ c Then we use Θ c = Θ 0 + ∆ c and perform some algebra to arrive at and the remainder of the proof is essentially to bound both sides of ( 14).
Step 1. Bound the left-hand side from below.
Due to the scaling n A1 κ0 2 .Therefore, with high probability we have Step 2. Bound the right-hand side from above.Note that To bound ∆ c nu , we let Θ 0 = U 1 ΣV ⊤ 1 be the (compact) singular value decomposition, where . Following [48], we define a pair of subspaces as For subspace V, we let V ⊥ be its orthogonal complement, and P V (.) be the projection onto V. Then it is not hard to see the decomposibility [48]: Combined with the constraint The last inequality is because rank(A) ≤ 2r if A ∈ M, and it always holds that A nu ≤ rank(A) A F .It remains to bound Σ xy − Σ xx Θ 0 op .We first plug in Σ xx , Σ xy , and further Thus, We consider the case of partial quantization (δ 1 = 0).In this case ξ k1 = 0, so T 2 = 0 and we only need to bound Overall, we have The result of part (a) follows by putting (15) and ( 20) into (14).(b) We then consider the complete quantization case (δ 1 > 0).
Similarly to (a), we have the bound So it remains to bound T 2 op .Since Θ 0 op ≤ R, and by Lemma 3, with the promised probability we have By putting pieces similarly, we conclude the proof.
Several remarks are in order.

Remark 2. (Prediction error)
As presented in Theorem 1, we will focus on the estimation of Θ 0 in this work, whereas in regression one may also be interested in the prediction performance.From ∆ c F , the bound for prediction error is indeed immediate.For instance, when δ 1 = 0, because with high probability Remark 3. (Compared to the least squares estimation) The ordinary least squares (OLS) estimator Θ LS is to minimize L(Θ) over Θ ∈ R d1×d2 without the nuclear norm constraint.
This amounts to estimating d 2 columns of Θ 0 separately without utilizing their correlations.Under similar assumptions on covariate and noise, one can easily show , which is essentially inferior to O r(d1+d2) n in the case of r ≪ min{d 1 , d 2 }.This illustrates the benefit of incorporating the low-rank priori on Θ 0 , which will be complemented by numerical example later (Figure 5). is minimax optimal compared to the information-theoretic lower bound in [55, Theorem 5] (also see [25,Remark 11], [26, Fact 1], [9, Page 12] for alternative statements).In fact, the quantization does not affect the order of (n, r, d 1 , d 2 ) in the sample complexity and error bounds but only slightly worsens the multiplicative factors, i.e., A 1 in n A1 κ0 A 2 in (12) and A 3 in (13).Thus, in a regime where the quantization levels δ 1 , δ 2 are fixed, our result matches the one in a case without quantization up to multiplicative constant.In addition, δ 1 and E are on equal footing in A 2 , hence the role of partial quantization can be nicely interpreted as additional sub-Gaussian noise.This extends similar findings in [59], [61], [68] from compressed sensing to LRMR.Further, a useful perspective is that the result for the setting without quantization can be recovered by letting δ 1 , δ 2 = 0.For instance, when δ 2 = 0 the bound in Theorem 1 reads as , thus agreeing with the bound in [46,Coro. 3].The above discussions regarding the role of quantization remain valid for our subsequent results.

C. Regularized Lasso
Since prior estimate on Θ 0 nu is often unavailable, a more practically appealing recovery procedure is the following regularized Lasso given by Θ p = arg min and we let ∆ p be the estimation error.By properly tuning λ, the Regularized Lasso estimator Θ p achieves the same error rate as the previous Θ c .
Theorem 2. (Regularized Lasso).We consider LRMR under Assumption 1 and the quantization procedure described above.We assume the scaling n A1 κ0 , where A 1 is the multiplicative factor in (11).Then for the estimator Θ p in (22) we have the following guarantees.
(a) (Partial Quantization).If δ 1 = 0, we let with sufficiently large C 1 , then with probability at least with sufficiently large C 5 , then with probability at least 1 − c 7 exp(−c 8 d 1 ) it holds that By using some standard analyses for regularizer Mestimator (e.g., see [46]), the proof of Theorem 2 follows similar lines of Theorem 1.We defer the proof to Appendix.
It is clear that we need an additional constraint on Θ 0 op for the cases of complete quantization in Theorems 1-2, while this is not needed when we have access to the full-precision covariate.The following remark elaborates on this point.
Remark 5. (The norm constraint of Θ 0 ) When there is error in covariate, a norm constraint on the true parameter Θ 0 seems indispensable rather than an artifact from the proof technique.The main reason is that the error in covariate propagates along the true parameter, and hence its overall contribution to the response is proportional to Θ 0 .Note that similar observation was also made in [44, Section 3.2] for corrected linear regression where the covariates suffer from zero-mean random noise with known covariance matrix.

IV. QUANTIZED LOW-RANK LINEAR REGRESSION MODEL
WITH MATRIX RESPONSE The proposed quantization scheme enjoys broader applicability, as we will show in this section that the dithered quantizer can be similarly applied to the problem of low-rank linear regression (L2RM) with matrix response [41].In particular, such regression model finds application in imaging genetics, with matrix responses representing the weighted or binary adjacency matrix of a finite graph that characterizes structural or functional connectivity pattern, while the covariates are a set of genetic markers [45], [60], [65].We would also like to note some recent advances on variable selection [31] and covariance estimation [71] for matrix-valued data.
Following the notation in [41], L2RM with matrix response can be formulated as where 0 ∈ R p×q are the true coefficient matrices, E k , Y k ∈ R p×q are respectively the noise matrix and response.Our goal is to estimate 0 ] ∈ R p×(sq) under moderately large s but p, q that can be extremely huge. 4Analogously to Assumption 1, for analysing the nuclear norm regularized M-estimator (see (27) below), we make the following sub-Gaussian data assumptions that relax the Gaussian ones in [41, (A9)-(A11)].
Assumption 2. The assumptions on covariates x k 's are the same as Assumption 1; Independent of x k 's, the noise matrices E 1 , ..., E n are i.i.d., zero-mean and sub-Gaussian with E k ψ2 := sup u∈S p−1 sup v∈S q−1 u ⊤ E k v ψ2 ≤ E; the 4 In fact, s in real applications can also be very large.For dimension reduction, [41] assumed Θ (i) 0 = 0 for most i's and developed a screening method to estimate those i's with non-zero Θ (i) 0 .We focus on the estimation after this screening step.
matrix responses Y k 's are generated from (24) for some Θ 0 satisfying s i=1 rank(Θ ] p×q .To be concise we only consider the more practical regularized Lasso.Based on the full data (x k , Y k ), [41] proposed the unconstrained convex program that minimizes (1) , ..., Θ (s) ], where i Θ (i) nu is the regularizer that incorporates the low-rankness structures of Θ (i) 's.However, in our quantized regime one only observes ( ẋk , Ẏk ) ( ẋk = x k in partial quantization with δ 1 = 0), modification of L 1 (Θ) is needed.By vectorization we first reformulate (24) Here, for Θ = [Θ (1) , ..., Θ (s) ] we define the rearrangement Θ as then we have that agrees with (5).Now we can employ the prior developments -similar to (9) we let which can be constructed from the quantized data.Combining these pieces, we are in a position to define the Lasso estimator: We have the following theoretical guarantee for Θ.Theorem 3. (Regularized Lasso).We consider L2RM with matrix response under Assumption 2 and the quantization procedure described above.We assume the scaling n max{s, p, q} for some sufficiently large hidden constant and log s = O(p + q).Then for estimator Θ in (27) we have the following guarantees.(a) (Partial Quantization).If δ 1 = 0, we let A 6 := K(E + δ 2 ).Set λ = C 1 A 6 p+q n with sufficiently large C 1 , then with probability at least 1 − exp(−s) − c 3 exp(−c 4 (p + q)) it holds that op ≤ R 2 for some R > 0, s = O(p + q), and with sufficiently large C 2 , then with probability at least 1 − exp(−s) − c 1 exp(−c 2 (p + q)) it holds that Setting δ 2 = 0 in Theorem 3(a) exactly recovers [41, Theorem 5].While beyond the range of [41], our results clearly display how the dithered quantization affects the error bounds, i.e., slightly worse multiplicative factors (A 6 , A 7 ).Specifically, when δ 1 and δ 2 are chosen and then fixed, the estimation error still scales as O r(p+q) n , which matches the case without quantization up to multiplicative constant.
There are some technical differences between our proof and the one for [41, Theorem 5].First, because we assume sub-Gaussian (x k , E k ) rather than the Gaussian ones as in [41], different arguments are required to proceed the proof.More specifically, Gaussian (x k , E k ) enables [41,Theorem 5]  x ki E k op .In contrast, this term is bounded via Lemma 5 in (32); besides handling sub-Gaussian (x k , E k ), Lemma 5 itself represents a cleaner way to bound this random term compared to the arguments in [41].Second, in the "complete quantization" case, due to error in the covariate, there appears an additional random term in (33), (34), and to bound it we need to further assume op ≤ R 2 (as explained in Remark 5).We defer the detailed proof to Appendix.
We give the following remark to compare (27) with the ordinary least squares method and the Lasso for LRMR based on the reformulation (26).can be essentially better when r ≪ s min{p, q}.This illustrates the benefit of incorporating the low-rank structure.Moreover, if we impose low-rankness on Θ 0 after vectorization (26), then by Theorem 2 the estimation error scales as O r1pq n (here, r 1 = rank( Θ 0 )), which still suffers from the extremely large pq.Thus, the method in this section (also, as in [41]) achieves more effective dimension reduction in the case of matrix response.

V. EXPERIMENTAL RESULTS
In this section we provide experimental results to support and demonstrate our theoretical results.Otherwise specified, each data point is set to be the mean value of 50 independent trials.

A. Simulations with synthetic data
We first present simulation results on synthetic data.Our main purpose is to verify the established error rates, specifically O A r(d1+d2) n in Theorems 1-2 and O A r(p+q) n in Theorem 3, are in the correct order for characterizing the Lasso estimation errors.In particular, the dithered quantization only results in slightly larger multiplicative factor A. We will also demonstrate the important role played by the random dithering.

1) Constrained Lasso for quantized LRMR:
To simulate the setting of quantized LRMR we generate the low-rank underlying Θ 0 ∈ R d1×d2 as follows: we first generate Θ 1 ∈ R d1×r , Θ 2 ∈ R r×d2 with i.i.d.standard Gaussian entries, and then use a rescaled version of Θ 1 Θ 2 (with unit Frobenius norm) as Θ 0 .To simulate the sub-Gaussian data in Assumption 1, for simplicity, we use x k ∼ N (0, I d1 ) and ǫ k ∼ N (0, 0.1 • I d2 ).The constrained Lasso is fed with R = Θ 0 nu and optimized by an algorithm based on alternating direction method of multipliers (ADMM) [6].To verify and demonstrate the error rate of O A r(d1+d2) n , we test different choices of (d 1 , d 2 , r, δ 1 , δ 2 ) under n = 1000 : 500 : 3500, with the loglog error plots displayed in Figure 1.Firstly, the experimental curves are aligned with the dashed line that represents the decreasing rate of n −1/2 , thus confirming the order regarding the sample size.Then, to illustrate that quantization merely affects multiplicative factors, we compare the curves of δ 2 = 0.2, 0.3, 0.4 in Figure 1(a) (partial quantization) and the curves for δ 1 = δ 2 = 0.2, 0.3, 0.4 in Figure 1(b) (complete quantization).Note that these curves are still parallel to each other, while the ones with larger δ i are higher, which is consistent with our theory.Moreover, we note that increasing d 1 (from 50 to 70) or r (from 5 to 8) also leads to larger estimation errors.This is also predicted by the theoretical bound O r(d1+d2) n , that is, LRMR with more coefficients or weaker low-rank structure is harder.regularized Lasso (complete quantization).
2) Regularized Lasso for quantized LRMR: We switch to the Regularized Lasso estimator, which is more practically appealing in that it does not requires a pre-estimate on Θ nu .The choices of parameters, data generation and quantization are exactly the same as before.We follow the instruction in Theorem 2 for choosing λ in (22).That is, for each curve we slightly tune C(λ) and then set λ = C(λ) r(d1+d2) n .We solve the regularized Lasso with ADMM algorithm and show the results in Figure 1(c)-(d).Note that these results have implications similar to the previous ones for constrained Lasso, in terms of the O(n −1/2 ) decreasing rate, the effect of quantization, problem size, low-rank structure.Thus, we do not repeat the demonstrations.
As suggested by an anonymous reviewer, we simulate quantized LRMR under a sample size n closer or even smaller than d 1 , d 2 .Specifically, we generate the low-rank Θ 0 ∈ R 30×30 using the same mechanism, and then test the constrained/regularized Lasso estimators under sample size n = [20,25,30,50,70,90,110,130] for partial quantization, or under n = [30,50,70,90,110,130] for complete quantization. 5The results in Figure 2 indicate that, using sample size close to d 1 and d 2 , the theoretical error bounds still characterize the estimation errors of our Lasso estimators fairly well.
3) Lasso for quantized L2RM with matrix response: Now we move to the problem of low-rank linear regression with matrix response.Specifically, we set s = 4 in (24) and thus there are Θ (1) 0 , ..., Θ (4) 0 as underlying coefficients matrices.We generate each Θ (i) 0 ∈ R p×q with rank r 4 as before.To fulfill Assumption 2, we adopt covariates x k ∼ N (0, I s ) and the noise matrices E k ∼ N p×q (0, 0.01).We simulate different choices of (p, q, r, δ 1 , δ 2 ) under n = 4000 : 1000 : 8000.We note the following facts from the results in Figure 3 that can support our theoretical error rate O r(p+q) n : all experimental curves decrease with n in a rate of O(n −1/2 ); coarser quantization only lifts the curve a little bit; larger (p, q, r) results in larger estimation error.4) The importance of dithering: As already analysed in section I, under a direct uniform quantization without dithering, it is in general not possible to estimate the low-rank parameter matrix to arbitrarily small error.To demonstrate this, we use covariates with entries i.i.d.drawn from {±1}valued Bernoulli distribution to simulate LRMR with 50 × 60 underlying low-rank matrix given by or directly without dithering.Then we estimate the parameters via regularized Lasso under different sample sizes, the results are shown in Figure 4. We find that, compared to a direct quantization, using dithering significantly reduces estimation errors; more prominently, the errors under dithering decrease at a sharp rate, whereas the curves without dithering reach some error floor where more data can no longer improve the estimation.We refer to [59, Figure 1], [15, Figure 5] for similar experimental results in the contexts of compressed sensing, matrix completion, and covariance estimation.

B. Simulations of image restoration
Note that natural images are approximately low-rank 6 (e.g., [13], [18]), and our theoretical results can be easily extended 6 This means that its singular values decrease rapidly and only the first few are dominant.to approximately low-rank case by slightly more work (e.g., [13], [25], [46]).To better visualize the effect of quantization, following prior work like [41], we conduct simulations with images as underlying low-rank matrices in this part.1) Quantized LRMR: This numerical example simulates (5) with each channel of "Peppers" as Θ 0 , aiming to test the effect of quantization in a relatively high-noise setting.We also demonstrate the advantage of LRMR over the ordinary least squares (OLS) estimation (see Remark 3).In the experiment, we separately deal with each channel, which is a 256 × 256 approximately low-rank matrix (see the left bottom of Figure 5).Specifically, we draw entries of x k from N (0, 1); let e be the average magnitude of the signal part (Θ ⊤ 0 x k ) n k=1 , we use ǫ k ∼ 2e 5 • N (0, I 256 ) to simulate a relatively large noise (signal-to-noise ratio less than 7); in the quantized setting, we use uniform dithering and quantize y k with δ 2 = e 8 .Under n = 300 or n = 400, we test regularized Lasso with noisy unquantized/quantized y k , as well as OLS with noisy quantized y k .The results in Figure 5 indicate that, quantization does not notably harm the restoration (comparing columns 2 and 3).Moreover, in such a noisy and quantized setting, Lasso estimator significantly outperforms the OLS estimation that is ignorant of the low-rank structure (comparing columns 3 and 4).2) Quantized L2RM with matrix response: We follow the experiment in [41, Figures 1-2].Specifically, we simulate (24) with s = 4 where Θ (i) 0 's are 64 × 64 0-1 matrices and shown as images in the first row of Figure 6.It can be verified that they are approximately low-rank.We also adopt the method of generating (x k , E k ) in [41].While the experiment in [41] aims at comparing different methods of recovering Θ (i) 0 , however, our main goal here is to exhibit how quantization resolution affects the recovery.Thus, we simulate the regularized Lasso (27) under response quantization with δ 2 = 0.0, 0.5, 1.0, 3.0.Under the sample size of n = 2000, the reconstructed images are shown in rows two through five in Figure 6.We also run 100 independent trials and report the mean (relative) Frobenius norm error and standard deviation for each It is clear both visually and on the mean error that, under quantization with relatively high resolution (δ 2 = 0.5, 1.0), Lasso returns estimations fairly close to the ones obtained in a full-data regime.In fact, even if we quantize Y k with δ 2 = 3,7 the Lasso estimator still delivers quite acceptable results.Therefore, we conclude that the dithered quantization will not significantly deteriorate one's ability to recover the underlying low-rank parameters; rather, the dithered uniform quantizer preserves the information fairly well.Generally speaking, there should be a trade-off between quantization resolution and recovery accuracy in practice.Note that the smaller sample size n = 400 is also simulated, see Table II for the results with similar implications.

C. A real data application
To confirm the efficacy of the proposed method, we perform the quantization and estimation in a genetic association study for examining the regulatory control mechanisms in gene networks for isoprenoids in Arabidopsis thaliana [57], [67].We adopt the LRMR model ( 5) with x k being the expression levels of d 1 = 39 genes from the two isoprenoid biosynthesis pathways, y k being the expression levels of d 2 = 62 genes from four downstream pathways, and we use n = 115 samples in total. 8Besides, the mean magnitudes of the entries of x k and y k are 2160 and 3707, respectively.
We will focus on how the dithered quantization of (x k , y k ) affects the estimation and prediction of regularized Lasso (6).Note that the two major differences between this real data application and the previous simulations are that the data here may not be nicely captured by the sub-Gaussian distributions (Assumption 1), and that the relation between x k , y k may not be perfectly modeled by LRMR (5).Thus, there is not an underlying Θ 0 serving as the ground truth.Alternatively, since the emphasis is on the effect of quantization, we use the Lasso estimator with suitable λ from unquantized data as Θ 0 .
For partial quantization, we quantize y k to ẏk under δ 2 = 0 : 100 : 1000 and obtain Θ δ from (x k , ẏk ) as in (22), where the parameter λ increases with δ 2 , as instructed by Theorem 2. The relative estimation error Θ δ −Θ0 F Θ0 F and relative prediction error are reported as their mean values in 50 independent trials in Figure 7 (a)-(b).Specifically, the curves slowly increase with δ 2 ; compared to the unquantized case δ 2 = 0, the estimation and prediction under the coarse quantization δ 2 = 1000 are still acceptable.We also test the complete quantization setting where x k is quantized to ẋk with δ 1 = 0 : 5 : 50, y k is quantized to ẏk with δ 2 = 0 : 50 : 500.Similar results are reported in Figure 7 (c)-(d), but comparing Figure 7(c) and Figure 7(a), we also note that Θ δ deviates from Θ 0 more significantly in complete quantization (even though δ 1 = 0 : 5 : 50 is relatively small compared to the mean magnitude of x k ); that is, the quantization of x k affects the estimation more severely.Finally, we conduct a more practical learning and prediction setting as follows: randomly dividing the columns of X ∈ R 39×115 , Y ∈ R 62×115 into the "training data" X 1 ∈ R 39×95 , Y 1 ∈ R 62×95 and the "testing data" X 2 ∈ R 39×20 , Y 2 ∈ R 62×20 , we quantize Y 1 to Ẏ1 with δ 2 = 0 : 100 : 1000 and use (X 1 , Ẏ1 ) to obtain the estimator Θ δ defined in (22), then we track the relative prediction error over the testing data, i.e., , whose mean value in 50 independent trials is reported in Figure 7(e).Compared to Figure 7(b), the prediction error increases even more slowly with δ 2 .In conclusion, our quantization scheme well preserves the data information for subsequent estimation and prediction procedures.
VI. CONCLUSIONS This paper, for the first time, studied low-rank multivariate regression (LRMR) in a realistic setting that involves data quantization.We proposed to use the dithered uniform quantizer, associated with uniform dither for the response, or with triangular dither for the covariate.We proposed the Lasso estimators based on quantized data in a constrained or regularized manner.With the aid of random dithering, albeit losing information in quantization, our estimators achieve minimax optimal error rate.In fact, the derived error bounds demonstrate that the quantization only results in slightly worse multiplicative factors, which is reminiscent of similar results in quantized CS (Remark 4) and has been clearly observed in our simulations (e.g., Figure 1).Moreover, we similarly applied the quantization scheme to a low-rank regression problem with matrix response and established the theoretical results accordingly.Experimental results were reported to complement our theoretical developments.
For future work, our first direction is to study LRMR under the more extreme 1-bit quantization, which only retains the sign of the data.Secondly, while we separately worked on LRMR and L2RM with matrix response in this paper, it would be of interest to attempt to unify their analyses, and ideally build a general theoretical framework for quantized multiresponse regression.Last but not least, it is desired to investigate whether our quantization method and theoretical results could be extended to a high-dimensional setting where n < d 1 , which probably requires new machinery in the technical proofs and structure on Θ 0 beyond low-rankness.claim, we only need to verify both choices of τ i satisfy the condition in Lemma 1(a): which obviously vanishes at u = 2πl δ for non-zero integer l; It is similar for triangular dither.For the second part of the claim, let us show the triangular dither satisfies the condition in Lemma 1 It is evident that g ′′ (u) contains a common factor sin δu 2 , thus g ′′ ( 2πl δ ) = 0 holds for any non-zero integer l.Hence, the proof is complete.

B. The proof of Theorem 2
Proof.We start with the optimality of Θ p Recall that ∆ p = Θ p − Θ 0 , by some algebra we arrive at Note that the left-hand side is always non-negative (this holds deterministically when δ 1 = 0, and holds within the promised probability when δ 1 > 0, see step 1 in the proof of Theorem 1).By ( 18), ( 19), (21) in the proof of Theorem 1, in both "partial quantization" and "complete quantization", our choices of λ can guarantee λ ≥ 4 Σ xy − Σ xx Θ 0 op holds under the promised probability.Under the same probability, we thus obtain where the involved subspaces and projections are defined in the proof of Theorem 1.Thus, which further gives and the last inequality is because rank(A) ≤ 2r if A ∈ M.
Having deduced ∆ p nu √ r ∆ p F , we can upper bound the right-hand side of (30) by As deduced in (15), within the promised probability the lefthand side of ( 30) is lower bounded by κ0 2 ∆ p 2 F .Thus, we arrive at ∆ p F κ −1 0 λ √ r.To complete the proof, we only need to plug in the value of λ in both cases.
C. The proof of Theorem 3 Proof.We define (25).We continue to use prior notation for quantization noise/error:

Now we use the definition and obtain
Then we perform some algebra to arrive at Step 1. Bound the left-hand side from below.This is exactly the same as Step 1 in the proof of Theorem 1.In more detail, because n s, one can invoke Lemma 4 to show that λ min ( Σ xx ) ≥ κ0 2 holds with probability at least 1 − 2 exp(−s).Assume that we are on this event, then evidently we have Step 2. Bound T Thus, we have |T | ≤ T 1 , ∆ + T 2 , ∆ , and it amounts to estimating T 1 , ∆ and T 2 , ∆ .For the first term, by turning back to the R p×q we have where in the last inequality we invoke Lemma 5 and a union bound over i ∈ [s]; it holds with probability at least 1 − 2 exp(−c(p + q)) because log s = O(p + q).Note that the second term T 2 , ∆ vanishes in partial quantization (δ 1 = 0), thus we estimate it on the complete quantization case (δ 1 > 0) where we further assume i Θ (i) 0 2 op ≤ R 2 and s = O(p + q).In particular, we define and note that we have Ψ = 0. Moreover Lemma 5 provides that, Ψ op (K + δ 1 )δ 1 s n holds with probability at least 1 − exp(−s).On this event, we estimate that where the last inequality is because for also recall that s = O(p + q).To conclude, in "partial quantization" we have shown T = O A 6 p+q n , and in "complete quantization" T = O A 7 p+q n .Compared to our choices of λ, we can assume 2T ≤ 1 2 λ i ∆ (i) nu with the promised probability.Because the left-hand side of ( 31) is non-negative (deterministically if δ 1 = 0, with the promised probability if δ 1 > 0), and λ > 0, we arrive at Step 3. Conclude the proof.We use a decomposability argument.In particular, we let r i = rank(Θ holds for all i ∈ [s] and A, B ∈ R p×q .Thus, we can use (17) to obtain Putting this into the left-hand side of (35), and also apply nu to the right-hand side, it provides i P M ⊥ i ∆ (i) nu ≤ 3 i P M i ∆ (i)  nu , which leads to i Now we are ready to put pieces together.Because Θ (i) 0 nu − Θ (i) nu ≤ ∆ (i) nu , overall, the right-hand side of (31) has the bound O λ i ∆ (i) nu = O √ rλ ∆ F , while the left-hand side is lower bounded by κ0 2 ∆ 2 F , so it holds with the promised probability that, ∆ F = O √ rλ κ0 .The proof can be concluded by using the chosen value of λ.

D. Auxiliary facts
1) The proof of Lemma 3: Proof.The proof is a standard covering argument for controlling the matrix operator norm.We construct N 1 ⊂ S d1−1 as a 1  4 -net of S d1−1 , meaning that for any v ∈ S d1−1 there exists x ∈ N 1 such that x − v 2 ≤ 1 4 .Similarly, let N 2 be a 1  4 -net of S d2−1 .By [64, Corollary 4.2.13]we can assume |N 1 | ≤ 9 d1 , |N 2 | ≤ 9 d2 .Note that for any u ∈ N 1 , v ∈ N 2 , we have Note that (i) is due to centering [64, Exercise 2.7.10], and we use (4) in (ii).Thus, we can use Bernstein's inequality (see [64,Theorem 2.8.1]) to obtain the concentration of ; Followed by a union bound over (u, v) ∈ N 1 × N 2 , then for any t > 0 with sufficiently large C 3 , recall that n ≥ d 1 + d 2 , then the event  38)), the proof is complete.
2) A Lemma for the proof of Theorem 3: Lemma 5. Assume a 1 , ..., a n ∈ R are independent and satisfy max k a k ψ2 ≤ K; B 1 , ..., B n ∈ R p×q are independent and satisfy sup u∈R p−1 sup v∈R q−1 u ⊤ B k v ψ2 ≤ E for each k.Assume n p + q, then it holds with probability at least 1 − 2 exp(−c(p + q)) that, Proof.Similarly to that of Lemma 3, the proof is essentially a standard covering argument for controlling operator norm of random matrix.For fixed u, v, Thus, we can apply Bernstein's inequality [64, Theorem 2.8.1], together with a union bound on N 1 × N 2 , to obtain that for any t > 0, We set t = CKE p+q n with sufficiently large C, recall that we assume n p + q, we obtain that with probability at least 1 − 2 exp(−c 1 (p + q)), Combined with (39), the result follows.

Remark 4 .
(Minimax optimality and the role of quantization) The non-asymptotic error bound O r(d1+d2) n

Remark 6 .
(Compared to OLS and LRMR via vectorization) For the estimator Θ LS defined by minimizing the empirical ℓ 2 loss over Θ ∈ R p×sq , the error Θ LS − Θ 0 F would scale as O spq n even without quantization.By contrast, the deduced O r(p+q) n
and exactly the same as the definition of (M, M, M ⊥ ) at the beginning of Step 2 in the proof of Theorem 1 (regarding Θ 0 thereof), we now define(M i , M i , M ⊥ i ) regarding Θ (i)0 .Similarly, we have the decomposabilityP Mi A + P M ⊥ i B nu = P Mi A nu + P M ⊥ i B nu )holds with probability at least 1−exp(−c 4 (d1 +d 2 )).Note that [64, Exercise 4.4.3]gives 1 n k {a k b ⊤ k − (a k b ⊤ k )} op ≤ 2 • (the left hand side of (

TABLE I n
= 2000.MEAN RELATIVE FROBENIUS NORM ERRORS (STANDARD DEVIATION(×10 −3 )) Junren Chen is currently pursuing the Ph.D. degree with Department of Mathematics, The University of HongKong.He received a Hong Kong PhD fellowship from Hong Kong Research Grants Council for supporting his Ph.D. study.Before that, he got the B.Sc. on Mathematics and Applied Mathematics from Sun Yat-Sen University.His research interests include compressed sensing, high-dimensional statistics, signal and image processing, quantization and optimization.Yueqi Wang received the B.S. degree from Zhejiang University, Zhejiang, China in 2021.She is currently pursuing the Ph.D. degree at the University of Hong Kong, Hong Kong, China.Her major research interests include Photonic dispersion relation reconstruction, topological optimization, and machine learning.Michael K. Ng (Senior Member, IEEE) received the B.Sc. and M.Phil.degreesfrom The University of Hong Kong, Hong Kong, in 1990 and 1992, respectively, and the Ph.D. degree from The Chinese University of Hong Kong, Hong Kong, in 1995.From 1995 to 1997, he was a Research Fellow with the Computer Sciences Laboratory, The Australian National University, Canberra, ACT, Australia.He was an Assistant Professor/Associate Professor with The University of Hong Kong from 1997 to 2005.He was a Professor/Chair Professor (2005-2019) with the Department of Mathematics, Hong Kong Baptist University, Hong Kong, Chair Professor (2019-2023) with the Department of Mathematics, The University of Hong.He is currently a Chair Professor in Mathematics and Chair Professor in Data Science at Hong Kong Baptist University.His research interests include applied and computational mathematics, machine learning and artificial intelligence, and data science.Dr.Ng serves as an editorial board member of several international journals.He was selected for the 2017 Class of Fellows of the Society for Industrial and Applied Mathematics.He received the Feng Kang Prize for his significant contributions to scientific computing.