A Few Interactions Improve Distributed Nonparametric Estimation, Optimally

Consider the problem of nonparametric estimation of an unknown $\beta$-H\"older smooth density $p_{XY}$ at a given point, where $X$ and $Y$ are both $d$ dimensional. An infinite sequence of i.i.d.\ samples $(X_i,Y_i)$ are generated according to this distribution, and two terminals observe $(X_i)$ and $(Y_i)$, respectively. They are allowed to exchange $k$ bits either in oneway or interactively in order for Bob to estimate the unknown density. We show that the minimax mean square risk is order $\left(\frac{k}{\log k} \right)^{-\frac{2\beta}{d+2\beta}}$ for one-way protocols and $k^{-\frac{2\beta}{d+2\beta}}$ for interactive protocols. The logarithmic improvement is nonexistent in the parametric counterparts, and therefore can be regarded as a consequence of nonparametric nature of the problem. Moreover, a few rounds of interactions achieve the interactive minimax rate: the number of rounds can grow as slowly as the super-logarithm (i.e., inverse tetration) of $k$. The proof of the upper bound is based on a novel multi-round scheme for estimating the joint distribution of a pair of biased Bernoulli variables, and the lower bound is built on a sharp estimate of a symmetric strong data processing constant for biased Bernoulli variables.


I. INTRODUCTION
The communication complexity problem was introduced in the seminal paper of Yao [50] (see also [26] for a survey), where two terminals (which we call Alice and Bob) compute a given Boolean function of their local inputs (X = (X i ) n i=1 and Y = (Y i ) n i=1 ) by means of exchanging messages.The famous log-rank conjecture provides an estimate of the communication complexity of a general Boolean function, which is still open to date.Meanwhile, communication complexity of certain specific functions can be better understood.For example, the Gap-Hamming problem [24] [12] concerns testing f (X, Y) > 1 n against f (X, Y) < − 1 n , where f (X, Y) := 1 n n i=1 X i Y i denotes the sample correlation and X i , Y i ∈ {+1, −1}.It was shown in [12] with a geometric argument that the communication complexity (for worst-case deterministic X, Y) is Θ(n), therefore a one-way protocol where Alice simply sends X cannot be improved (up to a multiplicative constant) by an interactive protocol.
Gap-Hamming is closely related to the problem of estimating the joint distribution of a pair of binary or Gaussian random variables (using n i.i.d.samples).Indeed, for n large we Jingbo Liu is with the Department of Statistics and the Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, IL, 61820, US.Email: jingbol@illinois.eduManuscript received 16-Feb-2022; revised 27-Nov-2022.Associate editor: Himanshu Tyagi.This paper was presented in part at 2022 IEEE International Symposium on Information Theory (ISIT) [29].may assume that Alice (resp.Bob) can estimate the marginal distributions of X (resp.Y ) very well, so that the joint distribution is parameterized by only one scalar which is the correlation.An information-theoretic proof of Gap-Hamming was previously provided in [19], building on a converse for correlation estimation for the binary symmetric distribution, and pinned down the exact prefactor in the risk-communication tradeoff.In particular, the result of [19] implies that the naive algorithm where Alice simply sends X can be improved by a constant factor in the estimation risk by a more sophisticated scheme using additional samples.For the closely related problem of correlation (distribution) testing, [38] and [48] provided asymptotically tight bounds on the communication complexity under the one-way and interactive protocols when the null hypothesis is the independent distribution (zero correlation), which also implies that the error exponent can be improved by an algorithm using additional samples.The technique of [48] is based on the tensorization of internal and external information ((20) ahead), whereas the bound of [38] uses hypercontractivity.More recently, [20] derived bounds for testing against dependent distributions using optimal transport inequalities.
In this paper, we take the natural step of introducing nonparametric (NP) statistics to Alice and Bob, whereby two parties estimate a nonparametric density by means of sending messages interactively.It will be seen that this problem is closely related to a "sparse" version of the aforementioned Gap-Hamming problem, where interaction does help, in contrast to the usual Gap-Hamming problem.
For concreteness, consider the problem of nonparametric estimation of an unknown β-Hölder smooth density p XY at a given point (x 0 , y 0 ).For simplicity we assume the symmetric case where X and Y are both d dimensional.An infinite sequence of i.i.d.samples (X i , Y i ) are generated according to p XY , and Alice and Bob observe (X i ) and (Y i ), respectively.After they exchange k bits (either in one-way or interactively), Bob estimates the unknown density at the given point.We successfully characterize the minimax rate in terms of the communication complexity k: it is order for one-way protocols and k − 2β d+2β for interactive protocols.Notably, allowing interaction strictly improves the estimation risk.Previously, separations between one-way and interactive protocols are known but in very different contexts: In [32, Corollary 1] (see also [31]), the separation was found in the rate region of common randomness generation from biased binary distributions, using certain convexity arguments, but this only implies a difference in the leading constant, arXiv:2107.00211v4[cs.IT] 27 Aug 2023 rather than the asymptotic scaling.On the other hand, the example distribution in [42] is based on the pointer-chasing construction of [35], which appears to be a highly artificial distribution designed to entail a separation between the oneway and interactive protocols.Another example where interaction improves zero-error source coding with side information, based on a "bit location" algorithm, was described in [36], and it was shown that two-way communication complexity differs from interactivity communication complexity only by constant factors.In contrast, the logarithmic separation in the present paper arises from the nonparametric nature of the problem: If we consider the problem of correlation estimation for Bernoulli pairs with a fixed bias (a parametric problem), the risk will be order k − 1  2 , and there will be no separation between one-way and interactive protocols (which is indeed the case in [19]).In contrast, nonparametric estimation is analogous to Bernoulli correlation estimation where the bias changes with k (since the optimal bandwidth adapts to k), which gives rise to the separation.
For the risk upper bound, in the one-way setting it is efficient for Alice to just encode the set of i's such that X i falls within a neighborhood (computed by the optimal bandwidth for a given k) of the given point x 0 .To achieve the optimal k − 2β d+2β rate for interactive protocols, we provide a novel scheme that uses r > 1 rounds of interactions, where r = r(k) grows as slowly as the super logarithm (i.e. the inverse of tetration) of k.With the sequence of r(k) we use in Section V-C (and suppose that β = d = 1), while r = 4 interactions is barely enough for k equal to the number of letters in a short sentence, r = 8 is more than sufficient for k equal to the number of all elementary particles in the entire observable universe.Thus from a practical perspective, r(k) is effectively a constant, although it remains an interesting theoretical question whether r(k) really diverges (Conjecture 1).
For the lower bound, the proof is based on the symmetric data processing constant introduced in [32].Previously, the data processing constant s * r has been connected to two-party estimation and hypothesis testing in [19]; the idea was canonized as the following statement: "Information for hypothesis testing locally" is upper bounded by s * r times "Information communicated mutually".However, s * r is not easy to compute, and previous bounds on s * r are also not tight enough for our purpose.Instead, we first use an idea of simulation of continuous variables to reduce the problem to estimation of Bernoulli distributions, for which s * r is easier to analyze.Then we use some new arguments to bound s * ∞ .Let us emphasize that this paper concerns density estimation at a given point, rather than estimating the global density function.For the latter problem, it is optimal for Alice to just quantize the samples and send it to Bob, which we show in the companion paper [28] .The mean square error (in ℓ 2 norm) of estimating global density function scales differently than the case of point-wise density estimation since the messages cannot be tailored to the given point.
Related work.Besides function computation, distribution estimation and testing, other problems which have been studied in the communication complexity or privacy settings include lossy source coding [25] and common randomness or secret key generation [47][30][32] [43].The key technical tool for interactive two-way communication models, namely the tensorization of internal and external information ((20) ahead), appeared in [25] for lossy compression, [9] [33] for function computation, [47] [32] for common randomness generation, and [48][19] for parameter estimation.
For one-way communication models, the main tool is a tensorization property related to the strong data processing constant (see (10) ahead), which was first used in [4] in the study of the error exponents in communication constrained hypothesis testing.The hypercontractivity method for singleshot bounds in one-way models was used in [30] [27] for common randomness generation and [38] for testing.
In statistics, communication-constrained estimation has received considerable attention recently, starting from [52], which considered a model where distributed samples are compressed and sent to a central estimator.Further works on this communication model include settings of Gaussian location estimation [10][11], parametric estimation [22], nonparametric regression [53], Gaussian noise model [54], statistical inference [2], and nonparametric estimation [21](with a bug fixed in [7]) [1].Related problems solved using similar techniques include differential privacy [16] and data summarization [37][46] [45].Communication-efficient construction of test statistics for distributed testing using the divide-and-conquer algorithm is studied in [8].Generally speaking, these works on statistical minimax rates concern the so-called horizontal partitioning of data sets, where data sets share the same feature space but differ in samples [49] [18].In contrast, vertical distributed or federated learning, where data sets differ in features, has been used by corporations such as those in finance and medical care [49] [18].It is worth mentioning that such horizontal partitioning model was also introduced in Yao's paper [50] in the context of function computation under the name "simultaneous message model", where different parties send messages to a referee instead of to each other.The direct sum property (similar to the tensorization property of internal and external information) of the simultaneous message model was discussed in [13].
Organization of the paper.We review the background on nonparametric estimation, data processing constants and testing independence in Section II.The formulation of the twoparty nonparametric estimation problem and the summary of main results are given in Section III.Section IV examines the problem of estimating a parameter in a pair of biased Bernoulli distributions, which will be used as a building block in our nonparametric estimation algorithm.Section V proves some bounds on information exchanges, which will be the key auxiliary results for the proof of upper bound for Bernoulli estimation in Section VI, and for nonparametric estimation in Section VII.Finally, lower bounds are proved in Section VIII in the one-way case and in Section IX in the interactive case.

A. Notation
We use capital letters for probability measures and lower cases for the densities functions.We use the abbreviations . ., U j ) and U j := U j 1 .We use boldface letters to denote vectors, for example U i = (U i (l)) n l=1 .Unless otherwise specified, the base of logarithms can be arbitrary but remain consistent throughout equations.The precise meaning of the Landau notations, such as O(•), will be explained in each section or proofs of specific theorems.We use odd 1≤i≤r to denote summing over i ∈ {1, . . ., r}\2Z.For the vector representation of a binary probability distribution, we use the convention that P U = [P U (0), P U (1)].For the matrix representation of a joint distribution of a pair of binary random variables, we use the convention that P XY = P XY (0, 0) P XY (0, 1) P XY (1, 0) P XY (1, 1) .

B. Nonparametric Estimation
Let us recall the basics about the problem of estimating a smooth density; more details may be found in [44], [41].Let d ≥ 1 be an integer, and s = (s 1 , . . ., s d ) ∈ {0, 1, 2, . . .} d be a multi-index.For x = (x 1 , . . ., x d ) ∈ R d , let D s denote the differential operator Given β ∈ (0, ∞), let ⌊β⌋ be the maximum integer strictly smaller than β [44] (note the difference with the usual conventions).Given a function f whose domain includes a set A ⊆ R d , define ∥f ∥ A,β as the minimum L ≥ 0 such that for all multi-indices s such that s 1 + • • • + s d = ⌊β⌋.For example, β = 1 define a Lipschitz function and an integer β defines a function with bounded β-th derivative.Given L > 0, let P(β, L) be the class of probability density functions p satisfying ∥p∥ R d ,β ≤ L. Let x 0 ∈ R d be arbitrary.The following result on the minimax estimation error is wellknown: where the infimum is over all estimators T n of p(x 0 ), i.e., measurable maps from i.i.d.samples X 1 , . . ., X n ∼ p to R. Θ(•) in (3) may hide constants independent of n.
We say K : R d → R is a kernel of order l (l ∈ {1, 2, . . .}) if K = 1 and all up to the l-th derivatives of the Fourier transform of K vanish at 0 [44, Definition 1.3].Therefore the rectangular kernel, which is the indicator of a set, is order 1.A kernel estimator has the form where h ∈ (0, ∞) is called bandwidth.If K is a kernel of order l = ⌊β⌋, then the kernel estimator (4) with appropriate h achieves the bound in (3) [44, Chapter 1].In particular, the rectangular kernel is minimax optimal for β ∈ (0, 2].
If K is compactly supported, then only local smoothness is needed, and density lower bound does not change the rate: we have where S is any compact neighborhood of x 0 , A ∈ [0, 1 vol(S) ) is arbitrary (with vol(S) denoting the volume of S), and P S (β, L, A) denotes the non-empty set of probability density functions p satisfying ∥p∥ S,β ≤ L and inf x∈S p(x) ≥ A.

C. Strong and Symmetric Data Processing Constants
The strong data processing constant has proved useful in many distributed estimation problems [10], [4], [16], [52].In particular, it is strongly connected to two-party hypothesis testing under the one-way protocol.In contrast, the symmetric data processing constant [32] can be viewed as a natural extension to interactive protocols.This section recalls their definitions and auxiliary results, which will mainly be used in the proofs of lower bounds; however, the intuitions are useful for the upper bounds as well.
Given two probability measures P , Q on the same measurable space, define the KL divergence Define the χ 2 -divergence Let X, Y be two random variables with joint distribution P XY .Define the mutual information Definition 1.Let P XY be an arbitrary distribution on X × Y. Define the strong data processing constant s * (X; Y ) := sup where P U |X is a conditional distribution (with U being an arbitrary set), and the mutual informations are computed under the joint distribution P U |X P XY .
Clearly, the value of s * (X; Y ) does not depend on the choice of the base of logarithm.A basic yet useful property of the strong data processing constant is tensorization: Now if (X; Y) are the samples observed by Alice and Bob, Π 1 denotes the message sent to Bob, then I(Π 1 ; X) ≤ k implies that The left side is the KL divergence between the distribution under the hypothesis that (X, Y ) follows some joint distribution, and the distribution under the hypothesis that X and Y are independent.Thus the error probabilities in testing against independence with one-way protocols can be lower bounded.This simple argument dates back at least to [4], [3].
A similar argument can be extended to testing independence under interactive protocols [48].The fundamental fact enabling such extensions is the tensorization of certain information-theoretic quantities, which appeared in various contexts [25], [9], [32] and are Markov chains.We call s * ∞ (X; Y ) the symmetric data processing constant.
Let us remark that using the Markov chains we have the right side of ( 12) whereas the right side of ( 13) In the computer science literature [9], I(U r ; XY ) is called the external information whereas I(U r ; X|Y ) + I(U r ; Y |X) the internal information.
The symmetric strong data processing constant is symmetric in the sense that s * ∞ (X; Y ) = s * ∞ (Y ; X), since r = ∞ in the definition.On the other hand, s * 1 (X; Y ) coincides with the strong data processing constant which is generally not symmetric.Furthermore, a tensorization property holds for the internal and external information: denote by R(X; Y ) the set of all (R, S) satisfying (12) and (13) In particular, taking the slope of the boundary at the original yields A useful and general upper bound on s * ∞ in terms of SVD was provided in [32, Theorem 4], which implies that s * ∞ = s * 1 when X and Y are unbiased Bernoulli.However, that bound is not tight enough for the nonparametric estimation problem we consider, and in fact we adopt a new approach in Section IX for the biased Bernoulli distribution.Let us remark that s * ∞ = s * 1 holds also for Gaussian (X, Y ), which follows by combining the result on unbiased Bernoulli distribution and a central limit theorem argument [32] (see also [19]).Moreover, it was conjectured in [32] that the set of possible (R, S) satisfying ( 12)-( 13) does not depend on r when X and Y are unbiased Bernoulli.

Now by Definition 2, we immediately have
which generalizes (11).Therefore, s * r (X; Y ) can be used to bound D(P YΠ ∥ PYΠ ), and in turn, the error probability in indepedence testing.

III. PROBLEM SETUP AND MAIN RESULTS
We consider estimating the density function at a given point, where the density is assumed to be Hölder continuous in a neighborhood of that point.It is clear that there is no loss of generality assuming such neighborhood to be the unit cube, and that the given point is its center.More precisely, the class of densities under consideration is defined as follows: and Definition 4. We say C is a prefix code [14] if it is a subset of the set of all finite non-empty binary sequences satisfying the property that for any distinct s 1 , s 2 ∈ C, s 1 cannot be a prefix of s 2 .
The problem is to estimate the density at a given point of an unknown distribution from H(β, L, A).More precisely, • P XY is a fixed but unknown distribution whose corresponding density p XY belongs to H(β, L, A) for some β ∈ (0, ∞), L ∈ (0, ∞), and A ∈ [0, 1).
• Unlimited common randomness Π 0 is observed by both Alice and Bob.That is, an infinite random bit string independent of (X, Y) shared by Alice and Bob.
Alice sends to Bob a message Π i , which is an element in a prefix code, where Π i is computed using the common randomness Π 0 , the previous transcripts ), and X; if i is even, then Bob sends to Alice a message Π i computed using Π 0 , Π i−1 , and Y. • Bob computes an estimate p of the true density p XY (x 0 , y 0 ), where x 0 = y 0 is the center of [0, 1] d .One-way NP Estimation Problem.Suppose that r = 1.Under the constraint on the expected length of the transcript (i.e.length of the bit string) where k > 0 is a real number, what is the minimax risk Interactive NP Estimation Problem.Under the same constraint on the expected length of the transcript, but without any constraint on the number of rounds r, what is the minimax risk?
Remark 1.The prefix condition ensures that Bob knows that the current round has terminated after finishing reading each Π i .Alternatively, the problem can be formulated by stating that Π i is a random variable in an arbitrary alphabet, and replacing (25) by the entropy constraint H(Π r ) ≤ k.Furthermore, one may use the information leakage constraint I(X, Y; Π r ) ≤ k instead.From our proofs it is clear that the minimax rates will not change under these alternative formulations.
Remark 2. There would be no essential difference if the problem were formulated with |Π| ≤ k almost surely and Indeed, for the upper bound direction, those conditions are satisfied with a truncation argument, once we have an algorithm satisfying , by Markov's inequality and the union bound, therefore results only differ with a constant factor.For the lower bound, the proof can be extended to the high probability version, since we used the Le Cam style argument [51].
Remark 3. The common randomness assumption is common in the communication complexity literature, and, in some sense, is equivalent to private randomness [34].In our upper bound proof, the common randomness is the randomness in the codebooks.Random codebooks give rise to convenient properties, such as the fact that the expectation of the distribution of the matched codewords equals exactly the product of idealized single-letter distributions (78).It is likely, however, that some approximate versions of these proofs steps, and ultimately the same asymptotic risk, should hold for some carefully designed deterministic codebooks.
The proof of the upper bound is in Section VII-B.Recall that nonparametric density estimation using a rectangular kernel is equivalent to counting the frequency of samples in a neighborhood of a given diameter, the bandwidth, which we denote as ∆.A naive protocol is for Alice to send the indices of samples in x 0 + [−∆, ∆] d .Locating each sample in that neighborhood requires on average Θ(log 1 ∆ ) = Θ(log k) bits.Thus Θ(k/ log k) samples in that neighborhood can be located.It turns out that the naive protocol is asymptotically optimal.
The proof of the lower bound (Section VIII) follows by a reduction to testing independence for biased Bernoulli distributions, via a simulation argument.Although some arguments are similar to [19], the present problem concerns biased Bernoulli distributions instead.The (KL) strong data processing constant turns out to be drastically different from the χ 2 data processing constant, as opposed to the cases of many familiar distributions such as the unbiased Bernoulli or the Gaussian distributions.
As alluded, our main result is that the risk can be strictly improved when interactions are allowed: Theorem 2. In interactive NP estimation, for any β ∈ (0, ∞), L ∈ (0, ∞), and A ∈ [0, 1), we have where Θ(•) hides multiplicative factors depending on L, β and A.
To achieve the scaling in (28), r can grow as slowly as the super-logarithm (i.e., inverse tetration) of k; for the precise relation between r and k, see Section V-C.
The proof of the upper bound of Theorem 2 is given in Section VII-C, which is based on a novel multi-round estimation scheme for biased Bernoulli distributions formulated and analyzed in Sections IV,V,VI.Roughly speaking, the intuition is to "locate" the samples within neighborhoods of (x 0 , y 0 ) by successive refinements, which is more communicationefficient than revealing the location at once.
The lower bound of Theorem 2 is proved in Section IX.The main technical hurdle is to develop new and tighter bounds on the symmetric data processing constant in [32] for the biased binary cases.

IV. ESTIMATION OF BIASED BERNOULLI DISTRIBUTIONS
In this section, we shall describe an algorithm for estimating the joint distribution of a pair of biased Bernoulli random variables.The biased Bernoulli estimation problem can be viewed as a natural generalization of the Gap hamming problem [24][12] to the sparse setting, and is the key component in both the upper and lower bound analysis for the nonparametric estimation problem.Indeed, we shall explain in Section VII that our nonparametric estimator is based on a linear combination of rectangle kernel estimators, which estimate the probability that X and Y fall into neighborhoods of x 0 and y 0 .Indicators that samples are within such neighborhoods are Bernoulli variables, so that the biased Bernoulli estimator can be used.For the lower bound, we shall explain in Section VIII that the nonparametric estimation problem can be reduced to the biased Bernoulli estimation problem via a simulation argument.
For notational simplicity, we shall use X, Y for the Bernoulli variables in this section as well as Sections V-VI, although we should keep in mind that these are not the continuous variables in the original nonparametric estimation problem.

according to the distribution
where we recall our convention that the upper left entry of the matrix denotes the probability that X = Y = 0. Alice observes (X(l)) ∞ l=1 and Bob observes (Y (l)) ∞ l=1 .• Unlimited common randomness Π 0 .Goal: Alice and Bob exchange messages in no more than r rounds in order to estimate δ.
Our algorithm is described as follows: Input.m 1 , m 2 ∈ (10, ∞); positive integer n and r; a sequence of real numbers α 1 , . . ., α r ∈ (1, ∞) satisfying odd 1≤i≤r The α 1 , . . ., α r can be viewed as parameters of the algorithm, and controls how much information is revealed about the locations of "common zeros" of X, Y in each round of communication.For example, setting α 1 = m1 10 and all other α i = 1 yields a one-way communication protocol, whereas setting all α i > 1 yields a "successive refinement" algorithm which may incur smaller communication budget yet convey the same amount of information.
Before describing the algorithm, let us define a conditional distribution P U r |XY by recursion, which will be used later in generation random codebooks.Definition 5.For each i ∈ {1, . . ., r} \ 2Z, define Then set P Ui|XY U i−1 = P Ui|XU i−1 .For i = 1, . . ., r even, we use similar definitions, but with the roles of X and Y switched.This specifies P Ui|XY U i−1 , i = 1, . . ., r.
Note that by Definition 5, U i = 1 implies U i+1 = 1 for each i = 1, . . ., r − 1.In words, for i odd, U i marks all X = 0 as 0, and marks X = 1 as either 0 or 1; whenever U i = 1 is marked, then X is definitely 1, and will be forgotten in all subsequent rounds.Now set where P U r |XY is induced by (P Ui|XY U i−1 ) r i=1 in Definition 5. Initialization.By applying a common function to the common randomness, Alice and Bob can produce a shared infinite array (V i,j (l)), where i ∈ {1, . . ., r}, j ∈ {1, 2, . . .}, l ∈ {1, 2, . . ., n}, such that the entries in the array are independent random variables, with Iterations.Consider any i = 1, . . ., r, where i is odd.We want to generate U i by selecting a codeword so that (X, Y, U i ) follows the distribution of (P Note that Alice knows both A 0 and A 1 , while Bob knows A, since it will be seen from the recursion that Alice and Bob both know U 1 , . . ., U i−1 at the beginning of the i-th round.
Alice chooses ĵi as the minimum nonnegative integer j such that Alice encodes ĵi using a prefix code, e.g.Elias gamma code [17], and sends it to Bob.Then both Alice and Bob compute The operations in the i-th round for even i is similar, with the roles of Alice and Bob reversed.We will see later that the notation U i is consistent in the sense of (49).
Estimator.Recall that in classical parametric statistics, one can evaluate the score function at the sample, compute its expectation and variation, and construct an estimator achieving the Cramer-Rao bound asymptotically.Now for i ∈ {1, . . ., r} \ 2Z, define the score function where XY U r .For i ∈ {1, . . ., r} ∪ 2Z, define Γ i (u i , x) similarly with the roles of X and Y reversed.Alice and Bob can each compute and respectively.Finally, Alice's and Bob's estimators are given by where E (δ) refers to expectation when the true parameter is δ, and ∂ δ denotes the derivative in δ.We will show that these estimators are well-defined: ] are independent of δ (Lemma 3), and can be computed by Alice and Bob without knowing δ.
Proof.By (49), for each i odd and l ∈ {1, . . ., n}, and similar expressions hold for i even.The claims then follow.

V. BOUNDS ON INFORMATION EXCHANGES
In this section we prove key auxiliary results that will be used in the upper bounds.

A. General (α i )
The following Theorem is crucial for the achievability part of the analysis of the Bernoulli estimation problem described in Section IV (and hence for the nonparametric estimation problem).Specifically, ( 52)-( 53) bounds the communication from Alice to Bob and in reverse, and ( 54)-(55) bounds the information exchanged which, in turn, will bound the estimation risk via Fano's inequality.
and assuming the natural base of logarithms, The proof can be found in Appendix A.
= P (0) Theorem 4 also implies the following bound on the external information (see (18)): Remark 5. Let us provide some intuition why interaction helps, assuming the case of m 1 = m 2 = m for simplicity.From the proof of Theorem 4, it can be seen that up to a constant factor, s * ∞ (X; Y ) is at least ∼ δ 2 m (which is in fact sharp as will be seen from the upper bound on s * ∞ (X; Y ) in Theorem 6).Moreover, lower bounds on s * r (X; Y ) can be computed by replacing the integrals with discrete sums with r terms: where 1 = t 0 < t 1 < • • • < t ⌈r/2⌉ = ln m 100 .In particular, when r = 1, we recover s * 1 (X; Y ) ∼ δ 2 m ln m , whereas choosing t i − t i−1 = 1, i = 1, . . ., ⌈r/2⌉ shows that r ∼ ln m achieves s * r (X; Y ) ∼ δ 2 m .Even better, later we will take t k as the k-th iterated power of 2, and then r will be the super logarithm of m.
Recall that (α i ) control the amount of information revealed in each round and serve as hyperparameters of the algorithm to be tuned.Next we shall explain how to select the value of (α i ) so that the optimal performance is achieved in the one-way and interactive settings.
Let us remark that the sequence (α i ) we used in (66)-( 68) is essentially optimal: Let β k := even 2≤j≤2k α −1 j .In order for (69) to converge, we need k ln( β k β k−1 )β −1 k−1 to be convergent.Therefore β k cannot grow faster than β k = exp(β k−1 ) which is tetration.However this only amounts to a lower bound on r for a particular design of P U r |XY in Definition 5. Since tetration grows super fast, from a practical viewpoint r is essentially a constant.Nevertheless, it remains an interesting theoretical question whether r needs to diverge: Conjecture 1.If there is an algorithm (indexed by k) achieving the optimal risk (28) for nonparametric estimation, then necessarily r → ∞ as k → ∞.

VI. ACHIEVABILITY BOUNDS FOR BERNOULLI ESTIMATION
In this section we analyze the performance of the Bernoulli distribution estimation algorithm described in Section IV.
As for the communication cost where we used (85) and Corollary 4.
Proof.The bound on the mean square error is similar to the r = 1 case: except that we use Corollary 5 in the last step.
For the communication cost, where used (85) and Corollary 5.

VII. DENSITY ESTIMATION UPPER BOUNDS
In this section we prove the upper bounds in Theorem 1 and Theorem 2, by building nonparametric density estimators based on the Bernoulli distribution estimator.For β ∈ (0, 2], the rectangular kernel is minimax optimal (Section II-B), so that the integral with the kernel can be directly estimated using the Bernoulli distribution estimator, which we explain in Section VII-B and VII-C.Extension to higher order kernels is possible using a linear combination of rectangular kernels, which is explained in Section VII-D.

A. Density Lower Bound Assumption
First, we observe the following simple argument showing that it suffices to consider A > 0. Define where the supremum is over all density p XY on R 2d satisfying ∥p XY ∥ [0,1] 2d ,β ≤ L. Clearly B > 1 is finite and depends only on β, L, d.
Lemma 9.In either the one-way or the interactive setting, suppose that there exists an algorithm achieving max p XY ∈H(β,L,A) E[|p−p XY (x 0 , y 0 )| 2 ] ≤ R for some R > 0 and A ∈ (0, 1).Then, there must be an algorithm achieving 1+A with a new pair drawn according to p XY .Clearly pXY ∈ H(β, L, A), and by assumption, pXY (x 0 , y 0 ) can be estimated with mean square risk R. Since p XY is known, this implies that q XY (x 0 , y 0 ) can be estimated with mean square risk For the rest of the section, we will assume that there is a density lower bound A > 0 and p XY ∈ H(β, L, A), which is sufficient in view of Lemma 9. Consider bandwidth h > 0 (which will be specified later as an inverse polynomial of k).Also introduce the notations (124) Define P XY as the probability distribution induced by P XY and with Define Note that m 1 /m 2 is bounded above and below by positive constants depending on A, β, and L (see ( 132) and ( 137)).Also, we can assume Alice and Bob both know m 1 and m 2 , since with infinite samples Alice and Bob know their marginal densities p X and p Y , and Alice can send m 1 to Bob with very high precision using negligible number of bits.Let δ ≥ −1 be the number such that P XY is the matrix Let δB be Bob's estimator of δ in (48).Then we define Bob's density estimator: We next show that the smoothness of the density ensures that 1 + δ is at most the order of a constant.Recall that A is a density lower bound on p X and p Y .Define M := max{m 1 , m 2 } and m := min{m 1 , m 2 }.The definition of (m 1 , m 2 ) implies Ah d ≤ 1 M , and hence Recall B defined in (123).We then have m1B .Similarly we also have h d ≥ 1 m2B , therefore Together with (132), we see that Next, the bias of the density estimator is which is just the bias of the rectangular kernel estimator (with bandwidth h in each of the two subspaces).The rectangle kernel is order 1 [44, Definition 1.3] and compactly supported while, by assumption β ∈ (0, 2], and therefore the bias is bounded by ([44, Proposition ) where C is a constant depending only on β, d and L.

B. One-Way Case
By Corollary 7 and (137), we can bound the variance of the density estimator as where (142) used (137).Also by Corollary 7, the communication constraint is satisfied if the following holds Now we can choose h so that m 1 = ( k log 2 k ) d d+2β as defined by (128), and set where δ max , defined as the right side of (136) and hence depends only on (A, β, L), is an upper estimate of the true parameter δ.Then the communication constraint is satisfied.
Moreover by the bias (139) and the variance (142) bounds, the risk is bounded by where D is a constant depending only on β, L, and A, and we used the fact that δ is bounded above by (137) and the bound on h shown in (132).This proves the upper bound in Theorem 1 for β ∈ (0, 2].

C. Interactive Case
Choose h such that m as defined by m := min{m 1 , m 2 } and ( 128)-(129) satisfies and set where as before δ max is an upper bound on δ and only depends on (A, β, L).By Corollary 8, for k large enough we have m > 10, and the communication cost is bounded by k.Moreover from (146), the risk is bounded by Dk − 2β d+2β for some D depending only on β, L, and A. This proves the upper bound in Theorem 2 when β ∈ (0, 2].

D. Extension to β > 2
For β > 2, the rectangular kernel is no longer minimax optimal.However, observe the following: Proposition 10.For any positive integers d and l, there exists an order l-kernel in R d which is a linear combination of (⌊l/2⌋ + 1) d indicator functions.
Proof.In the following we prove for d = 1; the case of general d will then follow by taking the tensor product of kernel functions in R. Note that an l-th kernel must satisfy the following equations: (150) Let use consider K of the following form: where k 0 := ⌊l/2⌋ + 1.Since such K(u) is an even function, (150)-(150) yield k 0 nontrivial equations for c 1 , . . ., c k0 (i.e., only when j is even): From the formula for the determinant of the Vandermonde matrix, we see that these equations have a unique solution for c 1 , . . ., c k0 .Now for general β > 0, we can take an order l = ⌊β⌋ kernel as in Proposition 10.We can estimate ) by applying the Bernoulli distribution estimator repeatedly for (⌊l/2⌋ + 1) 2d times.Therefore by the similar arguments as the preceding sections we see that the upper bounds in Theorem 1 and Theorem 2 hold for β > 2 as well.

VIII. ONE-WAY DENSITY ESTIMATION LOWER BOUND
A. Upper Bounding s * (X; Y ) The pointwise estimation lower bound is obtained by lower bounding the risk for estimating P XY (with X and Y being indicators of neighborhoods of x 0 and y 0 ), and applying Le Cam's inequality to the latter.Therefore we are led to considering the strong data processing constant for the biased Bernoulli distribution.

B. Lower Bounding One-Way NP Estimation Risk
Given k, d, β, L, A, consider a distribution P XY on {0, 1} 2 with matrix where m := ak ln k d 2β+d and δ := m − β d , with a := 16β+8d d ln 2 being a constant.We then need to "simulate" smooth distributions from P XY .Let f : R d → [0, ∞) be a function satisfying the following properties: Clearly, such a function exists for any given β, L, d.For sufficiently large m such that m −1/d supp(f ) + x 0 ∈ [0, 1] d and m −1/d supp(f ) + y 0 ∈ [0, 1] d (recall that (x 0 , y 0 ) is the given point in the density estimation problem), define Since P X (0) = P Y (0) = 1 m , clearly the above define valid probability densities supported on [0, 1] d .Define which are also probability densities supported on [0, 1] d .Define P XY |XY = P X|X P Y |Y , where P X|X and P Y |Y are conditional distributions defined by the densities above.Under the joint distribution P XY XY , we have Define We now check that the density of P XY satisfies ∥p XY ∥ (0,1) 2d ,β ≤ L for m sufficiently large.Indeed, for x, y ∈ [0, 1] d , By the assumptions on f , we see that (175) (176) Therefore with the choice δ = m −β/d , we have ∥p XY ∥ (0,1) 2d ,β ≤ L for m ≥ 10.Now we can apply a Le Cam style argument [51] for the estimation lower bound.Suppose that there exists an algorithm that estimates the density at (x 0 , y 0 ) as p. Alice and Bob can convert this to an algorithm for testing the binary distributions P XY against PXY .Indeed, suppose that (X, Y) is an infinite sequence of i.i.d.random variable pairs according to either P XY or PXY .Using the local randomness (which is implied by the common randomness), Alice and Bob can simulate the sequence of i.i.d.random variables (X, Y) according to either P XY or PXY , by applying the random transformations P X|X and P Y |Y coordinate-wise.Then Alice and Bob can apply the density estimation algorithm to obtain p.Note that pXY (x 0 , y 0 ) = 1 while the latter following from (173).Now suppose that Bob declares and PXY otherwise.By Chebyshev's inequality, the error probability (under either hypothesis) is upper bounded by On the other hand, from (22) and Theorem 5 we have when m is sufficiently large.However, it is known (from Kraft's inequality, see e.g.[14]) that the expected length of a prefix code upper bounds the entropy.Thus and therefore Then by Pinsker's inequality (e.g.[44]), where the last line follows from our choice a = 16β+8d d ln 2. However, dP YΠ ∧ d PYΠ lower bounds twice of (180).Therefore we have which is lower bounded by

IX. INTERACTIVE DENSITY ESTIMATION LOWER BOUND
In this section we prove the lower bound in Theorem 2.
A. Upper Bounding s * ∞ (X; Y ) The heart of the proof is the following technical result: Theorem 6.There exists c > 0 small enough such that the following holds: For any P XY which is a distribution on {0, 1} 2 corresponding to the following matrix: where p, |δ| ∈ [0, c) and we used the notation p := 1 − p, we have The proof can be found in Appendix B.

B. Lower Bounding Interactive NP Estimation Risk
The proof is similar to the one-way case (Section VIII-B).Consider the distribution P XY on {0, 1} 2 as in (163) with c being the absolute constant in Theorem 6. Pick the function f , and define P XY XY and PXY XY as before.Note that, as before, pXY is uniform on [0, 1] 2d , while ∥p XY ∥ (0,1) 2d ,β ≤ L for m ≥ 10. p XY (x 0 , y 0 ) has the same formula (178), and Alice and Bob can convert a (now interactive) density estimation algorithm to an algorithm for testing P XY against PXY .With the same testing rule (179), the error probability under either hypothesis is again upper bounded by (180).
Changes arise in (181), where we shall apply Theorem 6 instead.Note that for the absolute constant c in Theorem 6, the condition 1 m , |δ| < c is satisfied for sufficiently large k (hence sufficiently large m).

D(P
Again using Kraft's inequality to Bound H(Π), we obtain Then Pinsker's inequality yields since we selected a = 2 ln 2 c .Again dP YΠ ∧ d PYΠ lower bounds twice of (180), therefore where the last line holds for sufficiently large k.Since a is a universal constant and f depends on d, β, L only, this completes the proof of the interactive lower bound.

X. ACKNOWLEDGEMENT
The author would like to thank Professor Venkat Anantharam for bringing the reference [36] to my attention and some interesting discussions.This research was supported by the starting grant from the Department of Statistics, University of Illinois, Urbana-Champaign.
By continuity, we have b as δ → 0. It is also easy to see from (217) that δ i = O(δ) (for this proof, only δ is the variable, and all other constants, such as m and (α i ), can be hidden in the Landau notations).Therefore (217)(220)(221) yield Using the fact that X and Y are independent under P (0) , noting (202) and the assumption odd 1≤j≤r α −1 j ≥ 10 m1 , we have where i ′ is the largest odd integer not exceeding i. Similarly we also have c (0) ≤ 1 10 .Consequently, (223) yields Moreover, let us define a (δ) := P (δ) In the following paragraph we consider arbitrary i ∈ {1, 2, . . ., r} \ 2Z, and we shall omit the superscripts (δ) for a (δ) , b (δ) , c (δ) , unless otherwise noted.Then We can verify that Therefore as δ → 0, where d(p∥q) := p log p q + (1 − p) log 1−p 1−q denotes the binary divergence function, and recall that we assumed the natural base of logarithms.On the other hand, Turning back to (228), we obtain where (239) follows since (208) implies and Moreover, by (202), Similarly, Therefore by ( 240) and (211), and hence odd 1≤i≤r establishing the claim (54) of the theorem.The proof of (55) is similar.

ln m 100 0 e t dt 1 m ln m 100 0 e
−t dt −1