Minimax Converse for Identification via Channels

A minimax converse for the identification via channels is derived. By this converse, a general formula for the identification capacity, which coincides with the transmission capacity, is proved without the assumption of the strong converse property. Furthermore, the optimal second-order coding rate of the identification via channels is characterized when the type I error probability is non-vanishing and the type II error probability is vanishing. Our converse is built upon the so-called partial channel resolvability approach; however, the minimax argument enables us to circumvent a flaw reported in the literature.


I. INTRODUCTION
The identification is one of typical functions such that randomization significantly reduces the amount of communication necessary to compute those functions; eg.see [18].Inspired by the work by Ja'Ja' [17], Ahlswede and Dueck studied the problem of identification via noisy channels in [3], [2]; they have shown that, with randomization, messages of doubly exponential size of the block-length can be identified, and the optimal coefficient is given by Shannon's transmission capacity.Since then, the problem of identification in the context of information theory has been studied extensively in the literature [10], [11], [23], [24], [4], [12], [19], [5]; see [1] for a thorough review.
In many cases, the difficulties of identification problems arise in proving converse coding theorems.Initially, the so-called soft converse was proved in [3]; the converse coding theorem was only proved under the assumption that the identification error probabilities converge to zero exponentially fast in the block-length.Later, Han and Verdú proved the strong converse coding theorem of the identification via channels in [10].The crucial step of the proof in [10] is that we replace general stochastic encoders with stochastic encoders having specific forms, termed "M -types."In [11], Han and Verdú further studied this step as a separate problem, which they termed the channel resolvability, by introducing the information spectrum approach.
The information spectrum approach provides effective tools to investigate coding problems for general non-ergodic sources/channels; see [9] for a thorough treatment.For the channel resolvability, the optimal rate is upper bounded by the spectral sup-mutual information rate maximized over input processes.On the other hand, the identification capacity of general channels can be lower bounded by the spectral inf-mutual information rate maximized over input processes.When those upper bound and lower bound coincide, which is termed the strong converse property, it was shown in [11] that the identification capacity and the optimal rate of the channel resolvability coincide with the transmission capacity of the same channels.Later, it was proved in [12] that, without the assumption of the strong converse property, the optimal rate of the channel resolvability is characterized by the spectral sup-mutual information rate maximized over input processes.
In an attempt to determine the identification capacity without the assumption of the strong converse property, Steinberg introduced the partial channel resolvability [23].In the partial channel resolvability, we consider a truncated channel so that the tail probability of information spectrum is not accumulated twice in the argument of relating the channel resolvability to the identification code.It should be noted that, in the modern terminology, considering the partial response is essentially equivalent to the technique termed "smoothing" [22].For instance, the channel resolvability for smoothed channels has been effectively used to derive second-order bounds on coding problems with side-information [31].
Using the partial channel resolvability, it was claimed in [23] that the identification capacity of general channels coincides with the transmission capacity of the same channels.However, there is a flaw in the proof of [23,Lemma 2], which has been reported in [12,Remark 2].Thus, without the assumption of the strong converse property, the identification capacity of general channels has been an open problem so far.The main purpose of this paper is to provide a remedy to the result claimed in [23].In fact, our converse is built upon the partial channel resolvability; however, in order to circumvent the aforementioned flaw, we leverage the minimax argument described below.
In the past few decades, the argument based on the hypothesis testing has been successfully used to derive a converse bound on transmission codes of general channels [16], [21], which is termed the meta converse. 1articularly, a useful feature of the meta converse bound is that we can choose an auxiliary output distribution; thus, the expression of the converse bound involves the minimum over the output distribution and the maximum over the input distribution.For the asymptotic analysis of discrete memoryless channels, the Shannon capacity is recovered from the minimax expression by the Topsoe identity [29].In fact, the flexibility of choosing the output distribution has been effectively used to derive finer asymptotic results: the second-order coding rate [13], [21] and the third-order coding rate [28]; see also [26].Also, Polyanskiy proved that the order of minimax in the meta converse bound can be interchanged under certain regularity conditions [20].
In this paper, we derive a minimax converse bound for the identification via channels.To that end, we utilize a modified version [15] of the so-called soft covering lemma reported in [12], [19], [8]; the modified bound on the channel resolvability involves an auxiliary output distribution.The main contribution of this paper is to apply the flexibility of choosing the auxiliary output distribution to the argument connecting the channel resolvability and the identification code. 2 The key difference between our argument and the argument in [23, Lemma 2] is as follows: in our argument, we consider a truncated channel induced from a fixed auxiliary output distribution; on the other hand, truncated channels are constructed from output distributions that depend on input distributions in [23,Lemma 2].In the former case, we can bound the number of messages of an identification code by the number of M -types without causing any trouble; this enables us to circumvent the flaw reported in [12,Remark 2].See Remark 2 of Section IV for more detail.
By using the minimax converse bound, we derive the identification capacity of general channels; it turns out that the identification capacity coincides with the transmission capacity without the assumption of the strong converse property.In the derivation of this result, we invoke the aforementioned result in [20] to interchange the order of the minimum over the output distribution and the maximum over the input distribution.Furthermore, we also derive the optimal second-order coding rate of the identification via channels when the type I error probability is non-vanishing and the type II error probability is vanishing.
Notation: Throughout the paper, random variables (eg.X) and their realizations (eg.x) are denoted by capital and lower case letters, respectively.All random variables take values in some finite alphabets which are denoted by the respective calligraphic letters (eg.X ).The probability distribution of random variable X is denoted by P X .
Similarly, X n = (X 1 , . . ., X n ) and x n = (x 1 , . . ., x n ) denote, respectively, a random vector and its realization in the nth Cartesian product X n .For a finite set S, the cardinality of S is denoted by |S|.For a subset T ⊆ S, the complement S\T is denoted by T c .The set of all distributions on X is denoted by P(X ).The indicator function is denoted by 1[•].Information theoretic quantities are denoted in the same manner as [6], [7], [9].All information quantities and rates are evaluated with respect to the natural logarithm.For given sub-distributions P and Q that are not necessarily normalized, the variational distance is denoted by

II. PROBLEM FORMULATION OF IDENTIFICATION VIA CHANNELS
In this section, we describe the problem formulation of the identification via channel, and review basic results.
We start with the problem formulation for the single-shot regime.Given a channel W from X to Y, the sender tries to transmit one of N messages; then the receiver shall identify if message i ∈ {1, . . ., N } was transmitted or not.
The encoder is given by stochastic mappings P 1 , . . ., P N ∈ P(X ), and the decoder is given by acceptance regions D 1 , . . ., D N ⊂ Y for each message.Note that, unlike the standard transmission code, the acceptance regions of an identification code need not be disjoint.In other words, if the receiver is intended to identify message i, there is no need to distinguish messages other than i.
For a given identification code {(P i , D i )} N i=1 , the first type error probability is given by and the second type error probability is given by where is the output distribution of the channel W corresponding to the input distribution P i .For given error probabilities ε and P II ≤ δ are satisfied.Then, the optimal code size of identification via channel is defined by When we consider the block coding over n uses W n of a channel, it is known that the optimal code size N ⋆ (ε, δ|W n ) grows doubly exponentially in the block length n.For a discrete memoryless channel, it has been known that the identification capacity coincide with the transmission capacity [3], [10], i.e., as long as ε + δ < 1.It should be noted that the identification capacity is infinite when ε + δ ≥ 1 [10].

III. HYPOTHESIS TESTING
In this section, we summarize known facts on the hypothesis testing and the meta converse that are needed in the rest of the paper.Consider a binary hypothesis testing with a null hypothesis Z ∼ P Z and an alternative hypothesis Z ∼ Q Z , where P Z and Q Z are distribution on the same alphabet Z. Upon observing Z = z, we shall decide whether the value was generated by the distribution P Z or the distribution Q Z .Most general test can be described by a channel T from Z to {0, 1}, where 0 indicates the null hypothesis and 1 indicates the alternative hypothesis.When z ∈ Z is observed, the test T chooses the null hypothesis with probability T (0|z) and the alternative hypothesis with probability T (1|z) = 1 − T (0|z).Then, the type I error probability of the test is defined by and the type II error probability of the test is defined by For a given 0 ≤ ε < 1, denote by β ε (P Z , Q Z ) the optimal type II error probability under the condition that the type I error probability is less than ε, i.e., In fact, since β ε (P Z , Q Z ) can be described as a linear programming when Z is finite, the infimum can be attained.
For a threshold parameter γ ∈ R, the test given by is termed the likelihood ratio test, also known as the Neyman-Pearson test.For given 0 ≤ ε < 1, let where the probability is with respect to Z ∼ P Z .Note that the quantity is the supremum of thresholds such that the type I error probability of the likelihood ratio test is less than ε, and it is referred to as ε-information spectrum divergence [27].This quantity and the optimal type II error probability defined above have the following relationship (eg.see [26,Lemma 2.4]); it can be understood as a variant of the Neyman-Pearson lemma claiming that the likelihood ratio test is essentially optimal.
Lemma 1 For a given 0 ≤ ε < 1, it holds that This lemma enables us to use the two quantities almost interchangeably.
As we have mentioned in Section I, in the past few decades, the hypothesis testing has become a useful tool to derive a converse bound on transmission codes over a channel W from X to Y.For such an application, we consider the hypothesis testing between the null hypothesis and the alternative hypothesis where P ∈ P(X ) and Q ∈ P(Y) are given input/output distributions.More specifically, the optimal coding rate of transmission codes is bounded in terms of It can be easily verified from the definition that β ε (P × W, P × Q) is concave with respect to the output distribution On the other hand, it was proved in [20] that β ε (P × W, P × Q) is convex with respect to the input distribution P ∈ P(X ).Thus, β ε (P × W, P × Q) is a convex-concave function on P(X ) × P(Y), and regularity conditions on the saddle-point property were discussed in [20]; particularly, since P(X ) and P(Y) are compact for finite alphabets X and Y, the following saddle-point property follows from the classic min-max theorem.

Lemma 2 ([20]
) For a given 0 ≤ ε < 1, the optimal value in (1) is attainable and When we evaluate asymptotic behavior of coding rates for a DMC, it is more convenient to use the ε-information spectrum divergence.Particularly, we will use the following symbol-wise relaxation (eg.see [28]): for any Q ∈ P(Y).

IV. MAIN RESULT: MINIMAX CONVERSE FOR IDENTIFICATION VIA CHANNELS
In this section, we present our main result, i.e., the minimax converse bound on the identification via channels.
To that end, we first explain the problem of channel resolvability.
For an integer M , a distribution P ∈ P(X ) is said to be an M -type if P (x) is an integer multiple of 1/M for every x ∈ X .Then, (N, ε, δ)-ID code {(P i , D i )} N i=1 is said to be M -canonical if P i is an M -type for every 1 ≤ i ≤ N .For M -canonical (N, ε, δ)-ID code with ε + δ < 1, it is not difficult to see that all P i s are distinct; in fact, if there exist i and j such that P i = P j , then In [11], among other motivations, the channel resolvability was introduced as a tool handle general ID codes by relating their analysis to that of M -canonical codes.In the channel resolvability problem, we shall approximate the output distribution P W of an arbitrarily given input distribution P ∈ P(X ) by the output distribution P W of an M -type P so that is satisfied for a prescribed approximation error ζ.If such an approximation is realized, then we can replace each P i with an M -type Pi , and use the above mentioned counting argument for M -canonical codes.
In an attempt to derive a tighter converse bound than that in [11], the partial channel resolvability was introduced in [23].For a given subset S ⊂ X × Y, let us introduce the truncated channel and the truncated output distribution Note that P W S is a sub-distribution, i.e., it may not add up to 1, and it is referred to as the partial response of the input distribution P .It can be immediately verified that In the partial channel resolvability problem, we shall approximate the partial response P W S of an arbitrarily given input distribution P ∈ P(X ) by the partial response P W S of an M -type P so that is satisfied for a prescribed approximation error ζ.
A standard approach of constructing the (partial) channel resolvability code is to randomly generate M symbols x 1 , . . ., x M according to distribution P .The performance analysis of such a random code construction is referred to as the soft covering lemma [8].The following lemma is a variant of the soft covering lemma, and it can be derived in almost the same manner as [12], [19], [8] with a simple modification.Even though the modified version is available in the literature [15], [32], we provide a proof here for completeness.
Lemma 3 For arbitrarily given Q ∈ P(Y) and γ ∈ R, let Then, for a given input distribution P ∈ P(X ), there exists an M -type P such that Proof: Let C = {X 1 , . . ., X M } be a codebook such that each X i is randomly generated with distribution P .
Then, we define M -type P = PC by We shall evaluate the approximation error averaged over the random generation of the codebook C. By Jensen's inequality and the convexity of t → t 2 , we have Then, we have where the summation y is taken over supp(Q), 3 and the last inequality follows from the Cauchy-Schwarz inequality.
Denoting Y ∼ Q, we can rewrite the above formula as Furthermore, by noting that, for i = j, we can rewrite (8) as where X ∼ P .By combining ( 6), ( 7), (8), and (9), we have which implies the existence of an M -type P satisfying (5).
The difference between Lemma 3 and the standard soft covering lemmas is that we can arbitrarily choose an auxiliary output distribution Q ∈ P(Y) instead of the output distribution P W that corresponds to the input distribution P .A similar usage of the auxiliary distribution has been known in the context of a related problem, the privacy amplification [22].
The main innovation of this paper is that we use the above mentioned flexibility of choosing the auxiliary output distribution to derive a novel converse bound on the ID code.
Theorem 1 For arbitrarily given Q ∈ P(Y) and γ ∈ R, let S = S(Q, γ) be defined as in (4).Then, for an arbitrary integer M , any (N, ε, δ)-ID code with N > |X | M must satisfy Proof: For an arbitrarily given (N, ε, δ)-ID code {(P i , D i )} N i=1 , we have for every i = j.By applying Lemma 3 for each P i , we can find M -type Pi such that Since the number of distinct M -types is upper bonded by |X | M and since N > |X | M by assumption, there must exists a pair i and j such that Pi = Pj .For such a pair, we have where the second last inequality follows from ( 12) and (3).Then, (13) together with (11) imply (10).
Remark 1 Without using the partial channel resolvability, it can be proved that any (N, ε, δ)-ID code with N > where The factor 2 of the first term in ( 14) has prevented us from deriving a general formula of the ID-capacity without the strong converse property.

Remark 2
The proof of Theorem 1 is inspired in part from the argument in [23, Lemma 2], which has a flaw reported in [12,Remark 2].A crucial difference between our argument and that in [23, Lemma 2] is that the set S for a fixed Q is used to construct the truncated channel W S in our argument, while the set T Pi defined by (15) is used to construct the truncated channel W TP i for each i in [23, Lemma 2].In the former case, Pi = Pj implies Pi W S = Pj W S , and the size N of the ID code is bounded by the number |X | M of M -types eventually. 5On the other hand, in the latter case, we cannot conclude that P1 , . . ., PN are all distinct since Pi = Pj does not necessarily imply Pi W TP i = Pj W TP j ; thus, the size N of the ID code cannot be bounded by the number of M -types.Instead, it was attempted in [23, Lemma 2] to bound N by the number of some alternative measures induced by M -types, which has a flaw [12,Remark 2].
From Lemma 1 and Lemma 2, Corollary 1 implies the following corollary.
= max Up to some residual terms, the upper bounds on the doubly exponential rate of the optimal ID code in Corollary 1 and Corollary 2 have the same form as the upper bounds on the rate of the optimal transmission code reported in the literature [21].In the next section, we will discuss asymptotic behaviors of those bounds.

V. CAPACITY FOR GENERAL CHANNELS
In this section, we derive the identification capacity of general channels.Let W = {W n } ∞ n=1 be a sequence of general channels from X n to Y n , where X and Y are finite alphabets; the channel W may not be stationary nor ergodic.For each integer n, an (N n , ε n , δ n )-ID code for channel W n is defined exactly in the same manner as in Section II.We are interested in characterizing the doubly exponential optimal growth rate of the message size N n .
Definition 1 For given 0 ≤ ε, δ < 1, a rate R is said to be (ε, δ)-achievable ID rate for general channel W if there exists a sequence of (N n , ε n , δ n )-ID codes satisfying and Then, the supremum of (ε, δ)-achievable ID rates for W is termed the (ε, δ)-ID capacity, and is denoted by Particularly, for (ε, δ) = (0, 0), it is termed the ID capacity and denoted by C ID (W ).
For a sequence X = {X n } ∞ n=1 of input processes, denote by Y = {Y n } ∞ n=1 the corresponding output processes via W = {W n } ∞ n=1 , i.e., P Y n = P X n W n for each n.Then, for 0 ≤ ε < 1, let be the ε-spectral inf-mutual information rate.Particularly, when ε = 0, we just denote I(X ∧ Y ).
Proposition 1 For 0 ≤ ε, δ < 1 and a sequence W = {W n } ∞ n=1 of general channels, we have where the supremum is taken over all sequences of input processes X.
On the other hand, from Corollary 2, we can derive the following upper bound on the (ε, δ)-ID capacity.
Theorem 2 For 0 ≤ ε, δ < 1 with ε + δ < 1 and a sequence W = {W n } ∞ n=1 of general channels, we have Proof: Suppose that R is (ε, δ)-achievable ID rate, i.e., there exists a sequence of (N n , ε n , δ n )-ID codes satisfying ( 21), (22), and (23).By Corollary 2, we have for η n = 1/n, 7 where Let X = { Xn } be a sequence of input processes that attain the maximum in (25) for each n, and let Ŷ = { Ŷ n } be the corresponding output process.Then, we have Furthermore, by applying the righthand inequality of Lemma 1, we have for sufficiently large n.
When the requirement of the type-II error probability is δ = 0, we can completely characterize the ID capacity from Proposition 1 and Theorem 2 as follows.
Corollary 3 For 0 ≤ ε < 1 and a sequence W = {W n } ∞ n=1 of general channels, we have Particularly, for ε = 0, we have Note that (29) coincides with the general formula of the transmission capacity [30].Thus, the ID capacity and the transmission capacity coincide for general channels.Previously, the coincidence of the ID capacity and the transmission capacity was known only for channels satisfying the strong converse property [9]; it should be emphasized that (29) holds without the assumption of the strong converse property.
Using these quantities, ε-dispersion of channel W is defined as For a given input distribution P X and corresponding output distribution P Y = P X W , let be the unconditional information variance.Then, we define the minimum and the maximum of unconditional information variances as Even though the unconditional information variance U (P X , W ) can be strictly larger than the conditional information variance V (W P X W |P X ) in general, for capacity achieving input distributions, these quantities coincide.Thus, the quantity coincides with the ε-dispersion V ε (W ) defined above [26].Now, we are ready to present the characterization of the second-order ε-ID capacity.
Theorem 3 For given DMC W and 0 < ε < 1, if V ε (W ) > 0, then the second-order ε-ID capacity is given by where Φ −1 (•) is the inverse function of the cumulative distribution function of the Gaussian distribution.

A. Proof of achievability
The achievability part of Theorem 3 is a straightforward consequence of the achievability bound derived in [12].
December 1, 2020 DRAFT Lemma 4 ( [12]) For given channel W and input distribution P X , let P Y be the corresponding output distribution.
Assume that real numbers a, a ′ , b, b ′ , τ, κ > 0 satisfy and Then, for any integer M > 0 and for any real number K > 0, there exists an (N, ε, δ)-ID code such that provided that8 ab Pr log where (X, Y ) ∼ P X × W . Now, we go back to the proof of achievability.For a given 0 < ε < 1, fix a capacity achieving input distribution P X that attains U ε (W ); then, let P Y be the corresponding output distribution of channel W .By setting a = b = 1 + 2 n , a ′ = b ′ = (n + 2), τ = 1 n+2 , and κ = 1+log 2 log n , we can verify that the conditions in (34) and (35) are satisfied for n ≥ 2. For R > 0, we apply Lemma 4 by setting K = e nR and M = ⌈e nR /(n + 2) 4 ⌉; then, there exist a constant F > 0 and a sequence of (N n , ε n , δ n )-ID codes such that where (X n , Y n ) ∼ P n X × W n .Here, set Then, by applying the central limit theorem, we have Thus, the condition in (36) is satisfied for sufficiently large n, and there exists a sequence of (N n , ε n , δ n )-ID codes satisfying ( 30)- (32) for L = U ε (W )Φ −1 (ε) and an arbitrary δ > 0. Thus, we have which completes the proof of the achievability part of Theorem 3.

B. Proof of converse
By Corollary 1 and the symbol-wise relaxation (2), we have Since the terms other than the first one in (37) are o( √ n), the remaining task is to evaluate the first term of (37) for an appropriate choice of the output distribution Q n .For the purpose of deriving the second-order rate, it suffices to choose a mixture of the capacity achieving output distribution and output distributions induced from types on X n [13].Although it is more than necessary to derive the second-order rate, we refer to a stronger result that is derived by a more sophisticated choice of the output distribution [28].By (37) and Lemma 5, we have Finally, by taking the limit of δ → 0, we have the converse part of Theorem 3.