Anomaly Detection based on Compressed Data: an Information Theoretic Characterization

We analyze the effect of lossy compression in the processing of sensor signals that must be used to detect anomalous events in the system under observation. The intuitive relationship between the quality loss at higher compression and the possibility of telling anomalous behaviours from normal ones is formalized in terms of information-theoretic quantities. Some analytic derivations are made within the Gaussian framework and possibly in the asymptotic regime for what concerns the stretch of signals considered. Analytical conclusions are matched with the performance of practical detectors in a toy case allowing the assessment of different compression/detector configurations.


I. INTRODUCTION
A typical scenario for nowadays massive acquisition systems can be modelled as a large number of sensing units, each transforming some physical unknown quantity into samples of random processes that are then transmitted over a network.To reduce transmission bitrate, signals are often compressed by a lossy mechanism that is theoretically capable of preserving useful information.Before reaching some cloud facility in which they will be ultimately stored or processed, the corresponding bitstreams may traverse several levels of hierarchical aggregation and intermediate devices that are often indicated as the edge of the cloud [1].For latency or privacy reasons, some computational tasks may benefit from their deployment at the edge.One of those tasks is the detection of anomalies/novelties.This is especially true when dealing, for example, with networks that sensorize plants or structures subject to monitoring as depicted in Fig. 1.The aggregated sensor readings may be processed in the cloud for off-line monitoring relying on longterm historical trends, while the outputs of subsets of sensors may be processed at the edge to give low-latency feedback on possible critical events that require immediate intervention.Usually, compression schemes applied to sensor data are asymmetric and entail a lightweight encoding performed on very-low complexity devices paired with a possibly expensive decoding stage running on the cloud.In these conditions, it is sensible that anomaly detectors work on compressed data and not on the recovered signal.
Yet, lossy compression bases its effectiveness on neglecting some of the signal details.This translates into a distortion between the original and the recovered signal but also in a loss of details that, in principle, could have been used to tell normal behaviours from anomalous ones.
In general, acquisition systems must obey a distortion constraint so that they are designed to best address the trade-off between compression and distortion.However, such a tradeoff goes in parallel to the one between distortion and the ability to determine if the signal is normal or anomalous.Here, we analyze the latter with the same information theoretic machinery used in the well-known rate-distortion analysis and implicitly show that the two trade-offs are different.
How compression affects distinguishability has been investigated in the literature.In [2] the problem of hypothesis testing is discussed for a single source under a rate constraint.Such a basis has been extended to information-theoretic problems of statistical inference in the case of multiterminal data compression in [3].These works address the inference problem with no constraint on distortion since compressed data is not Fig. 2. The signal chain is tuned on the normal signal x ok to best address the rate-distortion trade-off, guaranteeing a certain quality of service to a given application.An anomalous signal x ko may occur and a detector working on the compressed signal y should be able to detect it.
there exists at least one critical distortion level that makes the white anomaly undetectable.
The paper is organised as follows.Section II reviews the classical rate-distortion theory in a general setting first and then in the specific case of Gaussian sources with considerations on the optimal encoding mapping expression.Section III provides the definition of the normal and anomalous signals together with the formulation of the distinguishability measures for both anomaly-agnostic and anomaly-aware scenarios.Section IV focuses on the distinguishability in the average case, with an emphasis on the asymptotic characterization of high-dimensional signals.Section V reports some numerical evidence analysing the behaviour of some suitably simplified anomaly detection strategies with respect to ideal and suboptimal compression strategies.Theoretical curves anticipate many aspects of practical performance trends and show that compression that optimizes the rate-distortion trade-off is not necessarily addressing at best the compromise with distinguishability.The conclusion is finally drawn.Proofs of the theorems and lemmas stated in the discussion are reported in the Appendix.

II. RATE VS. DISTORTION
We consider the context in which a system has the main task of transferring the information content of a signal source x to a receiver through a communication channel that has a constraint on rate.At any time instant t, an instance x[t] is passed to an encoding stage producing a compressed version y[t] that may then be decompressed into The constraint on rate is such that it implies a lossy compression mechanism.The encoding stage is therefore not injective and introduces some distortion.The encoder is tuned on the source x, which is modelled as a independent discretetime, n-dimensional stochastic processes.
The trade-off between rate and distortion is addressed in the rate-distortion theory [17,Chapter 13].Distortion may be defined as where E[•] stands for expectation, and the minimal achievable rate ρ can be expressed as a function of the maximal accepted distortion δ as follows [17,Theorem 13.2.1] where I (x; x) is the mutual information between x and x [17, Chapter 8], and f x|x is a conditional probability density function (PDF) modeling the possibly stochastic mapping characterizing the encoder-decoder pair.Although [17,Theorem 13.2.1]defines the rate-distortion function in the discrete case, it can also be proved for well-behaved continuous sources [17,Chapter 13] as considered in this work.
If the source is memoryless (thus allowing us to drop the time index t) and generates vectors of independent and zeromean Gaussian variables, i.e., when x ∼ G (0, Σ) where Σ is a diagonal covariance matrix such that Σ = diag(λ 0 , . . ., λ n−1 ) where θ ∈ [0, λ 0 ] is the so called reverse water-filling parameter [17,Theorem 13.3.3],and τ j = min {1, θ/λ j } accounts for the fraction of energy cancelled by distortion along the j-th component.The coding theorems behind such a classical development imply that the optimal trade-off (2) between rate and distortion is asymptotically obtained by simultaneously encoding an increasing number of subsequent source symbols into a single block that can be then reverted to a sequence of distorted symbols.Hence, in principle, the intermediate symbols y feeding the anomaly detector in Fig. 2 cause it to work simultaneously on multiple instances of the signals.
Though this is not incoherent with what happens in real detectors that observe more than one suspect instance before declaring an anomaly, we here instead consider a per-use analysis which is typical and scales the key merit figures (rate, distortion and, in our case, distinguishability -see Section III) by the number of source symbols aggregated to obtain them.This allows us to pursue the classical approach defining a test channel whose single use has the same expected behaviour as the average of infinite uses and, in the case of Gaussian sources, has a particularly simple expression that we derive and exploit to imagine that a source instance x is encoded into a compressed symbol y from which x can be recovered [17,Chapter 13], [18].
In the same Gaussian framework, it is also possible to derive the PDF of the distorted signal x and the conditional PDF f x|x that stochastically maps an input x ∼ G (0, Σ) to x.If we accept to identify a zero-variance Gaussian with a Dirac's delta and define S θ = I n −T θ with T θ = diag(τ 0 , . . ., τ n−1 ) to account for the fraction of energy that survives distortion along each component, then we can derive the following Lemma whose proof is in the Appendix.Lemma 1.If x ∼ G (0, Σ) is a memory-less source and we constraint the distortion D ≤ δ, the optimally distorted signal has distribution x ∼ G (0, ΣS θ ) and the optimal encoding mapping is where G m,K (•) represents the PDF of a Gaussian variable with mean m and covariance matrix K.
Although in general it is not explicitly reported, the expression of f x|x is important when the compression mechanism is employed to encode a signal different from the one for which it was designed.This is the case of an unexpected anomalous source that replaces the normal signal.

III. ANOMALIES AND THEIR DETECTABILITY
In the path from encoder to decoder, the compressed signal y may be intercepted for some local processing.The local processing we focus on is the task of distinguishing whether the transmitted signal differs from what is usually observed, i.e., anomaly detection.
To include this aspect in our model, each observable instance x[t] has to be considered as a realization of two different sources: one modeling the normal behaviour x ok and one representing an anomaly x ko .These two sources are modelled as two discrete-time, stationary, n-dimensional stochastic processes each generating independent and identically distributed (i.i.d.) vectors x ok ∈ R n and x ko ∈ R n with different PDFs f ok : R n → R + and f ko : R n → R + .As a result, at any time t the observable process is either x[t] = x ok [t] or x[t] = x ko [t] (visually represented in Fig. 2).Since, we assume the generated vectors as i.i.d.from now on we may drop time indication.
More specifically, according to the framework characterizing Lemma 1, we here consider the case in which both sources are Gaussian.In particular we focus on signals with zeromean and covariance matrices Σ ok , Σ ko ∈ R n×n .In general, Σ ok = Σ ko , but we will assume tr(Σ ok ) = tr(Σ ko ) = n, where tr(•) stands for matrix trace, meaning that, on the average, each sample in the vector contributes with a unit energy.With the assumption of signals to be zero-mean and of equal energy, we can focus our analysis on one of the possible effects of anomalies, i.e., the distribution of energy over the signal subspace.
Moreover, with no loss of generality, we assume The signal x is encoded with a compression mechanism tailored for the typical condition in which x = x ok .The objective consists in guaranteeing a proper quality of service ≤ δ to the final user.Hence, in this specific case, the rate-distortion function in (3) considers Σ = Σ ok and f x|x = f ok x|x .Simultaneously, a detector observes y for anomaly detection.Since we assume the decoding stage to be injective, y brings the same information of x so that, in abstract terms, processing y is equivalent to working on x.As a result, the detector works on the difference between the two marginal distributions f ok x and f ko x that can be computed as follows From (8), it is evident that the compression mechanism f ok x|x that optimally addresses the rate-distortion trade-off for the normal source is used also on the anomalous instances.Under the i.i.d.Gaussian assumption, (7) reduces to (5) with Σ = Σ ok , while the PDF of xko is given by the following Lemma, whose proof is in the appendix.
Lemma 2. If an anomalous source x ko ∼ G 0, Σ ko is encoded with the compression scheme f ok x|x of Lemma 1, then Such a result has two noteworthy corner cases.
• If x ko ∼ x ok there is no anomaly, Σ ok = Σ ko , and where the last equality holds since S θ = max 0, I n − θ(Σ ok ) −1 , the possible disagreements between S θ +θ(Σ ok ) −1 and I n correspond to components multiplied by zero by the last S θ factor.Hence, Lemma 2 can be compared with Lemma 1 to confirm that xko ∼ xok .Lemma 1 and Lemma 2 imply that when the normal and anomalous signals are Gaussian before compression, performance of anomaly detectors depends on how much we are capable of distinguishing between the two distributions in (5) and (9).We quantify the difference between them with two kinds of information-theoretic measures, which model two distinct scenarios, one in which the detector knows both f ok x and f ko x and one in which it knows only f ok x .To proceed further it is convenient to define the functional that is the average coding rate, measured in bits per symbol, of a source characterized by the PDF f x with a code optimized for a source with PDF f x , so that L(x; x) is equal to the differential entropy of x [17, Chapter 8].As an alternative statistical point of view, if f x is the PDF of the symbols generated by a source x , f x is the PDF of the symbols generated by a source x and x (α) = − log 2 f x (α) is the negative log-likelihood that the symbol α has been generated by the sources x, then L(x ; x ) = E [ x (α)|x ], i.e., the average negative likelihood that an instance is generated by the source x when it is actually generated by the source x .Within the Gaussian assumption, we can derive the analytical expression for L in the following Lemma whose derivation is in the appendix.
where | • | indicates the determinant of its matrix argument.
A. Distinguishability in anomaly-agnostic detection When f ko x is unknown and only f ok x is given, we can only consider the average coding rates referring to code optimized for xok , i.e., L(x ko ; xok ) and L(x ok ; xok ).One may quantify the difference between the normal behaviour and an anomalous one by measuring the increase or decrease in the average coding rate with respect to the expected case L(x ok ; xok ) as follows: Since there may be anomalies whose encoding yields a lower rate with respect to normal signals, ζ is not always positive.As a result, a distinguishability measure is given by considering its magnitude, i.e., |ζ|.
From a statistical perspective, ζ corresponds to the difference in the expectations of the negative log-likelihood that α is normal given either α is actually an instance of xok or xko The use of the quantity ok x (α) = − log 2 f ok x (α) can be found in other anomaly detection related works, e.g., in [8] where it is referred as a coding cost of α.
With the assumption of Gaussian sources, the optimal encoder (in the rate-distortion sense) lets survive only the components j for which λ ok j > θ.Hence, f ok x and f ko x given in ( 5) and ( 9) have only the first n θ components non-null with n θ = arg max j {λ ok j > θ}.The other n − n θ components are set to 0 and thus cannot be used to tell anomalous from normal cases.We therefore focus on the first n θ components of xok and xko which are Gassian with covariance matrices Σok θ and Σko θ corresponding to the n θ × n θ upper-left submatrix of Σ ok S θ in (5) and of S θ Σ ok S θ + θS θ in (9), respectively.
By properly combining the definition of ζ in (12) with the expression of L within the Gaussian assumption in (11), we obtain where Σθ = ( Σok θ ) −1 Σko θ which corresponds to the n θ × n θ upper-left submatrix of (Σ ok ) −1 Σ ko S θ + T θ .Note that, since Σθ is linear with respect to Σko θ , so is ζ.In addition, ζ vanishes when Σok θ = Σko θ .As a noteworthy particular case, when the normal signal is white, i.e., when Σ ok = I n , we have that θ ∈ [0, 1] and that for any θ < 1, T θ = θI n and n θ = n.Hence, Σθ = (1 − θ)Σ ko + θI n that leads to ζ = 0.This result is not surprising since the distinguishability modelled by |ζ| depends only on the statistics of x ok that has no exploitable structure.

B. Distinguishability in anomaly-aware detection
When both f ok x and f ko x are known, the anomaly detection task reduces to a binary classification problem for which we may resort to the Neyman Pearson Lemma [17, Theorem 12.7.1],[19,Theorem 3.1].This lemma can be understood in the sense that the cardinal quantity to observe is x (α) which can be interpreted as a measure of abnormality of α, i.e., a score that the detector employs to distinguish whether the single α behaves normally or not.Consequently, one may measure the distinguishability between the distributions f ok x and f ko x as the difference between the score observed in average when x = xok and the score obtained in average when x = xko .
where, given distributions f and f , D KL (f f ) refers to the Kullback-Leibler divergence [17, Chapter 2], of which κ results to be the symmetrized version.
The measure κ models a detector that knows the distributions of both normal and anomalous sources such that their optimal codes are also known.From (17), it is evident that κ may be interpreted as the sum of the differences in the average coding rate for both distorted sources with a code optimized for the normal source L xko ; xok − L xok ; xok and optimized for the anomalous source L xok ; xko − L xko ; xko .Since the average coding rate is expected to be shorter when employed to code a source for which it is optimized, these differences are expected to be greater when the difference of two distributions f ok x and f ko x increases.As a result, large κ values correspond to system configurations with high detection capability.
Differently from ζ, κ is a quantity that is always positive and can be directly used as distinguishability measure.
Within the Gaussian assumption, the distiguishability measure κ becomes from which it is evident that κ is convex with respect to Σko θ and, as for ζ, κ vanishes for Σok θ = Σko θ .As a final remark, coherently with the typical per use analysis, distinguishability measures implicitly consider detectors that scrutiny an increasing number of subsequent source instances and scale their performance by such a number.Hence, as rate and distortion coming from (2) are best-case bounds that can be approximated by increasing the complexity of the system, the distinguishability measures indicate how fast a detector accumulates information allowing to declare an anomaly.The higher such a figure, the lower the number of subsequent symbols needed to arrive at a conclusion or, alternatively, the higher the confidence in a conclusion drawn after analyzing as single instance.

A. Average on the set of possible anomalies
Anomalies modelled as zero-mean Gaussian vectors with fixed energy, are completely defined by their covariance matrix Σ ko where tr(Σ ko ) = n.We decompose Σ ko = U ko Λ ko U ko with Λ ko = diag(λ ko 0 , . . ., λ ko n−1 ) and U ko orthonormal.The set of all possible λ ko = (λ ko 0 , . . ., λ ko n−1 ) is  while the set of all possible U ko is that of orthonormal matrices By indicating with U (•) the uniform distribution in the argument domain, we will assume that when λ ko is not known then λ ko ∼ U (S n ) and, similarly, when U ko is not known then Note now that S n is invariant with respect to any permutation of the λ j .Since λ ko ∼ U (S n ), also E[λ ko ] must be invariant with respect to the same permutations so that E[λ ko j ] = E[λ ko k ] for any j, k.Since λ ok has a constrained sum and is the diagonal of Λ ko we have E[Λ ko ] = I n .This implies Hence, in our setting, the average anomaly is white and we may compute the corresponding distinguishability measures ζ I and κ I , i.e., ζ and κ when Σ ko = I n .Note that, in this case, Σθ is the n θ × n θ upper-left submatrix of (Σ ok ) −1 S θ + T θ , which is a diagonal matrix whose diagonal elements are With these quantities, the expressions of the distinguishability measures become Note that due to the Jensen's inequality, the linearity of ζ and the convexity of κ, we have Moreover, the very simple structure of ζ I allows the derivation of the following Theorem whose proof is in the appendix.
Considering a white anomaly, the intuition behind this theorem is the following.When distortion is null (no compression), since xko and xok have the same average energy and the coding is tuned on xok , L(x ko , xok ) > L(x ok , xok ) such that ζ I is positive.On the other hand, when distortion is so high that only the first component of x ok survives, i.e., Σok θ = λ ok 0 − θ, a single component also survives in xko .In this setting, ζ I depends on the difference between the two scalar quantities Σok θ and Σko θ .With few numerical manipulations, it is possible to prove that Σok θ > Σko θ thus ζ I results to be negative.Since ζ I is continuous in θ, it must pass through zero at least once.Therefore, at least one critical level of distortion exists that makes the detectors that do not use the information of the anomaly ineffective.

B. Asymptotic distinguishability
White signals are not only the average anomalies but are also typical anomalies in a sense specified by the following Theorem whose proof is in the appendix.
Hence, when n increases, most of the possible anomalies behave as white signals, i.e., ζ tends to ζ I , that thus enjoys the property shown in Theorem 1. From an anomaly detection perspective, if the signal is characterized by a sufficiently large dimension n, the designer may consider the white anomaly as a reference.

V. NUMERICAL EXAMPLES
In this section we match the theoretical derivations with the quantitative assessment of the performance of some practical anomaly detectors applied to compressed signals.
Normal signals are assumed to be x ok ∼ G 0, Σ ok where Σ ok is the diagonal matrix of the eigendecomposition of the matrix Σ = U Σ ok U , with Σ j,k = ω |j−k| , for j, k = 0, . . ., n − 1 and U an orthonormal matrix.The parameter ω is set to yield a different degree of non-whiteness measured with the so-called localization defined as The localization goes from L x ok = 0 when the signal is white to L x ok = 1 − 1 n when all the energy is concentrated along a single direction of the signal space (see [20] for more details).To show the effect of realistic localization [21] we consider values of ω corresponding to L x ok ∈ {0, 0.05, 0.2}.Anomalous signals are generated as x ko ∼ G 0, Σ ko , where Σ ko = U ko Λ ko U ko is randomly picked according to the uniform distribution defined in Section IV-A.
A first use of this random sampling is the possibility of pairing Theorem 2 with some numerical evidence.Fig. 3 reports the vanishing trend of the average squared and uniform deviation from I n of a population of uniformly distributed covariance matrices Σ ko .Though not a theoretical result, note that empirical evidence supports a classical 1 / √ n convergence.As far as detector assessment is concerned, we decide to set n = 32 and consider three compression techniques tuned to the normal signal and applied to both normal and anomalous instances.More specifically, x is mapped to x by • the minimum-rate-given-distortion compression in (6) (Rate-Distortion Compression RDC); • projecting x along the subspace spanned by the eigenvectors of Σ ok with the largest eigenvalues (Principal Component Compression PCC); • a family of autoencoders [25, Chapter 14] with an increasingly deficient latent representation (Auto-Encoder Compression AEC).Assuming that p is the dimensionality of the representation, the encoder is a neural network with fully connected layers of dimensions n, 4n, 2n, p, and the decoder is the dual network whose layers have dimensions 2n, 4n, and n and the number of inputs is p.The family of autoencoders is trained to minimize distortion computed as in (1).To smooth performance degradation we first train an autoencoder with p = n − 1.Then, the node of the latent representation along which we measure the least average energy is dropped to produce a smaller network with an (p − 1)-dimensional latent space.The obtained network is re-trained using the previous weights as initialization.This process is repeated decreasing p and thus considering larger distortion values.These three schemes address in a different way the tradeoff between compression and distortion.Since we refer to a theoretical model based on continuous quantities and for which rate is potentially infinite, the compressors have to be paired with a quantization stage ensuring that rate values are finite.In particular we encode each component of x with 16 bits and this yields rates of less than 16n = 512 bits per time step.We assume that quantization is fine enough to substantially preserve the Gaussian distribution of x and thus evaluate the mutual information between x and x as if they were jointly gaussian with a covariance matrix that we estimate by Monte Carlo simulation [26].Such estimation yields the rate-distortion curves in Fig. 4.
As expected, RDC yields the smallest rates while PCC gives the largest ones.Between the two we have AEC, whose performance depends on the effectiveness of the training.
Note that, only the results of Fig. 4 refer to the additional quantization stage, while in the remaining part of our analysis we consider continuous sources.
The compressed version of the signal is then passed to a detector whose task is to compute a score such that highscore instances should be more likely to be anomalous.The final binary decision is taken by matching the score against a threshold.
We consider two detectors not relying on information of the anomaly • a Likelihood Detector (LD) whose score is the same considered for ζ, so that to each instance x we associate the score ok x (x) = − log f ok x (x); • a One-Class Support-Vector Machine (OCSVM) [27] with a Gaussian kernel1 , trained on a set of instances of normal signals contaminated by 1% of unlabelled white instances to help the algorithm in finding the envelope of normal instances.We also consider two detectors that are able to leverage information on the anomaly • a Neyman-Pearson Detector (NPD), whose score is the same considered for κ, so that to instance x we associate the score r(x) = log f ko x (x) − log f ok x (x); • a Deep Neural Network (DNN) with three fully connected hidden layers with p, 2n, n neurons with ReLu activations and a final sigmoid neuron producing the score.The network is trained2 with a binary cross-entropy loss against a dataset containing labelled normal and anomalous instances.
LD and NPD detectors can be employed only on signals compressed by RDC or by PCC method since they rely on the statistical characterization of the signals that is not available after the nonlinear processing in AEC.
Table I shows how many different anomalies and how many signal instances are generated for the training (when needed) and for the assessment of the detectors.Note that in the DNN case we limited the analysis to 50 anomalies since the training process must be repeated for each of them.
To be independent of the choice of thresholds, detectors' performance is assessed by the Area-Under-the-Curve (AUC) methodology [29].AUC estimates the probability that given a random normal instance and a random anomalous instance, the former has a lower score with respect to the latter, as it should be in an ideal setting.Hence, AUC is a positive performance index.
Clearly, detectors with AUC = 1 2 are no better than coin tossing.Yet, if AUC < 1  2 , the score has some ability to distinguishing normal and anomalous signals if it is interpreted in a reverse way.Hence, it is convenient to set our empirical distinguishability measure to Note that, if AUC must be estimated from samples, reversing values lower than 1  2 is not always possible.There are classes of estimators for which values less than 1  2 are not reliable [30], [31].From now on, we report results referring to AUC estimated as in [29] for which reversing values lower than 1  2 is possible.In the following, the trends of ψ are reported and matched with the trends of |ζ| and κ to show how theoretical properties reflect on real cases.Comparisons must be partially qualitative as ζ and κ quantify the distinguishability with bits per symbol while ψ comes from the probability of correct detection.Note also that ζ and κ refer to the difference between the average values of the score in the normal and anomalous cases, while ψ takes into account the entire distributions of these scores.
All plots are made against a normalized distortion d = D/n in the range d ∈ [0, 0.64] as larger relative distortions are usually beyond operative ranges.
A. RDC Fig. 5 summarizes the results we have in this case with two rows of 3 plots each.The upper row of plots corresponds to detectors that do not exploit information on the anomaly, while the lower row of plots concerns detectors that may leverage information on the anomaly.Colors correspond to different L x ok , dashed trends assume that the anomaly is the average one, i.e., white, and shaded areas show the span of 50% of the Monte Carlo population.The profiles of |ζ| and κ on the left shall be matched with the profiles on the right that correspond to the four detectors we consider.No |ζ| profile appears for as discussed in Section IV-A.This corroborates the role of the white anomaly as a reference case since it represents the average behaviour in case of anomaly-agnostic detector or a lower bound in the anomaly-aware scenarios.The white anomaly is not only a reference case but also the case to which any possible anomaly tends when n increases as demonstrated in Theorem 2.
Theory also anticipates that without any knowledge the anomaly (upper row), a limited amount of distortion may cause distinguishability to vanish and thus detectors to fail.This happens for practical detectors such as LD and OCSVM.The distortion level at which detectors fail is also anticipated by |ζ| and depends on L x ok as predicted by Theorem 1. Overall, theoretical measures |ζ| and κ anticipate that in the low-distortion region, more localized signals are more distinguishable from anomaly though they cause detector failures at smaller distortions with respect to less localized signals.
Detectors leveraging the knowledge of the anomaly (lower row) fail completely only at the maximum level of distortion as revealed by the abstract distinguishability measure κ.Also in this case, by comparing the trend of κ with the zoomed areas in the NPD and DNN plots we see how theoretical measures anticipates that in the low-distortion region more localized signals tend to be more distinguishable from anomalies but cause a more definite performance degradation of detectors when d increases.

B. PCC
From the point of view of the rate-distortion trade-off PCC is largely suboptimal.Yet, due to its linear nature, x and x are still jointly Gaussian, so that, also in this case, we can compute the theoretical |ζ| and κ by means of ( 21) and ( 19).Fig. 6 summarizes the results we have in this case with plots of the same kind of Fig. 5.The qualitative behaviours commented in the previous subsection appear in the new plots and are anticipated by the trends of the theoretical quantities.
The distortion levels at which anomaly-agnostic detectors fail change with respect to the RDC case but are still anticipated by the theoretical curves and Theorem 1.
In this case, the values of |ζ| beyond breakdown distortion levels increase slightly more that in the optimal compression scenario.Hence, by adopting a compression strategy that is suboptimal in the rate-distortion sense one may obtain a better distinguishability of the compressed normal signal from the compressed anomalies.This is, indeed, what happens in practice as highlighted by the LD and OCSVM plots in the first row of Fig. 6.

C. AEC
In this case, compression is non-linear so that x and x may not be jointly Gaussian.This prevents us from computing the theoretical curves |ζ| and κ and from applying LD and NPD that rely on the knowledge of the distribution of the signals.For this reason, Fig. 7 reports only the performance of OCSVM and DNN detectors.
Notice how the qualitative trends of those performances still follow, though with a larger level of approximation, what is indicated by the theoretical curves for PCC.

VI. CONCLUSION
Massive sensing systems may rely on lossy compression to reduce the bitrate needed to transmit acquisitions to the cloud while theoretically maintaining the important information.At some intermediate point along their path to centralized servers, compressed sensor readings may be processed for early detection of anomalies in the systems under observation.Such detection must be performed on compressed data.
To measure detection performance we define two information theory metrics referring to the anomaly-agnostic and anomaly-aware cases, for which a statistical interpretation is also provided.
In a framework approximating normal and anomalous signals with Gaussian sources, we revise the classical ratedistortion theory to report the distributions of the distorted signals, the mapping to obtain them (see Lemma 1 and Lemma 2), and closed forms for the distinguishability metrics.
Focusing on the anomaly-agnostic case, we prove with Theorem 1, and confirm with numerical evidences, that there exists at least one critical level of distortion for which the detector is ineffective.
We also prove that the white anomaly is a reference case that can be employed in the design of the system.Indeed, it provides information about the average and minimum performance in the anomaly-agnostic and anomaly-aware scenarios, respectively.Moreover, we demonstrate with Theorem 2 that any possible anomaly tends to be white in the asymptotic case.
All these results are confirmed with numerical examples in a toy case.We show that the theoretical measure of distinguishability anticipate the performances of real detectors in case of both optimal (in the rate-distortion sense) and suboptimal compressors.APPENDIX Proof of Lemma 1. Distortion is tuned to the normal case that entails a memoryless sources.Hence we may drop time indications and concentrate on a vector x with independent components x j ∼ G (0, λ j ) for j = 0, . . ., n − 1.
We know from [18] that for a given value of the parameter θ, each component x j is transformed separately into xj .In particular, where, to achieve the Shannon lower bound, ∆ j must be an instance of a Gaussian random variable independent of xj .Hence, the three quantities xj , x j and ∆ j must be such that That explains in which sense xj encodes x j .In fact, the non-diagonal elements λ j − θ are positive and thus xj and x j are positively correlated.
Moreover, (x j , x j ) ∼ G 0, Σ xj ,xj with Σ xj ,xj the upper-left 2 × 2 submatrix of Σ xj ,xj ,∆j in (24).If we assume that θ < λ j , from the joint probability of x j and xj , we may compute the action of f x|x on the j-th component of x j as the PDF of xj given x j , i.e., where τ j = min{1, θ/λ j } ∈ [0, 1], and s j = 1 − τ j .Note that, f xj |xj becomes δ(α) for τ j → 1 (maximum distortion of this component implies that the corresponding output is set to 0) and δ(α − β) for τ j → 0 (no distortion of this component, the output is equal to the input).
Proof of Lemma 2. The PDF of xko distorted by means of f ok x|x can be computed as Assume first to be in the low-distortion condition θ < λ ok n−1 that implies T θ = θ(Σ ok ) −1 , and write at the exponent of which one may add and subtract q Q Putting this back into f ok x we get A straightforward expansion of the definitions under the low-distortion assumption finally rearranges the covariance matrix into as in the statement of the Lemma.To address the case in which θ exceeds λ ok n−1 note that for θ → (λ ok n−1 ) − , the last diagonal entry of S θ tends to 0 and thus by (25) the covariance tends to have zeros in its last row and column.Since a Gaussian with vanishing-variance can be considered Dirac's delta, this model the fact that the last component of both x and x ko is fully distorted and set to 0. With this, (25), is valid also for λ ok n−1 < θ < λ ok n−2 .Yet, analogous considerations can be carried out for θ → (λ ok j ) − and j = n − 2, n − 3, . . ., 0 so that ( 25) is valid for any value of θ.
Proof of Lemma 3.

L(x
Where the last summand has been computed as the expectation of a quadratic form in a Gaussian multivariate for which Corollary 3.2b.1 in [32, chapter 3] gives a formula. Proof of Theorem 1. From (21) we have that Note that α j (θ) is continuous and its derivative is ∂ ∂θ α j = (1 − 1 /λ ok j )/λ ok j .For simplicity's sake assume As a function of θ, ζ I is continuous.In fact, it is trivially continuous in each Θ j .Yet, it is continuous also at any chosen λ ok  with  = 0, . . ., n − 1.To see why, note that where we have exploited that the α j (θ) are continuous and thus their left and right limits coincide, and that α (λ ok  ) = 0. On the left-hand side of its domain, When θ = λ ok n = 0 (no distortion), we have n θ = n and thus where the last inequality follows from the fact that n j=0 λ ok j = n and thus n j=0 1 /λ ok j ≥ n.On the right-hand side of its domain, when θ = λ ok 0 (maximum distortion), we have n θ = 0 and thus ζ I = 0. Yet, we also have that in which the summands are positive if λ ok j > 1.Hence, if k = arg max k {λ ok k ≥ 1}, for θ ≥ λ ok k , all the summands in the above expression are positive and thus ∂ ∂θ ζ I > 0 for λ ok k < θ ≤ λ ok 0 .Given that ζ I = 0 at the end of that interval, it must be negative in its interior.
Since we know that ζ I is positive for θ = λ ok n = 0 and it is continuous for θ ∈]λ ok n , λ ok 0 [, it must pass through zero at least once whenever it is not negative, i.e., for 0 < θ < λ ok k .Proof of Theorem 2. We will use the following Lemma whose proof follows this one.Lemma 4. If λ ko ∼ U (S n ), then for any integrable function f : R → R and any j = 0, . . ., n − 1 From [33] we know that if Q is uniformly distributed in O n (i.e., if it is distributed according to the Haar measure on the orthogonal group) then, for any sequence of integers M n < n increasing with n but such that M n = o n log n , the entries of the first M n columns of Q converge in probability to independent random variables such that √ nQ j,k ∼ G (0, 1) for j = 0, . . ., n − 1, k = 0, . . ., M n − 1.
Such a property can be extended to any subset of M n columns.In fact, given any subset of M n columns of Q, there is a permutation matrix P such that QP has such columns as the first ones.
Yet, since P ∈ O n and Q is distributed according to the Haar measure in that group, also QP is distributed according to that measure and the entries in those columns tend to independent Gaussians.Divide now n by M n as in n = N n M n + m n , where N n is the quotient and 0 ≤ m n < M n is the remainder.We can look at Q as the concatenation of N n matrices Q i , for i = 0, . . ., N n − 1, each n × M n , and of a last matrix Q Nn that is n×m n .From M n = o n log n , we have lim n→∞ N n / log n = ∞ and we can choose M n such that N n = o(n α ) for any α > 0.
If we set U ko = Q then, Σ ko = U ko Λ ko U ko can be written componentwise W Nn i = 0, . . ., N n − 1 scans the first submatrices Q i , the last summand accounts for the remainder matrix Q Nn , and, thanks to the above considerations, ν j,l ∼ ν k,l ∼ G (0, 1) for all j, k, l.
We now have to address that case j = k and the case j = k separately.
A. Asymptotics of Σ ko j,k for j = k Let us now consider W 0 as the representative of all other W i for i = 0, . . ., N n − 1, written as W 0 = Mn−1 l=0 X n,l with X n,l = 1 n ν j,l λ ok l ν k,l .All the normal random variables involved in such a sum are asymptotically independent but this is not true for the λ ok l since the eigenvalues are constrained to sum to n.
Hence, W 0 is a triangular array of row-dependent random variables whose asymptotic behaviour can be analyzed by means of [34,Theorem 2.1] that is essentially a Lindeberg-Feller Central Limit Theorem with the row-wise independence relaxed to asymptotic row-wise incorrelation.To analyze the asymptotics of W 0 we note that E[X n,l ] = E[λ ko l ]E[ν j,l ]E[ν k,l ] = 0 Note also that though not independent, the covariance and the correlation between X n,l and X n,l is ] if l = l = l With this we also know that To compute the last expectation we may resort to Lemma 4 that gives Finally, let L Mn , R Mn ⊂ {0, . . ., M n − 1} be two index subsets such that L Mn ∩ R Mn = ∅.If g L Mn is any function of the random variables X n,l with l ∈ L Mn and h R Mn = l∈Rn X n,l , the covariance between g L Mn and h R Mn is

Fig. 1 .
Fig. 1.A sensorized plant whose acquisitions are aggregated at the edge before being sent to the cloud.

LFig. 4 .
Fig. 4. Rate distortion curves for the three compression schemes we consider and for different value of the localization of the original signal.

p 2
(n − p) n−2 dp = 2n n + 1 Hence we have σ 2 W0 = 2Mn n(n+1) → 0 for n → ∞.This helps satisfying the Lindeberg condition since, if for a given > 0 we indicate with E X 2 n,l |X n,l |≥ the expectation of X 2 n,l restricted to its values that are not less than in modulus, then

TABLE I NUMBER
OF ANOMALIES (Σ ko ) AND, FOR EACH ANOMALY, THE NUMBER OF NORMAL (ok) AND ANOMALOUS (ko) SIGNAL INSTANCES USED IN THE TRAINING AND ASSESSMENT OF THE DETECTORS.