Importance of Small Probability Events in Big Data: Information Measures, Applications, and Challenges

In many applications (e.g., anomaly detection and security systems) of smart cities, rare events dominate the importance of the total information on big data collected by the Internet of Things (IoT). That is, it is pretty crucial to explore the valuable information associated with the rare events involved in minority subsets of the voluminous amounts of data. To do so, how to effectively measure the information with the importance of the small probability events from the perspective of information theory is a fundamental question. This paper first makes a survey of some theories and models with respect to importance measures and investigates the relationship between subjective or semantic importance and rare events in big data. Moreover, some applications for message processing and data analysis are discussed in the viewpoint of information measures. In addition, based on rare events detection, some open challenges related to information measures, such as smart cities, autonomous driving, and anomaly detection in the IoT, are introduced which can be considered as future research directions.

new opportunities, while many new challenges come up, including environmental deterioration, sanitation problem, traffic congestion and terrorist attacks.In order to figure out these problems so that citizens may enjoy a new daily life with security and convenience, Internet of Things (IoTs) has been emerging as an effective solution [1]- [5].
In IoTs, explosively increasing sensors and devices are deployed to sense and collect different types of data, e.g., states of moving cars, crossroads and subway tracks, which drive us into a "big data" era.In order to make things smart, massive data has to be mined to find useful information and knowledge.In this case, the key point lies in how to deal with the observed data and dig out the hidden valuable information [6]- [13].To do so, a series of promising technologies have been put forward such as statistical learning, computer vision, signal processing and so on [14]- [20].

A. Importance of Rare Events with Small Probability
As a matter of fact, in some applications, the regular patterns of systems' or users' behaviours are required to be explored from common events that often occur, but for the other applications, the rare events attract more attention than those occurring with large probability.For example, in financial crime detection systems, only a few illegal identities causing financial frauds indeed catch our eyes [21], [22], which are more important from subjective consciousness.Besides, in intrusion detection systems, only a few number of security alarms should be detected and handled [23]- [27].

So far, a lot of works have investigated networking intrusion and reliable communications to protect
IoTs from being attacked [28]- [38], which show that the rare events should be focused on for their special value in IoTs.By resorting to IoTs or other monitoring devices [39], smart city is becoming a timing fashion in city planning, construction, management and operations [40]- [45].In this case, the rare events observed from monitoring systems also contain more significant features in the numerous data, which can provide effective references for transportation management, city planning and public safety.
Due to the fact that anomalous events may be hidden in big data [46]- [50], it is significant to process rare events or the minorities in objective detection.With regard to the autonomous driving in highways, it is crucial to detect the unexpected moving obstacles over lanes (which can be viewed as rare events).
It is reported that around 150 people die from road hazards in American traffic accidents every year [51], [52].It is beneficial to develop autonomous driving cars based on anomalous objective detection in many aspects such as reducing traffic congestion and accidents, improving energy efficiency and ensuring transportation safety.Actually, there are some researches trying to design intelligent vehicle systems to avoid dangerous driving events with small probability [53]- [59].
In brief, rare events have special values in many newly rising fields such as IoTs, smart city, and autonomous driving.Actually, the approaches for small probability processing are investigated from many perspectives in big data era.

B. Information Theory for Rare Events
In the viewpoint of information theory, information measures could have a seat on the table of rare events processing in big data.According to conventional information theory, the uncertainty of probability distributions can be characterized by information measures such as Kolomogorov complexity, Shannon entropy, Renyi entropy, and mutual information.These measures are also applicable to the infrequent or abnormal events [60], [61].By using information measures to analyze the complexity of the different classes in big data, rare events would be recognized and handled [62].For instance, an objective function of distribution was proposed based on factorization to detect the subsets with smaller probability [63].
Additionally, as an effective information distance, the relative (or differential) entropy is also applied to outlier detection [64].Although there are special scenarios where the above approaches can be used, it is evident that they just focus on the large probability elements or subsets to deal with rare events.
From the perspective of small probability elements, there are also some technologies in the framework of statistical mechanics, such as the large deviation approaches and the measure of concentration of rare events [65], [66].In these cases, traditional information measures are explained and extended by aiming at minority subsets processing.These technologies could be also used in many applications such as secure lossy compression and anomaly detection [67], [68].
In the framework of data distribution processing, the information divergence as a kind of information measure is an intersection of information theory and big data analytics.In fact, information divergences can be adopted to measure the distance between two distributions with small probability elements.
Currently, information divergences have been used in many applications involved with rare events such as faulty detection [69], key frame selection [70] and image recognition [71], [72].Therefore, how to use information measures to cope with small probability events becomes more interesting.

C. The motivations and contributions
The purpose of this paper is to integrate the works on importance analysis of small probability events and clarify the relationship between small probability cases with more importance and information processing including the corresponding information measures and applications.Essentially, this paper is not a technical work but a survey to summarize some classical theories and approaches of information processing based on small probability events so that the related literature can be discovered in a logical and reasonable way.
As far as the contribution of this work is concerned, a theoretical framework with a common fundamental form of message importance measure is constructed to show the core idea of importance of small probability events and characterize its mathematical representation.Moreover, similar to Shannon entropy, an information processing architecture is proposed from the perspective of message importance to combine the message importance compression, transmission loss and receiver preprocessing, which may broaden the extension of conventional information theory.In this case, some novel source coding strategies and information distortion analysis are obtained in an information system based on the importance of small probability events.For big data analytics, some related technologies including measures estimation, dimension reduction and correlation analysis are also unified into an architecture of information system to process important small probability events.This provides a reasonable data processing procedure for the small probability events hidden in massive samples.Finally, some modern and challenging applications, such as smart cities, autonomous driving, and IoTs, may adopt the information measures based on the message importance as novel criterions or metrics for rare events detection.In this regard, we present some schemes with information measures for the corresponding applications.

D. Organization
The organization of the rest parts of this paper is summarized as follows.In Section I, we analyze some theories and technologies of information measures in the scenarios where rare events have valuable sense.In Section II, we discuss some applications based on information measures for rare events, including information compression, transmission and preprocessing.Section III first introduce some effective estimations of distributions and their functionals.Then, information coupling, directed information and some applications involved with rare events are introduced to reduce the dimension of big data and analyze the data causality or correlation from the perspective of information theory.In Section IV, some challenging research directions for information measures are presented based on the rare events detection.
At last, we conclude the paper in Section V.

I. INFORMATION THEORIES AND TECHNOLOGIES FOR MEASURING RARE EVENTS
Information measures play important roles in not only traditional information theory but also numerous applications of big data, such as detection, classification and clustering [73], [74].In fact, by facilitating the small probability elements, some information measures focusing on rare events are proposed to settle the big data problems such as anomaly detection, feature selection and pattern recognition [75]- [77].In these cases, rare events can be extremely eye-catching, in good agreement with the fact that the vital part of the information attracts more attention than the perfect information.Consequently, in this paper, we merely focus on the cases where small probability events, referred to as rare events, contain importance of information.
To characterize the importance of rare events mathematically, Message Importance Measure (MIM) [78]- [82], fixed-parameter MIM [83] and NMIM (Non-parametric MIM) [84] are proposed, whose details are summarized in the Table I.We also analyze the characteristics of these information measures and compare their similarities and differences as follows.
with the parameter ≥ 0.
[ Definition 1. in [78] ] • Event decomposition and merging: The MIM can be increased by dividing one event into two sub-events.
• Minority detection combined with the Bayes method.
• Measure system states with small probability events.
• Characterize users' preference distribution for recommendation.
The fixedparameter MIM [83] Lj (p, j ) with the parameter selection given by j = F (pj) .
[ Eq. ( 3) in [83] ] • Principal component: pj becomes the principal component in the MIM with • To focus on the probability pj by using j = F (pj) = 1/pj.
• Applied to minority subset detection.
• The mean and variance can converge when samples Ni → ∞.
[ Definition 1. in [84] ] • Event decomposition and merging: The NMIM can be increased by dividing one event into two sub-events.
• Without the constraint of parameter selection.
• Exponential operator rather than logarithm or polynomial operator.
• Storage code design and transmission planning in wireless communication.
i) Intrinsic sense of the information measures: The common fundamental form for the information measures (including MIM, fixed-parameter MIM and NMIM) can be given by where p is the given distribution which satisfies p = (p 1 , p 2 , ..., p n ), and the components V(p i ) of MIM, fixed-parameter MIM and NMIM are respectively given by where denotes the coefficient of importance.Actually, these values are just the same as the intuitive notion of importance value, which can be viewed as the invariant of system, referred to as self-scoring value.It implies that larger weights are allocated to the small probability events than those with large probability.Furthermore, Fig. 1 is shown to describe the above information measures visually.Specifically, by treating important events from the probabilistic viewpoint, the status of the atypical sets with small probability is highlighted, which can match many scenarios such as anomaly detection, anti-terrorist activities, forecasting abnormal weather, classification and clustering for binary events.
ii) Comparison of the information measures in the Bernoulli case: Here, the comparison of some different importance measures with respect to the Bernoulli distribution (p, 1 − p) is shown in Fig.It illustrates that the parameters of MIM can make great differences on the characterization for the Bernoulli distribution.While, the non-parametric MIM (namely NMIM) and the parametric MIM (namely fixed-parameter MIM) both have similar performance on measuring small probability elements.In brief, the details of comparison are listed as follows.
• Due to the fact that the MIM can be influenced by parameter , there is no worry about beyond the computing ability of computers.
• If the probability elements are small enough, MIM amplifies small probability not as greatly as NMIM and parametric MIM.
• In the adjacent region of uniform distribution, the parametric MIM can perform better to amplify the smaller probability than NMIM.

II. APPLICATIONS IN MESSAGE PROCESSING
With respect to big data in IoTs, it is significant to design efficient strategies for message processing including information compression and transmission [86].In particular, considering rapidly exploring data [87], we never need to store the whole data samples as before.Besides, since data traffic is exponentially increasing, it is a challenge for transmission resources (including links or networks) to carry so many data pockets [88].Hence, the data processing techniques about lossy compression and transmission are investigated in many aspects [89]- [92].In fact, information theory is a fundamental theory for data compression and transmission [93].To be specific, it provides the optimal coding strategy and the tight bounds for the lossless and lossy compression [94].Moreover, it also proposes information measures including relative entropy, Renyi divergence and f-divergence to guide information transmission and analysis [95], [96].
From the perspective of rare events, a message processing architecture based on message importance measure is presented as Fig. 3, whose details are listed as follows: i) As for the information source, it is significant to maintain the rare events regarded as important message and lose some normal events.In this case, it is feasible to make use of the importance measures to design lossy compression schemes.To this end, the reconstruction error weighted by message importance can be minimized to achieve the lower bounds of code length.
ii) From the viewpoint of transmission for message importance, the core idea is that the receiver can gain more amount of information from the source while maintaining the affordable loss of message importance.In this case, the change of information measure focusing on rare events can be used to characterize the upper bound of information importance loss.
iii) In the information sink, it is possible to use some information divergences to distinguish two adjacent distributions containing rare elements.This can be regarded as a preprocessing for received data.
In terms of the specific analysis of the message processing architecture, three possible applications of information measures are summarized in the Table II, whose main details and interpretations are given as follows.

Work area Description Key Points
Information Compression [84] Compression scheme based on NMIM: • NMIM is regarded as a measure to weight the importance of code length.
• Longer codewords are allocated to the rare events rather than the large probability ones.
• Lower bounds of the code length in the sense of message importance.
• W (li) is selected by different expressions such as l −1 i and γ −l i (γ is a constant).

Information
Transmission [85] NMIM loss distortion function with respect to distortion D: • The upper bound of message importance loss caused by transmission distortion.
• NMIM-loss-distortion viewpoint consisting of the message importance loss φ(X, X), the distortion D and the distribution of events.

Information
Preprocessing [73] The test method for outlier detection based on information divergence F(•): • Distribution Estimations obtained by maximum likelihood estimator, k-nearest neighbor or Gaussian kernel estimator.
• Message identification divergence given by Eq. (3) as F(•) to detect the outlier sequences with small probability of occurrence.
• Information Compression: Although standard compressions are proposed to reduce some redundant information in some degree [94], there still exists large size of data that contains some unimportant message.Further compression is considered to abandon the less vital message based on the probability of events, which may be achieved by using the compression scheme based on NMIM [84].In this case, lower bounds of the code length l i (with the limited total code length C n ) is obtained in the sense of message importance (based on the function of reconstruction error per unit importance, denoted by W (l i )).
• Information Transmission: As far as big data is concerned, the dominant part of message with more importance is more favored rather than the redundant message.In the traditional information transmission, some distortions or errors may have more disastrous impacts on the important messages than worthless ones.For instance, based on this characteristic, the strategy of unequal error protection (UEP) codes has been proposed as a reliable transmission approach [97]- [99].From a new viewpoint of rare events, data transmission with the constraint of message importance loss is discussed to guide the design of information transmission [85].In particular, the upper bound of message importance loss φ(X, X) (based on NMIM operator L non (•)) is given when there exists a kind of distortion d(x, x) (such as Hamming distortion) between a source X and a distortion source X.
• Information Preprocessing: Considering the information preprocessing, information divergences play vital roles in discriminating different distributions (namely information identification).That is, the information divergence can be used as a test tool for outlier detection [73], [74], [100].In particular, an information divergence between two distributions, denoted by F(•), can classify the pending sample sequences X (i) into the normal sequence set M t or the outlier sequence set M f .In fact, the message identification (MI) divergence has its advantage on outlier detection [73], whose definition is given by where the adjustable coefficient is positive, as well as p and q are two finite probability distributions in the same support set.Here, we also take two Bernoulli distributions P and Q as examples to compare different information divergences shown in Fig. 4. It is illustrated that MI divergence described in the Eq. ( 3) is more sensitive to distinguishing two distributions than the Kullback-Leibler (KL) divergence and the squared Euclidean distance when the distribution P is closed to Remark 1. i) For information compression: As for the data compression based on information measures for rare events, the common core idea is that the code length mainly depends on the message importance of events.That is, the code size is mostly assigned to the small probability events.In this case, it is applicable to use a smaller part of storage to save much more important information.ii) For information transmission: Compared with traditional communication, the transmission for big data has its own characteristics such as larger volume of data, a wide variety of events, and the value of information.Thus, it is sensible to preserve more information importance while reducing redundant information.In fact, the NMIM can be used as an efficient information importance measure to design rules  for communication systems.iii) For information preprocessing: As for the information preprocessing, it is possible to analyze the performance of different divergences on distinguishing distinct distributions.
Particularly, the MI divergence is a superior divergence in discerning a typical distribution from its adjacent distributions caused by rare events.

III. APPLICATIONS IN DATA ANALYTICS OF IOTS
In the view of rare events analytics of IoTs, it is required to reduce the dimension as well as estimate the distributions and their functionals efficiently.That is, we should take methods to save more computing resources and improve the efficiency of data utilization [101]- [103].Moreover, it is also necessary to analyze the relationships among rare events so that we can dig out more valuable information [104]- [106].From the perspective of information theory, some approaches are discussed to deal with numerous information sources and do some data mining.Considering the relationship between information theory and big data analytics, we design an architecture based on information measures for rare events as shown in Fig. 5 whose details are summarized as follows: i) Focusing on rare events: Rare events with small probability may contain more valuable information in some applications such as outlier detection and emergency alarm.In this case, it is necessary to define the rare events in a specific scenario at the first step.ii) Selecting an information measure: An appropriate information measure can be adopted to characterize the distribution and highlight the importance of rare events.This is a mathematical representation of small probability events in the sense of the message importance.
iii) Dimension reduction and efficient estimation: As for the sample processing, it is essential to extract the most significant information with low dimension from the original data with high dimension.
Especially, in the case of rare events, we can use low dimension samples and estimate the selected information measure to decrease the computation complexity.
iv) Analyzing relationships: As for big data processing, it may be efficient to analyze the relationships among rare events by use of information measures.
In the architecture of data analytics for rare events, the information measures are discussed in the Section I. We shall specifically introduce some applications about how to use information measures in big data analytics as follows.

A. Efficient Estimation of Information Measures
From the perspective of big data, it is quite essential to have efficient methods to estimate information measures, especially in the case of considerably large alphabet sizes.Whereas, the conventional estimation approaches can not work well [107]- [110], since that the rare events can not be observed accurately when the sample number is not very large.It is also worth investigating asymptotics with high dimension, especially when the number of samples is not much larger than the dimension.As a result, here lists some related works in the Table III whose details are described as follows.
• Estimation of Distributions: Based on some risk functions, different distribution estimations are investigated which play crucial roles in the information measure estimation [111]- [115].For example, in the case that the alphabet size S increases with the number of samples n, a minimax estimation
• Estimation of functionals of distribution: When the unknown support size S is not smaller or even larger than the samples number n, a general methodology based on the minimax estimator is presented to estimate the functionals of distribution [116], [117].Compared with the minimax estimator with non-smooth and smooth regions, the MLE is exactly sub-optimal in the large support [118]- [121].
• Entropy Estimation: As a widely used information measure, entropy is worth estimating especially.
An adaptive estimation framework is adopted to achieve the minimax rates in spite of the unknown support size S of distribution [122].Besides, the estimator based on the best polynomial approximation also has the same performance [123].Moreover, an inferior estimator is constructed by use of Dirichlet prior smoothing, which is similar to MLE but not as good as the above two [124].
In addition, an ensemble of plug-in estimators with weights is proposed to protect the results of estimation from decaying with the increase of sample dimension [125].
• Information Divergences Estimation: As a class of information measures, information divergences such as KL divergence, Hellinger distance and 2 -divergence can be estimated in some similar ways [126]- [131].In this regard, an augmented plug-in estimator and a methodology with the combination of polynomial approximation and plug-in rule are constructed to achieve the consistent estimator and the minimax rate-optimal estimator respectively [132].Moreover, an optimally weighted ensemble estimator is also designed, which has good performance in the cases of high dimension [133].
In fact, the above classifications are based on the work areas of estimation.While, there exist some common criterions which can unify these estimators [117], [121], whose details are discussed as follows.
i) The maximum risk: Essentially, the MLE of distributions or their functionals complies with the maximum risk criterion which is given by where D error denotes a kind of error metric such as the one-norm and two-norm, F (P ) is a function of the distribution P whose support is M S and F is the estimation for F (P ).In general, the MLE of distributions can be regarded as the fundamental plug-in estimator which is given by where Z j (j ∈ {1, 2, ..., n}) denotes the sample value, n is the sample number and S is the support size.Furthermore, we can substitute pi into the functionals including F (P ) = P (namely the distribution itself) to obtain the estimation for the functionals of distribution.Moreover, as another example of MLE, the Dirichlet prior smoothing estimator is similar to plug-in estimator in the case of maximum squared risk, which is given by where S is the alphabet size, P is an empirical distribution, and α = (α 1 , α 2 ..., α S ) denotes the parameter vector which is adjustable.Besides, the ensemble of plug-in estimators with weights also belongs to MLE, which is defined by Fe = l∈ l λ l Fl , ( where Fl is the plug-in estimator or its function, l = {l 1 , l 2 , ..., l L } is a set of parameters and λ l denotes the weight value.In this estimator, the weights can be adjusted by using different optimal rules flexibly. ii) The minimax risk: In terms of the minimax estimator for distributions or information functionals, it is based on the criterion minimizing the maximum risk of MLE which is given by in which the notations are the same as those in the Eq. ( 4).As an instance of the minimax estimator, an approach based on the polynomial approximation rule is proposed, which treats the estimation problem as two cases of "small p i " and "large p i " (p i denotes the probability element).In the case of "small p i ", the best polynomial approximation is used to guide the estimation, which is given by where g(x) is the objective function, Ψ K is the set of polynomials with order no more than K on the domain Ω.Moreover, in the case of "large p i ", the estimation can be obtained by use of a kind of MLE such as the plug-in estimator.
Moreover, in order to see the reliability of the estimators based on these criterions (including the minimax risk or the maximum risk of MLE), it is necessary to compare the corresponding performance in some specific cases.Here, the results of estimating some classical information measures are summarized in the Table IV in which H(P ) = − S i=1 p i ln p i denotes the Shannon entropy, H ξ (P ) = S i=1 p ξ i (ξ > 0) is the dominant part of Renyi entropy, S is the support size, n is the samples' number, and the notation a k b k denotes sup k ak bk ≤ A (A is a constant).It is remarkable that the performance of the minimax estimator with n samples is equal to the MLE with n ln n samples in the case of small probability estimation, which is called "effective sample size enlargement".

B. Dimension Reduction Based on Information Coupling
In the era of big data, there exists a big buzz word, "dimension reduction", which is involved in many fields such as machine learning, data mining, computer vision, etc.In order to solve this problem, more and more new techniques are being developed including principal component analysis, independent component analysis and regression analysis [134]- [137].Besides, lots of applicable algorithms enable these new developed approaches to be used in many applications [138], [139].However, these approaches are all designed from the viewpoint of the space of data rather than the intrinsic information flow.[117], [121] Functional of distribution Minimax squared error rates Maximum squared error rates of MLE (n ln n) 2 + ln 2 n n , ( S ln S n) [111], [117], [127], [128] n, ln n ln S) [111], [117] S 2 n 2ξ , (S (n ln n) 2−2ξ , (n ln n S) [111], [117] 1 H ξ (P ), 3  2 ≤ ξ 1 n [111], [117] 1 n [111], [117], [121] On the contrary, the information coupling based on information measures is discussed to construct a framework for information-centric data processing.In fact, it is a novel view to analyze the information exchange process of relative data nodes by use of information coupling.
Mathematically, information coupling can be formulated in a fundamental communication scenario, where the input X contributes to the output Y through a transition probability matrix W Y |X .In a typical communication system, a message U can form a Markov chain U → X → Y with the input X and the output Y , where the message U is encoded into the input X.In order to design an efficient encoding scheme, it is usual to maximize the mutual information I(U ; Y ) depending on the distribution P U and the conditional distributions P X|U =u .Similarly, the information coupling is to maximize the objective function I(U ; Y ) constrained by a small mutual information I(U ; X).The constraints satisfy that the conditional distributions P X|U (•|u) are neighbors of the marginal distribution P X .That is, the information coupling [140] can be given by max where the parameter σ is small enough.
In practice, the solution of the optimization problem about information coupling can provide a theoretical optimal result for dimension reduction from the perspective of information correlation.This can guide us to approximate the optimum by using low-dimensional information to represent the high-dimensional data.Specifically, suppose that there exists a hidden source sequence x n = {x 1 , x 2 , ..., x n } following the distribution P X , an observed sequence y n = {y 1 , y 2 , ..., y n } following the distribution P Y , and a transfer matrix W Y |X between the input X and output Y .In order to infer the hidden source X from Y , we usually require a sufficient statistic of y n containing the whole information of x n .While, it is difficult to compute the statistic in the cases of the high dimensional structures of x n and y n .To reduce the dimension, we would like to acquire a statistic from the observation y n to characterize a certain feature of x n .According to the information coupling, a feature U in x n is the most efficiently extracted from the observed data y n in terms of the maximized mutual information I(U ; Y ), which corresponds to the solution of this optimization problem.This efficient statistic based on the feature U , can be considered as a low-dimensional label containing the most significant information of the high-dimensional data, which implies an information theoretic method to reduce dimension [141].
Remark 2. Actually, it is not difficult to see that the information coupling is an efficient tool for statistics, which can extract the significant information from high dimensional original data.This can correspond to the goal of the dimension reduction and feature extraction for the rare events, which may use φ(U ; X) = L(p U ) − L(p X ) to replace I(U ; X) to take the message importance transfer quantifying.

C. Directed Information for Relationship Analysis
Directed information derived from information theory seems to be a commonly used approach, which can identify the interplay and causality between two stochastic processes [142]- [147].Furthermore, it is also rational to adopt this approach to analyze the stochastic processes with rare events.Some details of directed information are given as follows.
In order to solve the causality problem in information systems [148], [149], an information measure, referred to as "directed information", is defined as where . ., Y n ) are independently random sequences, while X i and Y i (i = 1, 2, ..., n) are random variables, and I(•) denotes the mutual information.Moreover, due to the fact that the upper bound of the feedback channel capacity can be obtained by maximizing the normalized directed information [150], [151], another formulation of directed information is given by which is obtained by use of the slide information (X i−1 , Y i−1 ) [152].
Furthermore, this information measure has been adopted in some applications of relationship analysis, such as the computational biology with intrinsic causality [153], [154], the prediction of rate distortion [155] and the data compression with causal side information.Besides, the directed information provides an upper bound for the growth rates of optimal portfolios, which can also tightly bound the horse race gambling [142].Notably, directed information can also measure the best error exponent for hypothesis testing which may be involved with the rare events identification.
Remark 3. Directed information is an efficient information measure which can interpret the causality transfer between two variables.Actually, this measure provides a significant tool to analyze the causal side information.Besides, it also plays an crucial role in dealing with the inference problem involved with causal influence factors.Similar method for the extension of MIM is necessary, which may bring some new insights on the massage importance discussion.

IoTs
Autonomous driving D Fig. 6.Architecture of small probability events (namely rare events) processing based on information measures for challenging applications.

D. Rare Events Detection for Probability Derivation Process
In the data mining of IoTs, some scenarios such as urban abnormal pattern recognition as well as fire early warning and detection, can be treated as probability derivation processes which may be characterized and analyzed by means of information theory.It is worth noting that rare events detection lies in the intersection of the probability derivation process and the practical applications related to information measures.This problem has been investigated from many perspectives.In particular, the common methods of rare events detection are proposed based on the specific models or frameworks [156]- [160], such as Bayesian network anomaly detection, anomaly pattern classification in images, as well as normal behaviors definition for data points or groups.
As a typical probability derivation process, urban abnormal events detection is investigated widely, which may provide advices for governments and communities in smart city planning and management.
In this regard, spatio-temporal data or multiple data sources are used to detect rare events of urban traffic states, such as mining uncommon trajectory of people, detecting road traffic anomalies [161], as well as identifying anomalous regions or locations [162]- [164].The essential idea of these approaches is to construct a conditional probability model based on Hidden Markov process or Maximum Likelihood rule to detect or predict anomalous events.That is, the underlying distribution of rare patterns can be obtained in the probabilistic models which are constructed based on the different patterns of spatio-temporal data.
Moreover, message measures based on similarity and correlation also play crucial roles in identifying urban abnormal events [165], [166].For instance, L −∞ distance is adopted as a kind of similarity measurement to evaluate the degree of anomalous traffic [167].Besides, KL divergence is also commonly used as a metric to measure correlation [168], [169].In video surveillance systems of urban traffic states, when a small video clip is represented as a histogram of multi-set bag of codewords by using Fourier based trajectory feature descriptor [168], KL divergence is applied to classify the pending video clips into the normal or abnormal ones.The corresponding metric based on KL divergence is given by where D KL (•) denotes the operator of KL divergence, p(v i |c = 1) and p(v i |c = 0) are probability elements from the codewords of normal video clips and abnormal ones (the corresponding distributions are P 1 and P 0 ), q i denotes the probability element from the codewords of pending video clips (the corresponding distribution is Q).Furthermore, a spatio-temporal detector for the mixture of dynamic textures (MDT) model is proposed, in which the center-surround saliency detection is based on the KL divergence between feature responses and events class labels [169]: where p i X|c are class-conditional densities (based on the class c ∈ {0, 1}), p j X are sample densities, π c j and ω j are parameters, K c (c ∈ {0, 1}) denotes the number of samples in the corresponding class c.
Similar to the KL divergence, the message measures mentioned in Section I may be also efficient in rare events detection for spatio-temporal data and may perform better in some special data sets, which can be investigated further in probability derivation processes.
Remark 4. Some message measures reveal the similarity or correlation for probability derivation processes.Specifically, these measures can be regarded as criteria for urban abnormal events mining.In general, it is promising to make good use of novel information measures to extend the strategies of rare events detection.

IV. FUTURE CHALLENGES
Considering future research directions, new approaches and challenging applications can promote the development of information measures with respect to rare events.By combining big data analytics, an architecture of rare events processing based on information measures is constructed shown in Fig. 6.
In particular, we can apply big data analytics and information measures in the challenging scenarios involved with rare events, including smart cities, autonomous driving, and detection in IoTs.
Actually, in the above applications, the common technique playing a core role is rare events detection.
Here, we design a technology framework in the viewpoint of information measures to help to detect rare events as shown in Fig. 7. To be specific, assume there exist two different kinds of message sequences in the data set, that is, the data set consists of two message sources X and Y with different distributions.In this case, the message sequences from the message source Y are considered as the rare events.The goal of our framework is to detect message sequences of Y .Our core idea is to make use of information measures such as KL divergence, Renyi divergence and f-divergence to identify the two kinds of information distributions.In this case, we assume that how to design efficient information measures is a fundamental problem in the first step.Moreover, when an information measure is obtained, we also need to analyze the samples in the message sequences and take efficient methods to estimate the information measure.
Furthermore, it is applicable to classify estimated results by resorting to the machine learning algorithms so that we can make a decision for rare events detection.
In addition, it is promising to measure rare events based on message importance and then analyze the relationship among the big data.The emerging applications related to big data require new ways to deal with anomalous detection or probability events mining.To this end, we summarize some challenges and perspectives associated with rare events processing, which can be future research directions for information measures as shown in Table V. A. Smart Cities 1) Anomaly Detection for Urban Monitoring Data: As a typical application of big data, smart city has been evolving rapidly with the increase of urban population.This implies that cities can be monitored by countless devices in many aspects such as road traffic, transportation management, environment monitoring, healthcare, etc. Actually, in cities, it is significant to detect the anomalies with small probability, which may provide effective guidance or warning information.
In order to investigate the anomaly detection problem in smart cities [170], the major challenges are listed as follows.
• Security problems in the urban monitoring systems with wireless sensor networks (WSN).
• The way to optimize the validity and reliability of transportation schedule system by avoiding the anomalies.
• The long time prediction for the regular pattern of cities.
• To distinguish the unexpected events from popular anomalies.
• Automatic anomaly detection algorithms for the urban monitoring systems with IoTs.
In fact, the anomaly detection (or rare events detection) can be processed in many ways including machine learning, signal analysis and even information theory.To be honest, there are some specific methods to detect the anomalies in smart cities, which may overcome the above challenges from different perspectives.Particularly, in order to improve the security of the WSNs in urban monitoring systems, a non-intrusive architecture is proposed to detect attacks by use of the support vector machine (SVM) [171].Moreover, for the IoTs of smart cities, by using automatic clustering or classification, the events with low probability can be identified in many applications such as the car parking scenario, polluted
Notably, it is promising to exploit information theory to deal with anomaly detection by emphasizing the importance of rare events.By combining machine learning techniques, the importance measures focusing on rare events may provide new ways to cope with the anomaly detection and the evaluation of post processing, which plays an vital role in smart cities.
2) Detecting Urban Black Holes: As an important part of smart cities, the urban black hole denotes a region in which the whole traffic inflow is larger than the whole traffic outflow.Actually, the urban black hole can reflect emergencies or irregular events, namely rare events, including disasters, accidents, as well as traffic jams or congestion [177], [178].It is worth detecting urban black holes efficiently, which can make a beneficial effect on urban safety.Therefore, some approaches are investigated as follows.
• Graph Clustering: With regard to the graph clustering, the approaches with the pruning schemes and the random matrix are proposed to characterize the potential black holes in a directed graph [179].
Besides, there are some other approaches detecting black holes by means of different measures [180], [181] such as attribute, modularity and density.
• Dynamic Graph Detection: To detect black holes emerging in dynamic graph, some efficient approaches are proposed by means of the increment, pattern trees, and the pattern recognition with constraints [182], [183].
• Groups Moving Recognition: On one hand, the density of regions is used to discover the object groups beyond the threshold during the observation time [184].On the other hand, moving together behaviors during a given time period are investigated to find out the tracking of a group of objects [185], [186].
• Spatio-Temporal Graph: Based on the spatio-temporal graph, some approaches are presented to mine spatial urban black holes [187], as well as, detect the tracking of data temporally and spatially [188].
Actually, from the perspective of probability distribution, it is possible to use information measures to find out urban black holes which may be described by graph methods.To do so, a detection scheme for the smart city is shown in Fig. 8 whose details are as follows.
Specifically, data from the monitoring system and database are used to detect the emergency events or accidents which can be regarded as the rare events.Next, we can apply the information measures to analyze the relationships among events.By using the data analysis and processing, the control center predicts the state of the city.This can be adopted as a reference to update the system model and database.
Moreover, when anomalies are detected or the system breaks down, the control center can reset the system.If anomalies or accidents are detected or some unsolved emergencies are reported, control center will take measures to handle them.Then, if the system generates strategies for the anomaly detection, it would commands to executors to solve the problems.Besides, human can also set in the work directly when the system fails to finish the work.

B. Autonomous Driving
As an important part of the autonomous driving, obstacles detection makes a great influence on warning and predicting collisions and accidents [189].However, it is still a challenge to accurately detect the obstacles or objects with small probability in the view of computer vision.In general, some key issues of autonomous driving are summarized as follows.
• Obstacles Detection: On one hand, some approaches are presented to characterize obstacles by use of image data [190], v-disparity histogram [191], as well as the models for the height-over ground [192]- [194].On the other hand, deep learning tasks are used to detect obstacles by means of the image features and related information.Moreover, a technique "6D-vision" is also put forward to discover the dangerous events on the roads [195], [196].
• Object Detection: There are some approaches to detect and track objects by means of classification or clustering [197].The strategies and frameworks for object localization or tracking are also proposed depending on the Kalman filter [198] and deep convolutional neural networks [199].Furthermore, some other approaches are designed by use of the trade-off between camera orientations prediction and monitoring techniques [200], [201].
• Detecting Road Surface and Lanes: As for road surface detection, the discriminant analysis (DA) is presented to characterize the road crack [202], [203].This can provide a threshold for classification according to the road texture and color in images.Besides, in order to detect the road curb and lanes, it is common to regard color and texture as interesting features of roads.These can be used by combining classification with the hue-saturation-intensity (HSI) color space or red-green-blue (RGB) color space [204], [205].Besides, another framework of road curb and lanes detection is addressed by extracting the 3D parameters from some curb models [206].
Moreover, there are some works proposed based on probabilistic approaches and learning strategies.
Gaussian process (GP) regression decomposition based on a superpixel-like algorithm is employed to validate quasi-constant velocity models which build a set of Kalman filters to identify the abnormal motions online [207].A particle swarm optimization (PSO) and bacterial foraging optimization (BFO)based learning strategy (PBLS) is presented to improve the classifier and loss function of strengthened region proposal network (SRPN), which can be applied in object detection of autonomous driving [208].
A set of 3D object proposals based on an energy function are obtained to detect high-quality 3D objects by use of a convolutional neural net (CNN) [209].
Additionally, with regard to the detection for autonomous driving, it is apparent that rare events play important roles in many aspects of vehicular safety system.By measuring small probability, it is appropriate to apply information theory to the autonomous driving detection.To do so, an obstacle detection scheme is shown in Fig. 9, whose details are given as follows.Based on the data from monitoring devices or radars, the autonomous control system can detect obstacles or other outlier events, which can be analyzed by use of information measures.If no obstacle is discovered, the system will continue the normal surveillance.However, if some obstacles are detected, the system will take measures to solve the problem by slowing down and choosing a new way.When the emergencies are not solved well, it will put on the brake and report them to drivers for further commands.

C. Applications on Detection in IoTs
Outlier detection in IoTs, is to dig out the minority of sensors data exactly [210], [211].In fact, it is essential to differentiate the outlier data or observations from the normal data so that one can gain the warning information and prevent the outlier data from misleading us [212].There exist a various of researches focusing on the outlier detection which are also considered to detect rare events in IoTs systems.
On one hand, there are some approaches to detect outliers in IoTs directly, such as using Jaccard coefficient or Euclidean distance as the criterion of decision making [213], referring to the expert knowledge on security [214], as well as, monitoring the abnormal traffic among communication devices [215], etc.On the other hand, several researches divide observations into different groups to find out the outliers by use of classification and clustering algorithms [216], [217].To address this kind of matter, a few approaches also introduce static data series [218] or dynamic time series into the machine learning algorithms.Besides, a framework of data analysis is put forward by means of the recursive principal component analysis (R-PCA) [210], which provides another way to investigate the security of IoTs systems.
In light of the fact that the data observed from IoTs are usually fed to cloud service systems, some approaches are proposed by blending both IoTs and cloud technologies [219].Moreover, to test IoTs systems conveniently, a new method is presented to emulate the environments of IoTs by means of a network emulator, which can improve the processing efficiency for outlier detection [220].Furthermore, some probabilistic models and large-scale processing approaches are also exploited in the anomalies detection of IoTs.A statistical decision framework based on temporally correlated traffic is designed, which develops two low-complexity algorithms (based on cross entropy method and generalized likelihood ratio test) to achieve anomaly detection and attribution [221].An adversarial statistical learning mechanism, outlier Dirichlet mixture-based anomaly detection systems (ODM-ADS), is presented to obtain legitimate profiles and discover suspicious anomalies [222].Besides, there are two methods are proposed, namely a one-class support Tucker machine (OCSTuM) and an OCSTuM based on a genetic algorithm called GA-OCSTuM, which extend one-class support vector machines to tensor space to detect anomalies in IoTs [223].
However, in spite of many efficient approaches for outlier detection, few researches consider to exploit the small probability character in the viewpoint of probability distribution.Actually, it is promising to take use of information measures to analyze the outliers of IoTs.
From the perspective of information theory, importance measures can provide a specific access to tackle the outlier detection problem by using probability distribution, which is shown in Fig. 10 whose details are as follows.The data collected by distributed sensors are used to detect the potential or ongoing outliers by resorting to information measures.If an outlier is detected and handled, the local center will continue to collect data and update the database.However, if a detected outlier is not handled well, the local center will contact with executors to solve the problem and save the data to the database.Once there is no answer for the request, local center will report it to the control center.

V. CONCLUSION
In this paper, we gave a total review on information measures for rare events in big data.In order to characterize the importance of rare events, we summarized some message measures such as the parametric MIM and the non-parametric MIM which have properties on emphasizing small probability elements for a given distribution.These information measures are regarded as promising criterions or tools for statistical big data analytics.Furthermore, we introduced that measures focusing on rare events can provide new ways for message processing such as compression and transmission.Moreover, some other applications in big data have been discussed including efficient estimation, dimension reduction and relationship analysis.
Additionally, we introduced that information measures for rare events could be applicable for some future research directions including smart cities, autonomous driving, and anomaly detection in IoTs.In these cases, there exist several future challenges of information measures summarized as follows: i) Data storage and low latency computation for the data sets containing rare events.
ii) Feature extraction and data cleaning of holding rare events.
iii) Design of information theoretic criterions to measure distributions while considering the values of rare events.
iv) Efficient methods of information measure estimations.
v) Correlation and causality analysis based on information measures.
vi) Decision making strategies for rare events (or probability events) mining.

Fig. 1 .
Fig. 1.The interpretation of information importance measures focusing on rare events.

Fig. 3 .
Fig. 3. Message processing architecture from the viewpoint of rare events.

Fig. 5 .
Fig. 5. Architecture of data analytics based on message importance of rare events.

Fig. 8 .
Fig. 8.The rare events detection scheme for the smart city.

Fig. 9 .
Fig. 9.The obstacle detection scheme for the autonomous driving.

Fig. 10 .
Fig. 10.The outlier detection scheme for the IoTs composed of monitoring sensors.

TABLE I SUMMARY
OF INFORMATION MEASURES FOR RARE EVENTS 2.

TABLE II SUMMARY
OF MESSAGE PROCESSING APPLICATIONS

TABLE III SUMMARY
OF LITERATURE ON THE EFFICIENT ESTIMATIONS OF DISTRIBUTION AND ITS FUNCTIONAL

TABLE IV PERFORMANCE
OF MINIMAX ESTIMATOR AND MLE AND THE COMPARISON

TABLE V PERSPECTIVE
APPLICATIONS AND USE CASES OF RARE EVENTS PROCESSING