TVOR: Finding Discrete Total Variation Outliers among Histograms

Pearson's chi-squared test can detect outliers in the data distribution of a given set of histograms. However, in fields such as demographics (for e.g. birth years), outliers may be more easily found in terms of the histogram smoothness where techniques such as Whipple's or Myers' indices handle successfully only specific anomalies. This paper proposes smoothness outliers detection among histograms by using the relation between their discrete total variations (DTV) and their respective sample sizes. This relation is mathematically derived to be applicable in all cases and simplified by an accurate linear model. The deviation of the histogram's DTV from the value predicted by the model is used as the outlier score and the proposed method is named Total Variation Outlier Recognizer (TVOR). TVOR requires no prior assumptions about the histograms' samples' distribution, it has no hyperparameters that require tuning, it is not limited to only specific patterns, and it is applicable to histograms with the same bins. Each bin can have an arbitrary interval that can also be unbounded. TVOR finds DTV outliers easier than Pearson's chi-squared test. In case of distribution outliers, the opposite holds. TVOR is tested on real census data and it successfully finds suspicious histograms. The source code is given at https://github.com/DiscreteTotalVariation/TVOR.


I. INTRODUCTION
O UTLIERS can be defined as data patterns that do not conform to an expected normal data behavior [1]. Since identifying outliers or anomalies can often be useful, performing outlier, i.e. anomaly, detection has an important role in many data related areas. For example, with the ever growing application of machine learning in various fields, having clean training sets, free of any unwanted outliers, can often significantly benefit the final production accuracy. On the other hand, in real-time applications such as network traffic or health monitoring, it is usually highly important to detect anomalies that could represent any form of unwanted behavior to prevent their potentially detrimental effects. Alternatively, it may be required to see which samples differ the most from the rest of the data and study them in more detail.
Since there is a relatively high demand for anomaly and outlier detection methods in fields dealing with some form of data, numerous methods have been proposed for various applications, as can be seen in several review papers [1]- [3].
A particular kind of data are histograms. First introduced by Pearson [4], histograms are by definition estimates of the probability distribution of a continuous variable. If there is a sample of real numbers drawn from the same distribution and all inside a given interval, then histograms can be used as their simple representation, and are also suitable for visual presentation. For histograms to be useful, the bin size should be adjusted accordingly to the data being described [5]- [8]. In certain cases for a group of such histograms it may be interesting to know whether some of them are outliers. This may include histograms describing samples drawn from another distribution different from the one of the majority of the samples, but it may also include histograms just describing some less likely samples from the same distribution. To be clear, in such a case, histograms are not used as tools for outlier detection like in e.g. [9], but they are the data representations to be analyzed for the presence of outliers.
In the simple case when only a single histogram is given, instead of multiple histograms, a straightforward approach to check whether it represents a sample that differs from a given distribution would be to use the Pearson's chi-squared test [10]. It tests how likely it is that any observed difference between the bins counts of the given histogram and the VOLUME 4, 2016 1 arXiv:2012.11574v1 [stat.ME] 21 Dec 2020 expected bin counts occurred by chance. However, for this to work, it is required to know the expected bin counts.
On the other hand, if multiple histograms are given for samples that are assumed to have been drawn from the same distribution, then it is possible to find outliers among them by means of the Pearson's chi-squared test even if the distribution is unknown. Namely, under Glivenko-Cantelli theorem [11] all the given histograms, except the currently tested one, can be used to get a reliable empirical distribution function, which in turn can be used to get the expected bin counts. Over time, numerous other techniques that can be applied in the described cases have been proposed [12]- [15].
While the problem of finding outliers in terms of distribution is common, in some cases it is required to find histogram outliers in terms of some specific histogram property. For example, census data histograms are usually smooth, i.e. the difference between the counts of neighboring bins is relatively low, but in the presence of anomalies such as age heaping [16], this often stops being the case. One way to measure smoothness is to calculate total variation [17]. This means that by detecting deviations from the expected total variation it could be possible to detect smoothness outliers more easily than by means of some of the previously described techniques. Single-value properties similar to total variation in terms of simplicity, such as skewness, have already been used for outlier detection [18]. As a matter of fact, total variation has also found application in tasks such as classification [19] and outlier detection for graph signals [20]. Therefore, in this paper a new method for outlier detection in terms of discrete total variation (DTV) among histograms that describe samples drawn from the supposedly same, but unknown distribution is proposed. There are several contributions of this paper. First, it is mathematically proven that in terms of the underlying distribution there are only two possible cases of the relation between the sample size and the expected discrete total variation with the first case only being a special case of the second one. Second, a method is proposed that utilizes this relation to detect outliers that deviate from their expected discrete total variation. Third, it is shown that while the proposed method is not supposed to be used as a general outlier detector in terms of distribution, in some special cases it still performs better in this task than Pearson's chi-squared test. Fourth, the proposed method is shown to be able to detect suspicious histograms on real-life census data. The practical applicability and usefulness of the proposed method are shown on synthetic data and real-life census data. The proposed method is simple to implement and it does not require prior knowledge of any distribution.
The paper is structured as follows: in Section II the total variation is formally described, in Section III the theoretical derivation of the proposed method and its underlying model are given, in Section IV the experimental results obtained on synthetic data and historical real-life census data are presented and discussed, and Section V concludes the paper.

III. THE PROPOSED METHOD
In this section, the proposed method for finding discrete total variation outliers among histograms and the method's underlying model are described. In order to try to avoid any misunderstandings, the structure of this section has purposely been slightly extended. Section III-A gives the general idea of how to use the discrete total variation for outlier detection, Section III-B gives an initial statistical foundation, Sections III-C and III-D use this foundation to derive the relation between the sample size and its expected discrete total variation for two general cases, Section III-E uses this relation to propose the sample models based on the discrete total variation, Section III-F describes the score calculation, Section III-G explains how to combine all these results into a single method, and, finally, Section III-H names this method.

A. THE GENERAL IDEA
Let there be a sample of N values, x n its histogram with n bins, and x i the number of values that fell in the i-th bin with Each of the n bins has an arbitrary interval that can also be unbounded. The bins are not required to be of the same size. Let p i be the probability of a value falling in the i-th bin and Due to randomness the discrete total variation of x n , i.e.
x n V can differ for each sampling, but it should mostly not differ significantly from its expected value E [ x n V ] for a given N and probabilities p i . For a given x n the difference between its x n V and E [ x n V ] can serve as a score of how much the sample differs from the expected behavior. Such a score has several drawbacks as well as advantages.
The main disadvantage is that it is required to know E [ x n V ] for any given N or at least to know the relation between these two values for proper scaling and comparison.
The main advantage of such a scoring is the simplicity of its calculations due to the very definition of the discrete total variation. Further, because of that it is not necessary to know the desired sample distribution, which significantly widens the application possibilities. Finally, it is not very likely that two samples of the same size have histograms of the same or similar smoothness, i.e. discrete total variation and that their scores differ significantly. That means that if this score is calculated for every sample in a group of samples that are expected to have similar smoothness, then the ones with the highest scores can be considered as outlier candidates.
However, in order for this to be practically usable, first an analytical relation between N and E [ x n V ] has to be found.

B. THE STATISTICAL BACKGROUND
The first step in finding a relation between N and The value of x i for a given i has binomial distribution so that Var For the second term of the last form of Eq. (6) it holds that The result of Eq. (9) can now be further developed as follows: Combining Eq. (7), Eq. (8), and Eq. (10) develops Eq. (6) to (11) Based on the values of p i there are two cases of further actions for establishing a relation between N and E [ x n V ]. These two cases are covered in the following subsections.

C. UNIFORM DISTRIBUTION 1) Upper bound
The first case is when the distribution of the sample and consequently the distribution of the histogram are uniform so that the probability of a value falling in the i-th bin is then When this is applied to Eq. (11), it eliminates its first term and it simplifies its second term, which then gives the form Taking into account that the square root is a concave function and applying the Jensen's inequality [36] to Eq. (13) gives (14) This inequality can than be applied to all neighboring bins: Due to the basic properties of the expectation, it holds that Applying Eq. (3), Eq. (13), and Eq. (16) to Eq. (15) gives This gives the upper bound for the expected value of the discrete total variation and thus the first relation between N and E [ x n V ] if the sample numbers are uniformly distributed.

2) Exact values
Let F (n, N ) denote the expected value of the discrete total variation as a function of two key parameters n and N : Theorem 1: The exact value of F (2, N ) in closed form is The proof of Theorem 1 is given later in Appendix. VOLUME 4, 2016 It is relatively easy to show that for each r it holds that F (2, 2r) = F (2, 2r − 1) (20) and this leads to some unwanted consequences later on in the paper, but there they are mentioned and handled properly. The case of uniform distribution means that a histogram is a realization of the multinomial distribution and its bins x 1 , x 2 , ..., x n are random variables. The distribution of each it is binomially distributed with parameters N and 1 n . Variables x i are not independent, since their sum equals N . However, because of the symmetry, variables x 2 − x 1 , . . . , x n − x n−1 have the same distribution, which gives Before continuing, for the sake of convenience, first the notation for the multinomial coefficient has to be given as Theorem 2: The expected value of the total variation of a histogram of uniformly distributed values is calculated as The proof of Theorem 2 is given later in Appendix. By using Eq. (23) it is possible to calculate the expected total variation for all reasonable values of n and N with some examples being shown in Table 1. However, if using Eq. (23) turns out to be computationally too demanding, the solution is to develop and use some appropriate asymptotic forms.

3) Asymptotics
By taking into account the well-known asymptotic form of the central binomial coefficients that is commonly given as it follows that the asymptotic form of F (2, N ) is given as The experimental calculations suggest that the following hypothesis can be stipulated for the uniform distribution: The right side of this equation represents the sum of the discrete total variations of two-binned histograms of the uniform distribution with sample size being equal to the expected number of values. If this hypothesis is accepted, then the following asymptotic is true for the uniform distribution: In Table 1    Hypothesis 1 and the results of the numerical calculation furthermore suggest that the following hypothesis is true: Hypothesis 2: For each n ≥ 3, the function N → F (n, N ) is increasing and strictly concave, hence, for each 0 ≤ k ≤ N F (n, k) + F (n, N − k) < 2F (n, N 2 ).
Function N → F (2, N ) is nondecreasing, but it is not strictly concave, because as demonstrated by Eq. (20) its neighboring values can be equal. The proof of these two hypotheses may be very difficult, but they are not essential for the conclusions that are drawn later in the paper. The diagram in Fig. 1 shows the situation for n = 4 and 1 ≤ N ≤ 100.
Let F c (n, N ) denote the expected value of the the circular variation, which unlike the usual variation has an additional term |x 1 − x n | for the absolute value of the difference between the first and the last bin. F c (n, N ) is then defined as Applying Eq. (21) and adjusting the result for later use gives All possible histograms x n can be split into disjoint groups, according to the number of realizations which fall into the first n/2 bins. Let q k be the probability that these bins contain exactly k realizations. Because of the symmetry, q k = q N −k for each k. Since other n/2 bins contain exactly N − k realizations, the conditional distribution of the realizations in the first n/2 bins is again uniform. Having all this in mind and applying the partition theorem to Eq. (31) gives (32) Applying Eq. (28) and the equality N k=0 q k = 1 leads to the following inequality that holds for each even n > 4: Here n has to be greater than 4 because having n = 4 effectively leads to use of the function F (2, N ) on the right side of the inequality, and as explained earlier, this is inappropriate for Eq. (28). If n = k2 r where k ≥ 3 and r ≥ 0 are integers, then taking the inequality above recursively leads further to wherefrom for all suitable N and n it then follows that If k = 2 is taken, then the inequality is no longer necessarily valid because of the involvement of F (2, N ). However, the obtained form yields a better approximation of F c (n, N ) as wherefrom after applying Eq. (30) it then further follows that which in turn is an approximation stipulated in Hypothesis 1.

4) Approximation error
Fig . 2 shows the difference between the results of Eq. (23) and Eq. (27), which represent the exact and approximated values of F (n, N ), respectively. It can be seen that in cases where N is several times greater than n, the approximation error becomes relatively insignificant for practical purposes. The error only becomes significant when the value of N is relatively close to the value of n or below it, but it must be additionally stressed that this rarely occurs in practice since having such values of n and N is not too useful. The plots in Fig. 3 further suggests that if required, the approximation error could be modelled accurately. However, for the later use here it is enough to conclude that having a sufficiently large value of N renders the approximation error insignificant.

D. NON-UNIFORM DISTRIBUTION
The second case is when the distribution of the sample and consequently the distribution of the histogram are not uniform. In other words this is the case where Eq. (12) does not hold, i.e. when p i = p j for at least one pair of i and j. Applying to Eq. (11) all steps that have led to Eq. (17) gives (36) If D is the sample's theoretical distribution, then the first term of Eq. (36) is the discrete total variation of D that is given as The second term is a bound for expectation of the deviation of this given sample from its theoretical distribution. A rough estimate for this second term is the value 2 √ n − 1 √ N . It is obtained by first removing the subtracting part and applying the inequality Since √ p i and √ p i+1 are non-negative, the sums in Eq. (38) can effectively be seen as It is useful to know the discrete total variation of some important distributions. Examples of their histograms are shown in Fig. 4. The uniform distribution has a zero total variation. For the triangular distribution T with n bins this is for an even n, while in the case of an odd n this is given as The square distribution Q for which p i = Ci 2 with n bins has a discrete total variation that can be approximated as Next, in the case of the square root distribution S for which p i = C √ i and with n bins the approximation is given as For the geometric distribution G with parameter p this is for the Poisson distribution P with parameter λ > 1 it is The discrete total variation for a unimodal discrete distribution with mode M is bounded by 2M . The mode for symmetric binomial distribution B(n, 1 2 ) is 1 2 n n n/2 and The normal distribution N (0, σ 2 ) is a continuous one with unbounded support and its theoretical DTV depends on rasterization. The total variation of the probability density VOLUME 4, 2016 is essentially the support of the distribution and if n ≥ 2c σ , then N V can be approximated: Let D be any distribution and x n the histogram with n bins of a corresponding sample of N values drawn from the distribution D. Then similarly to Eq. (36) it can be written where R is a deviation from the theoretical distribution. If there was no randomness and all values were distributed exactly as predicted by the probabilities, then E [ x n V ] would be D V · N . Therefore, the second term is due to the randomness. A further thing to notice here is that as N grows, randomness plays an ever smaller role in Eq. (48) and as N limits at infinity, the term C 1 N gets to fully dominate in Eq. (48), which is also expected in accordance with the Glivenko-Cantelli theorem. In Fig. 5 the total variation of the theoretical distribution and the total variation of a sample are equal. This will be the case for all samples which do not alter order between adjacent bins. Therefore, the alteration from the theoretical distribution means that the corresponding sample is essentially different from theoretical one. Deviation from the theoretical distribution can be approximated as total variation of a sample from uniform distribution and therefore the bounds written before can be applied to any distribution.
With regard to the distribution, the use of the discrete total variation that is somewhat similar to the L 1 -norm may be reminiscent of the assumption of the Laplace distribution. However, no minimization, regularization, or any similar process that requires such an assumption is being performed here. Therefore, it should be stressed again that the relations obtained here can be applied to samples of any distribution.

E. THE PROPOSED MODEL
After taking into account the previous subsections' results, it is reasonable to consider the model for E [ x n V ] to be This model can be fitted directly to the sizes and discrete total variations obtained on the given histograms that are to be checked for outliers. If there is not enough given histograms to cover the desired value ranges of N , then additional ones can be created by randomly subsampling the given ones. In the case where a larger amount of histogram outliers is suspected, then their detrimental influence on fitting of Eq. (49) can be reduced by applying methods such as RANSAC [38]. Alternatively, if the distribution, i.e. the values of p i for the histograms' bins are known, then a and b can be obtained through Monte Carlo simulation by randomly creating arbitrarily many histograms of various sizes N and then fitting the model Eq. (49) to their sizes and discrete total variations.

F. SCORE CALCULATION
Once the model described by Eq. (49) has been fitted to data, the next step is to assign an outlier score to each of the given histograms. The first step is to calculate a histogram's discrete total variation. Next, the discrete total variation expected for the histograms's size is obtained by using Eq. (49). Finally, the absolute difference between these two values is However, d cannot yet be used as the score because the standard deviation of the discrete total variation for histograms of random samples varies depending on the samples size N , which means that the significance of d depends on N . This means that first the influence of the sample size on the standard deviation has to be removed. Additionally, the discrete total variation is already a statistic of the sample, which means that its standard deviation is actually the standard error [39]. Many standard errors that do not include division by N are proportional or close to being proportional to √ N , at least in limit, and in practice this is also the case with the discrete total variation. This can intuitively be seen in the form of the second term of Eq. (48) as discussed earlier. Therefore, for practical purposes the influence of N on d can be approximately removed by calculating the distance d as (51) The value of d can now be used instead of the value d as the outlier score for the histogram that it was calculated for because it is normalized with respect to the standard error.
It must be mentioned that strictly speaking Eq. (51) is theoretically not correct because the expected value of the discrete total variation is not always proportional to N . However, during the research conducted for this paper it has been empirically shown that for all tested distributions the standard error was proportional to √ N and that using Eq. (51) is a good practice, even though it may introduce inaccuracies. Since Eq. (51) was specifically designed to comply with the statistical properties related to the discrete total variation as discussed here, using some other score calculation would potentially require a major overhaul of the whole framework. An alternative to using Eq. (51) that unlike Eq. (51) does not include a explicitly derived formula is to take all data from the given histograms, use it in Monte Carlo simulations to create samples of various desired sized, for each of these sizes calculate the discrete total variations and their standard deviation, and fit a model to these sizes and their respective standard deviations. If enough data is available, this should result in a relation that is very similar to the one in Eq. (51).
Since d is the normalized distance from the expected discrete total variation and since it resembles the t-statistic, it could be further used to also provide a probabilistic interpretation for a given histogram. However, the goal of this paper is not to propose a new statistical test that can be used in hypothesis testing with predetermined significance levels.
The main goal of this paper is just to find the most likely outlier candidates based on the discrete total variation and the distance d also suffices for such ranking. Therefore, probabilistic interpretation calculation is omitted in this paper.

G. APPLICATION
With all the required background given in the previous subsections, it is possible to propose a new method for detecting histogram outliers in terms of the discrete total variation.
First, multiple histograms for the samples of various sizes are given as input. The histograms are supposed to have the same bins where each of the bins can have an arbitrary interval. It is also supposed that all these samples are drawn from the same distribution and the goal is to check which of them are most likely to be outliers in terms of the discrete total variation. Next, the discrete total variation is calculated for each of these histograms. Then, model Eq. (49) is fitted to histogram sizes and discrete total variations. Finally, each of the histograms is scored by applying Eq. (51). The histograms for which the highest score values were obtained are the most likely outlier candidates in terms of their discrete total variation. All these steps are summarized in Algorithm 1.
Here it should be additionally stressed that the proposed method has no hyperparameters whatsoever that would have to be tuned or that would otherwise influence the result. It may seem that the number of histogram bins n is a tunable hyperparameter, but the proposed method is agnostic of the underlying histogram samples -it merely receives already existing histograms as inputs. The histograms are only assumed to have the same bins. It is not even important what the range of the bins is nor is it important whether they are bounded.

H. THE PROPOSED METHOD'S NAME
Due to the proposed method's model's reliance on the discrete total variation, it was named Total Variation Outlier Recognizer (TVOR) or for the sake of simplicity just Tvor, which is pronounced /tVô:r/ and it means skunk in Croatian.
Calculate the score 8: end for

IV. EXPERIMENTAL RESULTS
In order to validate the proposed method, several experiments have been conducted on both synthetic and real-life data. Additionally, it is shown why the proposed method is more appropriate than some other similar methods. To give a clear and descriptive overview of the method's properties, the structure of this section is purposely slightly more extended. First, Section IV-A describes a baseline method for histogram outlier detection based on the Pearson's chi-squared test [10] to compare its results to the ones of the proposed method. In Section IV-B the behavior of the proposed method in several scenarios of changing conditions is demonstrated and additionally explained by several experiments for distribution outlier detection among histograms of random samples of different sizes drawn from the normal distribution and the beta distribution with various parameter values. Similar to that, Section IV-C contains experiments for discrete total variation outlier detection among histograms of random samples of various sizes drawn from the beta distribution. The real-life practical use of the proposed method is demonstrated in Section IV-D on the histograms of the birth years taken from census data of several populations from the same time frame. Section IV-E shows the advantage of the proposed method over some other methods that can be used for similar purposes. The obtained results are discussed in Section IV-F. The online repository with the source code and the data required to recreate the results is described in Section IV-G.

A. THE BASELINE METHOD
The proposed method's goal is to detect outliers speficically in terms of the expected discrete total variation, which can differ significantly from detecting distribution outliers in general. Therefore, the goal of this section is to show the difference in the performance of the proposed method and the Pearson's chi-squared test [10]. This test can be used to check whether a histogram is an outlier by comparing the values of the histogram's bins, which serve here as the categorical variables, to the values that are expected under a supposed distribution. However, since in the problem that is being analyzed in this paper the supposed distribution is unknown, the expected bin values first have to be estimated.
The observed bin value 12: The expected bin value 13: end for 14: Calculate the score 15: end for statistic is used as the outlier score for the tested histogram.
The described procedure is summarized in Algorithm 2.

B. SYNTHETIC DATA FOR DISTRIBUTION OUTLIERS 1) The goal
Since there is much freedom in the overall data generation procedure when using synthetic data and less or no limitations when compared to using real-life data, the goal of this subsection is to demonstrate and explain in more detail the behavior of the proposed method depending on gradual changes of various conditions. The performance is here first measured in terms of distribution outlier detection, even though the proposed method was not designed specifically for that task, while the performance in terms of DTV outlier detection is described in the following subsection. The experiments were performed for cases when the inlier and outlier samples for histograms were from the same distribution with changed parameter values and from different distributions.

2) Experimental setup
The experiments for distribution outlier detection on synthetic data, i.e. histograms of random samples, were conducted by repeatedly first simulating the mixtures of inlier and outlier samples, then trying to recognize the outlier samples by means of applying the baseline method and the proposed method, and finally examining the results of these simulations. The experiments were conducted for two general cases of inlier and outlier random sample distributions by mixing them in 10 4 simulations. In the first case both the inlier and outlier samples were from the normal distribution.
In each simulation of this first case, the inlier data was prepared by generating 100 random inlier samples drawn from the normal distribution with mean 0 and variance 1, i.e .  N (0, 1). The size of each individual sample was randomly Furthermore, in each simulation, the outlier data was generated by drawing a certain number of random samples from N 0, σ 2 for various σ = 1. The sample size was randomly determined in the same way as for the inlier samples. For both the inlier and outlier data the values of c and n were set to the same values to assure having histograms with the same bins. Next, the baseline method and the proposed method were applied to the combined inlier and outlier data to score individual histograms. Finally, the mean value of the rank of all outlier examples obtained by each method was calculated as the performance score of each method. A lower mean rank here means a better performance in terms of outlier detection. For the sake of simplicity, zero-based numbering was used for ranks. This means that in the case of a single added outlier sample, the optimal mean rank of a tested method is 0, while in the case of e.g. 10 added outlier samples, the optimal mean rank is 4.5 since this is the average value of the first 10 zerobased ranks, which should all be assigned to outlier samples' histograms in the case of a method that performs ideally.
In short, every instance of the simulation setup is uniquely determined by the number of histogram bins n, the number of added outlier samples, the value c used to determine the interval of the binned values, and the value of σ for outlier distribution. Simulations for each instance were repeated 10 4 times to check the performance of the baseline method and the proposed method in various sampling conditions. n the second general case, the inlier samples were drawn

3) Results
After examining the results of performing simulations for a large number of setups when both the inliers and the outliers are from the normal distribution, due to the similarity of many of the results, it was decided to show only those that can be used to summarize them all. These results are shown in Fig. 7. The first thing to observe is that in the majority of the cases the baseline method based on the Pearson's chisquared test performs better in terms of outlier ranking. This is mainly because the proposed method was not designed to find outliers in general, but to find outliers in terms of the discrete total variation. Interestingly, however, the exception to this are the cases when there is a relatively small number of bins, which can be seen in Figs. 7a and 7f, and cases with a high amount of added outlier sample histograms, which can be seen in Fig. 7e where the proposed methods outperforms the baseline method for all given numbers of bins. This means that even if the proposed method was not designed for the same task as the baseline method, in some cases it is still able to outperform it, which may be useful should such cases emerge. A more detailed analysis of the performance results shown in Fig. 7 is given in Appendix, which also explains the sudden drops in the performance such as the one in Fig. 7d. In short, the proposed method generally performs worse than the baseline method. However, in the cases of smaller values of n, i.e. in the cases of a smaller number of bins, as well as in the cases with a high amount of outliers, it may perform better. Similar results can be obtained with some other distributions as well and therefore they have been omitted here. If required, any other experiments with a similar setup can be conducted by using the source code publicly available in the repository that is described later.
Next, Fig. 8 shows the results of the experiment where the inlier and the outlier samples were drawn from the beta and the triangular distribution, respectively. As can be expected by viewing Fig. 6, the baseline method outperforms the proposed method in most cases since the difference between the used distributions is significant. Nevertheless, Fig. 8b again shows that the proposed method may be able to outperform the baseline method in the case of a high amount of outliers.
The performance drop of the proposed method for several values of n shown in Fig. 8a deserves some additional comments. As shown in Fig. 9a, the theoretical DTVs of both distributions are clearly separated for all shown values of n. This means that if the random samples were sufficiently big, then the performance should significantly improve in accordance with Eq. (48). Namely, in that case the influence of the sample size significantly overpowers the influence of the randomness. As a matter of fact, if the whole experiment is repeated with random samples having their sizes increased by several orders of magnitude, then both the proposed method and the baseline method have the same ideal performance. However, as mentioned earlier, the size of each sample used in the experiment whose results are shown in Fig. 8a was randomly chosen to be between 500 and 1000. For such sizes, the randomness still has a substantial influence on the histograms' DTVs. This is illustrated in Fig. 9b, which shows the mean DTV calculated for 10 6 random samples of size 1000 for various values of n created for both the beta and the triangular distribution. It can be clearly seen how this differs from the case of the theoretical DTVs and this can be used to explains the particularly low performance of the proposed method when n is 30 and 35 shown in Fig. 8a. Namely, for these values of n, the mean values of DTVs become so close that, with the influence of randomness included, it becomes difficult to successfully distinguish between the inlier and the outlier histograms based only on their DTVs. The dependence of the proposed method's performance on the size of the samples is further analyzed in more detail in Appendix. Based on all the results shown here and in Appendix, it can be concluded that the proposed method's performance improves as the size of the samples increases.
Overall, in terms of distribution outlier detection, the performance of the proposed method is only indirectly dependent on the inlier and the outlier distributions. As shown, it is directly dependent on the difference between the theoretical DTVs of these distributions, which is in turn dependent on the chosen histogram bins. This means that, depending on the histogram bins, the proposed method may perform well even when the inlier and the outlier distribution are same, but with slightly different parameters. On the other hand, for significantly different inlier and outlier distributions that have similar theoretical DTVs for the chosen bins, the proposed method may perform poorly. The opposite cases are also possible. Nevertheless, this is not too problematic because the proposed method was not designed for distribution outlier detection, but specifically for the DTV outlier detection.

C. SYNTHETIC DATA FOR TOTAL VARIATION OUTLIERS 1) The goal
The goal of this subsection is to demonstrate the behavior of the proposed method for the case that it was originally designed for, i.e. for discrete total variation outlier detection. Additional emphasis is specifically put on cases where the number of outliers gets very close to the number as the inliers.

2) Experimental setup
Since earlier in the paper it was mentioned that demographics is one of the fields that can benefit from discrete total variation outlier detection, the beta distribution with α = 2 and β = 3 was chosen for the inlier samples' distribution. The reason is the resemblance of its histograms to the histograms of some population age distributions. For all experiments the number of bins n was fixed to 100. The outlier samples were initially also drawn from the same beta distribution and their histograms also had 100 bins. However, the outlier samples' histograms underwent an additional change to simulate the so called age heaping [16]. Namely, for a certain amount of randomly chosen bins with a count greater than 0, their count was decreased by 1 and the count of the closest bin to each of them whose ordinal number was divisible by 5 was increased by 1 as can be seen on the example that is shown in Fig. 10. This was done for various combinations of the amount of outlier samples and the amount of randomly chosen bins that were changed for these outlier samples' histograms. Finally, the performances of the proposed method and of the baseline method were then compared for all these combinations.

3) Results
The obtained results and comparisons are shown in Fig. 11. It can be seen that if there are only a few outliers, then the proposed and the baseline methods are on par with each other and there are only some smaller differences in performance for various amount of heaped values. However, as the number of outliers increases, the proposed method starts to significantly outperform the baseline method, especially in cases where RANSAC is used as suggested in Section III-E. This is especially noticeable in Fig. 11f where the number of outliers is very close to the number of inliers. There the baseline effectively degrades to a random chooser, while the proposed method used in combination with RANSAC excels at outlier  detection. This shows the usefulness of the proposed methods for the task of finding the discrete total variation outliers.

1) The goal
The goal of this subsection is to test the proposed method on an example of real-life census data with sample sizes spanning several orders of magnitude and being drawn from slightly different, but similar distributions. Here a closer look is taken at the samples of the top-scoring histograms. This can show the robustness of the proposed method in noisy conditions and its usefulness for real-life data applications.

2) Experimental setup
Several census data sources have been used for the experimental setup. The largest of them is the German census of 1939 [40] with the corresponding birth year histogram being shown in Fig. 12a. Since the significant gap for the years of World War I can be traced in age composition of other similar lists and censuses of other countries as well [46], [47], this census data is used here as a gold standard for the discrete total variation of the population histograms for that time.  In addition to that, 7106 variously sized censuses, i.e. lists of people with birth years available at the website of the United States Holocaust Memorial Museum (USHMM) [48] are used since they were composed for the populations from roughly the same time frame. The distribution of the majority of the sizes with the largest ones being excluded for practical purposes is shown in Fig. 13. The geographical locations of these populations differ, but they still mostly cover the populations whose birth year histograms should have similar dis-crete total variation properties. To make it clear immediately, this does not necessarily mean that the age distributions are similar as well. Namely, one census can have a significantly higher amount of e.g. young people in comparison to other censuses, but as it will be shown later on concrete examples, this should not necessarily affect the discrete total variation of the birth year histograms too significantly. Therefore, these lists available at USHMM constitute an interesting dataset in which to look for outliers in terms of discrete total variation.

3) Results
The first experiments that were carried out consisted of simply taking many variously sized subsamples of the birth years from the German census of 1939, calculating the discrete total variations of their birth year histograms, and fitting the proposed method's model in Eq. (49) to the data obtained in this way. Fig. 12b shows the result of this experiment. The proposed model fits well to all data. This also holds for smaller subsamples where the influences of the two terms in Eq. (49) are still on par. It can also be seen how the discrete total variations get more dispersed as the sample size grows. While this may hint at heteroscedasticity, applying weighted regression or variance-stabilizing data transformations did not significantly change the results that are described here.
The relation between the sample size and the standard deviation of the discrete total variation is shown in Fig. 12c.  Very similar results are obtained for other distributions as well. It can be seen that the relation is very similar to the square root function, which effectively justifies the use of Eq. (51) for practical purposes. The distribution of the discrete total variations for the subsamples of the same size closely resembles the normal one as shown in Fig. 12d with the remark that the discrete total variations there are integers. After conducting the relatively simple mentioned experiments in order to get a better insight into the inner workings of the proposed method, the next step was to apply the method to all USHMM lists whose data includes birth years. The distribution of the values of d described in Eq. (51) and obtained by the proposed method in this way is shown in Fig. 15, while the relation between the calculated discrete total variations and the predicted ones are shown in Fig. 14.
It can be seen that the majority of the values d in Fig. 15 are not spread too widely with the exception of several outliers. Before analyzing these outliers in more detail and commenting on Fig. 14, it must be stressed that in Fig. 14 the plot axes use the logarithmic scale to better accommodate the presentation to the list' size distribution. Therefore, the apparent misfit for the smallest lists can deceive into believing that the proposed model failed to fit properly, while it is actually only a misfit on a small scale. For similar reasons many of the differences between the calculated and the predicted discrete total variations for the larger lists are higher than they may appear to be on the plot. In addition  Fig. 14 for comparison. It can be seen that on several places its predictions are not quite aligned with the ones of the proposed model, which can be attributed to the distribution shown in Fig. 13 A more detailed analysis of the top-scoring histogram that provides additional insights and explanations of the behavior of the proposed method's scoring is available in Appendix.
In the case of Monte Carlo the score d was calculated as whereμ N andσ N are the mean and the standard deviation, respectively, of the discrete total variation obtained for a large number of subsamples of size N of the German census of 1939. The distribution of the values of d obtained for all lists from the USHMM is shown in Fig. 17

E. ADVANTAGES OVER EXISTING METRICS
Like for many other groups of population histograms, there is no ground-truth ordering for USHMM lists in terms of their histograms' smoothness or accordance with historical populations. Because of that, the quality of ordering obtained by the proposed method and by existing metrics such as Whipple's and Myers' indices can not be compared directly. However, it is possible to show cases that are problematic for both of these indices, but not for the proposed method. The first example is the histograms shown by Figs. 18a and 18b, which represent the top-scoring histograms among the USHMM lists' histograms for the Whipple's and Myers' indices, respectively. It can be seen that these histograms are actually relatively smooth, but they also contain only a few non-zero values: the first one 8 and the second one only a single. These histograms can hardly be considered outliers in terms of smoothness when compared to the histograms in Fig. 16, but rather outliers in terms of covered years span, which is different and also detectable by much simpler techniques. Additionally, the lists that produced these histograms have only a relatively small number of birth years and since the mentioned indices, unlike the proposed method, do not take into account the sample size, they are also more prone to anomalies that arise in smaller samples due to randomness.
Another problem with metrics such as Whipple's and Myers' indices is that they are mainly concerned with frequencies and do not take into account other properties such as shape or smoothness. Because of that, for different samples that have the same frequencies of last digits of their numbers, it is still possible to obtain the same values of the mentioned indices even if the samples' histograms differ significantly. An example of this is given in Fig. 19 with a fully smooth histogram that has the same indices values as a histogram that can hardly be considered smooth. While numerous similar examples exist, the ones presented are enough to show the frequency-based weakness of the Whipple's and Myers' indices. On the other hand, the proposed method has no such problems and its values for the histograms in Fig. 19 differ significantly with one being zero and the other one non-zero. In short, while being widely used and useful in certain cases, metrics such as the Whipple's and the Myers' indices are too simple to properly handle properties such as smoothness. Therefore, the proposed method's ability to specifically target smoothness is its main advantage over other metrics.

F. DISCUSSION
Looking at the distributions shown in Figs. 15 and 17 and observing the significant difference between the majority of the scores and the highest scores, it can be concluded that the histograms of the used USHMM lists that obtained the highest scores are indeed outliers in terms of the discrete total variation. Since the analyzed data consisted of birth years, it may seem that an appropriate tool for identifying outliers such as the ones in Fig. 16 could be the Whipple's index [46], but due to its fixed nature of checking only specific kinds of data, it is often inappropriate [49], [50]. This also holds in the case of the histogram of the Jasenovac inmates shown in 16a whose artifacts are marked more closely in Fig. 20. It can be seen that age heaping occurs in several forms that the Whipple's index not only cannot pick up, but it also gets hampered by them. Namely, in its slightly changed form the Whipple index checks for a surplus of years ending in 0 or 5 when compared to other years, but in the case of Jasenovac there is also a surplus of years ending in 2, which is not checked by the Whipple's index and it actually reduces the overall surplus of years ending in 0 or 5, thus hampering the Whipple's index in detecting the unusual data patterns. Since the proposed method has no such problems, it may be more appropriate in situations similar to the one in this experiment.
Besides . This means that the proposed method can also be used to detect samples that contain potentially problematic data with properties not always shared with the usual outliers.

G. SOURCE CODE AND DATA REPOSITORY
The source code written in the Python programming language and the data required to recreate the results described in this section are publicly available in a dedicated GitHub repository. 1 At the time of writing this paper, the census data used in this section was publicly available at the USHMM website, but for the sake of simplicity of recreating the results, it is also available in the repository. While the census data also contains other information alongside the birth years, only the birth years were copied to the repository in order to avoid data privacy violation for potentially still living persons. For example, according to the Jasenovac camp inmates list [42], which was already shown to be problematic, a certain Stojan Ražokrak [58] was allegedly killed in 1942, but a publicly available video of him 2,3 from 2012 and its transcript 4 clearly show the opposite. Because of that, it seemed reasonable to copy only the birth years, while any interested reader can check the rest of the data at the USHMM website by using the appropriate list identifier given in the repository.

V. CONCLUSIONS AND FUTURE WORK
In this paper, a method for finding discrete total variation outliers among histograms has been proposed. It scores histograms based on the deviation of their discrete total variation from its expected value. To carry out this scoring, a FIGURE 20: Same data for Jasenovac inmates as in Fig. 16a, but with additional markings for the age heaping [16] artifacts. statistical framework has been proposed. One of the method's main advantages is that in order to work it requires no information about the distribution of the samples that are being described by histograms. In some special cases the proposed method even outperforms the Pearson's chi-squared test when looking for the outlier histograms in terms of the sample distribution despite the fact that is was not designed for this task. On the other hand, the proposed method clearly outperforms the Pearson's chi-squared test when looking for discrete total variation outliers, especially in cases of a huge amount of outliers. Overall, the proposed method represents a successful proof-of-concept of how discrete total variation that is used in the method's modelling can be applied to histogram outlier detection in terms of discrete total variation, which has been experimentally confirmed on synthetic and real-life data. Future work may include looking for some other histogram properties that can also be used for histogram outlier detection in terms of their smoothness in the cases where the distribution of the histogram samples is unknown. As for improving the proposed method, future work will include at least two things. The first of them is the analysis of variance for the discrete total variation to potentially improve the scoring criteria. The second of them comprises other aspects of the histogram's discrete total variation that could decrease the scores obtained for the inlier samples, but simultaneously keep the scores obtained for the outliers high. By the definition of F (2, N ), it can be developed as follows: For an even N = 2r, the equality N k=0 N k = 2 N leads to Since here (N + 1)/2 = (2r + 1)/2 = r and N/2 = (2r + 1)/2 = r, it follows that Eq. (54) matches Eq. (19).
For an odd N = 2r + 1, a similar calculation as before gives To avoid any possible confusion, it has to be mentioned that the lower index of the binomial coefficient in Eq. (55) can also be set to r + 1 because N is supposed to be odd there. Furthermore, like in the previous case, it can be seen that Eq. (55) also matches Eq. (19), which proves Theorem 1.

2) Proof of Theorem 2
The expectation E [|x 2 − x 1 |] can be written as follows: The right-hand side of Eq. (56) can further be written as  To obtain Eq. (23) from here, it is sufficient to note that Eq. (58) can be applied to Eq. (57), which can then be applied to Eq. (21). This results in Eq. (23), which proves Theorem 2.

B. THEORETICAL DISCRETE TOTAL VARIATIONS
To facilitate a better understanding of the experimental results that were discussed in Section IV-B and shown in Fig. 7, the comparison of the values of the theoretical discrete total variation D V of the histograms of normal distribution with parameters used to obtain these results are given in Fig. 21. By comparing Figs. 7 and 21, it is relatively easy to explain phenomena such as the sudden drops in the proposed method's performance that can be seen in Fig. 7d when 10 bins are used. Namely, Fig. 21d clearly shows that for 10 bins the difference between the theoretical DTVs of the distributions used there is very small, which renders the proposed method inadequate for recognizing outlier samples for that specific case. Similar reasoning can also be applied to successful cases where this difference is sufficiently large. Fig. 9 clearly shows how randomness can have a significant impact on the performance of the proposed method. Nevertheless, as described by Eq. (48), when the samples' sizes grow, this impact becomes ever smaller. However, in order to decrease this impact in cases of e.g. larger values of n, the samples' sizes have to grow significantly more than in the cases of smaller values of n. This is illustrated on several examples shown in Fig. 22. There it can be seen that for n = 5 the samples with random sizes up to 1000 are clearly separated, while for the same sizes and n = 50 the samples can hardly be separated. However, as shown in Fig. 22e and Fig. 22f, if the upper bound for the sizes of random samples gets increased even further, the separation again becomes clear. As shown in Fig. 23, this has a direct influence on the performance of both the baseline and the proposed methods. In short, a successful application of the proposed method assumes a reasonably high ratio between the number of bins n and the sizes of samples. How high this ratio should be, however, depends on the specific distributions of the samples.

D. PARTITIONING THE TOP-SCORING HISTOGRAM
In order to describe the behavior of the proposed method in more detail, it may be useful to additionally analyze the topscoring histogram shown in Figs. 16a and 20. By partitioning the initial birth year sample into more smaller samples, it is possible to examine the behavior of the proposed method when the sample size is changing. One way of partitioning the sample is by nationality of the inmates. The nationalities for which there are more than 1000 listed inmates are, as specified in the Jasenovac inmates list, the following ones: Serbian, Roma, Jewish, Croatian, and Muslim. While the histograms of the Roma, Jewish, Croatian, and Muslim nationalities shown in Fig. 24 all exhibit signs of age heaping similar to the ones in Fig. 20, by far more prominent signs are exhibited by the Serbian nationality as shown in Fig. 25.
If the histograms for separate nationalities are also added to the set of USHMM lists and the proposed method is applied to this extended set, then the histogram for the Serbian nationality ends up being the second most likely outlier just after the whole Jasenovac list with d = 40.82. The Romani nationality histogram ends up on the 21st place with d = 15.12, the Jewish nationality histogram ends up on the 66th place with d = 9.08, while other histograms are not inside the 100 most likely outliers. This shows how the proposed method can also be used to detect the potentially problematic parts of a sample, which in the case of the Jasenovac list lies in the birth years of Serbian inmates.
Additionally, there is another thing to be observed here. Namely, while Figs. 20 and 25 seem to be very similar, the score d for the histogram of the birth years of the Serbian inmates was nevertheless smaller than the one for the whole Jasenovac list. This has to do with the fact that the sample with birth years of Serbian inmates has fewer values than the whole Jasenovac list, i.e. it makes up roughly 57% of the Jasenovac list. Because of that, such similar deviations are considered to be less likely on a larger sample and thus the whole Jasenovac list has a slightly larger value of score d .  NIKOLA BANIĆ received B.Sc., M.Sc., and Ph.D. degrees in computer science in 2011, 2013, and 2016, respectively. He is currently working as a senior computer vision engineer at Gideon Brothers, Croatia. He has worked in real-time image enhancement for embedded systems, digital signature recognition, people tracking and counting, and image processing for stereo vision. His research interests include image enhancement, color constancy, image processing for stereo vision, and tone mapping.
NEVEN ELEZOVIĆ received B.Sc., M.Sc., and Ph.D. degrees in mathematics in 1977, 1981, and 1985, respectively. He published numerous publications in high ranking journals and books for college mathematics. Currently he is a full professor at the University of Zagreb Faculty of Electrical Engineering and Computing and the founder of the publishing house Element. His research interests include mathematical analysis and probability theory.