SKEDSOFT

Data Mining & Data Warehousing

Introduction: The statistical distribution-based approach to outlier detection assumes a distribution or probability model for the given data set (e.g: a normal or Poisson distribution) and then identifies outliers with respect to the model using a discordancy test. Application of the test requires knowledge of the data set parameters (such as the assumed data distribution), knowledge of distribution parameters (such as the mean and variance), and the expected number of outliers.

Working of discordancy testing: A statistical discordancy test examines two hypotheses: a working hypothesis and an alternative hypothesis. A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is,

H: oi ∈ F, where i = 1, 2, . . ., n.

The hypothesis is retained if there is no statistically significant evidence supporting its rejection. A discordancy test verifies whether an object, oi, is significantly large (or small) in relation to the distribution F. Different test statistics have been proposed for use as a discordancy test, depending on the available knowledge of the data. Assuming that some statistic, T, has been chosen for discordancy testing, and the value of the statistic for object oi is vi, then the distribution of T is constructed. Significance probability, SP (vi)= Prob (T > vi), is evaluated. If SP (vi) is sufficiently small, then oi is discordant and the working hypothesis is rejected. An alternative hypothesis, H, which states that oi comes from another distribution model, G, is adopted. The result is very much dependent on which model F is chosen because oi may be an outlier under one model and a perfectly valid value under another.

The alternative distribution is very important in determining the power of the test, that is, the probability that the working hypothesis is rejected when oi is really an outlier. There are different kinds of alternative distributions.

  • Inherent alternative distribution: In this case, the working hypothesis that all of the objects come from distribution F is rejected in favor of the alternative hypothesis that all of the objects arise from another distribution, G:
  • H: oi ∈ G, where i = 1, 2,…., n.
  • F and G may be different distributions or differ only in parameters of the same distribution. There are constraints on the form of the G distribution in that it must have potential to produce outliers. For example, it may have a different mean or dispersion, or a longer tail.
  • Mixture alternative distribution: The mixture alternative states that discordant values are not outliers in the F population, but contaminants from some other population, G. In this case, the alternative hypothesis is
  • H: oi ∈ (1-ÊŽ)F ÊŽG,  where i = 1, 2, ……, n.
  • Slippage alternative distribution: This alternative states that all of the objects (apart from some prescribed small number) arise independently from the initial model, F, with its given parameters, whereas the remaining objects are independent observations from a modified version of F in which the parameters have been shifted.

There are two basic types of procedures for detecting outliers:

  • Block procedures: In this case, either the entire suspect objects are treated as outliers or all of them are accepted as consistent.
  • Consecutive (or sequential) procedures: An example of such a procedure is the inside out procedure. Its main idea is that the object that is least “likely” to be an outlier is tested first. If it is found to be an outlier, then all of the more extreme values are also considered outliers; otherwise, the next most extreme object is tested, and so on. This procedure tends to be more effective than block procedures.

“How effective is the statistical approach at outlier detection?” A major drawback is that most tests are for single attributes, yet many data mining problems require finding outliers in multidimensional space. Moreover, the statistical approach requires knowledge about parameters of the data set, such as the data distribution. However, in many cases, the data distribution may not be known. Statistical methods do not guarantee that all outliers will be found for the cases where no specific test was developed, or where the observed distribution cannot be adequately modeled with any standard distribution.