SKEDSOFT

Data Mining & Data Warehousing

Introduction:

Data cubes facilitate the answering of data mining queries as they allow the computation of aggregate data at multiple levels of granularity. In this section, you will learn about multi feature cubes, which compute complex queries involving multiple dependent aggregates at multiple granularities. These cubes are very useful in practice. Many complex data mining queries can be answered by multi feature cubes without any significant increase in computational cost, in comparison to cube computation for simple queries with standard data cubes.

All of the examples in this section are from the Purchases data of All Electronics, where an item is purchased in a sales region on a business day (year, month, day). The shelf life in months of a given item is stored in shelf. The item price and sales (in dollars) at a given region are stored in price and sales, respectively. To aid in our study of multi feature cubes, let’s first look at an example of a simple data cube.

Constrained Gradient Analysis in Data Cubes

Many data cube applications need to analyze the changes of complex measures in multidimensional space. For example, in real estate, we may want to ask what are the changes of the average house price in the Vancouver area in the year 2004 compared against 2003, and the answer could be “the average price for those sold to professionals in the West End went down by 20%, while those sold to business people in Metro town went up by 10%, etc.” Expressions such as “professionals in the West End” correspond to cuboid cells and describe sectors of the business modeled by the data cube.

The problem of mining changes of complex measures in a multidimensional space was first proposed by Imielinski, Khachiyan, and Abdulghani [IKA02] as the cube grade problem, which can be viewed as a generalization of association rules6 and data cubes. It studies how changes in a set of measures (aggregates) of interest are associated with changes in the underlying characteristics of sectors, where changes in sector characteristics are expressed in terms of dimensions of the cube and are limited to specialization (drilldown), generalization (roll-up), and mutation (a change in one of the cube’s dimensions). For example, we may want to ask The answer will be pairs of sectors, associated with major changes in average house price, including, for example, “the sector of professional buyers in the West End area of Vancouver” versus “the sector of all buyers in the entire area of Vancouver” as a specialization (or generalization). The cube grade problem is significantly more expressive than association rules, because it captures data trends and handles complex measures, not just count, as association rules do. The problem has broad applications, from trend analysis to answering “what-if” questions and discovering exceptions or outliers. The curse of dimensionality and the need for understandable results pose serious challenges for finding an efficient and scalable solution to the cube grade problem. Here we examine a confined but interesting version of the cube grade problem, called constrained multidimensional gradient analysis, which reduces the search space and derives interesting results. It incorporates the following types of constraints:

Ø  Significance constraint: This ensures that we examine only the cells that have certain “statistical significance” in the data, such as containing at least a specified number of base cells or at least a certain total sales. In the data cube context, this constraint acts as the iceberg condition, which prunes a huge number of trivial cells from the answer set.

Ø  Probe constraint: This selects a subset of cells (called probe cells) from all of the possible cells as starting points for examination. Because the cube grade problem needs to compare each cell in the cube with other cells that are specializations, generalizations, or mutations of the given cell, it extracts pairs of similar cell characteristics associated with big changes in measure in a data cube. Given three cells, a, b, and c, if a is a specialization of b, then we say it is a descendant of b, in which case, b is a generalization or ancestor of a. Cell c is a mutation of a if the two have identical values in all but one dimension, where the dimension for which they vary cannot have a value of “


”. Cells a and c are considered siblings. Even when considering only iceberg cubes, a large number of pairs may still be generated. Probe constraints allow the user to specify a subset of cells that are of interest for the analysis task. In this way, the study is focused only on these cells and their relationships with corresponding ancestors, descendants, and siblings.

Ø  Gradient constraint: This specifies the user’s range of interest on the gradient (measure change). A user is typically interested in only certain types of changes between the cells (sectors) under comparison. For example, we may be interested in only those cells whose average profit increases by more than 40% compared to that of the probe cells. Such changes can be specified as a threshold in the form of either a ratio or a difference between certain measure values of the cells under comparison. A cell that captures the change from the probe cell is referred to as a gradient cell.