SKEDSOFT

Data Mining & Data Warehousing

Introduction: Unlike normal database queries, which find data that match the given query exactly, a similarity search finds data sequences that differ only slightly from the given query sequence. Given a set of time-series sequences, S, there are two types of similarity searches: subsequence matching and whole sequence matching. Subsequence matching finds the sequences in S that contain subsequences that are similar to a given query sequence x, while whole sequence matching finds a set of sequences in S that are similar to each other (as a whole). Subsequence matching is a more frequently encountered problem in applications. Similarity search in time-series analysis is useful for financial market analysis (e.g., stock data analysis), medical diagnosis (e.g., cardiogram analysis), and in scientific or engineering databases (e.g., power consumption analysis).

Data Reduction and Transformation Techniques: Due to the tremendous size and high-dimensionality of time-series data, data reduction often serves as the first step in time-series analysis. Data reduction leads to not only much smaller storage space but also much faster processing. As discussed in Chapter 2, major strategies for data reduction include attribute subset selection (which removes irrelevant or redundant attributes or dimensions), dimensionality reduction (which typically employs signal processing techniques to obtain a reduced version of the original data), and numerosity reduction (where data are replaced or estimated by alternative, smaller representations, such as histograms, clustering, and sampling). Because time series can be viewed as data of very high dimensionality where each point of time can be viewed as a dimension, dimensionality reduction is our major concern here. For example, to compute correlations between two time-series curves, the reduction of the time series from length (i.e., dimension) n to k may lead to a reduction from O(n) to O(k) in computational complexity. If kn, the complexity of the computation will be greatly reduced.

Several dimensionality reduction techniques can be used in time-series analysis. Examples include (1) the discrete Fourier transform (DFT) as the classical data reduction technique, (2) more recently developed discrete wavelet transforms (DWT), (3) Singular Value Decomposition (SVD) based on Principle Components Analysis (PCA),5 and (4) random projection-based sketch techniques , which can also give a good-quality synopsis of data. Because we have touched on these topics earlier in this book, and because a thorough explanation is beyond our scope, we will not go into great detail here. The first three techniques listed are signal processing techniques. A given time series can be considered as a finite sequence of real values (or coefficients), recorded over time in some object space. The data or signal is transformed (using a specific transformation function) into a signal in a transformed space. A small subset of the “strongest” transformed coefficients is saved as features. These features form a feature space, which is simply a projection of the transformed space. This representation is sparse so that operations that can take advantage of data sparsity are computationally very fast if performed in feature space. The features can be transformed back into object space, resulting in a compressed approximation of the original data.

Many techniques for signal analysis require the data to be in the frequency domain. Therefore, distance-preserving ortho normal transformations are often used to transform the data from the time domain to the frequency domain. Usually, a data-independent transformation is applied, where the transformation matrix is determined a priori, independent of the input data. Because the distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain, the DFT does a good job of preserving essentials in the first few coefficients. By keeping only the first few (i.e., “strongest”) coefficients of the DFT, we can compute the lower bounds of the actual distance.

Indexing Methods for Similarity Search: “Once the data are transformed by, say, a DFT, how can we provide support for efficient search in time-series data?” For efficient accessing, a multidimensional index can be constructed using the first few Fourier coefficients. When a similarity query is submitted to the system, the index can be used to retrieve the sequences that are at most a certain small distance away from the query sequence. Post processing is then performed by computing the actual distance between sequences in the time domain and discarding any false matches.

For subsequence matching, each sequence can be broken down into a set of “pieces” of windows with length w. In one approach, the features of the subsequence inside each window are then extracted. Each sequence is mapped to a “trail” in the feature space. The trail of each sequence is divided into “subtrails,” each represented by a minimum bounding rectangle. A multipiece assembly algorithm can then be used to search for longer sequence matches.

Various kinds of indexing methods have been explored to speed up the similarity search. For example, R-trees and R?-trees have been used to store the minimal bounding rectangles mentioned above. In addition, the ε-kdB tree has been developed for faster spatial similarity joins on high-dimensional points, and suffix trees have also been explored. References are given in the bibliographic notes.