SKEDSOFT

Data Mining & Data Warehousing

Introduction: Because mining the complete set of frequent subsequences can generate a huge number of sequential patterns, an interesting alternative is to mine frequent closed subsequences only, that is, those containing no super sequence with the same support. Mining closed sequential patterns can produce a significantly less number of sequences than the full set of sequential patterns. Note that the full set of frequent subsequences, together with their supports, can easily be derived from the closed subsequences. Thus, closed subsequences have the same expressive power as the corresponding full set of subsequences. Because of their compactness, they may also be quicker to find.

CloSpan is an efficient closed sequential pattern mining method. The method is based on a property of sequence databases, called equivalence of projected databases, stated as follows: Two projected sequence databases, S|a = S|b,8 a ⊆ b (i:e:;a is a subsequence of b), are equivalent if and only if the total number of items in S|a is equal to the total number of items in Sjb. Based on this property, CloSpan can prune the non closed sequences from further consideration during the mining process. That is, whenever we find two prefix-based projected databases that are exactly the same, we can stop growing one of them. This can be used to prune backward sub patterns and backward super patterns.

After such pruning and mining, a post processing step is still required in order to delete non closed sequential patterns that may exist in the derived set. A later algorithm called BIDE (which performs a bidirectional search) can further optimize this process to avoid such additional checking. Empirical results show that CloSpan often derives a much smaller set of sequential patterns in a shorter time than Prefix Span, which mines the complete set of sequential patterns.

Mining Multidimensional, Multilevel Sequential Patterns

Sequence identifiers (representing individual customers, for example) and sequence items (such as products bought) are often associated with additional pieces of information. Sequential pattern mining should take advantage of such additional information to discover interesting patterns in multidimensional, multilevel information space. Take customer shopping transactions, for instance. In a sequence database for such data, the additional information associated with sequence IDs could include customer age, address, group, and profession. Information associated with items could include item category, brand, model type, model number, place manufactured, and manufacture date. Mining multidimensional, multilevel sequential patterns is the discovery of interesting patterns in such a broad dimensional space, at different levels of detail.

Example: Multidimensional, multilevel sequential patterns. The discoveries that “Retired customers who purchase a digital camera are likely to purchase a color printer within a month” and that “Young adults who purchase a laptop are likely to buy a flash drive within two weeks” are examples of multidimensional, multilevel sequential patterns. By grouping customers into “retired customers” and “young adults” according to the values in the age dimension, and by generalizing items to, say, “digital camera” rather than a specific model, the patterns mined here are associated with additional dimensions and are at a higher level of granularity.

“Can a typical sequential pattern algorithm such as Prefix Span be extended to efficiently mine multidimensional, multilevel sequential patterns?” One suggested modification is to associate the multidimensional, multilevel information with the sequence ID and item ID, respectively, which the mining method can take into consideration when finding frequent subsequences. For example, (Chicago, middle aged, business) can be associated with sequence ID 1002 (for a given customer), whereas (Digital camera, Canon, Super shot, SD400, Japan, 2005) can be associated with item ID 543005 in the sequence. A sequential pattern mining algorithm will use such information in the mining process to find sequential patterns associated with multidimensional, multilevel information.