SKEDSOFT

Data Mining & Data Warehousing

Introduction: Sequential pattern mining is computationally challenging because such mining may generate and/or test a combinatorial explosive number of intermediate subsequences.

Recent developments have made progress in two directions: (1) efficient methods for mining the full set of sequential patterns, and (2) efficient methods for mining only the set of closed sequential patterns, where a sequential pattern s is closed if there exists no sequential pattern s0 where s0 is a proper super sequence of s, and s0 has the same (frequency) support as s. Because all of the subsequences of a frequent sequence are also frequent, mining the set of closed sequential patterns may avoid the generation of unnecessary subsequences and thus lead to more compact results as well as more efficient methods than mining the full set. We will first examine methods for mining the full set and then study how they can be extended for mining the closed set. In addition, we discuss modifications for mining multilevel, multidimensional sequential patterns (i.e., with multiple levels of granularity).

The major approaches for mining the full set of sequential patterns are similar to those introduced for frequent itemset mining in Chapter 5. Here, we discuss three such approaches for sequential pattern mining, represented by the algorithms GSP, SPADE, and Prefix Span, respectively. GSP adopts a candidate generate-and-test approach using horizontal data format (where the data are represented as {sequence ID: sequence of item sets}, as usual, where each itemset is an event). SPADE adopts a candidate generate and-test approach using vertical data format (where the data are represented as {item set: (sequence ID, event ID)}). The vertical data format can be obtained by transforming from a horizontally formatted sequence database in just one scan. Prefix Span is a pattern growth method, which does not require candidate generation.

All three approaches either directly or indirectly explore the A priori property, stated as follows: every nonempty subsequence of a sequential pattern is a sequential pattern. (Recall that for a pattern to be called sequential, it must be frequent. That is, it must satisfy minimum support.) The A priori property is anti monotonic (or downward-closed) in that, if a sequence cannot pass a test (e.g., regarding minimum support), all of its super sequences will also fail the test. Use of this property to prune the search space can help make the discovery of sequential patterns more efficient.

GSP: A Sequential Pattern Mining Algorithm Based on Candidate Generate-and-Test

GSP (Generalized Sequential Patterns) is a sequential pattern mining method that was developed by Srikant and Agrawal in 1996. It is an extension of their seminal algorithm for frequent itemset mining, known as A priori (Section 5.2). GSP uses the downward-closure property of sequential patterns and adopts a multiple-pass, candidate generate-and-test approach. The algorithm is outlined as follows. In the first scan of the database, it finds all of the frequent items, that is, those with minimum support. Each such item yields a 1-event frequent sequence consisting of that item. Each subsequent pass starts with a seed set of sequential patterns—the set of sequential patterns found in the previous pass. This seed set is used to generate new potentially frequent patterns, called candidate sequences. Each candidate sequence contains one more item than the seed sequential pattern from which it was generated (where each event in the pattern may contain one or multiple items). Recall that the number of instances of items in a sequence is the length of the sequence. So, the entire candidate sequences in a given pass will have the same length. We refer to a sequence with length k as a k-sequence. Let Ck denote the set of candidate k-sequences. A pass over the database finds the support for each candidate k-sequence. The candidates in Ck with at least min sup form Lk, the set of all frequent k-sequences. This set then becomes the seed set for the next pass, k 1. The algorithm terminates when no new sequential pattern is found in a pass, or no candidate sequence can be generated.