SKEDSOFT

Data Mining & Data Warehousing

Introduction:

The A priori-like sequential pattern mining approach (based on candidate generate-and test) can also be explored by mapping a sequence database into vertical data format. In vertical data format, the database becomes a set of tuples of the form {item set: (sequence ID, event ID)}.

That is, for a given itemset, we record the sequence identifier and corresponding event identifier for which the itemset occurs. The event identifier serves as a timestamp within a sequence. The event ID of the ith itemset (or event) in a sequence is i. Note than an itemset can occur in more than one sequence. The set of (sequence ID, event ID) pairs for a given itemset form the ID list of the itemset. The mapping from horizontal to vertical format requires one scan of the database. A major advantage of using this format is that we can determine the support of any k-sequence by simply joining the ID lists of any two of its (k-1)-length subsequences. The length of the resulting ID list (i.e., unique sequence ID values) is equal to the support of the k-sequence, which tells us whether the sequence is frequent.

SPADE (Sequential Pattern Discovery using Equivalent classes) is an A priori-based sequential pattern mining algorithm that uses vertical data format. As with GSP, SPADE requires one scan to find the frequent 1-sequences. To find candidate 2-sequences, we join all pairs of single items if they are frequent (therein, it applies the A priori property), if they share the same sequence identifier, and if their event identifiers follow a sequential ordering. That is, the first item in the pair must occur as an event before the second item, where both occur in the same sequence. Similarly, we can grow the length of itemsets from length 2 to length 3, and so on. The procedure stops when no frequent sequences can be found or no such sequences can be formed by such joins. The following example helps illustrate the process.

Example: SPADE: Candidate generate-and-test using vertical data format. Let min sup = 2. Our running example sequence database, S, of Table 8.1 is in horizontal data format. SPADE first scans S and transforms it into vertical format, as shown in Figure 8.6(a). Each itemset (or event) is associated with its ID list, which is the set of SID (sequence ID) and EID (event ID) pairs that contain the itemset. The ID list for individual items, a, b, and so on, is shown in Figure 8.6(b). For example, the ID list for item b consists of the following (SID, EID) pairs: f(1, 2), (2, 3), (3, 2), (3, 5), (4, 5)g, where the entry (1, 2) means that b occurs in sequence 1, event 2, and so on. Items a and b are frequent. They can be joined to form the length-2 sequence, ha, bi. We find the support of this sequence as follows. We join the ID lists of a and b by joining on the same sequence ID wherever, according to the event IDs, a occurs before b. That is, the join must preserve the temporal order of the events involved. The result of such a join for a and b is shown in the ID list for ab of Figure 8.6(c). For example, the ID list for 2-sequence ab is a set of triples, (SID, EID (a), EID (b)), namely f(1, 1, 2), (2, 1, 3), (3, 2, 5), (4, 3, 5)g. The entry (2, 1, 3), for example, shows that both a and b occur in sequence 2, and that a (event 1 of the sequence) occurs before b (event 3), as required. Furthermore, the frequent 2-sequences can be joined (while considering the A priori pruning heuristic that the (k-1)-subsequences of a candidate k-sequence must be frequent) to form3-sequences, as in Figure 8.6(d), and so on. The process terminates when no frequent sequences can be found or no candidate sequences can be formed. Additional details of the method can be found in Zaki [Zak01].

The use of vertical data format, with the creation of ID lists, reduces scans of the sequence database. The ID lists carry the information necessary to find the support of candidates. As the length of a frequent sequence increases, the size of its ID list decreases, resulting in very fast joins. However, the basic search methodology of SPADE and GSP is breadth-first search (e.g., exploring 1-sequences, then 2-sequences, and so on) and A priori pruning. Despite the pruning, both algorithms have to generate large sets of candidates in breadth-first manner in order to grow longer sequences. Thus, most of the difficulties suffered in the GSP algorithm recur in SPADE as well.