SKEDSOFT

Data Mining & Data Warehousing

Introduction: Mining that is performed without user- or expert-specified constraints may generate numerous patterns that are of no interest. Such unfocused mining can reduce both the efficiency and usability of frequent-pattern mining. Thus, we promote constraint-based mining, which incorporates user-specified constraints to reduce the search space and derive only patterns that are of interest to the user.

Constraints can be expressed in many forms. They may specify desired relationships between attributes, attribute values, or aggregates within the resulting patterns mined. Regular expressions can also be used as constraints in the form of “pattern templates,” which specify the desired form of the patterns to be mined. The general concepts introduced for constraint-based frequent pattern mining apply to constraint-based sequential pattern mining as well. The key idea to note is that these kinds of constraints can be used during the mining process to confine the search space, thereby improving (1) the efficiency of the mining and (2) the interestingness of the resulting patterns found. This idea is also referred to as “pushing the constraints deep into the mining process.”

We now examine some typical examples of constraints for sequential pattern mining. First, constraints can be related to the duration, T, of a sequence. The duration may be the maximal or minimal length of the sequence in the database, or a user-specified duration related to time, such as the year 2005. Sequential pattern mining can then be confined to the data within the specified duration, T.

Constraints relating to the maximal or minimal length (duration) can be treated as anti monotonic or monotonic constraints, respectively. For example, the constraint T 10 is anti monotonic since, if a sequence does not satisfy this constraint, then neither will any of its super sequences (which are, obviously, longer). The constraint T >10 is monotonic. This means that if a sequence satisfies the constraint, then all of its super sequences will also satisfy the constraint. We have already seen several examples in this chapter of how anti monotonic constraints (such as those involving minimum support) can be pushed deep into the mining process to prune the search space. Monotonic constraints can be used in a way similar to its frequent-pattern counterpart as well.

Constraints related to a specific duration, such as a particular year, are considered succinct constraints. A constraint is succinct if we can enumerate all and only those sequences that are guaranteed to satisfy the constraint, even before support counting begins. Suppose, here, T = 2005. By selecting the data for which year = 2005, we can enumerate all of the sequences guaranteed to satisfy the constraint before mining begins. In other words, we don’t need to generate and test. Thus, such constraints contribute toward efficiency in that they avoid the substantial overhead of the generate-and-test paradigm.

Durations may also be defined as being related to sets of partitioned sequences, such as every year, or every month after stock dips, or every two weeks before and after an earthquake. In such cases, periodic patterns (Section 8.3.4) can be discovered.

Second, the constraint may be related to an event folding window, w. A set of events occurring within a specified period can be viewed as occurring together. If w is set to be as long as the duration, T, it finds time-insensitive frequent patterns—these are essentially frequent patterns, such as “In 1999, customers who bought a PC bought a digital camera as well” (i.e., without bothering about which items were bought first). If w is set to 0 (i.e., no event sequence folding), sequential patterns are found where each event occurs at a distinct time instant, such as “A customer who bought a PC and then a digital camera is likely to buy an SD memory chip in a month.” If w is set to be something in between (e.g., for transactions occurring within the same month or within a sliding window of 24 hours), then these transactions are considered as occurring within the same period, and such sequences are “folded” in the analysis.

Third, a desired (time) gap between events in the discovered patterns may be specified as a constraint. Possible cases are: (1) gap = 0 (no gap is allowed), which is to find strictly consecutive sequential patterns like a i-1aiai 1. For example, if the event folding window is set to a week, this will find frequent patterns occurring in consecutive weeks; (2) min gap ≤ gap ≤ max gap, which is to find patterns that are separated by at least min gap but at most max gap, such as “If a person rents movie A, it is likely she will rent movie B within 30 days” implies gap ≤ 30 (days); and (3) gap = c ≠ 0, which is to find patterns with an exact gap, c. It is straightforward to push gap constraints into the sequential pattern mining process. With minor modifications to the mining process, it can handle constraints with approximate gaps as well.