SKEDSOFT

Data Mining & Data Warehousing

Introduction: Before presenting graph mining methods, it is necessary to first introduce some preliminary concepts relating to frequent graph mining. We denote the vertex set of a graph g by V(g) and the edge set by E(g). A label function, L, maps a vertex or an edge to a label. A graph g is a sub-graph of another graph g0 if there exists a sub-graph isomorphism from g to g’. Given a labeled graph data set, D = {G1;G2; : : : ;Gn}, we define support(g) (or frequency(g)) as the percentage (or number) of graphs in D where g is a sub-graph. A frequent graph is a graph whose support is no less than a minimum support threshold, min_sup.

Apriori-based Approach

Apriori-based frequent substructure mining algorithms share similar characteristics with Apriori-based frequent itemset mining algorithms (Chapter 5). The search for frequent graphs starts with graphs of small “size,” and proceeds in a bottom-up manner by generating candidates having an extra vertex, edge, or path. The definition of graph size depends on the algorithm used.

The general framework of Apriori-based methods for frequent substructure mining is outlined in Figure 9.3.We refer to this algorithm as Apriori Graph. Sk is the frequent substructure set of size k. We will clarify the definition of graph sizewhenwe describe specific Apriori-based methods further below. Apriori Graph adopts a level-wise mining methodology. At each iteration, the size of newly discovered frequent substructures is increased by one. These new substructures are first generated by joining two similar but slightly different frequent sub-graphs that were discovered in the previous call to Apriori Graph.

This candidate generation procedure is outlined on line 4. The frequency of the newly formed graphs is then checked. Those found to be frequent are used to generate larger candidates in the next round.

The main design complexity of Apriori-based substructure mining algorithms is the candidate generation step. The candidate generation in frequent itemset mining is straightforward. For example, suppose we have two frequent itemsets of size-3: (abc) and (bcd). The frequent itemset candidate of size-4 generated from them is simply (abcd), derived from a join. However, the candidate generation problem in frequent substructure mining is harder than that in frequent itemset mining, because there are many ways to join two substructures

Algorithm: Apriori Graph. Apriori-based frequent substructure mining.

Input:

  • D, a graph data set;
  • min sup, the minimum support threshold.

Output:

Sk, the frequent substructure set.

Method:

S1 frequent single-elements in the data set;

Call Apriori Graph(D, min sup, S1);

procedure Apriori Graph(D, min_sup, Sk)

(1) Sk 1 <- φ;

(2) for each frequent gi ∈Sk do

(3) for each frequent gj ∈Sk do

(4) for each size (k 1) graph g formed by the merge of gi and gj do

(5) if g is frequent in D and g ∉Sk 1 then

(6) insert g into Sk 1;

(7) if sk 1 ≠ φthen

(8) Apriori Graph(D, min_sup, Sk 1);

(9) return;