SKEDSOFT

Data Mining & Data Warehousing

Introduction: Knowing the characteristics of small world networks is useful in many situations. We can build graph generation models, which incorporate the characteristics. These may be used to predict how a network may look in the future, answering “what-if ” questions. If a hypothesis contradicts the generally accepted characteristics, this raises a flag as to the questionable plausibility of the hypothesis. This can help detect abnormalities in existing graphs, which may indicate fraud, spam, or Distributed Denial of Service (DDoS) attacks. Models of graph generation can also be used for simulations when real graphs are excessively large and thus, impossible to collect (such as a very large network of friendships). In this section, we study the basic characteristics of social networks as well as a model for graph generation.

Most studies examine the nodes’ degrees, that is, the number of edges incident to each node, and the distances between a pair of nodes, as measured by the shortest path length. (This measure embodies the small world notion that individuals are linked via short chains.) In particular, the network diameter is the maximum distance between pairs of nodes. Other nodeto- node distances include the average distance between pairs and the effective diameter (i.e., the minimum distance, d, such that for at least 90% of the reachable node pairs, the path length is at most d).

Social networks are rarely static. Their graph representations evolve as nodes and edges are added or deleted over time. In general, social networks tend to exhibit the following phenomena:

1. Densification power law: Previously, it was believed that as a network evolves, the number of degrees grows linearly in the number of nodes. This was known as the constant average degree assumption. However, extensive experiments have shown that, on the contrary, networks become increasingly dense over time with the average degree increasing (and hence, the number of edges growing super linearly in the number of nodes). The densification follows the densification power law (or growth power law), which states

e(t) n(t)a

where e(t) and n(t), respectively, represent the number of edges and nodes of the graph at time t, and the exponent a generally lies strictly between 1 and 2. Note that if a = 1, this corresponds to constant average degree over time, whereas a = 2 corresponds to an extremely dense graph where each node has edges to a constant fraction of all nodes.      

2. Shrinking diameter: It has been experimentally shown that the effective diameter tends to decrease as the network grows. This contradicts an earlier belief that the diameter slowly increases as a function of network size. As an intuitive example, consider a citation network, where nodes are papers and a citation from one paper to another is indicated by a directed edge. The out-links of a node, v (representing the papers cited by v), are “frozen” at the moment it joins the graph. The decreasing distances between pairs of nodes consequently appears to be the result of subsequent papers acting as “bridges” by citing earlier papers from other areas.

3.  Heavy-tailed out-degree and in-degree distributions: The number of out-degrees for a node tends to follow a heavy-tailed distribution by observing the power law, 1=na, where n is the rank of the node in the order of decreasing out-degrees and typically, 0 < a < 2 (Figure 9.17). The smaller the value of a, the heavier the tail. This phenomena is represented in the preferential attachment model, where each new node attaches to an existing network by a constant number of out-links, following a “rich-get-richer” rule. The in-degrees also follow a heavy-tailed distribution, although it tends be more skewed than the out-degrees distribution.

A Forest Fire model for graph generation was proposed, which captures these characteristics of graph evolution over time. It is based on the notion that new nodes attach to the network by “burning” through existing edges in epidemic fashion. It uses two parameters, forward burning probability, p, and backward burning ratio, r, which are described below. Suppose a new node, v, arrives at time t. It attaches to Gt , the graph constructed so far, in the following steps:

1. It chooses an ambassador node, w, at random, and forms a link to w.

2. It selects x links incident to w, where x is a random number that is binomially distributed with mean (1-p)-1. It chooses from out-links and in-links of w but selects in-links with probability r times lower than out-links. Let w1;w2; : : : ;wx denote the nodes at the other end of the selected edges.

3. Our new node, v, forms out-links to w1;w2; : : : ;wx and then applies step 2 recursively to each of w1;w2; : : : ;wx. Nodes cannot be visited a second time so as to prevent the construction from cycling. The process continues until it dies out.