SKEDSOFT

Data Mining & Data Warehousing

Introduction:

Social networks are dynamic. New links appear, indicating new interactions between objects. In the link prediction problem, we are given a snapshot of a social network at time t and wish to predict the edges that will be added to the network during the interval from time t to a given future time, t0. In essence, we seek to uncover the extent to which the evolution of a social network can be modeled using features intrinsic to the model itself.

As an example, consider a social network of coauthor ship among scientists. Intuitively, we may predict that two scientists who are “close” in the network may be likely to collaborate in the future. Hence, link prediction can be thought of as a contribution to the study of social network evolution models.

Approaches to link prediction have been proposed based on several measures for analyzing the “proximity” of nodes in a network. Many measures originate from techniques in graph theory and social network analysis. The general methodology is as follows: All methods assign a connection weight, score(X, Y), to pairs of nodes, X and Y, based on the given proximity measure and input graph, G. A ranked list in decreasing order of score(X, Y) is produced. This gives the predicted new links in decreasing order of confidence. The predictions can be evaluated based on real observations on experimental data sets.

The simplest approach ranks pairs, hX, Yi, by the length of their shortest path in G. This embodies the small world notion that all individuals are linked through short chains. (Since the convention is to rank all pairs in order of decreasing score, here, score (X, Y) is defined as the negative of the shortest path length.) Several measures use neighborhood information. The simplest such measure is common neighbors—the greater the number of neighbors that X and Y have in common, the more likely X and Y are to form a link in the future. Intuitively, if authors X and Y have never written a paper together but have many colleagues in common, the more likely they are to collaborate in the future. Other measures are based on the ensemble of all paths between two nodes. The Katz measure, for example, computes a weighted sum over all paths between X and Y, where shorter paths are assigned heavier weights. All of the measures can be used in conjunction with higher-level approaches, such as clustering. For instance, the link prediction method can be applied to a cleaned-up version of the graph, in which spurious edges have been removed.

In experiments conducted on bibliographic citation data sets, no one method is superior to all others. Several methods significantly outperform a random predictor, which suggests that network topology can provide useful information for link prediction. The Katz measure, and variations of it based on clustering, performed consistently well, although the accuracy of prediction is still very low. Future work on link prediction may focus on finding better ways to use network topology information, as well as to improve the efficiency of node distance calculations such as by approximation.