SKEDSOFT

Data Mining & Data Warehousing

Introduction: In some applications, such as information retrieval, text document clustering, and biological taxonomy, we need to compare and cluster complex objects (such as documents) containing a large number of symbolic entities (such as keywords and phrases). To measure the distance between complex objects, it is often desirable to abandon traditional metric distance computation and introduce a non-metric similarity function.

There are several ways to define such a similarity function, s(x, y), to compare two vectors x and y. One popular way is to define the similarity function as a cosine measure as follows:

s(x, y) = (xt , y) / (||x||.||y||)

where xt is a transposition of vector x, ||x|| is the Euclidean norm of vector x,1 ||y|| is the Euclidean norm of vector y, and s is essentially the cosine of the angle between vectors x and y. This value is invariant to rotation and dilation, but it is not invariant to translation and general linear transformation.

When variables are binary-valued (0 or 1), the above similarity function can be interpreted in terms of shared features and attributes. Suppose an object x possesses the ith attribute if xi = 1. Then xt.y is the number of attributes possessed by both x and y, and |x||y| is the geometric mean of the number of attributes possessed by x and the number possessed by y. Thus s(x, y) is a measure of relative possession of common attributes.

Example: Non-metric similarity between two objects using cosine. Suppose we are given two vectors, x = (1, 1, 0, 0) and y = (0, 1, 1, 0). By Equation (7.16), the similarity between x and

y is s(x, y) = (0 1 0 0) / √2 √2 = 0.5.

A simple variation of the above measure is

s(x, y) = (xt . y) /  (xt . x yt . y - xt . y )

which is the ratio of the number of attributes shared by x and y to the number of attributes possessed by x or y. This function, known as the Tanimoto coefficient or Tanimoto distance, is frequently used in information retrieval and biology taxonomy.

Notice that there are many ways to select a particular similarity (or distance) function or normalize the data for cluster analysis. There is no universal standard to guide such selection. The appropriate selection of such measures will heavily depend on the given application. One should bear this in mind and refine the selection of such measures to ensure that the clusters generated are meaningful and useful for the application at hand.