Introduction: Conceptual clustering is a form of clustering in machine learning that, given a set of unlabeled objects, produces a classification scheme over the objects. Unlike conventional clustering, which primarily identifies groups of like objects, conceptual clustering goes one step further by also finding characteristic descriptions for each group, where each group represents a concept or class. Hence, conceptual clustering is a two-step process: clustering is performed first, followed by characterization. Here, clustering quality is not solely a function of the individual objects. Rather, it incorporates factors such as the generality and simplicity of the derived concept descriptions.
COBWEB uses a heuristic evaluation measure called category utility to guide construction of the tree. Category utility (CU) is defined as
where n is the number of nodes, concepts, or “categories” forming a partition, {C1, C2, : : : , Cn}, at the given level of the tree. In other words, category utility is the increase in the expected number of attribute values that can be correctly guessed given a partition (where this expected number corresponds to the term P(Ck) ΣiΣj P(Ai = vij|Ck)2) over the expected number of correct guesses with no such knowledge (corresponding to the term ΣiΣjP(Ai = vij)2). Although we do not have room to show the derivation, category utility rewards intra class similarity and interclass dissimilarity, where:
Intra class similarity is the probability P (Ai = vij|Ck). The larger this value is, the greater the proportion of class members that share this attribute-value pair and the more predictable the pair is of class members.
Interclass dissimilarity is the probability P (Ck|Ai = vij). The larger this value is, the fewer the objects in contrasting classes that share this attribute-value pair and the more predictive the pair is of the class.