SKEDSOFT

Data Mining & Data Warehousing

Introduction: There are many approaches to text mining, which can be classified from different perspectives, based on the inputs taken in the text mining system and the data mining tasks to be performed. In general, the major approaches, based on the kinds of data they take as input, are:

1.       the keyword-based approach, where the input is a set of keywords or terms in the documents,

2.       the tagging approach, where the input is a set of tags, and

3.       The information-extraction approach, which inputs semantic information, such as events, facts, or entities uncovered by information extraction.

A simple keyword-based approach may only discover relationships at a relatively shallow level, such as rediscovery of compound nouns (e.g., “database” and “systems”) or co-occurring patterns with less significance (e.g., “terrorist” and “explosion”). It may not bring much deep understanding to the text. The tagging approach may rely on tags obtained by manual tagging (which is costly and is unfeasible for large collections of documents) or by some automated categorization algorithm (which may process a relatively small set of tags and require defining the categories beforehand). The information-extraction approach is more advanced and may lead to the discovery of some deep knowledge, but it requires semantic analysis of text by natural language understanding and machine learning methods. This is a challenging knowledge discovery task.

Various text mining tasks can be performed on the extracted keywords, tags, or semantic information. These include document clustering, classification, information extraction, association analysis, and trend analysis. We examine a few such tasks in the following discussion.

Keyword-Based Association Analysis: Such analysis collects sets of keywords orterms that occur frequently together and then finds the association or correlation relationshipsamong them.

Like most of the analyses in text databases, association analysis first preprocess the text data by parsing, stemming, removing stop words, and so on, and then evokes association mining algorithms. In a document database, each document can be viewed as a transaction, while a set of keywords in the document can be considered as a set of items in the transaction. That is, the database is in the format

{document_id, a_set_of_keywords}

The problem of keyword association mining in document databases is thereby mapped to item association mining in transaction databases, where many interesting methods have been developed, as described in Chapter 5.

Notice that a set of frequently occurring consecutive or closely located keywords may form a term or a phrase. The association mining process can help detect compound associations, that is, domain-dependent terms or phrases, such as [Stanford, University] or [U.S., President, George W. Bush], or non compound associations, such as [dollars, shares, exchange, total, commission, stake, securities].Mining based on these associations is referred to as “term-level association mining” (as opposed to mining on individual words). Term recognition and term-level association mining enjoy two advantages in text analysis: (1) terms and phrases are automatically tagged so there is no need for human effort in tagging documents; and (2) the number of meaningless results is greatly reduced, as is the execution time of the mining algorithms.

With such term and phrase recognition, term-level mining can be evoked to find associations among a set of detected terms and keywords. Some users may like to find associations between pairs of keywords or terms from a given set of keywords or phrases, whereas others may wish to find the maximal set of terms occurring together. Therefore, based on user mining requirements, standard association mining or max-pattern mining, algorithms may be evoked.