SKEDSOFT

Data Mining & Data Warehousing

Introduction: Many data mining systems specialize in one data mining function, such as classification, or just one approach for a data mining function, such as decision tree classification. Other systems provide a broad spectrum of data mining functions. Most of the systems described below provide multiple data mining functions and explore multiple knowledge discovery techniques. Website URLs for the various systems are provided in the bibliographic notes.

From database system and graphics system vendors:

Intelligent Miner is an IBM data mining product that provides a wide range of data mining functions, including association mining, classification, regression, predictive modeling, deviation detection, clustering, and sequential pattern analysis. It also provides an application toolkit containing neural network algorithms, statistical methods, data preparation tools, and data visualization tools. Distinctive features of Intelligent Miner include the scalability of its mining algorithms and its tight integration with IBM’s DB2 relational database system.

Microsoft SQL Server 2005 is a database management system that incorporates multiple data mining functions smoothly in its relational database system and data warehouse system environments. It includes association mining, classification (using decision tree, naïve Bayes, and neural network algorithms), regression trees, sequence clustering, and time-series analysis. In addition, Microsoft SQL Server 2005 supports the integration of algorithms developed by third-party vendors and application users.

Mine Set, available from Purple Insight, was introduced by SGI in 1999. It provides multiple data mining functions, including association mining and classification, as well as advanced statistics and visualization tools. A distinguishing feature of Mine Set is its set of robust graphics tools, including rule visualizer, tree visualizer, map visualizer, and (multidimensional data) scatter visualizer for the visualization of data and data mining results.

Oracle Data Mining (ODM), an option to Oracle Database 10g Enterprise Edition, provides several data mining functions, including association mining, classification, prediction, regression, clustering, and sequence similarity search and analysis. Oracle Database 10g also provides an embedded data warehousing infrastructure for multidimensional data analysis.

From vendors of statistical analysis or data mining software:

Clementine, from SPSS, provides an integrated data mining development environment for end users and developers. Multiple data mining functions, including association mining, classification, prediction, clustering, and visualization tools, are incorporated into the system. A distinguishing feature of Clementine is its object oriented, extended module interface, which allows users’ algorithms and utilities to be added to Clementine’s visual programming environment.

Enterprise Miner was developed by SAS Institute, Inc. It provides multiple data mining functions, including association mining, classification, regression, clustering, time series analysis, and statistical analysis packages. A distinctive feature of Enterprise Miner is its variety of statistical analysis tools, which are built based on the long history of SAS in the market of statistical analysis.

Insightful Miner, from Insightful Inc., provides several data mining functions, including data cleaning, classification, prediction, clustering, and statistical analysis packages, along with visualization tools. A distinguishing feature is its visual interface, which allows users to wire components together to create self-documenting programs.

Originating from the machine learning community:

CART, available from Salford Systems, is the commercial version of the CART (Classification and Regression Trees) system discussed in Chapter 6. It creates decision trees for classification and regression trees for prediction. CART employs boosting to improve accuracy. Several attribute selection measures are available.

See5 and C5.0, available from Rule Quest, are commercial versions of the C4.5 decision tree and rule generation method described in Chapter 6. See5 is the Windows version of C4.5, while C5.0 is its UNIX counterpart. Both incorporate boosting. The source code is also provided.

Weka, developed at the University of Waikato in New Zealand, is open-source data mining software in Java. It contains a collection of algorithms for data mining tasks, including data preprocessing, association mining, classification, regression, clustering, and visualization.