Full AnalysisPolyAnalyst mines data and text, and its engines run the algorithmic gamutby Greg James
In this Issue: Megaputer Intelligence Inc., the U.S.-based corporation behind PolyAnalyst, traces its roots back to the Artificial Intelligence (AI) Research and Development group at Moscow State University. PolyAnalyst made its debut in 1994, with continual enhancements ever since. Version 4.5 adds decision forests, transactional market basket analysis, and link analysis to the base product. Notably, it's also the first of several commercial packages to offer integrated text mining within the same system as numeric data mining. Until recently, text and numeric mining were separate endeavors.
PolyAnalyst's user interface is a familiar, interactive development environment: standard menus and a horizontal toolbar across the top; a hierarchical project tree on the left; a multiview workspace in the center; and a system message window across the bottom. Its design focuses on high-level data mining functions and user-defined data-mining projects. These projects comprise Attributes, Data Sets, Illustrations and Graphs, Rules, Reports, Mining Models, and Data Links. These components are listed in the project tree and visualized in the multiview workspace. (See Figure 1.) Perhaps what's most notably unique about PolyAnalyst is its holistic approach to combining low-level algorithms into high-level data-mining functions. Megaputer's goal is to provide a comprehensive set of "exploration engines" built out of best-of-class machine-learning, statistical, and data-mining algorithms. The current version of PolyAnalyst includes 15 exploration engines: Summary Statistics, Linear Regression, Find Dependencies, Classify, Cluster, Decision Tree, Decision Forest, Discriminate, Find Laws, Nearest Neighbor, PolyNet Predictor, Basket Analysis, Transactional Basket Analysis, Link Analysis, and Text Analysis. Although PolyAnalyst is methodologically agnostic, its complete set of exploration engines supports almost any data-mining strategy. I review several here to provide insight into PolyAnalyst's algorithmic sophistication. Find LawsThe Find Laws exploration engine is unique to PolyAnalyst and sits at the core of the system. Find Laws automatically generates candidate formulas and tests them to find the ones that best fit the data. Find Laws is capable of describing complex, multidimensional, nonlinear relationships even though the equations are restricted to rational expressions (polynomials). Megaputer's developers drew upon their extensive background in AI to develop an algorithm that limits Find Laws' equation generator from producing numerous trivial and redundant expressions. A special search algorithm directs the system to generate candidate equations most likely to be an improvement over previous ones. Find Laws evaluates candidate equations in terms of standard errors, the same method you use with standard regression models. The benefit of this design is that Find Laws generates complex models stated in algebraic terms, thus making them relatively easy to deploy in production environments. PolyAnalyst uses its own Symbolic Rule Language (SRL) that doubles as its scripting language and internal model representation language. Find DependenciesThe Find Dependencies exploration engine is another example of higher-level data-mining functionality. One of the first steps in predictive modeling projects is identifying the internal relationships between specific target (dependent) variables and all relevant input (independent) variables. Next, feature selection reduces the set of inputs to include only those inputs that have the most predictive relationship to the target. This process becomes exponentially more challenging as the number of inputs and their possible values grow. PolyAnalyst's Find Dependencies algorithm accomplishes both these steps simultaneously, requiring you merely to specify the target and input variables. Furthermore, Megaputer encourages data miners to consider selecting all available input variables if very little is known about the data set. Find Dependencies can run in two modes: strict or liberal. Use strict mode to discover which attributes have the strongest influence on the chosen target. Liberal mode, on the other hand, will identify exceptional cases or outliers, depending upon the research objective. Both modes, when used in tandem, make the Find Dependencies engine an extremely efficient way of getting a handle on new data sets. Nearest NeighborThe Nearest Neighbor exploration engine is useful for classifying cases into one of several, mutually exclusive categories and predicting categories for new cases. Thus, you can use it for both exploration and prediction. Nearest Neighbor's accuracy gets better with more cases, but Megaputer warns that data sets in excess of 100,000 records can require a significant amount of time to complete.
|
Most Popular This Week
IE Weekly Newsletter
Subscribe to the newsletter
|
| |||||||||||||||||||||||||||||||





















