Eric Sebo
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Field Notes | |
|
Faced with increased competition in the customer relationship management (CRM) space, companies are turning to products and techniques that use data profiling and mapping to automate their data analysis process and ensure the integrity of their production models. Such projects are underway in various industries, including financial services, telecommunications, insurance, and transportation.
Data profiling systems enable the analysis of far more rows of data than with traditional methods, with a higher accuracy rate and fewer resources required, resulting in significant project savings and a very fast return on investment. This new technology will be implemented at an enterprise rather than project level, and becomes part of a larger data quality process supporting CRM initiatives.
I am involved with such a CRM effort at AXA Financial Inc., parent company of The Equitable Life Assurance Society of the United States, AXA Advisors, Donaldson, Lufkin, & Jenrette, DLJdirect, and Alliance Capital Management. Our AXA Client Solutions unit is implementing a CRM strategy to support the companys continued emphasis on customer service. The CRM project involves the migration of large quantities of data to a second-generation customer-centric data warehouse and business reporting environment.
AXA Financial recognized that in order to build a world-class financial services organization, it needed to accelerate the building of the CRM framework. The company developed an enterprise data warehouse architecture (EDWA) to provide underlying support for strategic business initiatives and the major CRM goals of customer acquisition, expansion, and retention. The EDWA includes information- and technical-architecture components, as well as multiple data store approaches to meet informational and operational data requirements. The EDWA uses new technologies and a multitiered data mart architecture to provide decision-makers with easy access to performance information. Reuse opportunities include requirements, designs, specifications, documentation, and system code. An initial information-architecture component was a new customer database sourced from an existing data warehouse.
The Value of Timely Data
The EDWA opens the door to new data possibilities. The consolidation of data from disparate sources onto a common platform provides a competitive advantage through timely access to a consistent source of information. With the advent of sophisticated knowledge discovery and data mining products, data used in the past to facilitate mainframe processes can now be leveraged for strategic purposes. For example, data used on an annual statement for an insurance policy can now be used as input to product, sales force, and customer profitability.
However, in order to achieve a true understanding of profitability, the integrity of the target model and the quality of the data being measured must be of the highest level. With a solid infrastructure being put into place, managements attention turned to taking a closer look at the data itself.
Datas Potential And Personality
AXA Financial knew that its existing data analysis process was extremely labor and time intensive and that even with knowledgeable workers putting in long hours, data issues would surface after the analysis phase. Being open to new technology, AXA Financial looked toward the emerging category of data profiling and mapping strategies to automate the data analysis process.
What is data profiling? One of the dictionary definitions of profile is a concise description of a persons abilities, personality, or career. The purpose of data profiling is to uncover the personality and potential of relational data so you can gain a true understanding of its content. In order to describe an attributes personality, profiling will provide data characteristics, such as data type, length, uniqueness, values, and frequency of occurrence.
Understanding the personality of the data lets you assess its quality; you can identify missing values and defects, for example. Datas value lies in its potential for use.
Can the data support the business rules? A profile of the functional dependencies that exist between tables columns can answer that question before you build the data model. Do similar-sounding fields from two tables have the same business meaning? Uncovering the overlap of data values across tables via a redundancy profile can pave the way for the integration of data from multiple sources.
In any integration or migration process, you need to know where data fields are coming from and where they are going. A data mapping document tracks the evolution of data fields from source schema, through normalization, and on to the production model. The benefit of data mapping software is that it tracks changes to attributes automatically, thereby eliminating the time-consuming task of manually adding to and maintaining a mapping document. A completed data mapping document can be input for the creation of source-to-target data maps within a transformation tool.
Data Quality Checks in the Architecture
Seeking to add data quality capabilities to the EDWA, AXA Financial turned to Evoke Software Corp. Evokes data profiling and mapping solution, Axio, saves significant time and resources in several data quality tasks, as shown in Table 1.
| Data Quality Tasks | Benefits of Data Profiling |
| 1. Identify risk factors | Data quality Issues discovered early |
| 2. Gain access to the data | Imports data from multiple platforms |
| 3. Specify the content of the data sample | Interactive sampling strategy |
| 4. View the data | GUI interface based on windows concepts |
| 5. Understand data structure and content | Infers metadata and points out anomalies |
| 6. Compare data fields for similar values | Provides overlapping domain values across files |
| 7. Understand data relationships | Determines functional dependencies |
| 8. Document and present the results | Capture and report on data discovery |
| Table 1 Profiling benefits to common data quality tasks. | |
For example, with the ability to define sample parameters through data profiling, you no longer have a reason to compromise on the size or content of the sample. Axio provides several sample load methods including random load, consecutive rows, and every nth row.
Without a profiling solution, the samples taken from large files will represent less than an entire table because the time to review thousands of rows of data is simply not available. With a profiling product, statistical inference takes the place of best guess.
Axio lets AXA Financial accelerate project completion time dramatically by automating what was a manual, error-prone process, and by eliminating iterative rework particularly in the development of target data models.
Putting Theory Into Action
AXA Financial recently used Axio to analyze data being moved into its new customer database. Axio reviewed more than 700 attributes on 32 tables within the project timeframe: a volume of 9GB! Data profiling contributed to a quicker deployment time and a greater confidence in data quality.
Output generated from Axio, along with analyst comments and observations, were placed into a spreadsheet format and shared with project team members. Table 2 is a partial example of documentation containing inferred metadata and observations based on a review of a columns domain values profile. A high-level rating of the files being reviewed can also be produced, as shown in Table 3.
| Attribute | # Distinct | Distinct Ratio | Documented Data Type | Inferred Data Type | Data Profile | Observations |
| ADDRBD | 419 | 0.0001 | DATE | DATE | Null (99%) | The <1% of data populating this field should be deleted if field not being used |
| ADDRID | 4,754,480 | 0.8503 | NUMBER (9, 0) | NUMBER (9, 0) | Contains varying length values: e.g.100,030, 10,166,922 | Determine reason for and effect of varying length fields |
| ENTDATE | 1722 | 0.0003 | DATE | DATE | 31 JUL 95 (74%) | Consistent population of values |
| Table 2 Table quality snapshot. | ||||||
| Source Table Name | Number of Fields | Excellent/Good Rating | Fair/Poor Rating | Comments |
| PRODUCT_ADDR | 8 | 4 | 4 | Clean-up and consistency issues |
| PRODUCT | 8 | 6 | 2 | Data availability issues |
| HOUSEHOLD | 23 | 23 | 0 | No significant data issues |
| PLAN_UNIT | 14 | 10 | 4 | Redundancy issues |
| Table 3 Sample column profiling results. | ||||
A Change in Mindset
Trepidation often persuades people not to look for problems: Dont look under the hood, you might find something wrong! Being the internal advocate of a data profiling and mapping solution can be challenging at times, because the technology really does look under the hood at the data. New data quality issues come to light, and these issues need to be addressed immediately, rather than later. And immediately may not be included in the project plan.
Other items to keep in mind when implementing a data profiling and mapping solution are as follows:
Training. Users will require a three- to four-day training class, which the vendor will most likely provide.
Project resources. Successful use of the software requires both product knowledge and modeling skills that will necessitate the training and availability of both business and technical resources.
Baby steps. As with a good video game, it is best to master the easier levels of the product first such things as discovery of metadata and data defects. Plunging too deep, too fast can be overwhelming.
Knowledge trap. Knowledge of the data may already exist within the project team, leading its members to believe that they dont need an automated analysis process. Profiling will always shed new light on data!
In contrast to these concerns, learning and using data profiling techniques will:
Reduce project time by minimizing scrap and rework
Add to the learning and growth perspective of the organizations balanced scorecard measurement
Extend the knowledge of a project team through statistical inference capabilities
Become a building block of an enterprise data quality process that also includes a data correction and standardization product, metadata management, and so on.
At AXA Financial, Evokes data profiling and mapping product is already considered a key part of the organizations technology framework and is being used in two additional strategic initiatives in early 2000. Whereas the initial use of Axio concentrated on understanding the datas personality (values, anomalies, quality, and so on), current project teams are also exploring the datas potential. For example, several uses work in conjunction with the extract, transform, and load (ETL) process. AXA Financial uses redundancy profiling to determine possible data gaps in the building of multidimensional cubes and dependency profiling to validate business rules discovered during transformation. Also, column profiling is being used to benchmark the output of a file when first moved onto the Oracle platform, so that future comparisons can be made at various staging areas within the data flow.
Eric Sebo (eric.sebo@equitable.com) is a technology consultant in information architecture for AXA Financial Inc. He has more than 10 years of experience in IT and has worked as a business/systems analyst on many of the companys strategic initiatives.
|
|
|
| |||||||||||||||||||||||||||||||





















