Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home




December 05, 2000



Is Your Data Correct?

Simple statistical techniques can help you ensure that users have accurate information at their fingertips

by Ralph Kimball

One common problem in the data warehouse back room is verifying that the data is correct before you release it to your users. Is the warehouse an accurate image of the production system? Was this morning's download complete? Could some of the numbers be corrupted?

There is no single technique for validating a data load because so much variability exists in the data sources. If you are downloading an unchanged image of a production source (preserving the original granularity), you can probably run a "flash report" on the production system with up-to-the minute totals, then recapitulate the same report on the data warehouse. In this case, you "know" the answer before running the report, and the two results should match to the last decimal place.

But more often you will not have a known baseline of data. For example, perhaps you are receiving the individual sales transactions from 600 retail stores every night. You can certainly perform a gross count on the number of stores reporting, but how can you apply some additional judgment to determine the probability that the data is correct?

Continuing with the 600 stores as an example, let's look at the sales totals for each department within each store for each morning and ask if today's new numbers are reasonable. We will decide that today's sales total is reasonable if it falls within three standard deviations of the mean of the previous sales totals for that department in that store.

I chose three standard deviations because in a "normal" distribution, 99 percent of the values lie within three standard deviations above or below the mean. If all the data values are valid, approximately one percent of them will still fail our test. Perhaps after reviewing these outliers, we will decide that the overall data load seems reasonable.

I'll describe the process of checking the data using this simple technique just in words. After that I'll include some SQL, but you can skip the SQL and still get the basic idea.

In order to make this process run quickly, you want to avoid looking at the complete time history of old data when you are calculating the standard deviation. You can do this by keeping three accumulating numbers for each department in each store in a special table used only in the data validity pass. You need to keep the number of days you are accumulating, the accumulating sum of each day's sales (by department in each store) and the accumulating sum of the SQUARE of each day's sales (again, by department within store). These could be kept in a little, stand-alone accumulating department table. The grain of this table is department by store, and the three numeric fields, NUMBER_DAYS, SUM_SALES, and SUM_SQUARE_SALES, are all Type 1 attributes that are overwritten each day. You can update all three of these fields just by adding the next day's values to the ones already present. So if you have 600 stores and 20 departments in each store, this table has 12,000 rows but does not grow over time. The table also carries the store names and department names in each row.

Using this accumulating department table, look at all 12,000 department totals in the morning's data load, and kick out the morning's numbers that are more than three standard deviations from the mean. You can choose to examine the specific numbers with unusual values if there aren't too many, or you can reject the entire load if you see more than one percent of the data values being flagged as out-of-bounds.

If the morning's load passes muster, release the data to your end users and update the accumulating department table to get ready for the next day's load.

Here's some untested SQL that might work in this scenario. Recall that the standard deviation is the square root of the variance. The variance is the sum of the squares of the differences between each of the historical data points and the mean of the data points, divided by N21, where N is the number of days of data. Normally the calculation of the variance requires us to look at the entire time history of sales, which although possible, makes the computation unattractive. But if you have been keeping track of SUM_SALES and SUM_SQUARE_SALES, you can write the variance as (1/(N21))*(SUM_SQUARE_SALES_(1/N)*SUM_SALES*SUM_SALES).

So if you abbreviate your variance formula with "VAR," then your data validity check looks like this:

SELECT s.storename, p.department-
name, sum(f.sales)
FROM fact f, store s, product p, time t,
accumulatingdept a
WHERE

First, take care of joins between tables:

f.storekey = s.storekey and
f.productkey = p.productkey and
f.timekey = t.timekey and
s.storename = a.storename and
p.departmentname = a.department-
name and

Then, constrain the time to today to get the newly loaded data:

t.full_date = #December 13, 2000# and

Finally, invoke the standard deviation constraint:

HAVING
ABS(sum(f.sales)2
(1/a.N)*a.SUM_SALES) &#gt; 3*SQRT
(a.VAR)

Expand VAR as in the previous explanation and use the "a." prefix on N, SUM_SALES, and SUM_SQUARE_SALES. I have assumed that departments are groupings of products and hence are available as a rollup in the product dimension.

Embellishments on this scheme could include running two queries: one for the sales more than three standard deviations above the mean, and another for sales less than three standard deviations below the mean. Maybe there is a different explanation for these two situations. Running two queries would also get rid of the ABS function if your SQL doesn't like this in the HAVING clause.

If you normally have significant daily fluctuations in sales (let's say Monday and Tuesday are very slow compared to Saturday), you could add a DAY-OF-WEEK to the accumulating department table and constrain to the appropriate day. Although doing this expands the accumulating department table by a factor of seven, it may remove a significant source of variance and make your test more accurate. Adding this bit of SQL is a kind of poor man's way of handling known daily fluctuations. I'll describe a high power way to deal with "trading day" and seasonal fluctuations in the last section of this column.

Judging Data Quality With No History

There are some special situations in which you can calculate the expected mean and variance of incoming data even if you have not been accumulating history. Suppose you are collecting data from a large number of individual respondents within the age range of 40 to 49 years. If you have no reason to believe that there is a bias that preferentially selects one gender over another, you can use the ratio of males in the population at large (assume 47/100) and the ratio of females (therefore 53/100) as the basis of your statistics. A statistician would say that your data represents a set of Bernoulli trials, and the resulting data should fit a binomial distribution. Consulting your statistics textbooks, you know that if you have recorded 1,000 respondents, you would expect the mean number of males to be N*P where N= 1,000 and P=47/100. This works out to 470 males in your sample. Seems reasonable. From the textbooks, the variance is N*P*(12P), which is 1,000* 0.47*0.53=249.1. The standard deviation is the square root of the variance, which in this case is 15.78.

Again, applying the three standard deviation criteria, you will worry about any data element that is more than three standard deviations away from the mean. In other words, if you see that the number of males reported in a sample of 1,000 is less than 47023*15.78=422.6 or is greater than 47013*15.78=517.3, stop to ask if the reported data is reasonable.

Consider whether you are comfortable assuming that your data is generated with a constant probability. If you are, you can use this technique without accumulating a complex past history. Other examples in which the constant probability assumption is plausible might include the yield percentages of manufacturing processes such as wafer fabs, paper production, or steel mills.

Compensating for Predictable Changes

Many of the measures (facts) that occur in a data warehouse come from marketplaces that exhibit predictable seasonal fluctuations. The holiday season and the summer are predictable, high-volume periods. Remove this seasonality as a source of variance when you judge your incoming data. Similarly, maybe business has been growing at an accelerating rate for the past several years; this acceleration could even be nonlinear. Finally, there may be specific days in the week or month during which you know your business volume is predictably high or low.

You can remove these sources of variance in your data in order to judge whether your data is reasonable, but you need some high-powered statistical help. You will want to project your previous data forward in time to today, then compare this projection with the data that has just come in.

Professional statisticians for many years have used a technique known as X-11-ARIMA to remove these effects from time series data. ARIMA is an acronym standing for auto-regressive integrated moving average. The X-11 algorithm takes your existing data and projects it forward. In other words, it tells you what it expects today's data value to be.

The venerable X-11 algorithm, which statisticians have been using in various forms for more than 30 years, has been superseded recently by X-12-ARIMA, which offers more flexibility for dealing with constant (nonseasonal) trends and short term effects such as trading day fluctuations. All the serious statistics software vendors, such as SAS, SPSS, SYSTAT, and others, will have X-12 modules that you can use to filter out unreasonable data values.

Most of the serious extract-transform-load (ETL) vendors such as Informatica Corp., Sagent Technology Inc., Ardent Software Inc. (now part of Informix), and Evolutionary Technologies International (ETI) offer "transform" modules with advanced statistics processing, including X-11-ARIMA and X-12-ARIMA. Have a talk with your ETL vendor and ask them how to attach one of the serious statistics packages to your backroom data flow.



Ralph Kimball co-invented the Star Workstation at Xerox and founded Red Brick Systems. He has three best-selling data warehousing books in print, including the newly released The Data Webhouse Toolkit (Wiley, 2000). Ralph teaches dimensional data warehouse design through Kimball University and critically reviews large data warehouse projects. You can reach Ralph through his Web site at www.ralphkimball.com.






IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space