The Soul of the Data Warehouse, Part Two: Drilling Across
Drilling across means asking for the same row headers from another fact table
by Ralph Kimball
The three fundamental themes that make up the soul of every data
warehouse are drilling down, drilling across, and handling time. In
Part One of "The Soul of the Data Warehouse," I showed that drilling
down was nothing more than adding a row header, any row header, to an
existing query. Although we often grouse about SQL's limitations as a
report writer, when it comes to drilling down, SQL gracefully expands
an existing query whenever a new row header is added. This simple
result led to the powerful realization that when data is organized in
a symmetrical, predictable fashion starting at the most atomic level,
all queries and applications benefit.
If drilling down is the most fundamental maneuver in a data
warehouse, drilling across is a close second. From the perspective of
an answer set, drilling across adds more data to an existing row.
Note that this result isn't what you get from a UNION of rows from
separate queries. It's better described as the column accretion from
separate queries.
Drilling across by adding another measured fact to the SELECT list
from the existing fact table mentioned in the query is a trivial
accomplishment. What's more interesting and important is adding
another measured fact from a new fact table.
The issues raised by this simple view of drilling across are at
the heart of data warehouse architecture. These issues boil down to
an observation and a choice.
Drill-across observation: The new fact table called for in the
drill-across operation must share certain dimensions with the fact
table in the original query.
Certain dimensions will be named in the original query because
they contribute row headers. Remember that these row headers are the
basis of the grouping that creates the answer-set row. These
dimensions will appear in the FROM clause of the SQL code and will be
joined to the fact table through the relationship of a foreign key to
primary key. The new fact table must also support exactly these same
row headers, or the context of the answer-set row is meaningless.
Drill-across choice: Either send a single, simultaneous SQL
request to the two fact tables or send two separate requests.
Although sending a single SQL request to the two fact tables seems
cleaner, this choice can become a showstopper. Sending a single
request means mentioning both fact tables in the FROM clause of the
SQL code and joining both fact tables in some way to the common
dimension tables I just discussed. This commingling of two fact
tables in the same SQL statement causes these problems:
- Because the two fact tables will be joined together either
directly or through the common dimensions, the query must specify
whether the many-to-many relationship between the two fact tables is
handled with inner or outer joins. This fundamental challenge arises
from the relational model. It's effectively impossible to get this
right, even if you're an expert SQL programmer. Depending on the
relative cardinality of the two fact tables, your aggregated numeric
totals can either be too low or too high, or both! Even if you don't
believe me, you have to deal with the next bullet point.
- The vast majority of queries the relational database receives
are generated by powerful query tools and report writers, and you
have no direct control over the SQL they emit. You don't want control
over the SQL. Some of these tools generate mind-boggling reams of SQL
and you can't effectively intervene.
- Emitting a single SQL statement precludes you from requesting
data from separate table spaces, separate machines, or separate
database vendors. You're stuck in the same table space on the same
machine talking to one database vendor. If you can easily avoid this
problem, why take on these restrictions?
- Finally, if you emit a single SQL statement involving both fact
tables, you'll almost certainly be unable to use any of the powerful
query-rewrite tools that perform aggregate navigation. Aggregate
navigation is the most cost-effective way to make dramatic gains in
data warehouse performance. For more on aggregate navigation, see my
column "Aggregate Navigation with (Almost) No Metadata" (August 1996)
in the DBMS magazine archives accessible through
www.ralphkimball.com
or at www.dbmsmag.com/9608d54.html.