June 26, 2000, Volume 3 - Number 10
Look Back in Anger
An example of using "forensic tests" to sort out real-world SQL
In textbooks, you always get to design a database from scratch, using all the right tools and proper techniques. The database is always in at least Third Normal form. The identifiers are always consistently named. And there is documentation a human being can both find and read easily.
Back in the real world, things are not that smooth. You frequently will enter a project where none of those things are true. Here is a worst-case scenario based on an actual consulting job; it is not atypical.
The first problem is that no documentation exists about the database. Time to play SQL detective. Here is what I need to do just to understand what a mess I have.
First, I obtain a script that creates all the tables in the schema. Every product has this, so there is nothing really special about it. This file is the source for all of the following scripts I will use to look at that database, which means a lot of text editing.
I ignore the views, triggers, most of the constraints, and so forth on the first attempt, because the base tables will certainly change during the cleanup.
The first "forensic test" is simply a sort on all the rows and their declarations. You would be surprised at how many times I find the same name used with totally different data types, sizes, and constraints.
account CHAR (10) NOT NULL,
account VARCHAR (50) NOT NULL,
active_ingredient VARCHAR (63) NOT NULL,
active_ingredient VARCHAR (80) NOT NULL,
address1 VARCHAR (30) NOT NULL,
address1 VARCHAR (50) NOT NULL,
and so forth.
Look at the data types used. This schema has NVARCHAR(n) and VARCHAR(n) data types; that makes sense because it has foreign mailing addresses and can use the national character sets. This schema also has DECIMAL(s,p), NUMERIC(s,p), MONEY, and FLOAT data types. This is bad. The difference between DECIMAL(s,p) and NUMERIC(s,p) is subtle in the SQL-92 Standard -- DECIMAL(s,p) must be exactly as precise as declared, while NUMERIC(s,p) must be at least as precise as declared. The Sybase and SQL Server family treats them the same.
So, why would anyone use both approaches to DECIMAL(s,p) and NUMERIC(s,p) in the same schema? Two or more programmers and no coding standards! The game is afoot, Watson! NUMERIC(s,p) looks like the Oracle proprietary NUMBER(s,p) declarations. Make a note, detectives, we can expect to find at least one programmer who worked on this database trying to use Oracle code in SQL Server.
The moron that declared columns to be FLOAT in a commercial system is probably a C programmer who does not understand rounding errors. While I do not like the use of the MONEY data type because it is proprietary, it does have the advantage in the Sybase and Microsoft SQL Server families of displaying with floating dollar signs, commas, and two decimal places.
The different naming styles for the identifiers support the multiple programmer theory . I see lowercase names, mixed-case names, underscores, abbreviations and full words all being used.
Okay, put that printout aside for now.
Now I apply a script that finds all the columns that are nothing but NULLs. The MAX() function will return a NULL if all the values in a column are NULLs, and it works for all data types, except the proprietary BIT, IMAGE, and other blobs.
However, some of the columns are insanely long and the display tool allocates the maximum column width for the printout. A CASE expression that puts a 1 in the columns I can drop from the table and a 0 in those that have some contents will keep things small enough to read. Here is an example of the code.
SELECT 'AccessDB',
CASE WHEN MAX(id) IS NULL
THEN 1 ELSE 0 END AS 'id',
CASE WHEN MAX(password) IS NULL
THEN 1 ELSE 0 END AS 'password',
CASE WHEN MAX(departno) IS NULL
THEN 1 ELSE 0 END AS 'departno',
CASE WHEN MAX(lname) IS NULL
THEN 1 ELSE 0 END AS 'lname',
CASE WHEN MAX(fname) IS NULL
THEN 1 ELSE 0 END AS 'fname'
FROM AccessDB;
You can also use the CAST() function in SQL Server to convert BIT data types into integers, but it still ignores image, text and other blobs.
The third script shows the allocated size of a column and the actual size used. This particular database likes to declare things as much too big. Grabbing a text editor, I take the declaration script and produce a few pages of this kind of code.
SELECT 'AccessDB',
MAX(CHAR_LENGTH(id)) AS id, (50),
MAX(CHAR_LENGTH(password)) AS password, (255),
MAX(CHAR_LENGTH(departno)) AS departno, (100),
MAX(CHAR_LENGTH(lname)) AS lname, (50),
MAX(CHAR_LENGTH(fname)) AS fname, (50)
FROM AccessDB;
This trick cleans up the CHAR(n) declarations, but I still see declarations like DECIMAL(18,0) used when INTEGER would have been smaller and faster.
The reason for the DECIMAL(18,0) declaration, I believe, is that the original programmer had worked with Oracle and tried to imitate that data storage convention in SQL Server. Oracle uses Cobol-style "PICTURE" storage, while SQL Server uses C-style internal binary formats. In one of the older Cobol standards, 18 was a "magic number" for the length of identifiers in the language.
I have a mental list of such "magic numbers" that show up in programming. These values had some physical basis at one time but have often outlived the reason for their creation. For instance, you will find obvious examples like powers of two minus one -- the number 255 became common from its use in the 16 bit machines that spawned C. The 80-column width of many forms originated from the width of an old CRT screen, which in turn stemmed from the 80-column punch card. The 80-column punch card goes back to the size of a U.S. Dollar bill because when Herman Hollerith invented the punch card for the Census, he borrowed storage cabinets from the Treasury Department.
I now look at the original script and see that DECIMAL(18,0) is an IDENTITY data type column. This data type is a proprietary Sybase and SQL Server family convention that auto-increments values in a column. It is nonrelational, proprietary, and nonportable and creates a nightmare of internal problems in the Sybase and SQL Server family.
Furthermore, I see that while no PRIMARY KEY declarations are made, the intent is to use the IDENTITY columns as a surrogate key or a replacement for an Oracle-style ROWID. This idea is awful; you can duplicate the meaningful data in the table, but the IDENTITY columns keep the row unique. This usage creates the impression of having a key without any of the benefits. In addition, no FOREIGN KEY constraints exist to tie an order and its details together.
This is not a database, but a set of separate files using SQL Server to hold them.
The invoice header table has a column called total, which probably contains the total of the invoice details. Having summary data in a row is always a bad idea because:
- It is not safe. Rounding errors accumulate, procedural code does not always update the table correctly, and so forth. Why do by hand that which you can do more accurately by machine?
- Totals can be calculated in main storage faster than they can be read from secondary storage. No disk drive spins faster than electrons can move.
The naming conventions in this database are irregular, sometimes spelling things out in full and sometimes abbreviating them. Some identifiers contain underscores, while square brackets, which allow the use of embedded blanks, enclose other identifiers.
By the way, the use of square brackets is proprietary to SQL Server and MS Access; the correct syntax is double quote marks. These days, when you see square brackets, you know you are in deep trouble, because it almost always means that you have code written by an Access programmer. Access is not SQL. I spend days in my classes getting the victims to unlearn Access before I can begin to teach them real SQL. But that is a rant for another day.
Another carryover from the old Cobol file system days is that there are recurring column groups of identifiers whose names end with a number. That is, "shipper1, ship_street1, ..." followed by "shipper2, ship_street2, ..." and so forth. In a Cobol program, these would have been handled with an OCCURS clause in a storage declaration. You Pascal programmers can think of a variant record structure. In SQL, you have a table with First Normal Form problems. They need to be split off into a separate Shippers table, then joined to the Shipments table as needed.
My next step is to make a copy of the existing database and use the information from the first probe to remove columns that are not being used. It might also be worth writing a SQL query script that will look for columns that have one and only one value in them as candidates for removal. That is a pretty easy set of ALTER TABLE statements.
You can also use the second script that gave us the actual vs. declared column widths to create ALTER TABLE statements, but you have to be careful. If your sample data is not representative, you might shorten the column too much. The other problem is that some SQL products do not allow you to shorten a column, but only to widen it. Some products are picky about changing numeric data types, too.
Okay, this gets the DDL pretty well under way. Phase two will look at the embedded SQL in the programs. Changing the data types might make some problems for the host programs. Now I will go for the "Wright Brothers systems engineering" approach -- put it all together, push it off a cliff, see if it flies and keep patching the thing until it does. Detective work can be dirty.
Puzzle:
While I do not recommend this puzzle as a technique, and it will vary from SQL dialect to dialect, this puzzle is a good exercise in learning to think in sets. You have a quoted string made up of integers separated by commas, and your goal is to break each integer out as a row in a table.
The obvious approach is to write procedural code that will loop over the input string and cut off all characters from the start up to but including the first comma in each iteration. Can you do this without an iterative construct, using pure SQL-92?
Answer:
One way to do this is with an auxiliary table of sequential numbers and this strange looking query.
INSERT INTO ParmList (parmeter_position, parameter)
SELECT S1.i,
CAST (SUBSTRING ((','|| :instring ||',')
FROM (S1.i + 1)
FOR (S2.i - S1.i - 1))
AS INTEGER)
FROM Sequence AS S1,
Sequence AS S2
WHERE SUBSTRING((','|| :instring ||',') FROM S1.i FOR 1) = ','
AND SUBSTRING((','|| :instring ||',') FROM S2.i FOR 1) = ','
AND S2.i
= (SELECT MIN(S3.i)
FROM Sequence AS S3
WHERE S1.i < S3.i
AND SUBSTRING((','|| :instring ||',')
FROM S3.i
FOR 1) = ',')
AND S1.i < CHAR_LENGTH (:instring + 1)
AND S2.i < CHAR_LENGTH (:instring + 2);
The trick here is to concatenate commas on the left and right sides of the input string. To be honest, you would probably want to trim blanks and perhaps do other tests on the string, such as LOWER(:instring) = UPPER(:instring) to avoid alphabetic characters, and so forth. That edited result string would be kept in a local variable and used in the INSERT INTO statement.
The integer substrings are located between the i-th and ((i+1)-th comma pairs. In effect, the sequence table replaces the loop counter. The Sequence table has to have enough numbers to cover the entire string, but unless you really like to type in long parameter lists, this requirement should not be a problem. The last two predicates are to avoid a Cartesian product with the Sequence table.
Joe Celko is an Atlanta-based independent consultant. He is the author of three
books on SQL -- SQL For Smarties (Morgan-Kaufmann, 1995),
SQL Puzzles and Answers (Morgan-Kaufmann, 1995), and
Instant SQL Programming (Wrox Press, 1997) -- and wrote the SQL for Smarties column for DBMS
magazine. You can contact him via email at www.celko.com or 71062.1056@compuserve.com.
Copyright © 2004 CMP Media Inc. ALL RIGHTS RESERVED
No Reproduction without permission