Tapes and Records
It's All Just Terminology, Right?
In January, we got a late Christmas present when Inprise (nee Borland)
decided to get very serious about the Linux database market by open-sourcing
the beta version of InterBase 6. It will also be available on other
platforms, including Linux, Windows NT, and Solaris.
Dale Fuller, the interim CEO and president of Inprise Corp., made the
announcement and immediately received praise from Eric S. Raymond, president of
the Open Source Initiative, who said "As open-source releases of operating
systems and critical middleware become the norm, Inprise's release will
undoubtedly raise the quality bar and customers will reap huge benefits in
reliability, security, and total cost of ownership."
While this sounds like a typical PR release, I have always liked Borland's
software. Unlike Microsoft, Borland products work well in the first release.
Does everyone remember how clean and easy Turbo Pascal was? How
nice it is to work with Delphi?
The beta version of InterBase 6 is scheduled to be available during the first
part of the year 2000. If you need a stable, maintainable embedded database
with a small footprint, you need to look at InterBase. The company also
announced it will continue to sell and support InterBase 5.6 through normal
distribution channels. You can get the details on the Borland Web site at
www.borland.com
This leads to another topic, since I have not done a technical column on the
basics in awhile -- data and concurrency control. Let's go back to the Dark
Ages, when I was young, wore a Nehru Jacket, and looked good in it (eat your
heart out, Austin Powers). People processed data largely with magnetic
tape systems and punch cards. Punch cards are such a bizarre concept for kids today
that I will not bother to explain them, but thanks to video and audio cassettes,
they understand tapes.
Tape systems have certain properties that are due to their physical nature. We took
those physical properties and made them into logical properties
in the file system. The first property is that the files are very closely
associated with their physical storage. You did not put more than one file
on a tape and did not like to put a single file on multiple tapes if you
could help it, which was a reasonable restriction for obvious reasons. If
you wrote past the end of your file, you would have overwritten the start of
someone else's file. How many times have you done that with a video tape?
Right now, I am waging a one-man war to get people in Internet chat groups
to refer to a "column" in a table instead of using the word "field" because
they are different things. In the old file systems and 3GL programming
languages, a statement such as "READ a, b, c FROM My file;" will not produce
the same data as "READ c, a, b FROM My_file;" because the application
program, and not the file system, defined the fields. The file system would
count off so many bytes or bits for the datatype of the field it was reading or
look for a field separator, and then stuff whatever value it found into the local
variable. Fields have an important order in the record.
Compare this to SQL, where "SELECT a, b, c FROM My_Table;" will produce the
same data as "SELECT c, a, b FROM My_Table;" because the database holds the
datatypes and names of the columns in a table. The data in the file can be almost
anything, even if it is dead wrong. People building data
warehouses from legacy data sources are discovering just how bad old data can
be. My favorite horror story so far is a part number of "I hate my job" that
had been in the system for decades, unnoticed.
A column in a table can have very elaborate constraints on it, so one is much more
sure of its correctness. A good database excludes bad data, and
does not have to depend on application programs for this function.
A "record" is not a "row" for pretty much the same reasons. On a tape,
records are in sequence. There is a first, next and last record. A table
has no order. Unfortunately, many SQL products have "vendor extensions"
that allow the programmer to reference a row by its physical location on a hard
disk or its position within the physical storage the product uses, which forces
design restrictions.
For example, if I have a ROWID "pseudo-column" like Oracle, I have forced the product to store each row of a
table in contiguous physical storage or to have an elaborate set of pointer
chains from which it can construct a single row on demand from non-contiguous
storage. I cannot dynamically relocate the data under the covers during a
session, nor can I play with a single column separately from the other columns in
its table. Also, distributed databases are much more complex than they need to
be. In short, these products have demolished set-oriented processing in
favor of "row-at-a-time" processing because under the covers and in the minds
of their original creators, they are really 1960s file systems. We used to cry in our beers
about this type of design flaw after ANSI X3H2 meetings.
And likewise, "tables" are not "files" because files are totally separate
from one another and have no enforceable relationships between them. That is why
a part number like "I hate my job" in the Orders file was not verified
with the Inventory file. In a tape system, you hung tapes on tape drives in
a particular order and on particular drive units to do a job. The payroll
system may require one subset of the accounting department files, the
retirement program would require another, slightly different subset, and so
forth.
In SQL, you connect to the database as a whole unit, or not at all. I cannot
place an Order for a part that is not in Inventory if I have used
Declarative Referential Integrity (DRI) between the tables. Tables are
not disjoint, separate things, but parts of a real system.
In a tape system, read-write rings were physical loops of plastic that you
put into a groove on the back side of a reel of magnetic tape to write-
protect it. If you improperly inserted or removed the ring, there was
nothing the application program or file system could do to correct the error.
The database Data Control Language (DCL) determines what you can do to what
tables and represents logical vs. physical access control.
A single application program mounted a file on a tape drive, opened the file
and read from it without sharing access or data with another program. If we
were to try to do this with a database, it would mean that only one user at a
time could be in the database. This is not as bad as it sounds at first. In certain
situations, such as physical reorganization or restoration, you do want to have
only programs in the database.
However, multiple users share a database, and everyone sees the same data.
One of the major problems with file systems was that the system kept the same
data element was kept in multiple files, so that in one file John Doe was
married and in another he
was single, proving that "a man with two watches will never know what time
it is," to quote a fortune cookie.
But how do you manage concurrent users? There are three basic schemes: (1)
Pessimistic, (2) Optimistic, and (3) Logical.
The Pessimistic family is the most common approach, and involves using locks
of some sort. Each vendor will have a slightly different scheme of locks,
but the essential idea is that no one other than the user who holds the lock can access
certain physical or logical units of data. In other words, expect conflicts.
Because locking is common, it is the best-understood method.
The Optimistic family is the next most common approach and it is the one that
InterBase uses (remember the lead paragraph of this column?). It involves
timestamps, or generations of data. Again, there will be slightly different
schemes, but a situation with two users trying to modify the same
data is rare and so you must handle it as an exception instead of an
expectation.
This method actually goes back to the days of microfilm archives. You
would go to central records and get a photocopy of a microfilmed file, and
take it back to your desk to work on it. The central records people would
timestamp the copy. If someone else got a copy of the same data, the system
would check to see if there were any conflicts and execute some conflict
resolution procedure. One user might have priority over another, the last
change might take effect, both users might be asked to try again, or some
other rule could be used.
Logical concurrency is still a laboratory idea and I do not know if anyone
has built a commercial product with it yet. The essential idea is to let
many users as possible into the database so long as they do not run into each
other. The way that the concurrency system determines who will not run into each
other is by looking at the predicates. For example, if one user is executing this
statement:
UPDATE Foobar
SET a = 5
WHERE b >= 0;
it can run at the same time as
UPDATE Foobar
SET a = 10
WHERE b < 0;
because the set of rows they affect do not overlap.
A deadlock occurs when two users need the same resources to complete their
tasks. Let's call the resources A and B, with user number one holding A while
user number two holds B. They sit facing each other, neither one willing to
give up the resource they hold. This can happen in a Pessimistic because of
the locks. A conflict, but not a deadlock, can occur in an Optimistic system,
but a Logical concurrency control system is completely free of this problem.
You can resolve the Optimistic conflict by aborting one or both of the user's session(s).
A livelock occurs when a user cannot get to a resource because a series of other
users keep excluding him. This has been a running gag for slapstick comedy
since the beginning of time; the runt of the litter is pushed away from his
mother by his siblings and never gets to suckle. This can happen in all
three systems, but it is essentially a Logical concurrency control system
characteristic. This can be resolved by giving the runt special priority, so
that he gets his turn. We have a lot of methods for this from multi-tasking
operating systems, but the most common ones consist of assigning a priority
to each task and increasing that priority over time until every task gets to
the highest level.
Well that is the end of my diatribe of the month. Feel like a little mental
exercise?
Puzzle:
This problem showed up in a Deja.com database group. The person submitting
it has a table declared something like this:
CREATE TABLE Foobar
(a INTEGER NOT NULL PRIMARY KEY,
b INTEGER NOT NULL DEFAULT 97,
c INTEGER NOT NULL DEFAULT 98,
d INTEGER NOT NULL DEFAULT 99);
What he wants to do is get data into the table from a host program form and
to convert the NULLs in the input into the DEFAULT values. However, if you
write something like this, you get an error message:
INSERT INTO Foobar
VALUES (:aa,
COALESCE (:bb, DEFAULT),
COALESCE (:cc, DEFAULT),
COALESCE (:dd, DEFAULT));
You get an error message. This can be done with CASE expressions or
procedural code that would look at all of the combinations of NULLs and
values in the four parameters. I am assuming that you have to provide the
key in the :aa parameter. But you also need to know what each default value
is to write the code:
INSERT INTO Foobar
VALUES (:aa,
CASE WHEN :bb IS NULL THEN 97 ELSE :bb END,
CASE WHEN :cc IS NULL THEN 98 ELSE :cc END,
CASE WHEN :dd IS NULL THEN 99 ELSE :dd END);
Can you make this more general, so that if a DEFAULT is changed in the table
declaration, you do not have to change the code in the INSERT INTO statement?
Answer:
First change the table declaration to allow NULLs in the columns.
CREATE TABLE Foobar
(a INTEGER NOT NULL PRIMARY KEY,
b INTEGER DEFAULT 97,
c INTEGER DEFAULT 98,
d INTEGER DEFAULT 99);
Now allow the user to insert the NULLs in the usual manner, but follow the
insertion with an UPDATE like this:
BEGIN ATOMIC
INSERT INTO Foobar VALUES (:aa, :bb, :cc, :dd);
UPDATE Foobar
SET b = CASE WHEN :bb IS NULL THEN DEFAULT ELSE b END,
c = CASE WHEN :cc IS NULL THEN DEFAULT ELSE c END,
d = CASE WHEN :dd IS NULL THEN DEFAULT ELSE d END
WHERE a = :aa;
END;
The ability to set a value to the current DEFAULT in an UPDATE is a
little-known feature of SQL-92.
Joe Celko is an Atlanta-based independent consultant. He is the author of three
books on SQL -- SQL For Smarties (Morgan-Kaufmann, 1995),
SQL Puzzles and Answers (Morgan-Kaufmann, 1995), and
Instant SQL Programming (Wrox Press, 1997) -- and wrote the SQL for Smarties column for DBMS
magazine. You can contact him via email at www.celko.com or 71062.1056@compuserve.com.
Copyright © 2004 CMP Media Inc. ALL RIGHTS RESERVED
No Reproduction without permission