Guide to the TechWeb Network

Intelligent Enterprise

Better Insight for Business Decisions

Intelligent Enterprise - Better Insight for Business Decisions
search Intelligent Enterprise
Advanced Search
RSS
Webcasts
Whitepapers
Subscribe
Home



Tapes and Records


It's All Just Terminology, Right?

In January, we got a late Christmas present when Inprise (nee Borland) decided to get very serious about the Linux database market by open-sourcing the beta version of InterBase 6. It will also be available on other platforms, including Linux, Windows NT, and Solaris.

Dale Fuller, the interim CEO and president of Inprise Corp., made the announcement and immediately received praise from Eric S. Raymond, president of the Open Source Initiative, who said "As open-source releases of operating systems and critical middleware become the norm, Inprise's release will undoubtedly raise the quality bar and customers will reap huge benefits in reliability, security, and total cost of ownership."

While this sounds like a typical PR release, I have always liked Borland's software. Unlike Microsoft, Borland products work well in the first release. Does everyone remember how clean and easy Turbo Pascal was? How nice it is to work with Delphi?

The beta version of InterBase 6 is scheduled to be available during the first part of the year 2000. If you need a stable, maintainable embedded database with a small footprint, you need to look at InterBase. The company also announced it will continue to sell and support InterBase 5.6 through normal distribution channels. You can get the details on the Borland Web site at www.borland.com

This leads to another topic, since I have not done a technical column on the basics in awhile -- data and concurrency control. Let's go back to the Dark Ages, when I was young, wore a Nehru Jacket, and looked good in it (eat your heart out, Austin Powers). People processed data largely with magnetic tape systems and punch cards. Punch cards are such a bizarre concept for kids today that I will not bother to explain them, but thanks to video and audio cassettes, they understand tapes.

Tape systems have certain properties that are due to their physical nature. We took those physical properties and made them into logical properties in the file system. The first property is that the files are very closely associated with their physical storage. You did not put more than one file on a tape and did not like to put a single file on multiple tapes if you could help it, which was a reasonable restriction for obvious reasons. If you wrote past the end of your file, you would have overwritten the start of someone else's file. How many times have you done that with a video tape?

Right now, I am waging a one-man war to get people in Internet chat groups to refer to a "column" in a table instead of using the word "field" because they are different things. In the old file systems and 3GL programming languages, a statement such as "READ a, b, c FROM My file;" will not produce the same data as "READ c, a, b FROM My_file;" because the application program, and not the file system, defined the fields. The file system would count off so many bytes or bits for the datatype of the field it was reading or look for a field separator, and then stuff whatever value it found into the local variable. Fields have an important order in the record.

Compare this to SQL, where "SELECT a, b, c FROM My_Table;" will produce the same data as "SELECT c, a, b FROM My_Table;" because the database holds the datatypes and names of the columns in a table. The data in the file can be almost anything, even if it is dead wrong. People building data warehouses from legacy data sources are discovering just how bad old data can be. My favorite horror story so far is a part number of "I hate my job" that had been in the system for decades, unnoticed.

A column in a table can have very elaborate constraints on it, so one is much more sure of its correctness. A good database excludes bad data, and does not have to depend on application programs for this function.

A "record" is not a "row" for pretty much the same reasons. On a tape, records are in sequence. There is a first, next and last record. A table has no order. Unfortunately, many SQL products have "vendor extensions" that allow the programmer to reference a row by its physical location on a hard disk or its position within the physical storage the product uses, which forces design restrictions.

For example, if I have a ROWID "pseudo-column" like Oracle, I have forced the product to store each row of a table in contiguous physical storage or to have an elaborate set of pointer chains from which it can construct a single row on demand from non-contiguous storage. I cannot dynamically relocate the data under the covers during a session, nor can I play with a single column separately from the other columns in its table. Also, distributed databases are much more complex than they need to be. In short, these products have demolished set-oriented processing in favor of "row-at-a-time" processing because under the covers and in the minds of their original creators, they are really 1960s file systems. We used to cry in our beers about this type of design flaw after ANSI X3H2 meetings.

And likewise, "tables" are not "files" because files are totally separate from one another and have no enforceable relationships between them. That is why a part number like "I hate my job" in the Orders file was not verified with the Inventory file. In a tape system, you hung tapes on tape drives in a particular order and on particular drive units to do a job. The payroll system may require one subset of the accounting department files, the retirement program would require another, slightly different subset, and so forth.

In SQL, you connect to the database as a whole unit, or not at all. I cannot place an Order for a part that is not in Inventory if I have used Declarative Referential Integrity (DRI) between the tables. Tables are not disjoint, separate things, but parts of a real system.

In a tape system, read-write rings were physical loops of plastic that you put into a groove on the back side of a reel of magnetic tape to write- protect it. If you improperly inserted or removed the ring, there was nothing the application program or file system could do to correct the error. The database Data Control Language (DCL) determines what you can do to what tables and represents logical vs. physical access control.

A single application program mounted a file on a tape drive, opened the file and read from it without sharing access or data with another program. If we were to try to do this with a database, it would mean that only one user at a time could be in the database. This is not as bad as it sounds at first. In certain situations, such as physical reorganization or restoration, you do want to have only programs in the database.

However, multiple users share a database, and everyone sees the same data. One of the major problems with file systems was that the system kept the same data element was kept in multiple files, so that in one file John Doe was married and in another he was single, proving that "a man with two watches will never know what time it is," to quote a fortune cookie.

But how do you manage concurrent users? There are three basic schemes: (1) Pessimistic, (2) Optimistic, and (3) Logical.

The Pessimistic family is the most common approach, and involves using locks of some sort. Each vendor will have a slightly different scheme of locks, but the essential idea is that no one other than the user who holds the lock can access certain physical or logical units of data. In other words, expect conflicts. Because locking is common, it is the best-understood method.

The Optimistic family is the next most common approach and it is the one that InterBase uses (remember the lead paragraph of this column?). It involves timestamps, or generations of data. Again, there will be slightly different schemes, but a situation with two users trying to modify the same data is rare and so you must handle it as an exception instead of an expectation.

This method actually goes back to the days of microfilm archives. You would go to central records and get a photocopy of a microfilmed file, and take it back to your desk to work on it. The central records people would timestamp the copy. If someone else got a copy of the same data, the system would check to see if there were any conflicts and execute some conflict resolution procedure. One user might have priority over another, the last change might take effect, both users might be asked to try again, or some other rule could be used.

Logical concurrency is still a laboratory idea and I do not know if anyone has built a commercial product with it yet. The essential idea is to let many users as possible into the database so long as they do not run into each other. The way that the concurrency system determines who will not run into each other is by looking at the predicates. For example, if one user is executing this statement:

 UPDATE Foobar
SET a = 5
WHERE b >= 0;

it can run at the same time as

UPDATE Foobar
SET a = 10 
WHERE b < 0;

because the set of rows they affect do not overlap.

A deadlock occurs when two users need the same resources to complete their tasks. Let's call the resources A and B, with user number one holding A while user number two holds B. They sit facing each other, neither one willing to give up the resource they hold. This can happen in a Pessimistic because of the locks. A conflict, but not a deadlock, can occur in an Optimistic system, but a Logical concurrency control system is completely free of this problem. You can resolve the Optimistic conflict by aborting one or both of the user's session(s).

A livelock occurs when a user cannot get to a resource because a series of other users keep excluding him. This has been a running gag for slapstick comedy since the beginning of time; the runt of the litter is pushed away from his mother by his siblings and never gets to suckle. This can happen in all three systems, but it is essentially a Logical concurrency control system characteristic. This can be resolved by giving the runt special priority, so that he gets his turn. We have a lot of methods for this from multi-tasking operating systems, but the most common ones consist of assigning a priority to each task and increasing that priority over time until every task gets to the highest level.

Well that is the end of my diatribe of the month. Feel like a little mental exercise?

Puzzle:

This problem showed up in a Deja.com database group. The person submitting

it has a table declared something like this:

CREATE TABLE Foobar
(a INTEGER NOT NULL PRIMARY KEY, 
b INTEGER NOT NULL DEFAULT 97, 
c INTEGER NOT NULL DEFAULT 98, 
d INTEGER NOT NULL DEFAULT 99);

What he wants to do is get data into the table from a host program form and to convert the NULLs in the input into the DEFAULT values. However, if you write something like this, you get an error message:

 INSERT INTO Foobar
VALUES (:aa, 
COALESCE (:bb, DEFAULT), 
COALESCE (:cc, DEFAULT),
COALESCE (:dd, DEFAULT)); 

You get an error message. This can be done with CASE expressions or procedural code that would look at all of the combinations of NULLs and values in the four parameters. I am assuming that you have to provide the key in the :aa parameter. But you also need to know what each default value is to write the code:

INSERT INTO Foobar
VALUES (:aa, 
CASE WHEN :bb IS NULL THEN 97 ELSE :bb END,
CASE WHEN :cc IS NULL THEN 98 ELSE :cc END,
CASE WHEN :dd IS NULL THEN 99 ELSE :dd END);

Can you make this more general, so that if a DEFAULT is changed in the table declaration, you do not have to change the code in the INSERT INTO statement?

Answer:

First change the table declaration to allow NULLs in the columns.

CREATE TABLE Foobar
(a INTEGER NOT NULL PRIMARY KEY, 
b INTEGER DEFAULT 97, 
c INTEGER DEFAULT 98, 
d INTEGER DEFAULT 99);

Now allow the user to insert the NULLs in the usual manner, but follow the insertion with an UPDATE like this:

BEGIN ATOMIC
INSERT INTO Foobar VALUES (:aa, :bb, :cc, :dd); 
UPDATE Foobar
SET b = CASE WHEN :bb IS NULL THEN DEFAULT ELSE b END,
c = CASE WHEN :cc IS NULL THEN DEFAULT ELSE c END,
d = CASE WHEN :dd IS NULL THEN DEFAULT ELSE d END
WHERE a = :aa;
END;

The ability to set a value to the current DEFAULT in an UPDATE is a little-known feature of SQL-92.




Joe Celko is an Atlanta-based independent consultant. He is the author of three books on SQL -- SQL For Smarties (Morgan-Kaufmann, 1995), SQL Puzzles and Answers (Morgan-Kaufmann, 1995), and Instant SQL Programming (Wrox Press, 1997) -- and wrote the SQL for Smarties column for DBMS magazine. You can contact him via email at www.celko.com or 71062.1056@compuserve.com.
 

Copyright © 2004 CMP Media Inc. ALL RIGHTS RESERVED
No Reproduction without permission





IE Weekly Newsletter
Subscribe to the newsletter
    Email Address







InformationWeek Business Technology Network
InformationWeekInformationWeek 500InformationWeek 500 ConferenceInformationWeek AnalyticsInformationWeek CIO
InformationWeek EventsInformationWeek ReportsInformationWeek MagazinebMightyByte and SwitchDark Reading
Digital LibraryIntelligent EnterpriseInternet EvolutionNetwork ComputingNo Jitter
space
Techweb Events Network
InteropVoiceConWeb 2.0 ExpoWeb 2.0 SummitEnterprise 2.0 ConferenceMobile Business ExpoSoftware ConferenceCSI - Computer Security Institute
Black HatGTECEnergy CampMashup CampStartup Camp
space
Light Reading Communications Network
Light ReadingLight Reading EuropeUnstrungLight Reading's Cable Digital NewsConstantinopleInternet Evolution
Heavy ReadingLight Reading Live!Light Reading InsiderEthernet ExpoOptical ExpoTeleco TVTower Technology Summit
space
Financial Technology Network
Advanced TradingBank Systems & TechnologyInsurance & TechnologyWall Street & TechnologyAccelerating Wall StreetBank Systems & Technology Executive SummitBuyside Trading SummitInsurance & Technology Executive Summit
space
Microsoft Technology Network
MSDN MagazineTechNetThe Architecture Journal
space