|
In the past three or four years, a segment of the data storage industry has been quietly building a new architecture that has real potential for our larger, busier data warehouses. This new architecture is called the storage area network (SAN). Think of a SAN as a way to take all the disk drives off all your mainframes and servers, concentrate the drives in a single location, and then allow all the servers to read and write to any combination of the drives simultaneously.
If you could concentrate all your storage technology in one location, together with universal access, you could realize some interesting economies of scale. You could also eliminate redundant costs, compared with the conventional processor-controls-its-own-storage architecture that you are probably using with most of your systems.
Let's take a quick look at a typical SAN configuration, which Figure 1 illustrates . As its name implies, a SAN is its own network, almost always based on fiber channel technology. Fiber channel technology is capable of very high bandwidths, matching the ability of high-performance disk drives to transfer data at their highest sustained rates. But unlike computer buses and SCSI chains, fiber channel can be extended to very large campuses. A SAN based on 9mm fiber optics can extend to a 10km diameter. Keep this thought in mind when I discuss backup and disaster recovery.
SANs normally contain storage devices, servers, and switches. A server can be any of the familiar server types, including online transaction processing (OLTP) servers, data staging servers for your data warehouse back room, presentation servers for your data warehouse front room, and a wide variety of other servers. Other servers include those devoted to data administration and functions such as data mining, multimedia servers, conventional file servers, and "hot response caches" found in Web-centric data warehouses.
Every server that is part of the SAN normally has a fiber channel interface to connect inward to the SAN and a local network interface to connect outward to a conventional local area network (LAN). The SAN switches are capable of connecting every server to every storage device on the SAN, at fiber channel speeds.
At this point, you are probably thinking of a number of advantages that a SAN could bring to a large, busy data warehouse. Here's an attempt to list all the ways a SAN could be interesting to a data warehouse:
High-performance disk access. Above all, a SAN offers very high data transfer rates from disk to server and directly from disk to disk. SANs transfer data at 100MBps, with promises in the near future ranging up to 400MBps. The current speed of 100MBps is comparable to the speed of a gigabit Ethernet but has the immense advantage (compared to a LAN) that every server has intimate access to every storage device. Some people have described the SAN as "SCSI on steroids."
High-performance transfer between applications. A typical data warehouse operation is bottlenecked by two, or possibly three, major data transfer steps. The OLTP system must transfer the primary production data to the staging area (back room) of the data warehouse. Or maybe this first step transfers data to an operational data store (ODS), which we will think of as being in the back room. In either case, a lot of very granular data must be physically copied from one storage device to another. A large retailer could transfer 50 million sales transaction records per day to the staging area. A Regional Bell Operating Company could transfer 200 million call detail records to the staging area each day. And finally, a huge Internet site, such as AOL or Microsoft, could transfer several billion page event records from production Web servers to a staging area each day. The secret is to have both the production servers and the components of the data warehouse back rooms and front rooms all on the same SAN.
A second transfer in the data warehouse must take place after the data goes through all the cleaning steps in the data staging area. In this second step, a "dimension authority" replicates conformed dimensions to many distributed data marts. Because an entire enterprise can use a single SAN, the separate data marts can all be resident on the SAN and can receive the conformed dimensions at high data rates. This possibility raises an interesting, subtle point. The data warehouse can still be a highly distributed affair with separate data marts organized around primary data sources. Having a SAN does not require you to build a monolithic, centralized data warehouse!
A third data transfer step might take place for certain kinds of data warehouse clients, such as data miners, who need to transfer very large "observation sets" from the normal presentation services of the data warehouse into their specialized tools - such as decision tree, neural network, and memory-based reasoning tools. These same specialized end users may also transfer large data sets back into the data warehouse after they have run what-if scenarios or after they have computed behavior scores for all the enterprise's customers.
|
|
|
|
|
| |||||||||||||||||||||||||||||||




















