VERITAS Software Corporation
1600 Plymouth Street
Mountain View, CA 94043
Modern Storage Systems consist of many components, including File Systems, Volume Managers, Device Drivers, Backup/Recovery tools, and Hardware Subsystems that include complex software components to implement RAID and other advanced Storage Management functions. The components are usually provided by a variety of different vendors, and each component typically comes with its own management interface, that may or may not expose its functional behavior to the System Administrator. In addition, System Administrators can not readily obtain insight in the I/O access patterns of applications running on the systems under their control. These factors make the management of Storage Systems for optimal availability, cost and performance a sheer impossible task for ordinary (or even extra-ordinary) humans.
This paper discusses products and technologies that make it possible to manage storage for Oracle databases for maximum availability and performance, while limiting the effort and knowledge required by System Administrators.
Since the early days of computing, storage devices have been viewed as peripherals to computer systems, in the same manner as workstations were considered peripherals before the evolution of the Local Area Network and client-server computing.
Until recently, interconnection technology has restricted the distance over which storage devices can be attached, and the number of hosts and devices than can be connected. This has caused this “peripheral view” of storage to persist, despite the fact that companies now invest more in storage resources than in computing resources.
With the emergence of Fibre Channel and other new storage interconnection technologies, we are now at the beginning of an era, in which storage systems will become central elements of Information Systems. We refer to this as “Storage Centric” or “Information Centric” computing.
Storage Management is comprised of all activities required for configuring and maintaining the storage resources used for electronically storing information. The most important goals of Storage Management are:
The ability to keep vast amounts of data accessible on a continuous basis, and with optimal performance, is increasingly critical to companies for meeting their business objectives. In conjunction with the shift towards Information Centric computing, this causes Storage Management to rapidly evolve from a little-known activity in the bowls of IS organizations, to a critical element of the IS strategy of companies around the world.
With much of the business critical data managed by Oracle, there is a need and opportunity for Storage Management solutions that are optimized for Oracle databases.
Before discussing Storage Management technologies and products, we will review the objectives IS organizations have in performing Storage Management tasks:
As stated earlier, the most important goal of Storage Management is protect data against loss that may result from hardware failures, human errors, or software errors, and to minimize loss of access to data. Both planned and unplanned unavailability need to be minimized, which translates in the requirement to minimize backup windows and recovery times, and to perform backup as well as other storage management tasks (such as re-configuration) on-line.
This implies the need to optimize database I/O policies for a given application and storage configuration, as well as the need to configure storage in a manner that is optimal for the I/O access patterns generated by the application.
Some analysts estimate that the cost of managing storage is as much as 8 times higher than the costs of the storage devices them selves. Limiting Storage Management costs, despite the rapid increase in the volume of information that is stored electronically and the need to keep more of this information available on a continuous basis, is therefore an important objective.
The Storage System of a (networked) Computing Environment is comprised of a variety of different types of components, as shown in figure 1.
Figure1: Overview of Storage Management Components
Add to this that most enterprises deploy heterogeneous system configurations, and it will be clear that performing Storage Management reliably and affordably has become a significant challenge for many IS organizations.
To support IS organizations in addressing these challenges, vendors have embarked on a number of initiatives. These include the development of new products and technologies in cooperation with other vendors to integrate Storage Management products through common API’s, management tools, and interfaces. The resulting “End-To-End Storage Management” solutions aim to increase data availability, improve performance, simplify and automate management tasks, and reduce management costs. An overview of initiatives that are important for Oracle database environments is given below:
File Systems are an attractive storage solution for databases from a point of view of manageability. However, the UNIX file system has traditionally been considered too slow and unreliable to host mission critical databases. The new file systems offer the manageability advantages of conventional file systems, without compromising reliability or performance. Extensive tests are performed with Oracle databases and ensure its correct behavior.
To automate and simplify the manner in which products work together, API’s are being defined through which products from different vendors can share relevant information. This makes it possible for these products to transparently optimize their behaviors, and simplify Storage Management tasks by hiding complexities from the System Administrator.
The goal is to support these capabilities on a variety of configurations, regular UP and SMP systems, parallel cluster configurations, and configurations deploying intelligent Storage Subsystems.
Creating these “End-To-End Storage Management” solutions requires vendors of Hardware Storage Systems, Database Management Systems, and Storage Management software products to cooperate in the definition and implementation of API’s between their products, and the validation and testing of products and solutions based on these API’s. Oracle and their partners share the vision that produce superior, easy-to-manage solutions can be provided by cooperating in this manner, and are working together in the definition and implementation of such solutions. The remainder of this paper gives a more detailed overview of the initiatives mentioned above.
File Systems are an attractive storage solution for databases from a point of view of manageability. However, traditionally many databases on UNIX have been implemented using the raw device interface, because file systems were considered too slow and unreliable to host mission critical databases. The main reasons for this are the long recovery times after system failures, use of buffered I/O, and the fact that only a single process can write to a file at any given time.
The new File Systems use an internal log to ensure that all of its meta data changes are atomic. As a result, it does not need to perform a full fsck, and recovers very quickly after system failures. To address the performance drawbacks traditionally associated with file systems, File systems, like the one offered by VERITAS, offers a “Quick I/O” interface. By providing unbuffered and asynchronous I/O, this interface offers characteristics similar to the raw device, and a comparable degree of read/write parallelism and CPU overhead.
As Computer Systems evolve towards support of 64-bit address spaces, some Operating Systems do not yet make it possible for Oracle to take advantage of system memory configurations larger than 4GB. To make it possible to achieve better performance on such large systems, file systems also support a “Cached Quick I/O” capability. In this mode read operations are served through the file system cache. This will often reduce the number of physical I/O operations and thus improves read performance. For write operations, Cached Quick I/O functions as the “standard” Quick I/O facility to guarantee data integrity. For on-line transaction processing on Solaris 2.5 and 2.6, Cached Quick I/O achieves better than raw device performance in database throughput on large memory configurations. Cached Quick I/O also helps sequential table scans due to the read-ahead algorithm used in the VERITAS File System, resulting in reduced query response times.
VERITAS has worked closely with Oracle to execute validation and performance tests of Oracle databases on VxFS. This has helped VxFS to evolve as a storage solution for Oracle databases that offers significant manageability advantages without compromising reliability, and with performance that is equivalent or –up to 100%- higher than when using the raw device interface.
Figure 2 gives an overview of OLTP transaction rates for Oracle 8.0.3 on Solaris 2.6, and VxFS 3.3 with Quick I/O and Cached Quick I/O, as compared to raw device. Tests were run on a Sun Microsystems Ultra Enterprise 10000 with fifteen processors, 6GB of memory, and 37 disk drives (4 per controller). A 200 warehouse TCP-C derived OLTP benchmark was used. All the VxFS file system tunable parameters, including the mkfs parameters, were default.
With, in addition to this, the Continuous Availability features that are discussed in the next section, VxFS is the storage solution for Oracle databases. Table 1, gives an overview of the advantages of new file systems over the use of Raw Partitions or traditional UFS.
Figure 2: OLTP throughput using raw device, ufs, and vxfs
Issue | UFS (Unix File System) | Oracle Raw Partitions | VERITAS
File System |
|||
Data Integrity |
Short window for data corruption. |
- |
No Window for data corruption |
+ |
No window for data orruption. |
+ |
Access efficiency |
Requires 250,000 calls to kernel to read 2GB file. |
- |
No calls to kernel |
+ |
Requires 1 call to kernel to read 2GB file. |
+ |
Caching |
Double Cache, wastes memory and CPU resources |
- |
No Double Caching |
+ |
Only caches Read Operations for >4GB memory configurations |
+ |
Write Performance |
Single CPU can write to file at a time. |
0 |
Multiple CPU's can write to file at a time. |
+ |
Multiple CPU's can write to a file at a time. |
+ |
Backup |
Easy |
+ |
Difficult, harder to manage and restore |
- |
Easy |
+ |
Manageability |
Easy – Grow FS after unmounting |
+ |
Difficult – must dump and restore. |
- |
Easy – Grow or Shrink FS while mounted. |
+ |
Protection against user error |
Good |
+ |
Poor – can format partition actively used by Oracle. |
- |
Good |
+ |
Table 1: Comparison of UFS, Raw Partitions, and VxFS as storage solution for Oracle
As stated earlier, the most important goals of Storage Management are to maximize data availability, in the face of hardware failures, human errors and software errors, and to minimize the impact of data protection measures on application availability and performance. This requires:
It should be possible to perform any storage management task without the need to take applications off-line, and without the need to have Application Servers perform I/O or compute intensive tasks.
Volume Managers and File Systems have long provided the possibility to perform all management operations on-line. This includes the ability to grow or shrink logical volumes and file systems, to reorganize file systems, and to improve performance by moving data or changing storage layouts.
More recently, innovation at Storage Solution providers have focussed on making significant improvements in the areas of backup & recovery and on-line Decision Support. The emphasis of these efforts is on the development of new technologies as well as on integration between Storage Management products. Specific facilities that are available now, or will be available in the foreseeable future are:
Several techniques have been developed that make it possible to provide these capabilities. They have in common that they make it possible to create a “snapshot” image of a file system or database. A snapshot is a consistent image of the data as it was at a specific point in time (the moment the snapshot was created). Snapshots can be accessed for read, and in some cases write access, while applications are running and modifying the original “live” data. This allows them to be used for on-line backup, Decision Support, and other applications. The most common snapshot techniques are:
The simplest, but most expensive, snapshot technology uses mirroring (RAID-1). A snapshot image is created by “breaking off” a mirror. This requires an amount of disk space equal to the total size of the file system or database. This technique is supported by some advanced Hardware Storage Subsystems.
In this case a stable image is maintained by saving the pre-images of changed blocks in a separate storage area. The first time a write changes a particular data block, the old data is first read and copied to the snapshot area before the new data is written. A subsequent read request for that block in the Snapshot will be satisfied by reading the data from the snapshot data area, rather than from the “live” file system or database. Subsequent writes to the block on the live database do not result in additional copies to the checkpoint, since the old data only needs to be saved once. Read requests for blocks in the snapshot database that have not changed, are satisfied from the “live” database. The advantage of copy-on-write snapshots is that they minimize the number of disk accesses and the storage space required for maintaining the snapshot, and work for database files as well as flat files. The challenge is to implement them with minimal performance degradation.
Copy-on-write snapshots can be implemented in Hardware Storage Subsystems, file systems or driver level products. File Systems now provide a copy-on-write snapshot (referred to as “Storage Checkpoint”) that can also be combined with incremental backup solutions (see below). Through the use of coalesced write operations, logging and other optimization techniques, the performance overhead associated with maintaining Storage Checkpoints is very small. Multiple Storage Checkpoints can exist concurrently, representing images of the file system or database at different points in time. Creating a Storage Checkpoint is a fast operation, which is typically completed in a few seconds. The file system and database must be in a consistent state while the Checkpoint is created. In the case of Oracle, this can be accomplished by using the “Archive Log Mode”. In this mode, a consistent image of the database is available of which a snapshot can be taken without any database down time.
We will now explain the use of the Snapshot technology for improving data availability.
The techniques described above, make it possible to create a consistent, instant-in-time view of a database, and can therefore be used to perform a backup, while the application is running and updating the “live” database. As described above, creating a consistent snapshot of an Oracle database can be accomplished without Database down time by using Oracle’s Archive Log Mode./ Standard backup utilities, Decision Support, and other applications can now process the snapshot. In this manner it is possible to perform a cold database backup while the database is actually online ("hot")!
To support on-line backup of Oracle databases, Backup products have been integrated with the snapshot facilities offered by EMC in its Symmetrix Storage Subsystem, and the file system. It transparently handles the operations required to establish and discard the snapshots. It also supports the Oracle RMAN on-line backup capabilities, giving users the opportunity to select the most appropriate solution for their application environment.
Although online backup eliminates the time a database has to be taken offline for backup, it still requires all of the data in the database or table space(s) that are backed up to be copied. This will often have a significant impact on application performance during the time the backup takes place. To minimize this impact, it is desirable to perform an “incremental backup”, copying only the data that has changed since the previous backup was made. This will often reduce the amount of data to be copied by one to two orders of magnitude. When incremental backup is done on-line, the impact of backup on application availability and performance is dramatically reduced.
Below we will discuss the different types of incremental backup solutions that are available or will be available in the foreseeable future, and their suitability for various application environments.
File systems maintain a timestamp for each file, which indicates when the file was last changed. This makes file level incremental backup relatively simple to implement, and most backup products, including NetBackup, support it.
File level incremental backup works well for backing up “traditional” file server environments. In these environments, the average file size is small and updates to a file are accomplished by truncating the file and re-writing it in its entirety. In database environments, the average file size is large, and the unit of change is a fixed size block rather than an entire file. This makes this technique not suitable for those environments.
Statistical “differencing” algorithms make it possible to determine small changes in files, such as the insertion of a small number of characters. This technique, possibly in combination with compression, can reduce the amount of data to be backed up significantly. However, it requires the reading and processing of all changed files, and does not scale well in large (or even medium) database environments.
Block level incremental backup is targeted at the typical database environment, where data is written at the block level. There are fundamentally two approaches for block level incremental backup:
Like the byte level differencing described above, this technique also reads all data. However, it can use a simple method to determine which blocks have changed. Since databases use a small number of large files, it requires almost the entire database to be read. As a result, it does not scale for larger databases, and does not result in a significant reduction in backup times. The incremental backup facility offered by Oracle’s Recovery Manager (RMAN), uses this technique.
This technique keeps track of which blocks have changed at run-time, as write operations are performed. At backup time it is only necessary to read the changed blocks.
Some File Systems implements this technique, incurring a very small runtime overhead independent of the size of the database. This solution results in a reduction in backup time that is proportional to the percentage of data blocks that has changed, and is scalable to very large databases. The block size can be set to match the block size used by Oracle.
Table 2 shows the results of a comparative test between Oracle RMAN incremental backup and the VERITAS Block Level Incremental backup, that was performed by Jeffrey Carter at Boeing. It clearly shows the significant reduction in backup times that can be achieved with block-level run-time differencing solutions as provided by the VERITAS File System. By using disk as the backup medium for the usually small amount of data, backup times can be further reduced. Also important to note is the significant reduction in Restore time that is achieved.
Table 2: Backup and Restore Elapsed Times (MIN:SEC)
Full |
Incremental_1 |
Incremental_2 |
Restore |
|
VxFS BLI (Tape) |
138:24 |
5:01 |
5:02 |
16:11 |
RMAN (Tape) |
103:33 |
15:10 |
15:24 |
34:09 |
VxFS BLI (Disk) |
31:01 |
1:42 |
1:59 |
1:04 |
RMAN (Disk) |
25:00 |
12:17 |
11:59 |
4:48 |
The tests were performed on a SUN UE3000, using Oracle 8.0.4 and VxFS 3.3. The total database file size allocated was 5.7 GB, with 3.4 GB of actual data. An 8 table schema and TPC-D data was used for the test. The approximate table sizes in records are as follows:
Customer | 300,000 |
Line item | 12,000,000 |
Nation | 25 |
Orders | 3,000,000 |
Parts | 400,000 |
Partsupp | 1,600,000 |
Region | 5 |
Supplier | 20,000 |
The test consists of performing one full and two incremental backups. Prior to the first incremental backup, 12,000 records in the CUSTOMER table are updated. Prior to the second incremental backup, 80,000 records are updated in the PARTS table. Elapsed times for performing the backups are logged. A restore and recovery is executed on both databases after the second incremental backup. Both disk and tape backups were performed for this comparison. Veritas NetBackup is used for writing to tape, and only RMAN for writing to disk.
Figure 3 shows the results of a series of tests executed by VERITAS that illustrate how VxFS Block Level Incremental Backup times correlate with the percentage of data blocks that has changed.
Figure 3: Elapsed times for backup using VxFS based Incremental Backup
Track level incremental backup is similar to block level incremental backup in the sense that changes are tracked at runtime. However, the unit of data is larger than in case of block level incremental backup, which means that more data has to be backed up. Some Intelligent Storage Subsystems are expected to implement this technique.
Backup products will support a variety of incremental backup solutions for Oracle databases, transparently handling the complexities of performing the database operations and snapshot management required for this. At this time, it supports the Oracle RMAN incremental backup facility as well as a file system Storage Checkpoint based incremental backup solution.
Although incremental backup can significantly reduce backup time and overhead, a recent full image of a database is still required to be able to restore a database in the shortest possible time in case of a major disaster. By making it possible to integrate one or more incremental backups with an existing “full” image of a file system, it is possible to always have a recent “full” image of the data available, without ever making a full backup! Creation of such a “synthetic” full backup can be done on a secondary server, without impacting application performance. Vendors will offer a “synthetic” full backup capability in a future release of their products.
On-line incremental backup results in dramatic reductions in the impact of backup on application availability and performance. However, the Application Server still needs to copy the data from disk to the backup medium or network connection. Off-host backup techniques make it possible for another system to perform the backup. The Application Server only needs to be involved in quiescing the file system or database before the backup can start.
Off-host backup can be accomplished in a number of manners, as summarized below:
Vendors will support a variety of off-host backup solutions.
Vendors also offer software replication products to create copies of data that can be backed up on a secondary host, and many are actively working with several suppliers of Intelligent Storage Subsystems to support peer-to-peer copy facilities. In the future, a Clustered versions of file systems will offer yet another solution for performing off-host backup, using commodity hardware systems.
With the techniques described until now, it is possible to almost completely “close the backup window”. By making the backup process less intrusive, it is possible to take backups more frequently. This will also have a positive impact on recovery time, since the time required to restore the data lost since the backup (through log replay) will be shorter. However, the improvement in recovery time is less dramatic than the improvement in backup time and system overhead.
It is possible to reduce recovery times for data loss that is the result of software or human error (“logical data corruption”), by using copy-on-write snapshot techniques as discussed earlier. Copy-on-write snapshots maintain a complete and consistent image of the data as it was at the time the snapshot was created, and can therefore provide the ability to “rollback” changes. This makes it possible to recover from logical data corruption without the need to restore from a backup!
The benefit of a file system based copy-on-write snapshot facility is that roll back can happen for the entire database or for individual files (typically table spaces in the database).
By combining the use of copy-on-write snapshot solution with RAID to protect against physical device failures, lengthy recovery from tape backup is only needed in extremely rare circumstances.
The standard UNIX or NT file system interfaces limit the extent to which databases can minimize the risk of data corruption through human error, and achieve optimal performance.
Many vendors are working with Oracle and Hardware Subsystem vendors to define and implement API’s that aim to make it possible for Oracle to achieve the best possible performance and availability, while at the same time simplifying administration. Below, we will briefly discuss some areas that could be covered in such interfaces:
In present-day computer system configurations, Oracle does not have access to information about the geometry of the storage configuration underlying the database, and hence can not adapt its I/O policies for optimal performance.
API’s between Hardware Storage Subsystems, volume managers, file systems, and Oracle can make it possible to pass information about optimal I/O size and alignment from the Hardware Storage Subsystem or (for SW managed arrays) the volume manager to Oracle. This will allow Oracle to adapt its I/O policy accordingly, for example ensuring that writes to a RAID-5 storage configuration would happen as full-stripe writes.
In the same manner as Oracle has no access to information about the geometry of the storage configuration, Hardware Storage Subsystems can not obtain information about the I/O access patterns to be expected. If available, such information could be used to optimize use of the subsystem resources.
API’s can be defined that make it possible for Oracle to provide file systems, volume managers, and Hardware subsystems with information about expected I/O access patterns, allowing these subsystems to optimize caching and logging policies. Examples of information to be passed through such API’s are sequential I/O hints and identification of data that will no longer be accessed (and can hence be discarded from logs and/or caches).
Oracle’s preferred I/O policies can not be mapped onto standard UNIX or NT File System API’s in an optimal manner. By implementing Oracle specific facilities, it is possible to increase performance by reducing the number of system calls and context switches required, and increasing parallelism.
Existing file system interfaces make it necessary for the administrator to “manually” perform certain file and table space management tasks, and expose databases to corruption as a result of human error. This is caused by the fact that certain file and table space management functions can not be performed by Oracle directly. Also, it is impossible to protect critical files that should be accessed and managed by Oracle exclusively, against undesired access by standard file system utilities. Through proper API’s, Oracle could perform these functions completely under its control. Access by standard file system utilities to certain critical files could be prohibited. This will significantly improve manageability, and reduce the chance of data corruption as a result of human error.
Recovery of mirrored storage configurations following a failure of the system managing the storage configuration, requires that data that may not have been written to all mirrors, be read from one and written to the other mirrors. By making use of Oracle’s change logs, it is possible to minimize the amount of data that has to be copied, and therefore speed up recovery and minimize the impact of the recovery process on application performance.
To achieve optimal performance and availability for database applications, a System Administrator has to perform the following tasks:
These tasks are collectively referred to as “Storage Resource Management” or “SRM”. There are several classes of SRM products to support the Administrator:
Products that are in the market today, are still of a generic nature, and limited in scope. It is expected that in the future we will see products that can be configured with knowledge of specific Storage Management products and Databases. There could be knowledge of Oracle specific storage requirements and I/O access patterns, as well as knowledge on the characteristics of specific Hardware Storage Subsystems or other products. By adding product specific knowledge, it will be possible to provide more automation of management tasks and improved ease-of-use for administrators.
As discussed earlier in this document, the Storage System consists of a large number of components, supplied by a variety of vendors. We have shown how availability and performance can be enhanced by a combination of new technologies, integration between components from different vendors, and specialization for Oracle.
To further improve manageability, it is beneficial to combine multiple components into Product Suites that provide a complete “End-to-End” storage management solution for Oracle. In addition to combining multiple (Oracle optimized) products into a single package, such Product Suites should be extensively tested and benchmarked against representative Oracle configurations, and documented and installable as a single package.
In the future, the number of components in such Product Suite will increase to offer more of the capabilities described in this document, and cover a wider range of configurations. Future extensions are expected to include:
The main objective for Storage Management is to maximize data availability and performance, while limiting the knowledge and effort required for performing the management tasks. End-to-end Storage Management solutions, that integrate file systems, volume managers, Device Drivers, Backup/Recovery tools, and Hardware Subsystems, and are specialized for Oracle databases, aim to offer an effective answer to this challenge. The discussion in this document is intended to show the value that these solutions already bring today, but also the promise of significant improvement of the intelligence that will be incorporated into the storage management solutions.