HP World '99 and ERP World '99 Conference Proceedings

Computing Through a Disaster with Assured AvailabilityÔ Systems

by Marathon Technologies Corporation

Businesses that provide critical hardware and applications to their customers - such as the process automation, air traffic control, and securities trading - need to be assured that there will be no downtime in operations, as this could lead not only to loss in profit, but even more seriously- life and limb. In addition, users of VMS, UNIX and other traditional high-end operating systems are increasingly demanding the highest levels of system availability and reliability from Windows NT servers. In order to keep up with the rate of demand for high availability systems, the enterprise is going to have to move to Assured AvailabilityÔ technologies with ComputeThruÔ capabilities that enable applications to continuously compute through all manner of disruptions.

More specifically, disaster recovery and disaster tolerance are a necessity, rather than an option, for an increasing number of businesses. The globalization of is driving a dramatic increase in business awareness of the importance of avoiding information system failures. Research points to a burgeoning Web enablement of institutions / organizations that have traditionally been paper and snail mail (regular mail) based organizations. Web enablement is, even as you read this article, accelerating the rate of change in the ways that organizations work, propelling them into this new on-line interactive information based reality which is full of challenges and opportunities.

In the future, businesses, their customers, suppliers and competitors will depend on information access to such an extent that any service outage or problem with information availability will cause them to switch service providers with the click of a mouse button. In the Web enabled world any organization without a highly available information infrastructure runs the risk of falling behind the competition and potentially being driven out of business. E-Commerce, telecommuting, the emergence of the virtual corporation, and competition on a global scale are forcing companies to provide their services on a global 7x24 basis. This in turn is driving the ever-increasing requirement for computer systems that provide enhanced or continuous availability. This future reality, which is already taking form, will impose the requirement for disaster or fault tolerant information systems without which most businesses will not be able to effectively compete let alone survive.

A significant increase in the number and variety of customer facing applications and information sources is accelerating the requirement for disaster and fault tolerant systems. Customer facing applications are those that a business customer interacts with either on-line, over the phone, or at an ATM machine. These applications are being implemented by businesses in order to increase business efficiencies, enhance productivity, and hopefully lower costs (both to the customer and for the business). Through customer facing applications customers interact more directly with the corporate information resource than at any other time in the history of modern business. Any service outage experienced by users of this type of application will have a dramatic impact in terms of lost revenue, lost transactions, and customer defection to competitors.

The role of risk assessment / analysis in disaster recovery and in fault tolerant information systems is to minimize the risk associated with either a disaster or fault / system failure. Additionally, both achieve the most benefit by totally avoiding a disaster or fault. When a business cannot carry on its work because the information system, or part of the information system has failed, the impact will be one or more of the following:

1. An inconvenience;

2. A temporary loss of productivity and revenue;

3. A severe impact to the business’s financial health;

4. A threat to public and personnel safety.

When situations two, three, or four occur, the event is determined to be a disaster and businesses should have a plan to either recover from it (Disaster Recovery) or avoid it in the first place (Disaster Tolerance/Fault Tolerance). The choice of which way to go must be based on a most realistic assessment of the potential severity of the consequences of the disaster as compared to the cost in resources and money to avoid or recover from it.

As organizations of all sizes increasingly depend on information technology, the value of the data, information, and services delivered via fully automated computer systems will continue to increase dramatically. Implementing a procedure of regularly backing up data and applications, either to tape, disk or a remote site is one proven way to provide an avenue for recovery from a catastrophic loss or disaster. However, the preferred and most effective solution is to avoid the risk altogether and assure that the data and computer will always be available!

Fault Tolerant Information Systems, when properly implemented, allow applications to continue processing without impacting the user, application, network, or operating system in the event of a failure, regardless of the nature of that system failure. All fault tolerant information systems use redundant components running simultaneously to check for errors and provide continuous processing in the event of a component failure. However, to truly meet the requirements of mission critical applications such as data servers, network servers, and web servers, fault tolerant information systems must satisfy the following requirements:

1. The system must uniquely identify any single error or failure;

2. The system must be able to isolate the failure and operate without the failed component;

3. The failed system must be repairable while it continues to perform its intended function;

4. The system must be able to be restored to its original level of redundancy and operational configuration.

We have developed a series of best practices guidelines for implementing true information system fault tolerance. These practices are based on years of research, user interviews, and widely accepted concepts of systems and information theory.

Basic computer theory tells us that system reliability can be improved by appropriately employing multiple components (redundancy) to perform the same function. Redundancy can be applied, and therefore should be considered, in terms of both time and space. For example, to improve communications through a noisy phone, one can repeat the message several times until the message gets through. The message takes more time to get through the noisy phone than it would take through a clear phone, but it gets there. Alternatively, having two phones, each carrying the same conversation, provides better reliability than one phone. If one phone fails, the other phone can still carry the conversation. The downside to this redundant approach is that it takes twice as many resources to get the message through or an extra (redundant) communication device (phone). No matter how you look at it, reliability requires redundancy, and redundancy expends either time or resources, both of which are not free. Furthermore, redundancy is only the starting point. It provides the basis on which one can build a reliable or continuously available information system. In order to provide the most complete protection, seven additional and critical steps must be taken.

Minimizing single points of failure provides the basis for insuring fault tolerant information systems. In order to minimize single points of failure in any system redundancy must be applied, as appropriate, in all aspects of the computing infrastructure. The authors have heard the war stories of system managers who were careful to run dual power cords to their computer systems, but unfortunately ran them through the same wire channel, creating the opportunity for an unaware service person to accidentally dislodge both cords, while servicing the system. Some options for avoiding single points of failure include using alternative power sources and RAID disk subsystems to protect the system from being brought down by the failure of a either a disk drive or power supply. The ultimate application of this principle is to duplicate a complete physical facility at a different geographical location to provide a disaster recovery site.

The trade-off between availability and cost should be analyzed during the planning and implementation phases of an information system. For example, it costs more to run a system from multiple power sources or double up on the amount of disk used for data storage. A primary consideration is the cost of a highly available system as compared to a conventional system. While some highly available information systems can cost as much as twice, and some fault tolerant system as much as six times that of a standalone system, the cost of these systems is small in comparison to the opportunity cost associated with a service outage. In general, the direct and indirect cost of system downtime should determine the amount of investment to be made in system availability requirements, along with the nature of the application and the end user’s needs.

An often overlooked factor that is important to all parts of the system is capacity planning or the analysis of the performance of the various system components to assure the necessary performance is delivered to the users. A number of questions need to be addressed during this process such as network loading, peak and average bandwidths required, disk size, memory size, and the speed and number of CPUs required. Care must be taken to address the interactions of all system hardware and application software under the expected system load. This is particularly important when considering high availability failover configurations, where the interrelationship of all applications and middleware must be fully understood. Otherwise, the system could fail over the specific user application, but not bring all the necessary support middleware such as the database.

Serial paths are comprised of multiple steps, where the failure of any single step will cause a complete system failure. Serial paths exist in operational procedures as well as in the physical system implementation. Application software is often the most critical of serial elements because an application software bug cannot be fixed while in operation. It can be restarted or rebooted, but it cannot be repaired. A well-written application can minimize the opportunity to lose data by employing techniques such as checkpoint and restart. Checkpoint and restart stores intermediate compute results when passing data from one process to another in order to avoid a serial failure.

Selecting and managing the software used for critical applications are important steps that must not be overlooked. First, utilities and applications must be stable, as determined by careful selection and testing. Many IT organizations test new versions of critical applications in an offline simulated environment or in a non-critical part of the organization for several months before full deployment to minimize the probability of crashing a critical application. The software industry has promoted software upgrades as the pathway to computing heaven, however the rate of release and complexity of upgrades often exceeds the ability of IT managers to fully qualify one upgrade before the next upgrade is out. The pressure to upgrade should be resisted; the installation of an unstable application can be more devastating than a physical server meltdown. Likewise, even though management may be pushing to consolidate distributed applications onto fewer servers, it should be avoided. Consolidation can jeopardize the availability of the critical applications. For example, the server can become unstable, and the network may not support the new traffic load on the server. Either of these events can be disastrous.

The physical aspects of the computing environment must be considered when establishing a reliable and safe information system environment. The primary components of the computers and network must be addressed initially. Then, consider the physical environment of space, temperature, humidity, and power sources. Most of the time these factors only get attention when building a new facility and are totally overlooked when making small system changes, installing new systems or upgrading current systems.

Another key and yet often ignored element in the management of the physical environment is the actual physical security and access. It is a basic element of protection of the business’s information assets. Allowing casual access to critical information systems can result in inadvertent or even intentional system outages.

The processes and procedures used in managing the information infrastructure should provide maximum system availability with minimum interruption in service in the event of a failure. This includes access control, backup policies, virus screening / recovery, staffing, training, and disaster recovery. These processes and procedures should be documented and updated regularly. They should also be exercised and revised, if necessary, at least once a year. Exception procedures are elements of last resort and must be complemented by proper day-to-day operational processes which ensure the proper allocation of system resources via application and operating system tuning. In too many cases processes and procedures are ignored until a crisis. Then it may be too late to avoid a system outage. Finally, remember that even a well-documented process has little value if the operators and system managers have not been trained and updated on a regular basis.

The overall architecture of the system, including the major functions of each subsystem and component, must behave as an integrated whole to accomplish the business goals of the enterprise. The design of any system requires the application of trade-offs and design decisions to implement the architecture. The architecture and design decisions should be documented and managed on an ongoing basis to maintain the system’s architectural and design integrity and also to provide a means for transferring knowledge to new personnel.

Commercially available fault tolerant computers have been around since the 1980s. Historically, they have been characterized as expensive to buy, proprietary in nature, and complex to manage. Today, fault tolerant systems are not necessarily proprietary, but they still tend to be the most expensive. For example, fault tolerant systems based upon the Unix operating environment are more open and somewhat easier to manage, but they can cost four times a standalone solution. Recently, with the advent of commodity PC Servers, the NT operating system, and new hardware and software technologies for high availability clustering and fault tolerance, the paradigm is shifting. It is now possible to purchase an NT based fault tolerant system that only costs two to three times a conventional computer, and offers significant savings by avoiding the downtime costs.

In this new environment where fault tolerant solutions are relatively inexpensive and easy to use there will no longer be any barriers to the implementation of the most appropriate availability level solution.

The key to deploying a disaster or fault tolerant information system is to assess all the risks and then take the most appropriate action(s). In the case of making your computer applications and data fault tolerant, we recommend IT managers consider the following:

1. Begin with redundancy in hardware and software,

2. Minimize all single points of failure,

3. Choose the right server availability for the job,

4. Employ thorough capacity planning,

5. Eliminate hardware and software serial paths,

6. Carefully select and manage software,

7. Consider all the physical issues,

8. Apply good processes and procedures,

9. Maintain consistent architecture and design control.

The fundamental guideline is to not be distracted by the cost to implement the proper solution, but rather look at all the cost factors including the cost of loss of business and customer good will. They provide a basis for IT managers to determine the most appropriate allocation of resources for the highest level of availability consistent with the mission of the enterprise and the cost of downtime.

Send email to Interex or to the Webmaster©Copyright 1999 Interex. All rights reserved.

Send email to Interex or to the Webmaster
©Copyright 1999 Interex. All rights reserved.