Computing Through a Disaster with Assured AvailabilityÔ Systems
by Marathon
Technologies Corporation
Businesses that provide critical hardware and
applications to their customers - such as the process automation, air traffic
control, and securities trading - need to be assured that there will be no
downtime in operations, as this could lead not only to loss in profit, but even
more seriously- life and limb. In
addition, users of VMS, UNIX and other traditional high-end operating systems
are increasingly demanding the highest levels of system availability and
reliability from Windows NT servers. In
order to keep up with the rate of demand for high availability systems, the
enterprise is going to have to move to Assured AvailabilityÔ
technologies with ComputeThruÔ
capabilities that enable applications to continuously compute through all
manner of disruptions.
More
specifically, disaster recovery and disaster tolerance are a necessity, rather
than an option, for an increasing number of businesses. The globalization of is
driving a dramatic increase in business awareness of the importance of avoiding
information system failures. Research points to a burgeoning Web enablement of
institutions / organizations that have traditionally been paper and snail mail
(regular mail) based organizations. Web enablement is, even as you read this
article, accelerating the rate of change in the ways that organizations work,
propelling them into this new on-line interactive information based reality
which is full of challenges and opportunities.
In
the future, businesses, their customers, suppliers and competitors will depend
on information access to such an extent that any service outage or problem with
information availability will cause them to switch service providers with the
click of a mouse button. In the Web enabled world any organization without a
highly available information infrastructure runs the risk of falling behind the
competition and potentially being driven out of business. E-Commerce,
telecommuting, the emergence of the virtual corporation, and competition on a
global scale are forcing companies to provide their services on a global 7x24
basis. This in turn is driving the ever-increasing requirement for computer
systems that provide enhanced or continuous availability. This future reality,
which is already taking form, will impose the requirement for disaster or fault
tolerant information systems without which most businesses will not be able to
effectively compete let alone survive.
A
significant increase in the number and variety of customer facing applications
and information sources is accelerating the requirement for disaster and fault
tolerant systems. Customer facing applications are those that a business
customer interacts with either on-line, over the phone, or at an ATM machine.
These applications are being implemented by businesses in order to increase
business efficiencies, enhance productivity, and hopefully lower costs (both to
the customer and for the business). Through customer facing applications customers
interact more directly with the corporate information resource than at any
other time in the history of modern business. Any service outage experienced by
users of this type of application will have a dramatic impact in terms of lost
revenue, lost transactions, and customer defection to competitors.
The
role of risk assessment / analysis in disaster recovery and in fault tolerant
information systems is to minimize the risk associated with either a disaster
or fault / system failure. Additionally, both achieve the most benefit by
totally avoiding a disaster or fault. When a business cannot carry on its work
because the information system, or part of the information system has failed,
the impact will be one or more of the following:
1.
An
inconvenience;
2.
A
temporary loss of productivity and revenue;
3.
A
severe impact to the business’s financial health;
4.
A
threat to public and personnel safety.
When
situations two, three, or four occur, the event is determined to be a disaster
and businesses should have a plan to either recover from it (Disaster Recovery)
or avoid it in the first place (Disaster Tolerance/Fault Tolerance). The choice
of which way to go must be based on a most realistic assessment of the
potential severity of the consequences of the disaster as compared to the cost
in resources and money to avoid or recover from it.
As
organizations of all sizes increasingly depend on information technology, the
value of the data, information, and services delivered via fully automated
computer systems will continue to increase dramatically. Implementing a
procedure of regularly backing up data and applications, either to tape, disk
or a remote site is one proven way to provide an avenue for recovery from a
catastrophic loss or disaster. However, the preferred and most effective
solution is to avoid the risk altogether and assure that the data and computer
will always be available!
Fault
Tolerant Information Systems, when properly implemented, allow applications to
continue processing without impacting the user, application, network, or
operating system in the event of a failure, regardless of the nature of that
system failure. All fault tolerant information systems use redundant components
running simultaneously to check for errors and provide continuous processing in
the event of a component failure. However, to truly meet the requirements of
mission critical applications such as data servers, network servers, and web
servers, fault tolerant information systems must satisfy the following
requirements:
1.
The
system must uniquely identify any single error or failure;
2.
The
system must be able to isolate the failure and operate without the failed
component;
3.
The
failed system must be repairable while it continues to perform its intended
function;
4.
The
system must be able to be restored to its original level of redundancy and
operational configuration.
We
have developed a series of best practices guidelines for implementing true
information system fault tolerance. These practices are based on years of
research, user interviews, and widely accepted concepts of systems and
information theory.
Basic
computer theory tells us that system reliability can be improved by
appropriately employing multiple components (redundancy) to perform the same
function. Redundancy can be applied, and therefore should be considered, in
terms of both time and space. For example, to improve communications through a
noisy phone, one can repeat the message several times until the message gets
through. The message takes more time to get through the noisy phone than it
would take through a clear phone, but it gets there. Alternatively, having two
phones, each carrying the same conversation, provides better reliability than
one phone. If one phone fails, the other phone can still carry the conversation.
The downside to this redundant approach is that it takes twice as many
resources to get the message through or an extra (redundant) communication
device (phone). No matter how you look at it, reliability requires redundancy,
and redundancy expends either time or resources, both of which are not free.
Furthermore, redundancy is only the starting point. It provides the basis on
which one can build a reliable or continuously available information system. In
order to provide the most complete protection, seven additional and critical
steps must be taken.
Minimizing
single points of failure provides the basis for insuring fault tolerant
information systems. In order to minimize single points of failure in any
system redundancy must be applied, as appropriate, in all aspects of the
computing infrastructure. The authors have heard the war stories of system
managers who were careful to run dual power cords to their computer systems,
but unfortunately ran them through the same wire channel, creating the opportunity
for an unaware service person to accidentally dislodge both cords, while
servicing the system. Some options for avoiding single points of failure
include using alternative power sources and RAID disk subsystems to protect the
system from being brought down by the failure of a either a disk drive or power
supply. The ultimate application of this principle is to duplicate a complete
physical facility at a different geographical location to provide a disaster
recovery site.
The
trade-off between availability and cost should be analyzed during the planning
and implementation phases of an information system. For example, it costs more
to run a system from multiple power sources or double up on the amount of disk
used for data storage. A primary consideration is the cost of a highly
available system as compared to a conventional system. While some highly
available information systems can cost as much as twice, and some fault
tolerant system as much as six times that of a standalone system, the cost of
these systems is small in comparison to the opportunity cost associated with a
service outage. In general, the direct and indirect cost of system downtime
should determine the amount of investment to be made in system availability
requirements, along with the nature of the application and the end user’s
needs.
An
often overlooked factor that is important to all parts of the system is
capacity planning or the analysis of the performance of the various system
components to assure the necessary performance is delivered to the users. A
number of questions need to be addressed during this process such as network
loading, peak and average bandwidths required, disk size, memory size, and the
speed and number of CPUs required. Care must be taken to address the interactions
of all system hardware and application software under the expected system load.
This is particularly important when considering high availability failover
configurations, where the interrelationship of all applications and middleware
must be fully understood. Otherwise, the system could fail over the specific
user application, but not bring all the necessary support middleware such as
the database.
Serial
paths are comprised of multiple steps, where the failure of any single step
will cause a complete system failure. Serial paths exist in operational
procedures as well as in the physical system implementation. Application
software is often the most critical of serial elements because an application
software bug cannot be fixed while in operation. It can be restarted or
rebooted, but it cannot be repaired. A well-written application can minimize
the opportunity to lose data by employing techniques such as checkpoint and
restart. Checkpoint and restart stores intermediate compute results when
passing data from one process to another in order to avoid a serial failure.
Selecting
and managing the software used for critical applications are important steps
that must not be overlooked. First, utilities and applications must be stable,
as determined by careful selection and testing. Many IT organizations test new
versions of critical applications in an offline simulated environment or in a
non-critical part of the organization for several months before full deployment
to minimize the probability of crashing a critical application. The software
industry has promoted software upgrades as the pathway to computing heaven,
however the rate of release and complexity of upgrades often exceeds the
ability of IT managers to fully qualify one upgrade before the next upgrade is
out. The pressure to upgrade should be resisted; the installation of an
unstable application can be more devastating than a physical server meltdown.
Likewise, even though management may be pushing to consolidate distributed
applications onto fewer servers, it should be avoided. Consolidation can
jeopardize the availability of the critical applications. For example, the
server can become unstable, and the network may not support the new traffic
load on the server. Either of these events can be disastrous.
The
physical aspects of the computing environment must be considered when
establishing a reliable and safe information system environment. The primary
components of the computers and network must be addressed initially. Then,
consider the physical environment of space, temperature, humidity, and power
sources. Most of the time these factors only get attention when building a new
facility and are totally overlooked when making small system changes,
installing new systems or upgrading current systems.
Another
key and yet often ignored element in the management of the physical environment
is the actual physical security and access. It is a basic element of protection
of the business’s information assets. Allowing casual access to critical
information systems can result in inadvertent or even intentional system
outages.
The
processes and procedures used in managing the information infrastructure should
provide maximum system availability with minimum interruption in service in the
event of a failure. This includes access control, backup policies, virus
screening / recovery, staffing, training, and disaster recovery. These
processes and procedures should be documented and updated regularly. They
should also be exercised and revised, if necessary, at least once a year.
Exception procedures are elements of last resort and must be complemented by
proper day-to-day operational processes which ensure the proper allocation of
system resources via application and operating system tuning. In too many cases
processes and procedures are ignored until a crisis. Then it may be too late to
avoid a system outage. Finally, remember that even a well-documented process
has little value if the operators and system managers have not been trained and
updated on a regular basis.
The
overall architecture of the system, including the major functions of each
subsystem and component, must behave as an integrated whole to accomplish the
business goals of the enterprise. The design of any system requires the
application of trade-offs and design decisions to implement the architecture.
The architecture and design decisions should be documented and managed on an
ongoing basis to maintain the system’s architectural and design integrity and
also to provide a means for transferring knowledge to new personnel.
Commercially
available fault tolerant computers have been around since the 1980s.
Historically, they have been characterized as expensive to buy, proprietary in
nature, and complex to manage. Today, fault tolerant systems are not necessarily
proprietary, but they still tend to be the most expensive. For example, fault
tolerant systems based upon the Unix operating environment are more open and
somewhat easier to manage, but they can cost four times a standalone solution.
Recently, with the advent of commodity PC Servers, the NT operating system, and
new hardware and software technologies for high availability clustering and
fault tolerance, the paradigm is shifting. It is now possible to purchase an NT
based fault tolerant system that only costs two to three times a conventional
computer, and offers significant savings by avoiding the downtime costs.
In
this new environment where fault tolerant solutions are relatively inexpensive
and easy to use there will no longer be any barriers to the implementation of
the most appropriate availability level solution.
The
key to deploying a disaster or fault tolerant information system is to assess
all the risks and then take the most appropriate action(s). In the case of
making your computer applications and data fault tolerant, we recommend IT
managers consider the following:
1.
Begin
with redundancy in hardware and software,
2.
Minimize
all single points of failure,
3.
Choose
the right server availability for the job,
4.
Employ
thorough capacity planning,
5.
Eliminate
hardware and software serial paths,
6.
Carefully
select and manage software,
7.
Consider
all the physical issues,
8.
Apply
good processes and procedures,
9.
Maintain
consistent architecture and design control.
The
fundamental guideline is to not be distracted by the cost to implement the
proper solution, but rather look at all the cost factors including the cost of
loss of business and customer good will. They provide a basis for IT managers
to determine the most appropriate allocation of resources for the highest level
of availability consistent with the mission of the enterprise and the cost of
downtime.