The "5-Nine's" Data Availability Initiative: Make It Reality

Steve Lemme

Platinum technology inc

Abstract

With information technology and planning, organizations are now able to achieve five minutes of yearly downtime, or 99.999 percent availability of their enterprise IT systems. This "5-Nines" initiative professed by Hewlett Packard and other vendors is setting industry standards, but what does it take in terms of technical capability to achieve it?

In this presentation, the speaker will help attendees answer these questions and more. Attendees will learn proactive steps they can take to foster maximum data availability, including how to effectively monitor and maintain enterprise systems. The presentation will offer techniques to help attendees predict and prevent unplanned downtime; provide faster response to potentially critical problems; reduce risk; maximize an organization's IT investment; identify what causes system failures; determine how to increase availability while decreasing outages and downtime; cover the three high availability categories: fault tolerant, highly available, and continuously available; define architecture and methodology required to support service levels; and apply proactive prediction principles to prevent system failures.

Introduction

What is 5-Nines and what does it involve? Many companies are being exposed to this moniker and are trying to understand and evaluate the merits of the initiative. But isn't 5-Nines just another term for High Availability? Yes, but 5-Nines involves more then just hardware High Availability. Let's look at what is driving the need for 5-Nines.

In todays competitive business landscape, 7X24 operations is becoming the standard, especially those areas driven by E-Commerce, Intranet, and Web enablement of data. The unavailability of the applications, systems, or networks can mean a significant loss of revenue to an organization. Industry experts and analysts agree that in order to support E-Commerce applications, typical network availability must provide a minimum of 99.99 percent availability. This level of availability requires careful planning and a comprehensive, end-to-end strategy, with Business Availability as the goal.

The High Availability Environment

Avoiding downtime is not a new practice. The availability requirements facing companies today are based upon a fiercely competitive rapid technology driven marketplace ever dependent on computing.

High Availability cannot be implementing a service level or by just purchasing a hardware solution. Availability is a combination of environmental, process, software, development strategies, computing hardware and an investment in terms of dollars and human capital that are made to minimize the time that the application and system are not available. It also involves daily process and methodology to support the environment in meeting business requirements.

What Defines an Outage?

Downtime

Downtime is the amount of time that there is non-accessability to the user. Outages may be caused by environmental factors, loss of electricity, fires, floods, earthquakes, hardware, application, software failures or human error. Hardware failures have traditionally been the major cause of system failures. However, software failures and user error are both growing in percentage of total system failures with distributed environments. There are two types of outages: planned and unplanned.

Planned Downtime

Planned downtime is usually scheduled for off-peak processing times, such as available off-hours or Holidays. Planned downtime is scheduled time for patches, upgrades, and maintenance activities such as database reorganizations, adding disk storage, performing off-line backups, and installing patches, upgrades or new application or system operating software. Because today’s applications run in a global 7x24x365 full-time environment, IT organizations are eliminating planned downtime.

Unplanned Downtime

Unplanned downtime is associated with unexpected events such as sudden network, hardware and software failures. A typical distributed application usually consists of a Web browser front-end and an application reliant upon servers, networks, and databases. Any problem with anyone of them will cause the application to become unavailable. Consequently, all components need to be treated equally and monitored as any one piece would could create downtime

Downtime from a percentage perspective

Looking at percentages, a 97 percent availability rating will mean that you incur approximately 263 hours of downtime per year or 6.6 days whereas a 99 percent availability would be 88 hours or 2.2 days. The majority of companies today would go out of business based upon 97 percent availability. With E-Commerce, anything less than 99.999 percent availability can pose a serious detriment.

Availability Percentage	Year Downtime
99	2.2 Days
99.5	44 Hours
99.9	8.8 Hours
99.95	4 Hours
99.99	53 Minutes
99.999	5.3 Minutes
99.9999	32 Seconds
99.99999	3.2 Seconds

Uptime

Uptime is the opposite of downtime and is usually a measurement of availability. Uptime is the term used to refer to time, usually in a week, that the application can be accessed by the users. When loss of required service or a performance impact lasts long enough to create an issue for the user. Although downtime maybe planned, if not properly communicated, it is still considered an outage by the user. The user does not care what is the source, but only that they have been impacted and how soon is the resolution.

What Defines Availability ?

Availability is usually measured on an annual basis. Availability refers to the percentage

of time that the application is available to the user. By standard usage, the term availability does not account for planned downtime. Today, the definition is broader covering mission-critical applications including distributed applications, network, email, scheduling services, and other business solutions companies rely on for day-to-day operations.

High Availability

A system or application designed to prevent a total loss of service by reducing or managing failures. Preventing Single Points of Failure (SPOF) through component redundancy is a way of providing hardware High Availability. Common hardware SPOFs are: CPU, disks, host adapters, network adapters, hubs, routers, power, cooling. The major goal of highly available systems is to provide a higher level of availability than standard systems.

Fault Tolerance

More expensive than Highly Available, a fault tolerant system contains multiple hardware components that function concurrently, duplicating all of the computation and I/O. This type of system protects against hardware failures by incorporating redundant hardware components in a single system. Keep in mind, a fault tolerant system can still fail. System or application software failures can easily cause application non-availability. Fault tolerant systems cost as much as five times more than non-fault tolerant solutions.

Continuous Availability

The expectation in these systems is to provide Continuous Availability, which equates to non-stop service, no planned or unplanned outages. Hardware and software failures may occur; however, the intent is to insulate the user from the failure and to reduce the time needed to recover from that failure down to only several minutes or less. In a Continuously Available environment patches and upgrades can be performed with no impact to the user.

Preparing for 5-Nines

With more and more companies dependent on computer technology to run their business, the applications, database, and network systems have become mission critical to organizations. Loss of availability or outages to customers, translates to lost opportunity or revenue to a company. System uptime and, availability is a combination of process, software and computing hardware designed to minimize unavailability of the applications to the business.

Business Requirements Define the Solution

An understanding of a companies strategic business objectives are key to helping identify practical, scalable enterprise solutions. Without an understanding of the business and how the applications are being utilized, the IT organization is caught in a never-ending downward spiral. An available well-defined and properly configured scalable platform is one that has enough capacity to handle the estimated or measured application’s workload, with no bottlenecks in the hardware.

Scalability of the system, network, databases and applications are key to availability. Scalability includes the amount of data accessible to the application, the number of concurrent users supported, the number of transactions that can be processed in a given unit of time, and the breadth of functionality that the application encompasses.

In gathering information for availability, important decision drivers are:

· Total raw data footprint

· Query complexity

· Expected response time

· Number of concurrent users

· Backup and recovery

· Security

· Integration with other applications or systems

· Skillsets

Achieving a goal of dependability and availability requires effort at all phases of development. Steps must be taken at design time, implementation time, and execution time, and during maintenance utilizing risk reduction and avoidance techniques.

A 5-Nine's implementation has its foundation in these principles:

· Mitigation of risks

· Resiliency

· Redundancy

· Inclusion

· Manageability

· Methods and skills

Mitigation of Risks

Analysis, Design, and Architecture

Because companies are so dependent upon the data and systems, proper analysis and design are crucial in selection of reliable and scalable servers. To jump-start the analysis process, research can be obtained from sources such as industry standard benchmarks, reviews, networking with peers in other companies, and by conducting in-house or vendor competency centers benchmarks.

A properly architected solution will provide a quality, high performing available system with capacity for growth. Poor planning, architecture, or operational support can foster poor performance, impeded functionality, high cost of ownership, complex administration, lack of capacity and scalability, poor reliability and availability.

A properly architected solution relies on capturing the business requirements, designing to meet those requirements, and providing a path for the future. A common pitfall is employing "Bleeding Edge" technologies in expectation of meeting or overcompensating for exisiting requirements. Only after the business requirements have been gathered and understood, should technologies be employed to meet them. Criteria important to consider in distributed system architectures are:

· Processor technology

· Storage and I/O subsystems

· Ability to support change

· Ability to support growth

· Well defined usage and capacity planning

· Elimination of data redundancy

· Elimination of process redundancy

· Price/performance

· Implementation and integration planning

· Administration automation

· Server, Client-Server, N-Tier, and Web-based architecture

Eliminating risks derived from failures involves architecture coupled with monitoring and processes that prevent outages. Prevention depends on monitoring (real-time and historical), trending, rules and models to predict the occurrence of, and the prevention of failures (hardware and software). Common risk mitigation techniques are:

· Fault Avoidance - Use of processes and tools to minimize the introduction of faults

· Fault Minimization - In spite of efforts to prevent or eliminate them, there will be faults in any system. Proper risk assessment and fault minimization can insure uptime

· Fault Recognition and Removal - Monitoring and recognition can proactively locate faults and assist with remediation of their root cause

Approaches to Maintaining Uptime and Availability

Data Protection

Backup and Recovery are among the most critical administration tasks that must be regularly performed as part of system administration. There is usually a lot of effort in getting working backups after the installation of a system. As a result, a majority of organizations never test or even assemble a recovery plan. Most often that is realized after data has been lost, the backup tapes were blank or overwritten, or a disaster like a flood has occurred. Any data critical to business should be protected. Backups are the easiest way to protect data. However, it is almost certain there is data generated between backups. Safeguards through hardware, replication, or software should be used to bridge protection between backup periods.

Disaster Recovery

The ability to recover from a natural disaster, such as fire, flood, earthquake, tornado is equally important. Results of these disasters usually include physical damage or complete loss of systems, data, and workplace. Recovery time is directly related to how much up-front planning, and procedures were established to restore the business locally or at a Hotsite. Thus the impact of a disaster and its cost to the business must be weighed against the cost of preventing it.

Application Protection & Recoverability

Web servers are an excellent example of why application recoverability is a critical issue: Most companies with E-Commerce servers cannot afford the business impact caused by downtime.

Careful consideration should be given to the design and usage of an application in a High Availability situation, with the primary goal to insulate the user from outages or failovers. Methods include: client reconnects to an alternate server if a connection is lost, error handling automation, restartable and recoverable transactions.

Network Management

The network has become so ubiquitous to the computing environment that it is normally taken for granted. Proper network architecture, planning, maintenance, and monitoring are just as important as any of the other system components or applications. Redundancy, switching, as well as capacities should be considered.

System Management Monitoring and Measurement

One of the most critical factors in availability is system management. Systems management starts before the software is even ordered. Most often, planning and selection of the architecture, procedures and system management processes are overlooked. A vast majority of installations occur on an existing platform because of available space. Then, once the application is in production, performance, administration, and bottlenecks appear. Most often, these issues are viewed as the System Administrators responsibility. To address these issues, systems must be properly planned, architected, and refined through a set of methods and processes before "slapping in" a set of monitoring tools and expecting it take care of all deficiencies.

True system management involves monitoring, measuring, altering, and reporting on the levels of availability, performance and service levels. System management tools can also provide real-time business and operational visualization for the many operational components. Systems management usually begins with measurements, baselining and extrapolation of uptime and performance metrics.

Automation of Processes

Human error is a leading cause to downtime. Any effort to reduce human interaction and error reduces the risk. Consequently anything that can be automated should. And tools that perform the automation, control it, or monitor it should be employed to eliminate downtime. Areas to reduce the risk through automation, policies and procedures are:

· Backups and Recovery

· Upgrades

· Operations and Administration

· Maintenance

· Usage

· Performance

· Capacity Planning

· Security and Control

· Testing, Upgrading, and Implementation

Reporting / Service Level Agreements

Service Level Agreements should be derived from the business requirements that list availability goals, response time, planned downtime periods and specific performance requirements. An SLA typically specifies user response times and expectations for key business applications, networks and servers. In addition, the SLA provides a valuable baseline MIS department and system managers can utilize to assist in justification of additional resources for improving availability.

Resiliency

The capability of a system to prevent degradation or failure as well as the damage or loss from a failure or malfunction. Resiliency includes quality, design, and stability.

Redundancy

Multiple redundant critical components. Redundancy of two, or more times for CPU's, network cards, electrical transmission, power supplies, disk drives, switches, routers, cooling systems, or other equipment used to support operations. Redundancy is not limited to hardware, as it can also include mirrored applications, setup, and configuration files.

Inclusivity

Including High Availability through the entire application stack including client, middleware, and hardware.

Serviceability

Ability of the system to detect problems, rapidly correct problems, and reconfigure on-line.

Manageability

Ease and ability for evaluating and tuning for maintenance and availability, identification of critical resources, traffic patterns, performance levels and configuration of business-critical applications.

Methods and Skills

To achieve a 5-Nines environment, administration, monitoring and control of the High Availability IT environment must be simple. Otherwise, installation issues, upgrades, human error, customization, and other factors will impact the availability. As user error is a growing cause of outages, techniques should be applied that reduce the chance of user and or administrator error.

Installation and Deployment

For many organizations, installations and deployments can prove to more work than anticipated. One primary reason for this is planning. To properly install or deploy a system requires more than unboxing a system or just spinning a tape. In fact, many organization have found they have the software on media that their system doesn’t support! Prior to installation, performing an environment audit and system overview is critical to a successful implementation. Many hours of precious installation time have been lost running into the lack of diskspace, or performance problems caused by an overloaded system. Prior to any installation, upgrade or deployment activity all appropriate team members should review the system and connectivity environments.

Change Management, Documentation, and Sustaining

Standards and procedures provide the foundation for implementation. Without a serious effort given to consistent adherence to standards and procedures, a project will decline into an indecipherable hodgepodge of unsupportable variations and techniques. Compounding it can be the lack of documentation when a staff transition occurs with no one remaining on the team that understands the riddle of code left behind. Standards, Procedures and Documentation to consider for the 5-Nines Environment should include the following major topics:

· OS Standards and Procedures - file system layouts, kernel parameters, system backup/recovery, security, performance monitoring, installation and upgrades of the operating system

· Database Standards and Procedures - instance parameters, object sizing, storage and naming, procedures for installation and upgrades, security and backup/recovery

· Applications Development Standards - techniques for managing change procedures, detailed coding standards including source code control, change control, naming conventions, table/index creation

· Network Standards and Procedures - define network addressing and protocols supported for database and application communication

Training and Support

Training and support is critical to sustaining and maintaining availability. Dependent upon business requirements and service level agreements, support can make or break the business. Understanding and maintaining the appropriate vendor contracts is also part of job.

With technology and product updates leapfrogging every six months, personnel need the ability to quickly judge which features and upgrades map to existing business requirements. In order to do that, there needs to be some level of understanding of the features and or the technology. Annual training, as well as participation in User Groups can assist in keeping abreast of issues, features and technologies.

In Summary

A 5-Nines approach provides survivability to companies that cannot afford downtime. With the explosion of E-Commerce and the increased focus of the deployment of Web-enabled data for most business applications, both users and customers expect to have 7x24 accessibility.

5-Nine's is a manifestation of those business requirements.

A successful strategy incorporates procedures and components that work together to insure fail-over situations are handled appropriately. For 5-Nines to succeed, all components and dependencies need to be identified, prioritized, in terms of availability and service levels. Comprehensive service levels should address availability (in the form of both planned and unplanned downtime), performance, and recoverability. A key enabler to always available applications is the ability to rapidly identify issues and redirect application connectivity. When a computing resource failure is identified, applications and their data need to quickly move to an alternative server with minimal impact to users.

5-Nines distributed system availability and scalability cannot be an afterthought. It is impossible to add scalability to an application or system that was not designed to handle the anticipated load. All components must be tightly integrated, from the computer hardware, operating system to the database software, application layer, network and interfaces. In addition, the need for tighter integration between applications like Enterprise Resource Planning and the systems they support, becomes more apparent with increases in data volumes, transactions, and usage.

Selecting products that support High Availability and ensure success are the only way to maximize the computing power investment and see a viable return-on-investment. A Highly Available system provides substantial benefits, but can require a significant investment in terms of money and resources. Like any investment, ensuring the proper strategy and tools are in place, is the foundation to successfully managing and maintaining the environment.

Bio:

Steve Lemme is a Systems Architect for PLATINUM technology, inc., and has several years of experience with mission-critical VLDB system architecture and distributed multi-tier enterprise computing. Prior to joining PLATINUM, Mr. Lemme worked for several Fortune 500 companies, including Allied Signal, GTE, Apple Computer, and Motorola. Mr. Lemme is also President of the Arizona Oracle Users Group.

For more information on tools to assist with the implementation and lifecycle of the 5-Nines initiative, visit the Platinum technology website. www.platinum.com

What Defines Availability ?

In Summary

Send email to Interex or to the Webmaster©Copyright 1999 Interex. All rights reserved.

Send email to Interex or to the Webmaster
©Copyright 1999 Interex. All rights reserved.