The "5-Nine's" Data Availability Initiative: Make It Reality
Steve Lemme
Platinum technology inc
With information technology and planning, organizations are now able to achieve five minutes of yearly downtime, or 99.999 percent availability of their enterprise IT systems. This "5-Nines" initiative professed by Hewlett Packard and other vendors is setting industry standards, but what does it take in terms of technical capability to achieve it?
In this presentation, the speaker will help attendees answer these questions and more. Attendees will learn proactive steps they can take to foster maximum data availability, including how to effectively monitor and maintain enterprise systems. The presentation will offer techniques to help attendees predict and prevent unplanned downtime; provide faster response to potentially critical problems; reduce risk; maximize an organization's IT investment; identify what causes system failures; determine how to increase availability while decreasing outages and downtime; cover the three high availability categories: fault tolerant, highly available, and continuously available; define architecture and methodology required to support service levels; and apply proactive prediction principles to prevent system failures.
What is 5-Nines and what does it involve? Many companies are being exposed to this moniker and are trying to understand and evaluate the merits of the initiative. But isn't 5-Nines just another term for High Availability? Yes, but 5-Nines involves more then just hardware High Availability. Let's look at what is driving the need for 5-Nines.
In todays competitive
business landscape, 7X24 operations is becoming the standard, especially those
areas driven by E-Commerce, Intranet, and Web enablement of data. The
unavailability of the applications, systems, or networks can mean a significant
loss of revenue to an organization. Industry experts and analysts agree that in
order to support E-Commerce applications, typical network availability must
provide a minimum of 99.99 percent availability. This level of availability
requires careful planning and a comprehensive, end-to-end strategy, with
Business Availability as the goal.
Avoiding downtime is not a new practice. The availability requirements facing companies today are based upon a fiercely competitive rapid technology driven marketplace ever dependent on computing.
High Availability cannot
be implementing a service level or by just purchasing a hardware solution.
Availability is a combination of environmental, process, software, development
strategies, computing hardware and an investment in terms of dollars and human
capital that are made to minimize the time that the application and system are
not available. It also involves daily process and methodology to support the
environment in meeting business requirements.
Downtime is the amount
of time that there is non-accessability to the user. Outages may be caused by
environmental factors, loss of electricity, fires, floods, earthquakes,
hardware, application, software failures or human error. Hardware failures have traditionally been
the major cause of system failures. However, software failures and user error
are both growing in percentage of total system failures with distributed
environments. There are two types of outages: planned and unplanned.
Planned downtime is
usually scheduled for off-peak processing times, such as available off-hours or
Holidays. Planned downtime is scheduled time for patches, upgrades, and
maintenance activities such as database reorganizations, adding disk storage,
performing off-line backups, and installing patches, upgrades or new
application or system operating software. Because today’s applications run in a
global 7x24x365 full-time environment, IT organizations are eliminating planned
downtime.
Unplanned downtime is associated with unexpected events such as sudden
network, hardware and software failures. A typical distributed application
usually consists of a Web browser front-end and an application reliant upon
servers, networks, and databases. Any
problem with anyone of them will cause the application to become unavailable.
Consequently, all components need to be treated equally and monitored as any
one piece would could create downtime
Looking at percentages, a 97 percent availability rating will mean that you incur approximately 263 hours of downtime per year or 6.6 days whereas a 99 percent availability would be 88 hours or 2.2 days. The majority of companies today would go out of business based upon 97 percent availability. With E-Commerce, anything less than 99.999 percent availability can pose a serious detriment.
Availability Percentage |
Year Downtime |
99 |
2.2 Days |
99.5 |
44 Hours |
99.9 |
8.8 Hours |
99.95 |
4 Hours |
99.99 |
53 Minutes |
99.999 |
5.3 Minutes |
99.9999 |
32 Seconds |
99.99999 |
3.2 Seconds |
Uptime
Uptime is the opposite of downtime and is usually a
measurement of availability. Uptime is the
term used to refer to time, usually in a week, that the application can be
accessed by the users. When loss of required service or a performance
impact lasts long enough to create an issue for the user. Although downtime
maybe planned, if not properly communicated, it is still considered an outage
by the user. The user does not care what is
the source, but only that they have been impacted and how soon is the
resolution.
Availability is
usually measured on an annual basis. Availability refers to the percentage
of time that the
application is available to the user. By standard usage, the term availability
does not account for planned downtime. Today, the definition is broader
covering mission-critical applications including distributed applications,
network, email, scheduling services, and other business solutions companies
rely on for day-to-day operations.
A system or application designed to prevent a total loss of
service by reducing or managing failures. Preventing
Single Points of Failure (SPOF) through component redundancy is a way of
providing hardware High Availability. Common hardware SPOFs are: CPU, disks,
host adapters, network adapters, hubs, routers, power, cooling. The major goal
of highly available systems is to provide a higher level of availability than
standard systems.
More expensive than
Highly Available, a fault tolerant system contains multiple hardware components
that function concurrently, duplicating all of the computation and I/O. This
type of system protects against hardware failures by incorporating redundant
hardware components in a single system. Keep in mind, a fault tolerant system
can still fail. System or application software failures can easily cause
application non-availability. Fault tolerant systems cost as much as five times
more than non-fault tolerant solutions.
The expectation in these
systems is to provide Continuous Availability, which equates to non-stop
service, no planned or unplanned outages.
Hardware and software failures may occur; however, the intent is to insulate
the user from the failure and to reduce the time needed to recover from that
failure down to only several minutes or less. In a Continuously Available
environment patches and upgrades can be performed with no impact to the user.
With more and more
companies dependent on computer technology to run their business, the
applications, database, and network systems have become mission critical to
organizations. Loss of availability or outages to customers, translates to lost
opportunity or revenue to a company. System uptime and, availability is a
combination of process, software and computing hardware designed to minimize
unavailability of the applications to the business.
An understanding of a companies strategic business
objectives are key to helping identify practical, scalable enterprise
solutions. Without an understanding of the business and how the applications
are being utilized, the IT organization is caught in a never-ending downward
spiral. An available well-defined and properly configured scalable platform is
one that has enough capacity to handle the estimated or measured application’s
workload, with no bottlenecks in the hardware.
Scalability of the
system, network, databases and applications are key to availability. Scalability includes the amount of data
accessible to the application, the number of concurrent users supported, the
number of transactions that can be processed in a given unit of time, and the
breadth of functionality that the application encompasses.
In gathering information for availability, important decision drivers
are:
·
Total raw data footprint
·
Query complexity
·
Expected response time
·
Number of concurrent users
·
Backup and recovery
·
Security
·
Integration with other applications or systems
·
Skillsets
Achieving a goal of dependability
and availability requires effort at all phases of development. Steps must be
taken at design time, implementation time, and execution time, and during
maintenance utilizing risk reduction and avoidance techniques.
A 5-Nine's
implementation has its foundation in these principles:
·
Mitigation of risks
·
Resiliency
· Redundancy
·
Inclusion
·
Manageability
·
Methods and skills
Because companies are so dependent
upon the data and systems, proper analysis and design are crucial in selection
of reliable and scalable servers. To jump-start the analysis process, research
can be obtained from sources such as industry standard benchmarks, reviews,
networking with peers in other companies, and by conducting in-house or vendor
competency centers benchmarks.
A properly architected solution will provide a quality, high performing available system with capacity for growth. Poor planning, architecture, or operational support can foster poor performance, impeded functionality, high cost of ownership, complex administration, lack of capacity and scalability, poor reliability and availability.
A properly architected solution relies on capturing the
business requirements, designing to meet those requirements, and providing a
path for the future. A common pitfall is employing "Bleeding Edge"
technologies in expectation of meeting or overcompensating for exisiting
requirements. Only after the business requirements have been gathered and
understood, should technologies be employed to meet them. Criteria important to
consider in distributed system architectures are:
·
Processor
technology
·
Storage and I/O
subsystems
· Ability to support change
·
Ability to support
growth
·
Well defined usage
and capacity planning
·
Elimination of data
redundancy
·
Elimination of
process redundancy
·
Price/performance
·
Implementation and
integration planning
·
Administration
automation
·
Server,
Client-Server, N-Tier, and Web-based architecture
Eliminating risks
derived from failures involves architecture coupled with monitoring and
processes that prevent outages. Prevention depends on monitoring (real-time and
historical), trending, rules and models to predict the occurrence of, and the
prevention of failures (hardware and software). Common risk mitigation techniques
are:
·
Fault Avoidance - Use of processes and tools to minimize the introduction of faults
·
Fault
Minimization - In spite of efforts to prevent or eliminate them, there will
be faults in any system. Proper risk assessment and fault minimization can insure
uptime
·
Fault Recognition and Removal - Monitoring and recognition can proactively
locate faults and assist with remediation of their root cause
Backup and Recovery are among the
most critical administration tasks that must be regularly performed as part of
system administration. There is usually a lot of effort in getting working
backups after the installation of a system. As a result, a majority of
organizations never test or even assemble a recovery plan. Most often that is
realized after data has been lost, the backup tapes were blank or overwritten,
or a disaster like a flood has occurred. Any
data critical to business should be protected. Backups are the easiest way to
protect data. However, it is almost certain there is data generated between
backups. Safeguards through hardware, replication, or software should be used
to bridge protection between backup periods.
The ability to recover from a natural disaster, such as fire, flood,
earthquake, tornado is equally important. Results of these disasters usually
include physical damage or complete loss of systems, data, and workplace.
Recovery time is directly related to how much up-front planning, and procedures
were established to restore the business locally or at a Hotsite. Thus the
impact of a disaster and its cost to the business must be weighed against the
cost of preventing it.
Web servers are an
excellent example of why application recoverability is a critical issue: Most
companies with E-Commerce servers cannot afford the business impact caused by
downtime.
Careful consideration should be given to the design and usage of an application in a High Availability situation, with the primary goal to insulate the user from outages or failovers. Methods include: client reconnects to an alternate server if a connection is lost, error handling automation, restartable and recoverable transactions.
The
network has become so ubiquitous to the computing environment that it is
normally taken for granted. Proper network architecture, planning, maintenance,
and monitoring are just as important as any of the other system components or
applications. Redundancy, switching, as well as capacities should be
considered.
One of the most critical factors in availability is system management.
Systems management starts before the software is even ordered. Most often,
planning and selection of the architecture, procedures and system management
processes are overlooked. A vast majority of installations occur on an existing
platform because of available space. Then, once the application is in
production, performance, administration, and bottlenecks appear. Most often,
these issues are viewed as the System Administrators responsibility. To address
these issues, systems must be properly planned, architected, and refined
through a set of methods and processes before "slapping in" a set of
monitoring tools and expecting it take care of all deficiencies.
True system management involves monitoring, measuring, altering, and
reporting on the levels of availability, performance and service levels. System
management tools can also provide real-time business and operational
visualization for the many operational components. Systems management usually
begins with measurements, baselining and extrapolation of uptime and
performance metrics.
Human error is a leading cause to downtime. Any effort to reduce human interaction and error reduces the risk. Consequently anything that can be automated should. And tools that perform the automation, control it, or monitor it should be employed to eliminate downtime. Areas to reduce the risk through automation, policies and procedures are:
· Backups and Recovery
· Upgrades
· Operations and Administration
· Maintenance
· Usage
· Performance
· Capacity Planning
· Security and Control
· Testing, Upgrading, and Implementation
Service Level Agreements should be
derived from the business requirements that list availability goals, response
time, planned downtime periods and specific performance requirements. An SLA typically specifies user response times
and expectations for key business applications, networks and servers. In
addition, the SLA provides a valuable baseline MIS department and system
managers can utilize to assist in justification of additional resources for
improving availability.
Resiliency
The capability of a system
to prevent degradation or failure as well as the damage or loss from a failure
or malfunction. Resiliency includes quality, design, and stability.
Redundancy
Multiple redundant critical components. Redundancy of
two, or more times for CPU's, network cards, electrical transmission, power
supplies, disk drives, switches, routers, cooling systems, or other equipment
used to support operations. Redundancy is not limited to hardware, as it can
also include mirrored applications, setup, and configuration files.
Including High Availability through the entire application stack
including client, middleware, and hardware.
Ability of the system to
detect problems, rapidly correct problems, and reconfigure on-line.
Manageability
Ease
and ability for evaluating and tuning for
maintenance and availability, identification of critical resources, traffic
patterns, performance levels and configuration of business-critical
applications.
To achieve a 5-Nines
environment, administration, monitoring and control of the High Availability IT
environment must be simple. Otherwise, installation issues, upgrades, human
error, customization, and other factors will impact the availability. As user
error is a growing cause of outages, techniques should be applied that reduce
the chance of user and or administrator error.
For many organizations, installations and deployments can prove to more work than anticipated. One primary reason for this is planning. To properly install or deploy a system requires more than unboxing a system or just spinning a tape. In fact, many organization have found they have the software on media that their system doesn’t support! Prior to installation, performing an environment audit and system overview is critical to a successful implementation. Many hours of precious installation time have been lost running into the lack of diskspace, or performance problems caused by an overloaded system. Prior to any installation, upgrade or deployment activity all appropriate team members should review the system and connectivity environments.
Standards and procedures provide the foundation for implementation.
Without a serious effort given to consistent adherence to standards and
procedures, a project will decline into an indecipherable hodgepodge of
unsupportable variations and techniques. Compounding it can be the lack of
documentation when a staff transition occurs with no one remaining on the team
that understands the riddle of code left behind. Standards, Procedures and
Documentation to consider for the 5-Nines Environment should include the
following major topics:
·
OS Standards and Procedures - file system layouts, kernel parameters, system
backup/recovery, security, performance monitoring, installation and upgrades of
the operating system
·
Database Standards and Procedures - instance parameters, object sizing, storage
and naming, procedures for installation and upgrades, security and backup/recovery
·
Applications Development Standards - techniques for managing change procedures,
detailed coding standards including source code control, change control, naming
conventions, table/index creation
·
Network Standards and Procedures - define
network addressing and protocols supported for database and application
communication
Training and support is critical to sustaining and maintaining
availability. Dependent upon business requirements and service level
agreements, support can make or break the business. Understanding and
maintaining the appropriate vendor contracts is also part of job.
With technology and product updates leapfrogging every six months,
personnel need the ability to quickly judge which features and upgrades map to existing
business requirements. In order to do that, there needs to be some level of
understanding of the features and or the technology. Annual training, as well
as participation in User Groups can assist in keeping abreast of issues,
features and technologies.
A 5-Nines approach provides survivability to companies that
cannot afford downtime. With the explosion
of E-Commerce and the increased focus of the deployment of Web-enabled data for
most business applications, both users and customers expect to have 7x24
accessibility.
5-Nine's is a manifestation of those business requirements.
A successful strategy incorporates procedures and components
that work together to insure fail-over situations are handled
appropriately. For 5-Nines to succeed, all components and
dependencies need to be identified, prioritized, in terms of availability and
service levels. Comprehensive service levels should address availability (in
the form of both planned and unplanned downtime), performance, and recoverability.
A key enabler to always available applications is the ability to rapidly
identify issues and redirect application connectivity. When a computing
resource failure is identified, applications and their data need to quickly
move to an alternative server with minimal impact to users.
5-Nines distributed system availability and scalability cannot be an afterthought. It is impossible to add scalability to an application or system that was not designed to handle the anticipated load. All components must be tightly integrated, from the computer hardware, operating system to the database software, application layer, network and interfaces. In addition, the need for tighter integration between applications like Enterprise Resource Planning and the systems they support, becomes more apparent with increases in data volumes, transactions, and usage.
Selecting products that support High Availability and ensure success are the only way to maximize the computing power investment and see a viable return-on-investment. A Highly Available system provides substantial benefits, but can require a significant investment in terms of money and resources. Like any investment, ensuring the proper strategy and tools are in place, is the foundation to successfully managing and maintaining the environment.
Bio:
Steve Lemme is a Systems Architect for PLATINUM technology, inc., and has several years of experience with mission-critical VLDB system architecture and distributed multi-tier enterprise computing. Prior to joining PLATINUM, Mr. Lemme worked for several Fortune 500 companies, including Allied Signal, GTE, Apple Computer, and Motorola. Mr. Lemme is also President of the Arizona Oracle Users Group.
For more information on tools to assist with the
implementation and lifecycle of the 5-Nines initiative, visit the Platinum
technology website. www.platinum.com