With
the advent of more sophisticated multimedia applications, heightened interest
in web-based commerce, deployment of extranets/intranets, and the convergence
of data and telecommunications solutions, more traffic than ever is traversing
the Internet. The Internet is the
“Great Equalizer” in that businesses, large and small, are benefiting from
exposure to a wider audience and the potential increase in their customer
base. Not only are internal and
external users driving increased bandwidth requirements, they are also
demanding a greater quality of service in their networks. External customers accessing company
extranets for information and other electronic commerce applications will be
less tolerant of poor network service quality and its inherent downtime than
in-house users accessing the Web, making the “non-stop network” a
necessity.
Network managers concerned about quantifying service level metrics such as network availability, session integrity, and percent uptime were once paid more lip-service than given results by equipment vendors. Today, companies relying on electronic commerce or continuous video streaming (e.g., commercials on broadcast TV) as a source of revenue, or brokerages making split second financial decisions based on global market factors, can not afford the slightest glitch in their network. Network downtime as little as three (3) seconds could result in an opportunity cost of thousands of dollars.
Because
data networks are becoming the business lifeline, network managers are becoming
more concerned about network availability, session integrity, and need to be
aware of the causes and costs of downtime.
We will be examining the external causes of downtime, the available
research quantifying the costs of downtime, various methods vendors use to
provide fault tolerance and finally, how evolving network technology can play a
role in safeguarding your network.
Controllable
factors, such as scheduled support, software/hardware installation upgrades,
and routine maintenance/backup can impact network availability and reduce
uptime. It’s the uncontrollable
factors, such as power outages,
brownouts, equipment failures, EMI
and operator errors that predominantly impact the network at the worst possible
moment. With the tremendous variability
of user’s networks, let’s take a look at the uncontrollable causes of downtime
and provided a perspective on the cost of downtime from the available
research.
Blackouts and Brownouts - Putting it in perspective
A recent study cites that PCs and
workstations in a typical large network can be subject to infrequent (monthly),
but significant power interruptions.
Although blackouts and voltage spikes are the most noticeable, they only
account for 12% of power problems. The
majority (80%) of power disturbances is caused by undervoltage occurrences
(brownouts), overvoltages, surges, and power sags. In addition, LAN-related power problems, such as internode
communications interference caused by ground loops between two devices linked
by a data cable, noise on the data cable caused by EMI (electromagnetic
interference), and other seemingly minor interruptions can bring down the
network.
Most network managers are aware of blackouts/brownout conditions and take steps to protect their networks by using UPS (uninterruptible power supplies), or installing fiber optic cables with noise immunity. These measures provide a large degree of protection for workstations and servers, but the network is left vulnerable at the cable and switch port levels.
Quantifying the Cost of Downtime - Why 99% network uptime
won’t suffice
Any power
disruption can result in expensive downtime. Computer Reseller News cites that the cost of downtime can
range from $300/minute for a medium-sized LAN up to $633/minute for a UNIX
network, not including the cost of lost revenue. Market data from Contingency Planning Research calculates
the cost of downtime ranging from $1200/minute for retail businesses up to a
high of $108K/minute for brokerage operations. If we assume for a small company that 24 x 7 network uptime is
99%, and that every minute of downtime costs at least $1K/minute, a
conservative estimate of lost revenue/ opportunity cost is $ 5.25 million per year. Besides lost revenue, the opportunity cost
includes the cost of technical support personnel who typically spend 75% of
their time on network problem resolution and support.
According
to the available research, Fortune 1000 network administrators across the board
stated that they need 99.96% uptime, equal to no more than 3.5 hours per year
downtime (365 days, 24 hrs/day), so
merely adding UPS boxes in the network only provides a gross measure of
protection. Clearly, today’s virtual
“non-stop network” requires additional measures of reliability and fault
tolerance.
Vendors
have traditionally offered many methods to increase network reliability and
availability all based on adding additional hardware. These methods can be
classified by how the redundant hardware is switched into service. Some of the more common are:
· Using the Spanning Tree Protocol or other proprietary
protocols,
·
Trunking of multiple parallel links for increased bandwidth
which typically include redundant features,
· Using network interface cards (NICs) with
failover software in the server, or
· Deploying server clustering.
The
choice of what method to use largely depends on the network topology the number
of devices (switches) in the network and the recovery time required. As mentioned earlier all of these methods require additional hardware (e.g., duplicate
ports, switches, routers) in conjunction with other elements in the network,
such as Spanning Tree, proprietary protocols, or drivers for system software,
and coordinating these elements is often difficult. Coupled with the uncertainty of the hardware/software interaction
is predicting what the worst case switchover time will be and if the network
layer sessions will survive.
Duplicating Hardware Components in the Network
The
popularity of deploying a distributed switch architecture increases the number
of points of failure on the network.
Demand for on-board reliability, network availability and other fault
tolerant features have permeated to low end stackable switches and hubs. Many switch vendors offer redundant,
load-sharing, “hot swappable” N+1 AC power supplies, dual fans (to prevent
overheating), redundant management modules, or duplicate port expansion/network
modules in the chassis to ensure high network availability. In addition, some vendors also use passive
backplanes with no active circuitry, single redundant switching fabrics to
guarantee 100% of the bandwidth all the time, and non volatile memory in the
management module to preserve and reapply configuration data after a power
failure. The advantage of these
methods is that network connectivity is not compromised but that network
performance may still be affected. In
addition to the additional cost of the hardware components, the data cable
itself can still fail.
Some
network managers take preventative measures by duplicating all the equipment in
the network, such as using two switches.
This method allows fully meshed network topologies and active, redundant
links. This measure is particularly
good for video applications requiring low bandwidth and is tolerant of slight
delays in recovery time. The main
drawback is that for time critical, high bandwidth applications, usage of
spanning tree or other software-based protocols may cause network sessions to
time out during the reconvergence process.
Using Spanning Tree or other Proprietary Protocols
Spanning
Tree is universally used by switch vendors to provision multiple paths through
an Ethernet network. It allows a
completely fault tolerant design by allowing multiple links at every point in
the network. It uses the Spanning Tree
Algorithm to ensure a loop-free path by automatically controlling (severing)
the ports to avoid network loop backs and to provide network redundancy, yet
the integrity of the network topology is maintained by each device. The Spanning Tree Protocol enables a
learning bridge or router to dynamically work around loops in a network technology by creating a spanning tree. Since Layer 2 switches behave like multiport
bridges and deploy the Spanning Tree Algorithm,
the secondary port in the network is automatically activated once the primary
port fails.
Although using Spanning Tree provides a measure of loop free network redundancy, its notoriously slow failover times do not provide “instantaneous” recovery in the event of a link failure. According to the available research, an estimate of the time it takes to recalculate a Spanning Tree following a network change is up to 50 seconds (summing all the parameters such as forwarding delay time, and listening/learning states). At best, tuning these parameters to optimize convergence results in a recovery time of 30 seconds, which does not include delay times in message delivery, session interruptions, or time outs, and it can create backbone problems with the increased traffic flow.
Major networking vendors such as Cisco, HP, and Cabletron have recognized this problem, and have devised various alternatives to deal with the inherent slowness of Spanning Tree. Cisco’s Spanning Tree Uplink Fast (STUF) feature, Hewlett Packard’s LAN proprietary aggregation protocol, or Cabletron’s SecureFast technology enable faster convergence and minimize downtime in meshed switch networks with high data throughput rates. By optimizing the spanning tree protocol (Cisco), or using proprietary protocols (HP or Cabletron), these vendors claim that spanning tree convergence around a failed link can be reduced to 3-5 seconds. Cisco’s Spanning Tree Uplink Fast feature, an optimized version of the IEEE 802.1D spanning tree standard, groups uplink ports in a specific way. The disadvantage to the Cisco solution is that you must configure STUF on their access layer switches, and they recommend using Layer 3 routing protocol at the core. Cabletron’s SecureFast uses the Virtual Link State Protocol (VLSP) based on OSPF to deliver fault tolerant load balancing using an active meshed topology to route traffic around failures. The drawbacks to the HP and Cabletron solutions are that the protocols are proprietary and do not interoperate with Spanning Tree. For the HP solution you need Cisco’s Fast EtherChannel technology to provide point to point trunking, and Cabletron requires an integrated hardware/software switching solution which has not been entirely supported throughout its product line.
Using Various Load Balancers
Other
vendors have championed using various load-balancing modules to balance
incoming traffic to multiple servers while ensuring fault tolerance. Load balancers, while providing redundancy
to some degree, are not designed solely for this purpose, as their primary
mission is to provide for higher bandwidth over aggregate multiple links. Both Cisco’s Fast EtherChannel™ and
Cabletron’s SecureFast™ are leading technologies in this area.
Cisco’s Fast EtherChannel™ has garnered support from fifteen vendors, while Cabletron uses the SecureFast™ technology to create fully meshed network topologies by configuring up to three (3) media-independent links between each pair of switches using the Virtual Link State Protocol (VLSP). Fast EtherChannel is a trunking technology based on grouping together multiple full-duplex Fast Ethernets to provide fault-tolerant links between switches, servers, and routers. The main advantage is that the modules transparently provide fault tolerance at server and switch levels to the external (incoming) user. The disadvantage is that the SecureFast link failover to re-route a connection in a fully active, meshed topology takes an average of three (3) seconds, and convergence time increases as the number of switches in the network increases. Convergence time for five (5) or more switches in the network can take up to 30 seconds. For applications requiring a continuous flow of traffic, this loss of up to three (3) billion bits of data on a Fast Ethernet connection may be unacceptable.
Server-based adapters and failover software
Another recommended approach is to
install a second network interface card (NIC) in the server with failover
software installed. The software allows
the backup adapter to kick in if the primary link fails. Additional features allow you to bind a
single network address to multiple NICs, load-balancing capability across
multiple adapters, or allowance for multiple active connections to a single
switch for increased fault tolerance and performance improvement. This is a straightforward approach which
will interoperate with various switch vendors’ equipment, has a relatively
short failover time of 3-6 seconds, and is cost-effective ($300-$500 per server
card). However, this method requires
two adapters and uses up an additional PCI slot in the server.
Deploying Clustering between Servers
Server
clustering is a method that employs duplicate servers, each with a NIC card, connected
via a SCSI or Fibre Channel link. NT
software is typically used to run on top the servers. It is used for real time or near real time switch over
requirements in that server back up will activate within a minute. However, not all applications can run on an
NT clustering environment, and this method is not applicable to providing
redundancy at the network level.
Conclusions
Table
1 summarizes the advantages and disadvantages of each approach, with the
expected recovery (convergence) times in the case of downtime. As shown below, most of these methods take
a finite amount of time to reengage the network link, most require spanning
tree to route, and factors such as IP session layer time outs, incomplete
packet transmission, or network performance degradation are of concern. Although recovery time is relative, and
dependent on the application and user (i.e., employees accessing email would be
a lower priority than a brokerage facilitating a multi-million dollar
transaction), for those networks running mission-critical applications, the
choice of redundancy method can be a crucial decision. Surprisingly, using
media converters as an additional fault tolerance safeguard is a little-known
but quite effective method to build redundancy in your network. In the next section we will explore the use
of the media converter.
Table
1 Summary of Fault Tolerant Approaches
Method |
Approx. Reconvergence Time |
Pros |
Cons |
Redundant hardware using spanning
tree |
30-50 seconds |
·
Network connectivity is maintained ·
Use existing equipment |
·
Cost of surplus equipment ·
Still use spanning tree to reroute
links ·
Performance degradation due to
additional traffic load. |
Proprietary protocols |
3-5 seconds |
·
Fast failover time |
·
Non-standard protocol ·
Proprietary solution |
Load balancers |
3-30 seconds |
·
Fast failover time ·
Transparently provides fault
tolerance for incoming traffic |
·
Convergence time increases with the
number of switches on the network ·
Not design-optimized for redundancy |
NICs with failover software |
3-30 seconds |
·
Straightforward approach ·
Interoperates with various vendors’
equipment. |
·
May not support UNIX workstations ·
Requires two adapters and takes up
PCI slots in server |
NT server clustering |
a few seconds up to 1 minute |
·
Provides near real time switchovers ·
No degradation in network performance |
·
Not all applications will run in NT
clustered environment ·
Does not support redundancy at the
network level |
Media
Converters |
less
than 55 ns |
·
“Instantaneous” failover time - does
not require spanning tree ·
Interoperates with various vendors’
equipment ·
Media independent, standards based |
·
Can be perceived as another point of
failure in the network ·
Requires additional hardware (backup
ports, switches) |
Media
converters are commonly used devices to integrate fiber optics with Fast
Ethernet technology to support growing demands for increased distance and
enhanced data security. They play
another role in that they are also highly effective in establishing redundant
links between devices to ensure fault tolerance in Fast Ethernet networks. Media converters
can provide full redundant paths for Fast Ethernet devices, such as hubs,
routers, servers and switches.
Redundant
converters offer data link duplication to ensure network integrity and to
provide non-stop networking capability essential for high priority traffic and
mission-critical applications. The
redundant converter actively monitors the primary link and upon link failure,
it will automatically redirect traffic to the secondary link with no
interruption to normal network operation.
When the signal is reestablished, the primary link is reactivated with
the secondary link on standby mode, transparent to the end user. The failover time is imperceptible, less
than 20 us (less than 1,500 bits of data is lost).
Using
redundant media converters, you can build a fully redundant, switched network,
without using Spanning Tree or other routing protocols. The redundant converter is typically
connected to a standards-based NIC card in the server with either copper or
fiber media. By loading a media
converter chassis with up to multiple modules, you can cost-effectively
safeguard a multiple, fully meshed switched network.
Managed
Network-on-Demand
Network
administrators want their networks to function as a utility –just like their
phone and electric services. For this
to become a reality, they can no longer afford to have dumb, unmanaged devices
dispersed throughout critical areas of the network.
Ideally
they want a “lights-out data center”, but can’t afford to be left in the dark
in terms of visibility into the health of critical network convergence
points. That really sums up the network
administrator’s dilemma and the obvious solutions: manageability, reliability, first and foremost, and then
interoperable, affordable solutions that scale with ever-changing requirements.
Management
is such an important concept in networks today. With chassis-based media
converters data traffic of hundreds to thousands of users converge at this
point in the network, and it has became a very critical network juncture. It is now possible to truly manage media
conversion operationally, functionally, and environmentally to proactively
collect data and actively control the device, as well as receiving alarms.
Managed
media conversion easily integrates with other SNMP management platforms, such
as CiscoWorks, Optivity, Spectrum, HP Network Node Manager, as well as with
emerging network management software.
Managed
media conversion eliminates what was once a black hole in enterprise
management. Seeing and controlling functional, operational and environmental
characteristics is very important, because it gives network managers a new
level of flexibility. Now,
administrators can implement proactive management strategies. They can detect problems before they cause
downtime in the network. For example,
temperature and voltage monitoring allows administrators to know when there is
going to be an overheating problem in a wiring closet before it becomes
catastrophic. The problem can then be
remedied before a single user experiences a minute of down time.
Because
a number of configuration options are available directly through the management
console, it’s no longer necessary to play
hide and seek in wiring closets across a campus or a large network.
Network
managers have more users and more complex networking devices to support
everyday. By being able to automate
many aspects of managing their networks, they are able to control their costs,
and still manage larger networks with existing staff.
Trouble-shooting
time within the networks can be dramatically reduced, because administrators
not only have the monitoring tools to alert them to any potential problems, but
also active management tools to correct many of those problems from the
management console itself.
Managed
Network-on-Demand can provide in two levels of security. Dynamic Recovery Mode is used in areas of
the network that need high fault tolerance-- where you can’t afford a single
point-of-failure. At those points in
the network, you simply introduce one or more intelligent redundant converters
and connect and cross-connect network links to avoid downtime, even in the
event that one or more links fail or even if an entire module fails. Non disruptive link reconnection is an
immediate by-product of Dynamic Recovery Mode.
An alarm notification is sent to the management console, so that you can
take corrective steps can be initiated while users are still up and running. End-users will not notice any disruption in
service as a link fails and the fail-over is completed. There is an instantaneous recovery.
For
areas in the network that don’t require redundancy, a secondary port on the
intelligent media converter can be configured in Network Select Mode. This redirects traffic from one link to
another from the management console at any time. Dedicated networks for backup applications can be
implemented. NSM can also provide
security through link isolation. And
there are many other ways in which this powerful capability can be used within
the network.
Figure
1 Network-on-Demand (Change this
diagram)
Media converter
with redundant modules
Some
considerations in building fault tolerant networks
As
discussed in the previous section, typical fault tolerant approaches each have
limitations and drawbacks. Most of
these methods involve significant capital investment in duplicate equipment or
additional support personnel (in the case of highly technical clustering
software. Other approaches can
significantly reduce reconvergence time, but rely on proprietary protocols and
equipment. Often, reliance on software
can exacerbate recovery times because it forces continual retransmissions of TCP/IP
packets causing session timeouts or breakdowns. Reconvergence times are also directly proportional to the number
of switches in the network; large switched networks will suffer longer
delays. In addition, Spanning Tree
protocol parameters dictate failover times of 50 seconds or more. Redundant media converters offer a few
unique advantages over traditional fault tolerant approaches and provides the
network administrator an alternative for
providing fault tolerance for time sensitive, mission-critical applications.
Because media converters are hardware-based, they are by default highly deterministic and do not rely upon a specific protocol, software, and or specific communication (software) link between devices. Switches can be configured in parallel without loops, and in many cases, the redundant media converter can be used instead of using the spanning tree protocol. The redundant converter can be cost-effectively incorporated into a network design at the physical layer so that network MAC addresses and router tables do not have to be modified. When used to provide network redundancy for a NIC in a server, failover time is reduced from 3 or more seconds to 20 us (from 300 million bits or more down to 1,500 bits of lost data).
Media converters provide seamless, transparent operation,
when used with IEEE 802.3u Fast Ethernet routers, switches, hubs, and
NICs. They may be placed on a desktop
or installed within a 19” chassis.
Redundant converters can be used with solely with one switch or can be
scaled up to multiple switches and servers, and is available in a variety of
media types - Category 5 UTP, singlemode, and multimode, supporting full (200
Mbps) or half duplex (100 Mbps) Fast Ethernet data rates.
Recommendations
Network managers running
applications requiring session integrity and demanding 24 x 7 reliability can
not afford to have a network failure.
To maximize network uptime, network administrators need end-to-end
visibility of all network components and the ability to initiate active control
through sophisticated management tools.
Applying fault-tolerant, intelligent, redundant solutions at critical
junctures can be an enormously beneficial in safeguarding today’s
mission-critical, non-stop networks. Used
in conjunction with standards-based NICs and server failover software, they can
help eliminate all points of failure in the network, preventing data loss due
to cable failure, port failure, or catastrophic switch failures. Redundant
converter failover time can be 100,000
times faster than most solutions. For mission-critical networks relying on
Fast Ethernet backbones, adding the
redundant converter is a unique, effective method, adding an extra measure of
high reliability in today’s LAN market.
Further
Reading/ Bibliography
1.
Joel Conover,
“Building Fault Tolerant Networks”
Network Computing,
Feb. 15, 1998
2.
Joel Conover, “ATM
Switches: Network Muscle Machines”
Network Computing, Feb. 1, 1998
3.
Pankaj Chowdhry,
“Putting Packets Where They Belong”
PC Week, 10/15/97- Good summary on IP routing schemes
4.
White Paper from
Cisco Systems
“Understanding and Designing Networks Using Spanning Tree
and UplinkFast Groups”
http://www.cisco.com/warp/public/729/fec/spane_an.htm
Information on spanning tree recovery times
5.
“SecureFast™ Top Ten
List” Cabletron Solutions Paper
http://www.cabletron.com/securefast/topen.html
SecureFast Cabletron’s SecureFast is a standards based
architecture for building private and public networking infrastructures using
VSLP.
6.
White Paper from
Cabletron Systems
“Reducing the Cost of Network Ownership: A Business Model”
Information on downtime costs.
7. Robert J. Kohlhepp, “Two NIC Array Solutions Offer Fault
Tolerance and Load Balancing”
Network Computing, August 1, 1998
8.
White paper from Performance
Technologies, Inc.
http://www.pt.com/whitepaper.pdf
“The Effects of Network Downtime on Profits and
Productivity”
Data and information on the cost of network downtime
9.
Rahul Vir, Ohio State
University, paper on “LAN Switching”
http://www.cis.ohio-state.edu/~jain/cis788-97/lan_switching/
10. John Morency, “The Business Case for Non-Stop
Networking” White Paper
Renaissance Worldwide,
2/98
11. Michael Reget, “Guarding Data Integrity in a Dirty World”
Electronic Engineering Times, Feb. 16, 1998