Safeguarding Non-Stop Networks

Darrel Furlong
Lancast, Inc.
12 Murphy Drive
Nashua, NH 03062
603-880-1833
603-594-2887
furlong@lancast.com

Introduction

With the advent of more sophisticated multimedia applications, heightened interest in web-based commerce, deployment of extranets/intranets, and the convergence of data and telecommunications solutions, more traffic than ever is traversing the Internet. The Internet is the “Great Equalizer” in that businesses, large and small, are benefiting from exposure to a wider audience and the potential increase in their customer base. Not only are internal and external users driving increased bandwidth requirements, they are also demanding a greater quality of service in their networks. External customers accessing company extranets for information and other electronic commerce applications will be less tolerant of poor network service quality and its inherent downtime than in-house users accessing the Web, making the “non-stop network” a necessity.

Network managers concerned about quantifying service level metrics such as network availability, session integrity, and percent uptime were once paid more lip-service than given results by equipment vendors. Today, companies relying on electronic commerce or continuous video streaming (e.g., commercials on broadcast TV) as a source of revenue, or brokerages making split second financial decisions based on global market factors, can not afford the slightest glitch in their network. Network downtime as little as three (3) seconds could result in an opportunity cost of thousands of dollars.

Because data networks are becoming the business lifeline, network managers are becoming more concerned about network availability, session integrity, and need to be aware of the causes and costs of downtime. We will be examining the external causes of downtime, the available research quantifying the costs of downtime, various methods vendors use to provide fault tolerance and finally, how evolving network technology can play a role in safeguarding your network.

Quantifying the Causes and Costs of Downtime

Controllable factors, such as scheduled support, software/hardware installation upgrades, and routine maintenance/backup can impact network availability and reduce uptime. It’s the uncontrollable factors, such as power outages, brownouts, equipment failures, EMI and operator errors that predominantly impact the network at the worst possible moment. With the tremendous variability of user’s networks, let’s take a look at the uncontrollable causes of downtime and provided a perspective on the cost of downtime from the available research.

Blackouts and Brownouts - Putting it in perspective

A recent study cites that PCs and workstations in a typical large network can be subject to infrequent (monthly), but significant power interruptions. Although blackouts and voltage spikes are the most noticeable, they only account for 12% of power problems. The majority (80%) of power disturbances is caused by undervoltage occurrences (brownouts), overvoltages, surges, and power sags. In addition, LAN-related power problems, such as internode communications interference caused by ground loops between two devices linked by a data cable, noise on the data cable caused by EMI (electromagnetic interference), and other seemingly minor interruptions can bring down the network.

Most network managers are aware of blackouts/brownout conditions and take steps to protect their networks by using UPS (uninterruptible power supplies), or installing fiber optic cables with noise immunity. These measures provide a large degree of protection for workstations and servers, but the network is left vulnerable at the cable and switch port levels.

Quantifying the Cost of Downtime - Why 99% network uptime won’t suffice

Any power disruption can result in expensive downtime. Computer Reseller News cites that the cost of downtime can range from $300/minute for a medium-sized LAN up to $633/minute for a UNIX network, not including the cost of lost revenue. Market data from Contingency Planning Research calculates the cost of downtime ranging from $1200/minute for retail businesses up to a high of $108K/minute for brokerage operations. If we assume for a small company that 24 x 7 network uptime is 99%, and that every minute of downtime costs at least $1K/minute, a conservative estimate of lost revenue/ opportunity cost is $ 5.25 million per year. Besides lost revenue, the opportunity cost includes the cost of technical support personnel who typically spend 75% of their time on network problem resolution and support.

According to the available research, Fortune 1000 network administrators across the board stated that they need 99.96% uptime, equal to no more than 3.5 hours per year downtime (365 days, 24 hrs/day), so merely adding UPS boxes in the network only provides a gross measure of protection. Clearly, today’s virtual “non-stop network” requires additional measures of reliability and fault tolerance.

Typical Approaches to Fault Tolerance

Vendors have traditionally offered many methods to increase network reliability and availability all based on adding additional hardware. These methods can be classified by how the redundant hardware is switched into service. Some of the more common are:

· Using the Spanning Tree Protocol or other proprietary protocols,

· Trunking of multiple parallel links for increased bandwidth which typically include redundant features,

· Using network interface cards (NICs) with failover software in the server, or

· Deploying server clustering.

The choice of what method to use largely depends on the network topology the number of devices (switches) in the network and the recovery time required. As mentioned earlier all of these methods require additional hardware (e.g., duplicate ports, switches, routers) in conjunction with other elements in the network, such as Spanning Tree, proprietary protocols, or drivers for system software, and coordinating these elements is often difficult. Coupled with the uncertainty of the hardware/software interaction is predicting what the worst case switchover time will be and if the network layer sessions will survive.

Duplicating Hardware Components in the Network

The popularity of deploying a distributed switch architecture increases the number of points of failure on the network. Demand for on-board reliability, network availability and other fault tolerant features have permeated to low end stackable switches and hubs. Many switch vendors offer redundant, load-sharing, “hot swappable” N+1 AC power supplies, dual fans (to prevent overheating), redundant management modules, or duplicate port expansion/network modules in the chassis to ensure high network availability. In addition, some vendors also use passive backplanes with no active circuitry, single redundant switching fabrics to guarantee 100% of the bandwidth all the time, and non volatile memory in the management module to preserve and reapply configuration data after a power failure. The advantage of these methods is that network connectivity is not compromised but that network performance may still be affected. In addition to the additional cost of the hardware components, the data cable itself can still fail.

Some network managers take preventative measures by duplicating all the equipment in the network, such as using two switches. This method allows fully meshed network topologies and active, redundant links. This measure is particularly good for video applications requiring low bandwidth and is tolerant of slight delays in recovery time. The main drawback is that for time critical, high bandwidth applications, usage of spanning tree or other software-based protocols may cause network sessions to time out during the reconvergence process.

Using Spanning Tree or other Proprietary Protocols

Spanning Tree is universally used by switch vendors to provision multiple paths through an Ethernet network. It allows a completely fault tolerant design by allowing multiple links at every point in the network. It uses the Spanning Tree Algorithm to ensure a loop-free path by automatically controlling (severing) the ports to avoid network loop backs and to provide network redundancy, yet the integrity of the network topology is maintained by each device. The Spanning Tree Protocol enables a learning bridge or router to dynamically work around loops in a network technology by creating a spanning tree. Since Layer 2 switches behave like multiport bridges and deploy the Spanning Tree Algorithm, the secondary port in the network is automatically activated once the primary port fails.

Although using Spanning Tree provides a measure of loop free network redundancy, its notoriously slow failover times do not provide “instantaneous” recovery in the event of a link failure. According to the available research, an estimate of the time it takes to recalculate a Spanning Tree following a network change is up to 50 seconds (summing all the parameters such as forwarding delay time, and listening/learning states). At best, tuning these parameters to optimize convergence results in a recovery time of 30 seconds, which does not include delay times in message delivery, session interruptions, or time outs, and it can create backbone problems with the increased traffic flow.

Major networking vendors such as Cisco, HP, and Cabletron have recognized this problem, and have devised various alternatives to deal with the inherent slowness of Spanning Tree. Cisco’s Spanning Tree Uplink Fast (STUF) feature, Hewlett Packard’s LAN proprietary aggregation protocol, or Cabletron’s SecureFast technology enable faster convergence and minimize downtime in meshed switch networks with high data throughput rates. By optimizing the spanning tree protocol (Cisco), or using proprietary protocols (HP or Cabletron), these vendors claim that spanning tree convergence around a failed link can be reduced to 3-5 seconds. Cisco’s Spanning Tree Uplink Fast feature, an optimized version of the IEEE 802.1D spanning tree standard, groups uplink ports in a specific way. The disadvantage to the Cisco solution is that you must configure STUF on their access layer switches, and they recommend using Layer 3 routing protocol at the core. Cabletron’s SecureFast uses the Virtual Link State Protocol (VLSP) based on OSPF to deliver fault tolerant load balancing using an active meshed topology to route traffic around failures. The drawbacks to the HP and Cabletron solutions are that the protocols are proprietary and do not interoperate with Spanning Tree. For the HP solution you need Cisco’s Fast EtherChannel technology to provide point to point trunking, and Cabletron requires an integrated hardware/software switching solution which has not been entirely supported throughout its product line.

Using Various Load Balancers

Other vendors have championed using various load-balancing modules to balance incoming traffic to multiple servers while ensuring fault tolerance. Load balancers, while providing redundancy to some degree, are not designed solely for this purpose, as their primary mission is to provide for higher bandwidth over aggregate multiple links. Both Cisco’s Fast EtherChannel™ and Cabletron’s SecureFast™ are leading technologies in this area.

Cisco’s Fast EtherChannel™ has garnered support from fifteen vendors, while Cabletron uses the SecureFast™ technology to create fully meshed network topologies by configuring up to three (3) media-independent links between each pair of switches using the Virtual Link State Protocol (VLSP). Fast EtherChannel is a trunking technology based on grouping together multiple full-duplex Fast Ethernets to provide fault-tolerant links between switches, servers, and routers. The main advantage is that the modules transparently provide fault tolerance at server and switch levels to the external (incoming) user. The disadvantage is that the SecureFast link failover to re-route a connection in a fully active, meshed topology takes an average of three (3) seconds, and convergence time increases as the number of switches in the network increases. Convergence time for five (5) or more switches in the network can take up to 30 seconds. For applications requiring a continuous flow of traffic, this loss of up to three (3) billion bits of data on a Fast Ethernet connection may be unacceptable.

Server-based adapters and failover software

Another recommended approach is to install a second network interface card (NIC) in the server with failover software installed. The software allows the backup adapter to kick in if the primary link fails. Additional features allow you to bind a single network address to multiple NICs, load-balancing capability across multiple adapters, or allowance for multiple active connections to a single switch for increased fault tolerance and performance improvement. This is a straightforward approach which will interoperate with various switch vendors’ equipment, has a relatively short failover time of 3-6 seconds, and is cost-effective ($300-$500 per server card). However, this method requires two adapters and uses up an additional PCI slot in the server.

Deploying Clustering between Servers

Server clustering is a method that employs duplicate servers, each with a NIC card, connected via a SCSI or Fibre Channel link. NT software is typically used to run on top the servers. It is used for real time or near real time switch over requirements in that server back up will activate within a minute. However, not all applications can run on an NT clustering environment, and this method is not applicable to providing redundancy at the network level.

Conclusions

Table 1 summarizes the advantages and disadvantages of each approach, with the expected recovery (convergence) times in the case of downtime. As shown below, most of these methods take a finite amount of time to reengage the network link, most require spanning tree to route, and factors such as IP session layer time outs, incomplete packet transmission, or network performance degradation are of concern. Although recovery time is relative, and dependent on the application and user (i.e., employees accessing email would be a lower priority than a brokerage facilitating a multi-million dollar transaction), for those networks running mission-critical applications, the choice of redundancy method can be a crucial decision. Surprisingly, using media converters as an additional fault tolerance safeguard is a little-known but quite effective method to build redundancy in your network. In the next section we will explore the use of the media converter.

Table 1 Summary of Fault Tolerant Approaches

Method	Approx. Reconvergence Time	Pros	Cons
Redundant hardware using spanning tree	30-50 seconds	· Network connectivity is maintained · Use existing equipment	· Cost of surplus equipment · Still use spanning tree to reroute links · Performance degradation due to additional traffic load.
Proprietary protocols	3-5 seconds	· Fast failover time	· Non-standard protocol · Proprietary solution
Load balancers	3-30 seconds	· Fast failover time · Transparently provides fault tolerance for incoming traffic	· Convergence time increases with the number of switches on the network · Not design-optimized for redundancy
NICs with failover software	3-30 seconds	· Straightforward approach · Interoperates with various vendors’ equipment.	· May not support UNIX workstations · Requires two adapters and takes up PCI slots in server
NT server clustering	a few seconds up to 1 minute	· Provides near real time switchovers · No degradation in network performance	· Not all applications will run in NT clustered environment · Does not support redundancy at the network level
Media Converters	less than 55 ns	· “Instantaneous” failover time - does not require spanning tree · Interoperates with various vendors’ equipment · Media independent, standards based	· Can be perceived as another point of failure in the network · Requires additional hardware (backup ports, switches)

Using Media Converters to Provide Fault Tolerance

Media converters are commonly used devices to integrate fiber optics with Fast Ethernet technology to support growing demands for increased distance and enhanced data security. They play another role in that they are also highly effective in establishing redundant links between devices to ensure fault tolerance in Fast Ethernet networks. Media converters can provide full redundant paths for Fast Ethernet devices, such as hubs, routers, servers and switches.

Redundant converters offer data link duplication to ensure network integrity and to provide non-stop networking capability essential for high priority traffic and mission-critical applications. The redundant converter actively monitors the primary link and upon link failure, it will automatically redirect traffic to the secondary link with no interruption to normal network operation. When the signal is reestablished, the primary link is reactivated with the secondary link on standby mode, transparent to the end user. The failover time is imperceptible, less than 20 us (less than 1,500 bits of data is lost).

Using redundant media converters, you can build a fully redundant, switched network, without using Spanning Tree or other routing protocols. The redundant converter is typically connected to a standards-based NIC card in the server with either copper or fiber media. By loading a media converter chassis with up to multiple modules, you can cost-effectively safeguard a multiple, fully meshed switched network.

Managed Network-on-Demand

Network administrators want their networks to function as a utility –just like their phone and electric services. For this to become a reality, they can no longer afford to have dumb, unmanaged devices dispersed throughout critical areas of the network.

Ideally they want a “lights-out data center”, but can’t afford to be left in the dark in terms of visibility into the health of critical network convergence points. That really sums up the network administrator’s dilemma and the obvious solutions: manageability, reliability, first and foremost, and then interoperable, affordable solutions that scale with ever-changing requirements.

Management is such an important concept in networks today. With chassis-based media converters data traffic of hundreds to thousands of users converge at this point in the network, and it has became a very critical network juncture. It is now possible to truly manage media conversion operationally, functionally, and environmentally to proactively collect data and actively control the device, as well as receiving alarms.

Managed media conversion easily integrates with other SNMP management platforms, such as CiscoWorks, Optivity, Spectrum, HP Network Node Manager, as well as with emerging network management software.

Managed media conversion eliminates what was once a black hole in enterprise management. Seeing and controlling functional, operational and environmental characteristics is very important, because it gives network managers a new level of flexibility. Now, administrators can implement proactive management strategies. They can detect problems before they cause downtime in the network. For example, temperature and voltage monitoring allows administrators to know when there is going to be an overheating problem in a wiring closet before it becomes catastrophic. The problem can then be remedied before a single user experiences a minute of down time.

Because a number of configuration options are available directly through the management console, it’s no longer necessary to play hide and seek in wiring closets across a campus or a large network.

Network managers have more users and more complex networking devices to support everyday. By being able to automate many aspects of managing their networks, they are able to control their costs, and still manage larger networks with existing staff.

Trouble-shooting time within the networks can be dramatically reduced, because administrators not only have the monitoring tools to alert them to any potential problems, but also active management tools to correct many of those problems from the management console itself.

Managed Network-on-Demand can provide in two levels of security. Dynamic Recovery Mode is used in areas of the network that need high fault tolerance-- where you can’t afford a single point-of-failure. At those points in the network, you simply introduce one or more intelligent redundant converters and connect and cross-connect network links to avoid downtime, even in the event that one or more links fail or even if an entire module fails. Non disruptive link reconnection is an immediate by-product of Dynamic Recovery Mode. An alarm notification is sent to the management console, so that you can take corrective steps can be initiated while users are still up and running. End-users will not notice any disruption in service as a link fails and the fail-over is completed. There is an instantaneous recovery.

For areas in the network that don’t require redundancy, a secondary port on the intelligent media converter can be configured in Network Select Mode. This redirects traffic from one link to another from the management console at any time. Dedicated networks for backup applications can be implemented. NSM can also provide security through link isolation. And there are many other ways in which this powerful capability can be used within the network.

Figure 1 Network-on-Demand (Change this diagram)

Media converter with redundant modules

Some considerations in building fault tolerant networks

As discussed in the previous section, typical fault tolerant approaches each have limitations and drawbacks. Most of these methods involve significant capital investment in duplicate equipment or additional support personnel (in the case of highly technical clustering software. Other approaches can significantly reduce reconvergence time, but rely on proprietary protocols and equipment. Often, reliance on software can exacerbate recovery times because it forces continual retransmissions of TCP/IP packets causing session timeouts or breakdowns. Reconvergence times are also directly proportional to the number of switches in the network; large switched networks will suffer longer delays. In addition, Spanning Tree protocol parameters dictate failover times of 50 seconds or more. Redundant media converters offer a few unique advantages over traditional fault tolerant approaches and provides the network administrator an alternative for providing fault tolerance for time sensitive, mission-critical applications.

Hardware-based, and not software-dependent.

Because media converters are hardware-based, they are by default highly deterministic and do not rely upon a specific protocol, software, and or specific communication (software) link between devices. Switches can be configured in parallel without loops, and in many cases, the redundant media converter can be used instead of using the spanning tree protocol. The redundant converter can be cost-effectively incorporated into a network design at the physical layer so that network MAC addresses and router tables do not have to be modified. When used to provide network redundancy for a NIC in a server, failover time is reduced from 3 or more seconds to 20 us (from 300 million bits or more down to 1,500 bits of lost data).

Scalable and flexible

Media converters provide seamless, transparent operation, when used with IEEE 802.3u Fast Ethernet routers, switches, hubs, and NICs. They may be placed on a desktop or installed within a 19” chassis. Redundant converters can be used with solely with one switch or can be scaled up to multiple switches and servers, and is available in a variety of media types - Category 5 UTP, singlemode, and multimode, supporting full (200 Mbps) or half duplex (100 Mbps) Fast Ethernet data rates.

Recommendations

Network managers running applications requiring session integrity and demanding 24 x 7 reliability can not afford to have a network failure. To maximize network uptime, network administrators need end-to-end visibility of all network components and the ability to initiate active control through sophisticated management tools. Applying fault-tolerant, intelligent, redundant solutions at critical junctures can be an enormously beneficial in safeguarding today’s mission-critical, non-stop networks. Used in conjunction with standards-based NICs and server failover software, they can help eliminate all points of failure in the network, preventing data loss due to cable failure, port failure, or catastrophic switch failures. Redundant converter failover time can be 100,000 times faster than most solutions. For mission-critical networks relying on Fast Ethernet backbones, adding the redundant converter is a unique, effective method, adding an extra measure of high reliability in today’s LAN market.

Further Reading/ Bibliography

1. Joel Conover, “Building Fault Tolerant Networks”

Network Computing, Feb. 15, 1998

2. Joel Conover, “ATM Switches: Network Muscle Machines”

Network Computing, Feb. 1, 1998

3. Pankaj Chowdhry, “Putting Packets Where They Belong”

PC Week, 10/15/97- Good summary on IP routing schemes

4. White Paper from Cisco Systems

“Understanding and Designing Networks Using Spanning Tree and UplinkFast Groups”

http://www.cisco.com/warp/public/729/fec/spane_an.htm

Information on spanning tree recovery times

5. “SecureFast™ Top Ten List” Cabletron Solutions Paper

http://www.cabletron.com/securefast/topen.html

SecureFast Cabletron’s SecureFast is a standards based architecture for building private and public networking infrastructures using VSLP.

6. White Paper from Cabletron Systems

“Reducing the Cost of Network Ownership: A Business Model”

Information on downtime costs.

7. Robert J. Kohlhepp, “Two NIC Array Solutions Offer Fault Tolerance and Load Balancing”

Network Computing, August 1, 1998

8. White paper from Performance Technologies, Inc.

http://www.pt.com/whitepaper.pdf

“The Effects of Network Downtime on Profits and Productivity”

Data and information on the cost of network downtime

9. Rahul Vir, Ohio State University, paper on “LAN Switching”

http://www.cis.ohio-state.edu/~jain/cis788-97/lan_switching/

10. John Morency, “The Business Case for Non-Stop Networking” White Paper

Renaissance Worldwide, 2/98

11. Michael Reget, “Guarding Data Integrity in a Dirty World”

Electronic Engineering Times, Feb. 16, 1998

Safeguarding Non-Stop Networks

Darrel Furlong Lancast, Inc. 12 Murphy Drive Nashua, NH 03062 603-880-1833 603-594-2887 furlong@lancast.com