IPv4 Anycast on Linux

Karthik Prabhakar

Hewlett-Packard Company
3000 Hanover Street, MS20CX
Palo Alto, CA 94304
Phone: (650) 236 - 3720
Fax: (650) 236 - 3632
karthik@corp.hp.com

Disclaimer: This paper does not recommend any specific approach. It describes a technique that has not been commonly employed in the past, and does not reflect any commitment from my employer, Hewlett-Packard Company, toward the concept. The use of anycast has risks associated with it, and the user needs to understand and assume those risks.

1. Abstract

A number of mechanisms are available to provide failover to an alternate server in case of system failure. This paper describes another possible option that is not commonly known. Some of the other options are briefly described as well, and the use of anycast is positioned relative to the other options. Depending on the application, and topological location of servers relative to each other, anycast might be a possible option to consider as a high-availability option, especially when used with differentiated route metrics, as suggested in this paper.


2. Introduction

The concept of Anycast is not a new one, and has been described for IPv4 as early as 1993 [1]. Fundamentally, an anycast group refers to a group of servers providing a common service, and identified by an address that is common to all members of the group. Packets sent to that address (the "anycast address") are delivered to any one server in the group (typically the topologically closest server in the group). One of the capabilities that could be offered with such a mechanism may be fault tolerance - i.e., if one server in an anycast group has failed, then the impact of such a failure could be minimized or eliminated, because another server in the same group is readily available.

The benefits of such a mechanism would be immediately apparent to any system or network engineer who is architecting a service that is mirrored on multiple systems for high availability reasons. There are a number of products that address this target segment, and span the entire range of complexity and cost. Anycast may be another option to consider for certain applications, especially if cost and simplicity are important.


3. Assumptions

It is taken for granted that one is trying to provide high-availability across a number (two or more) of mirrored servers. The servers provide identical services - i.e., the transactions between the client and a server should be identical irrespective of which specific server is contacted by the client. Mechanisms to provide the mirroring of data between the servers is outside the scope of this document, although this is a relatively straightforward task.

It is also assumed that the servers in question are attached to an IP network. Although anycast has been suggested (and used) for other network types (such as ATM), this paper will focus on IP internets.


4. Current Technology Options for Failover and Load-Balancing

4.1 DNS Based Mechanisms

4.1.1 Simple DNS Round Robin

This is the most commonly used mechanism to distribute load between a set of mirrored servers. Some DNS servers (such as recent versions of BIND) have the ability to rotate or otherwise randomize the sequence in which IP addresses are returned in cases where a given canonical name maps to multiple IP addresses. Given a sufficiently large number of queries, this would have the effect of roughly distributing load equally between the various servers. The problem with this approach is that if any one server becomes unavailable, DNS responses could still contain the IP address of the offline server, since the DNS server would be unaware that a given server has failed. As a result, clients that have been pointed to that server would be denied service.

4.1.2 Enhanced DNS mechanisms

A number of commercial products go a step further from the basic DNS functions and perform additional checks on the health of the server at periodic intervals to verify that the server is available. In case a server goes offline, the servers IP address is not returned in DNS responses. Some of these DNS server products perform additional options, such as the ability to rate servers with different metrics. A few also try to determine the server that has the greatest proximity to the system initiating the DNS query on behalf of the client (not necessarily the client itself), with the expectation that such a system would provide the best response times for that specific client.

A key dependency for such a mechanism would be the time-to-live(TTL) value that is returned along with the DNS response. A reasonably small ttl value may be appropriate so that the most recent server health information would be used.

However, a problem that is encountered in the real world is that some client systems do not respect the ttl value returned in DNS responses. For example, some web proxy systems can be configured with a fixed minimum ttl, netscape navigator caches DNS responses for a minimum of 15 minutes irrespective of the ttl value in dns responses. Internet Explorer caches the dns response indefinitely (until the browser is restarted). Thus in practice, such enhanced DNS mechanisms may not be able to fully guarantee that all clients will be aware of the most recent server health information, although clients issuing a new DNS request will be receive a response that factors the latest health information.


4.2 Network Address Translation (NAT) based approaches

These approaches typically involve the use of some form of NAT device that assumes the IP address of the server (as visible to clients). Traffic sent to this device is redirected in some manner to the servers where the services are actually running. Typically this is done to address translation of the IP address in the data packets from the address of the NAT device to the address of the actual server. The front end NAT device could be a dedicated hardware device (sometimes termed as a "Layer-4 switch") or it could be one system in the cluster that takes on the lead role. In addition to monitoring service health, could also use other metrics such as system capability in determining how many sessions are sent to each server.


A number of commercial products providing this functionality are available, and span the entire spectrum of cost and complexity. A number of factors would need to be considered by administrators when following this approach. Scalability of the NAT device or front end system could be a concern, since all traffic is being switched and modified by it. Mechanisms (possibly involving the use of redundant front end systems) would need to be in place to account for the possibility of a failure in the front end system itself. NAT based approaches may not work in cases where the application protocol embeds the server IP address within the protocol data itself, since many NAT devices do not have an awareness of the application protocol itself. Furthermore, in most cases, this scheme constrains all the servers to a single location, typically directly attached to the Layer-4 switch, or on the same subnet as the front end system.


4.3 IP address Remapping

A few approaches involve the placement of backup servers in the same subnet as the primary servers. Failure of the primary server is detected through various heartbeat mechanisms, and the backup server assumes the IP address of the primary through mechanisms such as arp spoofing. This mechanism again typically constrains the primary and the backup system to be in the same IP subnet.


5. The Suggested Anycast Approach

5.1 IP Anycast - Simple Mode

IP anycast uses a fundamental principle of IP routing, namely that if multiple routes exist to a particular IP destination, an IP router will pick one route (typically the route with the lowest metric). Thus if it is possible to reach an IP unicast destination through multiple neighbors, then the router will pick one neighbor to send the packet to (although there is no guarantee that subsequent packets will be sent along the same path).


5.1.1 Theory of Operation - Simple Mode

[image]

Given a sample IP topology in the figure, we would enable the use of anycast in a naive manner as described below. However, as pointed out at the end of this section, there are a number of constraints to be considered that may prevent this from working cleanly in all cases.

Essentially, we have multiple mirrored servers in different locations, each with it's own unique IP unicast address. We define an anycast address that will now refer to the common server group. An IP alias is defined for each server, and is configured with the anycast address (i.e., the alias on each server is configured with the identical IP address). We then run a routing protocol daemon (running BGP in our example) on each server that advertises reachability to the IP anycast address.

Thus when a client sends packets to the anycast address, they will follow the normal routed path. As far as routers in the internet are concerned, it simply appears that there are multiple paths possible to reach the anycast address (they are unaware of the fact that there are actually multiple systems). As a result, they pick one path over which they forward the packet. In most cases, router will pick the shortest path to the destination, and as a result, the traffic from a given client will be forwarded to the server that is topologically closest to it (in terms of routing metrics).

If a given server crashes, routing advertisements cease to occur from that server. As a result, the routing tables on adjacent routers are updated, and all future packets to the anycast IP address are routed to an alternate server that is still online, and generating route advertisements.

Thus in the example, Server 1 and Server 2 are both configured with the IP address 10.10.10.1 as an IP alias. Server 1 has a IBGP session to it's neighboring router chihgw34, and advertises reachability to 10.10.10.1 through 15.30.45.1. Similarly, Server 2 has an IBGP session to it's neighboring router aushgw43, and advertises reachability to 10.10.10.1 through 15.75.90.1. We use Merit's mrtd code as the routing daemon on server 1 and server 2. Apache was used to illustrate the example of a mirrored web server.

The name anycast.dummy.hp.com was created and mapped to the anycast address 10.10.10.1. Now when a client connects to anycast.dummy.hp.com, it gets pointed to one of the two servers. If the server that it is talking to fails, traffic gets rerouted to the other server.


5.1.2 Problems with the Simple Mode

While this seems to be a straightforward, there are a number of situations that would need to be considered. Most of these would have to do with the possibility of some routers along the path having more than one equal cost route to the destination. In this case, behavior of the router is not defined, and different routers may choose different approaches to the determination of the next hop. In any case, one cannot guarantee (in an arbitrary internet) that different packets from the same client to the anycast address will be sent to the same server. Sending different packets within the same tcp connection to different servers would be fatal to the connection, and would probably result in a premature reset of the connection.

Thus, in the example above, if the client is located in Jakarta and tries to correspond with anycast.dummy.hp.com, the router in Jakarta might find two equal cost routes to reach 10.10.10.1 - one via Austin, and one via Chicago. In this case, the behavior is specific to the router implementation. For example, in the case of the specific switching mode used in our routers, it was possible that packets from Jakarta to 10.10.10.1 get switched to Austin for one minute, and then to Chicago for the next minute. As a result, the client browser in Jakarta could occasionally see reset connections (although the browser typically reestablishes the connection if the user reloads the page).

This is one reason why current literature recommend that an anycast address not be used as the destination address of a tcp connection. A number of alternatives have been suggested.

RFC 1546 had suggested one approach to address the problem, although this requires non-trivial changes to the host tcp stack both on the client and the server. Quoting from rfc 1546:

"The solution to this problem is to only permit anycast addresses as the remote address of a TCP SYN segment (without the ACK bit set). TCP can then initiate a connection to an anycast address. When the SYN-ACK is sent back by the host that received the anycast segment, the initiating TCP should replace the anycast address of its peer, with the address of the host returning the SYN-ACK. (The initiating TCP can recognize the connection for which the SYN-ACK is destined by treating the anycast address as a wildcard address, which matches any incoming SYN-ACK segment with the correct destination port and address and source port, provided the SYN-ACK's full address, including source address, does not match another connection and the sequence numbers in the SYN-ACK are correct.) This approach ensures that a TCP, after receiving the SYN-ACK is always communicating with only one host."

Researchers at IBM suggest another approach: the use of the IP source route option to ensure that subsequent packets are sent to the same server - however their suggested scheme would require changes to the network stack on the server. Furthermore, packets containing IP options are not treated with the same precedence as packets without IP options in current routers, leading to the possibility of arbitrary effects. In addition, some firewalls do not permit packets containing source route options.

Another possibility is that the simple mode might be used where the administrator has a precise knowledge and control over the location of the servers, as well as the routing metrics used in his internet topology, as well as the location of clients relative to the servers. For example, if the administrator places one server in the US and one in Europe, and can guarantee that no client is equidistant from both server (possibly resulting from a extremely high routing metric used for trans-oceanic links), then the simple mode of anycast might be considered in this constrained topology.

A third domain where the simple mode of anycast may be used could be where the individual client-server transactions are short lived and stateless. For example, the transaction may be completed through the exchange of a single request and response packet. In this case, it does not matter that the next request could be sent to a different server, since that is a different transaction.


5.1.3 Simple Mode - Conclusion

As would be obvious, the simple mode seems like a nice way to distributed load between multiple servers that are geographically distributed, and automatically point clients to the topologically closest server. However, there does not yet seem to be any clean way to handle a general case where a client is equidistant from 2 or more servers without requiring changes to the server and/or client. As a result, this mode does not seem applicable in arbitrary topologies yet, and requires caution.


5.2 IP Anycast - Differentiated Metrics mode

From the previous sections, it should be obvious that the simple mode of anycast is constrained in that it does not provide us with an easy way to ensure that all packets for a given connection are sent to the same server. We therefore suggest an alternate mechanism to ensure that packets of a given connection are sent to the same server as long as the server is not offline.


5.2.1 Theory of Operation - Differentiated Metrics Mode

In this case, we configure each server in the cluster with two IP aliases each. A simple IP unicast address (termed the primary address) will be used for one alias (each server has it's own primary IP address that is used). A routing protocol daemon will run on each server, and advertises reachability to it's primary IP address with a small route metric (say 10) associated with the route.

In addition, corresponding to each primary address of the systems in the cluster, another system is configured with the same address (but designated as the secondary address) as an IP alias. Thus, for every primary IP address of a system in the cluster, there is a corresponding secondary IP address (identical to the primary IP address) that is configured as an alias on another system. The routing protocol daemon on the system with the secondary server is configured to advertise reachability to that IP address, but with a suitably high route metric (for e.g., 25000).

We would then use a scheme such as simple DNS round robin to distribute load between systems in the cluster. However, since the metric to the system configured as the primary is always less than the metric to the system configured as the secondary, traffic to the IP address will always be sent to the primary system. However, if the primary system crashes, then routing updates from that system stop, and the route to the secondary becomes the preferred route. Thus on the failure of the primary, traffic is automatically redirected to the secondary.

It should be noted that connections that were active when the server crashed will be reset when they are redirected to the secondary - however, this cannot have been prevented because the primary has crashed, and the connections can not be sustained anyway.

Thus in the example, Server 1 is configured to advertise reachability to 10.10.10.1 with a metric of 10, and server 2 is configured to advertise 10.10.10.1 with a metric of 30000. Thus if the cost of the link between Austin and Chicago is say, 500, then all packets destined to 10.10.10.1 will be routed to Server 1 (i.e., to Chicago). However if server 1 is offline, and route updates from server 1 stop, then the route being advertised from Austin (with a metric of 30500+) is now the best route. As a result traffic to 10.10.10.1 gets routed to Server 2 (in Austin).


5.2.2 Differentiated Metrics Mode - Conclusion

It appears that by careful selection of route metric values, it is possible to ensure that traffic from a single connection does not get switched to different members of an anycast group. As long as the route metric being advertised by the secondary is significantly greater than the aggregate link cost between the primary and the secondary, traffic should flow to the primary. However, in the case of the primary failing, traffic is automatically rerouted to the secondary.


6. Caveats

Some caveats that merit consideration:


7. Notable Benefits of Anycast

8. Conclusions

IPv4 anycast in a general sense is clearly not suitable for casual deployment in an arbitrary topology. However, when modified to make use of routing metrics as suggested in this paper, anycast seems like a powerful mechanism, and may not require modifications to the network stacks in the client, server or the network infrastructure.

Anycast has not been widely deployed, possibly because of the lack of a clean mechanism to ensure that traffic on a given connection does not get forwarded to more than one system in the anycast group. We suggest a way (using route metrics) that may provide a relatively clean way to address this problem.


9. References

1. RFC 1546 - Host Anycasting Service - Craig Partridge, Trevor Mendez, Walter Milliken [November 1993]

2. RFC 2373 - IP Version 6 Addressing Architecture - Robert Hinden, Stephen Deering [July 1998]

3. IBM Research Report RC20938 - Using network layer anycast for load distribution in the Internet - Erol Basturk, et. al. [July 1997]

Author | Title | Track