High Availability - Clusters have Issues

As IT services become more and more important to the organization, the notion of the a service being down becomes scary. So the organization begins to search for ways to make the IT services more available. The usual solution to high availability is to place the IT service on a cluster system.

So, let's start with a definition
A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. They come in three generic flavors:

  • High-availability (HA) clusters - implemented for the purpose of better availability of IT services
  • Load-balancing clusters - distributing a workload evenly over multiple nodes
  • Grid computing - large sets of computers optimized for workloads which consist of many independent jobs or packets of work

High Availability Cluster
For a typical corporation the 'weapon of choice' is the High-availability cluster. The simplest form of a high availability cluster contains two computers and a shared disk resource.

Most high availability cluster run in a 'failover' mode, also known as 'active/standby' mode. This means that one of the computers (nodes) is running the IT service (web server, database server or similar) while the other node is idling and waiting for the first node to fail.

Should it fail, the second node will take over the IT services and related resources - usually disk volumes, ip addresses and hostnames and continue to run the service. This takeover takes anywhere from several seconds to a minute, which is acceptable for most types of services.

The process of takeover includes a process called 'voting'.
  1. Both nodes are checking each other's health at regular intervals. This health check is known as a heartbeat
  2. In the case when one of the nodes does not respond, the second one will assume that the first one has failed, and it needs to take over the IT service that needs to be run.
  3. The problem with the immediate decision to take over is that the missing response can be just a connectivity issue, in which case the first node is still up and running - and both nodes will end up fighting over the IT service. This is known as a 'split brain' cluster
  4. To avoid this situation, an odd numbered element must be included. Since a third computer can be expensive, a usual third element is a disk drive that is connected to both servers. This disk drive is known as a 'quorum disk'.
  5. So, in case of a failure, the surviving node will first contact the quorum disk and perform a 'vote' - usually write a file and wait a predetermined time to see whether the other node will erase it. If the file is there, the vote is successful, and the surviving node will take over the IT services.
This entire voting process takes several milliseconds so it does not delay the fail over process

Naturally, there are issues with using clusters. Here are the most common
  • Cost - Cluster systems need specific cluster aware software, the hardware is usually highly redundant and the shared disk systems are quite expensive.
  • Resource Waste - In failover cluster - the most common variety, one of the cluster computers is mostly idle, just sitting there and waiting for the first node to fail.
  • Difficult performance scaling - In failover cluster, if the current cluster node does not have sufficient power, it is not easy to replace it with a faster cpu. Everything inside a computer designed to run in a cluster is more expensive and needs special approval by the cluster software vendor to confirm that it is compatible with a cluster solution. And even if you manage to upgrade the system, you are careful to upgrade both nodes, so if failover occurs the performance remains the same.
  • No protection against software error - In essence, the cluster is not a silver bullet. It protects against hardware error, but in no way helps against corruption of information caused by faulty software or human error.
The High Availability cluster is an excellent solution for increasing IT service availability - if you can live with it's issues:
  • For maximum effect it needs to be supported by methods of protection against software or human error (backup and archive)
  • For resource waste, you can run several IT services and balance them on both nodes, so each node acts as failover for the services running on the other node. But bear in mind that when a failover occurs, you'll have to run all services on one node - thus creating a possible performance issue.
  • If cost of hardware and upgrades is a major issue, you can even consider an assymetric cluster - one node being much more powerful then the other. This is a double-edged sword: should a failover occur you'll be left running on considerably lower resources which may not be accepted by the organization

Talkback and comments are most welcome

1 comment:

lewismichael said...

that prevent you from using the program? Definitely no. Google Allo for Android responses so and to learn the users texting design.

Designed by Posicionamiento Web