High Availability - Clusters have Issues

As IT services become more and more important to the organization, the notion of the a service being down becomes scary. So the organization begins to search for ways to make the IT services more available. The usual solution to high availability is to place the IT service on a cluster system.

So, let's start with a definition
A computer cluster is a group of linked computers, working together closely so that in many respects they form a single computer. They come in three generic flavors:

  • High-availability (HA) clusters - implemented for the purpose of better availability of IT services
  • Load-balancing clusters - distributing a workload evenly over multiple nodes
  • Grid computing - large sets of computers optimized for workloads which consist of many independent jobs or packets of work

High Availability Cluster
For a typical corporation the 'weapon of choice' is the High-availability cluster. The simplest form of a high availability cluster contains two computers and a shared disk resource.

Most high availability cluster run in a 'failover' mode, also known as 'active/standby' mode. This means that one of the computers (nodes) is running the IT service (web server, database server or similar) while the other node is idling and waiting for the first node to fail.

Should it fail, the second node will take over the IT services and related resources - usually disk volumes, ip addresses and hostnames and continue to run the service. This takeover takes anywhere from several seconds to a minute, which is acceptable for most types of services.

The process of takeover includes a process called 'voting'.
  1. Both nodes are checking each other's health at regular intervals. This health check is known as a heartbeat
  2. In the case when one of the nodes does not respond, the second one will assume that the first one has failed, and it needs to take over the IT service that needs to be run.
  3. The problem with the immediate decision to take over is that the missing response can be just a connectivity issue, in which case the first node is still up and running - and both nodes will end up fighting over the IT service. This is known as a 'split brain' cluster
  4. To avoid this situation, an odd numbered element must be included. Since a third computer can be expensive, a usual third element is a disk drive that is connected to both servers. This disk drive is known as a 'quorum disk'.
  5. So, in case of a failure, the surviving node will first contact the quorum disk and perform a 'vote' - usually write a file and wait a predetermined time to see whether the other node will erase it. If the file is there, the vote is successful, and the surviving node will take over the IT services.
This entire voting process takes several milliseconds so it does not delay the fail over process

Naturally, there are issues with using clusters. Here are the most common
  • Cost - Cluster systems need specific cluster aware software, the hardware is usually highly redundant and the shared disk systems are quite expensive.
  • Resource Waste - In failover cluster - the most common variety, one of the cluster computers is mostly idle, just sitting there and waiting for the first node to fail.
  • Difficult performance scaling - In failover cluster, if the current cluster node does not have sufficient power, it is not easy to replace it with a faster cpu. Everything inside a computer designed to run in a cluster is more expensive and needs special approval by the cluster software vendor to confirm that it is compatible with a cluster solution. And even if you manage to upgrade the system, you are careful to upgrade both nodes, so if failover occurs the performance remains the same.
  • No protection against software error - In essence, the cluster is not a silver bullet. It protects against hardware error, but in no way helps against corruption of information caused by faulty software or human error.
The High Availability cluster is an excellent solution for increasing IT service availability - if you can live with it's issues:
  • For maximum effect it needs to be supported by methods of protection against software or human error (backup and archive)
  • For resource waste, you can run several IT services and balance them on both nodes, so each node acts as failover for the services running on the other node. But bear in mind that when a failover occurs, you'll have to run all services on one node - thus creating a possible performance issue.
  • If cost of hardware and upgrades is a major issue, you can even consider an assymetric cluster - one node being much more powerful then the other. This is a double-edged sword: should a failover occur you'll be left running on considerably lower resources which may not be accepted by the organization

Talkback and comments are most welcome

Know the Difference - Backup vs. Archive

Information availability and IT operations require Data Backup. Legal and Compliance requirements dictate Data Archival. But many organizations make the mistake of equalizing Archive with Backup, which can lead to wrong choice of backup or archival media, very poor restore time and even loss of information.

Example Scenario
As part of an audit, an auditor reviewed the backup and archival system of a company. The company presented their backup systems, access controls and audit. When asked about archived data, they again pointed to the tapes containing their backup. But their backup tapes are rotated every 6 months, so the company does not have any archive from earlier then 6 months ago.
The company failed the legal Archival requirement.

In order to properly design and architect a backup or archive systems, one must clearly understand the differences between backup and archive:

The key reason for the existence of backup is to provide an alternative data source in case the primary data source is corrupted or destroyed. A Backup process is creating a copy of the current state of data. It is understood and accepted that the state of the backed up data will change in the future under controlled circumstances. At that point the old backup will become irrelevant for operational purposes and the data will need to be backed-up again.

Criteria for selecting a backup solution

  • The backup needs to be accessible fast
  • The media should be reusable for maximum cost efficiency
  • The media should survive transport in less then ideal conditions (trunk of a car)
  • The backed up information should survive with full integrity and availability for several months on the backup media.
  • The backup should be able to span multiple media (if backup set is larger then media capacity).
  • The solution should be intelligent enough to enable different backup sets (full backup, incremental backup, differential backup etc)

The key reason for the existence of archive is to provide historical reference of information. The archive's process final product is a long term non-changeable copy of data or information. It is understood and accepted that the archive media must be resilient, capable of surviving over long periods of time (years) and must guarantee that the archived data remain unchanged during the entire archive lifespan.

Criteria for selecting archive solution
  • The archive media needs to be able to operate with different data collections while treating them at the same level of integrity - individual data records from a database as well as entire documents,
  • The access speed to an archive can be slow, but archive media should have an extremely high level of reliability (remember, archives can span several decades)
  • When creating an archive, always plan the lifetime of the archive, and make sure that the manufacturer will provide systems that can retrieve the stored data - having an archive that is unreadable because there is nothing to read it on is a terrible idea.
  • Data integrity must be maintained over the entire period of the archive existence - there is no point in having an archive if you can't trust that it's the same as it was when archived.
  • There should be an index of archive media to retreive relevant information from archive

Backup and archive solutions may be part of an integral system, but they perform a different function, so the actual media and individual systems will most likely vary.

While backup is still performed mostly on magnetic tapes, archive is usually performed on optical disks or microfilm. You may choose magnetic media for archive, but if you do, you need to plan that your archive tapes must be shielded from long term adverse influences, and you must maintain a functional reader for the tapes over the entire lifespan of the archive.

Talkback and comments are most welcome

Related posts
3 Rules to Prevent Backup Headaches
Business Continuity Plan for Blogs
Further resources and options for educating yourself in IT terminology
can be found here and here

New Helix3 Forensic CD - Welcome

E-fense has published a new version of their acclaimed Helix Forensic Live CD. It is now in version 2.0.

UPDATE: Helix3 is no longer a free product. e-Fense decided to make it a commercial product

Just as the old version, the new one contains two major components

  • A LiveCD (Based on Ubuntu) - A full blown forensic toolkit with a nice all-encompassing set of tools.
  • Windows set of tools - which allow the user to use a subset of forensic tools within a running windows system (most often during first response).
The Windows toolkit is maintaining the same interface as before, but the windows based application set is coherent, there are no missing applications. The previous version had a number of links in the windows toolkit that weren't working, which could cause a lot of grief at the wrong time.

Just a reminder of the Windows Helix Menu

The Linux LiveCD interface has seen a major overhaul. It is now based on Gnome, and the overall interface is much better organized.

The following screenshot depicts the new Helix boot menu

Unfortunately, probably in search of a better overall performance, it is departing the Forensic track and moving much more into mainstream - The toolkit is missing a lot of nice new Forensic tools that could have been installed and utilized. Hopefully, they'll be included in the next version.
There is one new major feature that was missing from the previous version - the LiveCD can now be installed on a hard drive - effectively creating a full blown Forensic investigation computer without the need to lug around a bootable CD.

The installer suffers from several bugs, so make sure you partition the target hard drive manually - the automatic option doesn't work

The following Screenshot depicts the installed version of Helix

The new version of Helix is much easier to use and overall a much more completed product.

UPDATE: Helix3 is no longer a free product. e-Fene decided to make it a commercial product

Talkback and comments are most welcome

Related Posts
Tutorial - Computer Forensics Process for Begginners
Tutorial - Computer Forensics Evidence Collection

Strategic Choice - Proper Selection of Web Hosting

The time of expensive hosting and limited functionalities on web servers are long gone. Today, everyone and their mother is doing web hosting, with a huge hosting disk capacity at very acceptable prices. But even though most hosting providers differ only in the price on paper, things are much different in the real world.

You can get stuck with a poor hosting, a lot of non-functional elements of the site and even huge downtime on your site.
Here is a practical approach to selecting a good but Affordable Web Hosting provider. In order to properly evaluate them, you'll need to engage both your technical and business teams.

Make a table like the one on the following slide and start grading according to the following bullets

  1. Business Support Quality - Through this category, you will evaluate how prepared the hosting provider is to meet your business expectations of hosting. When evaluating business support quality, you need to answer the following questions. Add two points for each Yes answer to your business support category grade:
    • Does the hosting providers' sales rep answer to calls and e-mails in a timely manner?
    • Does the hosting providers' sales rep try to understand what you are trying to achieve?
    • Is the sales rep discussing meeting your requirements?
    • Does the sales rep provide direct contact with a dedicated technical person for clarifications?
  2. Technical Support Quality - Through this category, you will evaluate how prepared the hosting provider is to meet your technical requirements for hosting. When evaluating technical support quality, you need to answer the following questions. Add two points for each Yes answer to your technical support category grade:
    • Does the hosting providers' technical support person answer to calls and e-mails in a timely manner?
    • Does the hosting provider actually support the technical requirements of your site?
    • Does the hosting providers' technical support person answer your team's technical questions in a clear manner?
    • Does the hosting providers' technical support person ask for clarification on your requirements?
    • Does the hosting providers' technical support person warn you of any specific policies and limitations in their hosting solution that might hamper you?\
    • Does the hosting provider offer remote tools for web site technical side management (service stop/start, add-ons and libraries management etc..)
  3. Hosting Solution Breadth - Through this category, you will evaluate what other services you might be able to utilize in the near future combined with web hosting. When evaluating hosting solution breadth, you need to answer the following questions. Add one point for each Yes answer to your solution breadth category grade:
    • Is the hosting provider prepared to take over DNS hosting?
    • Is DNS records management available to your technical staff via remote interface?
    • Is there a e-mail service available?
    • Can the e-mail service capture all e-mails for you if necessity arises?
    • Are they offering any other services as bundle or with additional payment?
  4. Hosting Contention Ratio - Through this category, you will evaluate how many other sites you'll have to compete with for server resources, and how many different sites can impact your own in terms of security since they are on the same server. When evaluating contention ratio, you need to answer the following questions. Add one point for each Yes answer to your contention ratio category grade.
    • Is your site on a dedicated server?
    • Is your site on a server with no more then 50 large customer sites?
    • Is your site on a server with dedicated and isolated resources from other sites (virtual machine or chroot type of isolation)?
  5. Error Recovery - Through this category, you will evaluate how will the hosting provider react to recover your web site should an error occur. When evaluating error recovery, you need to answer the following questions. Add one point for each Yes answer to your error recovery category grade
    • Is backup of the site performed daily?
    • Is backup of the site performed together with backup of the site's backend database
    • Is hacker attack detection/prevention present?
    • Will you get alerting/notice from the provider if suspect hacker activity is detected?
    • If site defacement occurs, can the hosting provider recover to a working site within 15 minutes of detection or notice bu you?
    • If site defacement occurs, is proper forensic investigation performed with results submitted to you?

After you've finished answering your questions, you'll have a table like the one below

Select the top 20% providers from the Total grades and add the pricing of their solution. The cheapest one will be your Affordable Web Hosting provider. You can afford to pay him, but you don't need to accept low quality.
Talkback and comments are most welcome

Related posts

Rules for good Corporate Web Presence
Creating Your Own Web Server
Tutorial: Making a Web Server
Web Site that is not that easy to hack - Part 1 HOWTO
Web Site that is not Easy to hack - Part 2 HOWTO - the web site attacks

Designed by Posicionamiento Web