Showing posts with label disaster recovery. Show all posts
Showing posts with label disaster recovery. Show all posts

iPhone Failed - Disaster Recovery Practical Insight

A lot of Disaster Recovery procedures are considered failed simply because they took longer then originally planned and documented. And a lot of these procedures take longer not because of poor equipment or incompetence. On the contrary, they take longer because the responsible people are focusing primarily on the effort to fix the problem. Here is a practical example:

On Tuesday my iPhone failed. And since its warranty is long gone i decided to fix it myself. I finally got it fixed at Wednesday night.

In my zeal to repair it, I forgot the first rule of business continuity - recover functionality within acceptable time frame. And for iPhone, just for any other mobile phone, the main functionality is TELEPHONY!!! I was unavailable for the most part of Tuesday and during parts of business hours on Wednesday.

In the end, the problem was solved, and my iPhone is working again. But then all missed calls came raining down, and that kicked me back into reality, and gave me a real perspective of what I needed to do: find a low end replacement phone instead of meddling with low-level format, firmware flashing and DFU modes. That way, I would have been contactable, and be under much less pressure to quickly fix my iPhone.

In perspective, the same behavior can be seen in many organizations during IT disaster recovery. Disaster recovery is organized and coordinate by IT people - mostly very capable engineers. And yet, a large number of Disaster Recovery actions are delayed by the effort of these good engineers focusing primarily on fixing the engineering problem - not fixing the business problem.

In a Disaster Recovery situation, the timer of recovery is known as Recovery Time Objective (RTO). That is the time interval starting from the moment ot disaster in which operation must be recovered to limited but essential functionality.

A good DR manager - regardless of his position and education does his work with a stopwatch. The time he can allow the engineers to try to fix the problem does not have a formal name so let's call it Fixing Time. It is the time difference between RTO and the tested time required to activate the Disaster Recovery systems.
Once this Fixing Time passes, Disaster Recovery preparations must start. If the problem gets fixed before completion of DR system activation, all is well. If not, RTO has been met. Oh, and the engineers can relax from the urgency pressure and work on fixing the original problem for as long as it takes

Back to my iPhone example - what was my timing? A phone RTO should be the recharge time - 2 hours. Getting a replacement phone is a walk to the store and buying the cheapest prepaid model or borrowing a spare form a friend - 30 minutes. So I needed to keep my cool, and try to fix the problem for only 1.5 hours before looking for an alternative. After that, I could have spent a week on the iPhone - no pressure to fix it fast.

Related posts

3 Rules to Prevent Backup Headaches
Business Continuity Analysis - Communication During Power Failure
Example Business Continuity Plan for Brick&Mortar Business
Business Continuity Plan for Blogs
Example Business Continuity Plan For Online Business

Talkback and comments are most welcome

3 Rules to Prevent Backup Headaches

Any modern IT infrastructure needs and (usually) has a solution for backup of information. But due to the constant drive to reduce expenditures, very undesirable situations can occur, such as not being able to read the backed-up data.

Example scenario:
A telco company has two data centers- one operational and one warm backup datacenter which is kept in sync via replication services. Due to rise in capacity of stored data, the old tape library from the primary datacenter has to be replaced with a much larger tape library to accommodate proper backup.
The old tape library is still operational, and is moved to the warm backup datacenter to provide backup for the servers in the backup datacenter, should they become operational.
After 6 months, a major power failure occurs and the backup datacenter needs to be brought online. During the process, it is concluded that one of the ERP databases became corrupted during the replication and cannot be recovered. Since tape backup from the primary location is kept offsite in a bank vault, the tape backups of the ERP database are taken from the bank and brought to the backup datacenter. Upon attempting to restore the data from the backup tapes, it is concluded that the tapes are unreadable, and the database cannot be restored immediately.
The database is restored to an old backup and then rebuilt by manual entry over the course of 5 days.

Analysis:

  1. The vision of the backup systems operation for the primary and backup locations was that the servers at the respective location will backup to and restore only from the local tape library.
  2. Nobody bothered to check whether the old and new tape drives and tapes are compatible with each other and whether tapes from primary location can be read at the backup location and vice-versa.
  3. This led to the problem in which the last resort - the tape backup, although properly archived and protected was unusable.

Recommendations:
To avoid such and similar problems, follow these rules
  1. Make sure that you have full compatibility of all tape drives used within the organization - such compatibility will ensure that you can easily use any drive for any tape, even move one drive to a specific location if the need arises.
  2. Make sure that your tape drives are functional - perform regular 'exercises' of backup and restore of ALL drives within the organization. If you don't do this, by Murphy's law, the only remaining drive you have during an incident will be clogged up with dust or simply failed
  3. Make sure that your tapes are functional - perform regular 'restore exercises' for all tapes, and keep track of tape lifetime. The last thing you want is to have a possibly failing tape during a disaster recovery.

Talkback and comments are most welcome

Designed by Posicionamiento Web