Showing posts with label business continuity. Show all posts
Showing posts with label business continuity. Show all posts

Contingency Planning Conference 2010

For anyone near New York City, you can check out the Planning & Management Conference (CPM 2010 East) on November 3-4.

According to the promoters, it is a 4-track advanced-level program taught by expert faculty in small, classroom settings. Plus, you can earn up to 35 Continuing Education Activity Points (CEAPs) just for attending.

You can register for the conference rate with a $100 discount off the full conference rate. Visit http://bit.ly/CPM2010MIS and register with the promotion code NX1C79.

Shortinfosec is distributing this information without any commercial interest. Sadly, we won't be able to visit. But anyone who visits is welcome to publish a guest post on Shortinfosec about the conference.

Choosing a Disaster Recovery Center Location

When preparing a Disaster Recovery Center, one of the most important decisions is the location of the location of the Disaster Recovery Center. Up until the 9/11, a lot of companies held their DR centers in the adjacent building, and right after 9/11, everyone wanted to go as far from the primary data center as possible.


One of the common misconceptions of Disaster Recovery planning is that longer distance ensures better disaster protection. Of course, increasing the distance between data centers reduces the likelihood that the two centers are affected by the same disaster. But just putting distance between locations may not be sufficient protection. In reality, the best distance for a DR location is dictated by a multitude of factors:

  • Minimal parameters dictated by regulators - certain businesses, especially telco and finance must maintain regulatory compliance. It is not unusual for regulators to mandate minimal distance between the primary and the Disaster Recovery location. You must comply to these parameters
  • Corporate RTO parameters - the company has decided that the Disaster Recovery Center must be up and running within the time defined as RTO - Recovery Time Objective. This time will include the travel time to Disaster Recovery center and the system activation times. So it is always important to take this parameter into account when choosing a Disaster Recovery site
  • Telecommunications services - larger distance between the primary and DR site means higher telecommunication costs and limits the choice of appropriate remote copy technology. For instance, synchronous replication is still very difficult to achieve past the 40km mark. Choose a location that is sufficiently distant but still manages to deliver the required bandwidth for the chosen replication/remote copy technology
  • Geophysical conditions -In order to avoid a natural disaster, it is not always sufficient to move your Disaster Recovery center to a specific distance from the primary center. Most natural disasters deliver high impact in areas which support their spread by terrain configuration or other geophysical conditions. For instance, a safe hurricane impact distance was considered 150 km. However hurricane Katrina lost strength after over 240 km inland since there was no terrain feature to stop it. Best location should be in a separate flood basin, off a seismic fault line (or at least on a different one) and with a large mountain between the primary and the DR site
  • Means of Transportation - increased distance between primary and DR site may make it difficult for employees to travel to the recovery site. This is especially true in situations of crisis, when roads may be damaged or blocked, or public transport is stopped by strikes. Choose a site that has multiple travel options - railroad, motorway, even river boat
  • Vicinity of Strategic objects - It is never smart to place your Disaster Recovery center in the vicinity of objects of strategic importance to the country. Such locations are prone to terrorist attacks, and attack by opposing forces in a military conflict. Also, even in situations of natural disasters, strategic locations will have strong military presence that may limit access to your Disaster Recovery center. Strategic objects are military bases, airports, refineries and oil depots etc. Choose a safe distance from such locations

There is no such thing as an ideal Disaster Recovery location. The optimal location is the one that minimizes the risks at an acceptable cost and meets the required SLAs and authorities' regulations.

Talkback and comments are most welcome

Related posts
Mitigating Risks of the IT Disaster Recovery Test
iPhone Failed - Disaster Recovery Practical Insight
Business Continuity Analysis - Communication During Power Failure
Business Continuity Plan for Brick & Mortar Businesses
Example Business Continuity Plan For Online Business

Mitigating Risks of the IT Disaster Recovery Test

The IT Disaster Recovery Test as part of the Business Continuity testing is becoming an annual event for most IT departments. It is mandated by a lot of regulators, nearly insisted upon by internal audit and ofcourse a very healthy thing to do.

But performing the IT DRP test without proper risk management can put your organization at significant risk.


To put things into perspective, let's analyze the steps, risks and countermeasures of an IT Disaster Recovery test:


DRP Test StepActivityRisksCountermeasures
1. Failure of primary systemsIn order to perform a disaster situation, the Primary systems need to be caused to fail on some level
  1. Databases not closed properly/damaged due to forced shutdown or forced power failure
  2. Hardware components failing due to forced shutdown or power failure
  3. Spilt-brain cluster due to uncontrolled sequence of failures of servers and storage
  1. Full backup prior to the initiation of the DRP test
  2. Backup components and Vendor presence at ready during the entire test.
  3. Not performing a direct forced shutdown but forcing a network level isolation at the routers
2. Activation of Disaster Recovery systemsSevering any relation between the DR and the primary systems and running the DR systems as temporary primary
  1. Actual failure of primary system during the test
  2. Failure of the primary system while the DR system is concluded to be non-functional
  1. Full awareness of the test of every interested party - business custodians, directors of divisions and top management to initiate the real Business Continuity Plan
  2. Full backup prior to the initiation of the DRP test at DRP site, and full vendor support.
3. Reconfiguring the user environmentIntervening in the end-user environment in a way that will make them use the DR system
  1. Error in reconfiguration which may cause the end-user to input test data into the primary systems
  2. Error in reconfiguration which may cause the primary system to stop functioning.
  1. , 2. Scripted and documented steps of reconfiguration. All steps should be performed by 2 persons - one observing the others actions
4. Reverting to the primary systemsResuming the primary systems at some level and reestablishing the relation between the DR and the primary systems
  1. Error in reconfiguration which may cause the primary system to stop functioning.
  2. Copying of test data that was input into the DR test system back into the primary location3. Failure of primary systems during resumption
  1. Scripted and documented steps of reconfiguration. All steps should be performed by 2 persons - one observing the others actions.
  2. Fully controlled and documented process of resumption, which guarantees that only the primary system is data master.
  3. Full backup prior to the initiation of the DRP test, Backup components and Vendor presence at ready during the entire test.



With all these risks, is it more prudent to never perform an IT DRP test? - Absolutely NOT, and here is why:
  • Performing the IT DRP test actually confirms that things are running, and if something breaks, you are much more prepared for the next time.
  • Not performing the test will just make you think everything is great, until the incident occurs. And the incident is just as certain as death and taxes
So, perform the IT DRP test regularly, but with a whole set of countermeasures for the possible risks which can happen during the test. Of course you will miss some risks, but if you plan for 10 and miss 1 is much better then not planning at all!

Talkback and comments are most welcome

Related posts
iPhone Failed - Disaster Recovery Practical Insight
Business Continuity Analysis - Communication During Power Failure
Business Continuity Plan for Brick & Mortar Businesses
Example Business Continuity Plan For Online Business

Is the Server Running - optimal use of redundancy on a budget

When purchasing a server, most companies select a server class computer from a reputable manufacturer. And in this day, usually the servers come loaded with redundant components to optimize server availability and make it more resilient. And yet a lot of these servers fail at the first glitch simply because they are not configured properly. Here is a brief blueprint on how to optimally utilize the purchased and paid redundancy.

First, let's analyze what is usually redundant in a server. If we take into account only the garden variety commercial servers and ignore the hugely expensive fault tolerant machines, here is what you usually get:

  • Redundant Disk drives
  • Redundant Power Supplies
  • Redundant Network Adapters

To achieve a maximum from these elements, you should perform the following steps:
  • Redundant Disk drives - organize them into a RAID configuration. RAID 1 (mirror) is the best in terms of redundancy and speed. But you loose exactly 50% of capacity. RAID 5 (parity) gives you the best trade off between capacity loss and optimal performance. When planning a RAID, look for a server that has a hardware RAID controller. The modern server operating systems can make a RAID themselves, but this way the operating system has to dedicate resources and have specific software to maintain the RAID - thus burdening the main CPU with this task

  • Redundant Power Supplies - connect all power supplies of the server to power lines coming from a different circuit breaker. This will save you a lot of grief if the cleaning lady decides to connect her vacuum cleaner to an outlet connected to the same circuit breaker as the server and overloads it. If possible, connect all power supplies of the server to different Uninterruptible Power Supplies. This way, all UPS systems will help your server ride out the blackout.

  • Network adapters - First, organize the network adapters to work as a failover team. This is realized with specific drivers delivered by the manufacturer, and the driver creates a virtual network adapter. The virtual network adapter is configured with the IP address of the server, and it binds to one of the physical network adapters. Should the adapter loose connectivity, the driver will bind the virtual network adapter to the other physical one, thus reestablishing connectivity. To achieve optimal solution, connect the physical network adapters to several switches which are interconnected via trunk links - thus creating one large meta-switch.

All described actions can be performed by your in-house system administrator, and do not require any special expertise. With these simple steps, you'll achieve excellent availability of your server.

Talkback and comments are most welcome

Related Posts



iPhone Failed - Disaster Recovery Practical Insight

A lot of Disaster Recovery procedures are considered failed simply because they took longer then originally planned and documented. And a lot of these procedures take longer not because of poor equipment or incompetence. On the contrary, they take longer because the responsible people are focusing primarily on the effort to fix the problem. Here is a practical example:

On Tuesday my iPhone failed. And since its warranty is long gone i decided to fix it myself. I finally got it fixed at Wednesday night.

In my zeal to repair it, I forgot the first rule of business continuity - recover functionality within acceptable time frame. And for iPhone, just for any other mobile phone, the main functionality is TELEPHONY!!! I was unavailable for the most part of Tuesday and during parts of business hours on Wednesday.

In the end, the problem was solved, and my iPhone is working again. But then all missed calls came raining down, and that kicked me back into reality, and gave me a real perspective of what I needed to do: find a low end replacement phone instead of meddling with low-level format, firmware flashing and DFU modes. That way, I would have been contactable, and be under much less pressure to quickly fix my iPhone.

In perspective, the same behavior can be seen in many organizations during IT disaster recovery. Disaster recovery is organized and coordinate by IT people - mostly very capable engineers. And yet, a large number of Disaster Recovery actions are delayed by the effort of these good engineers focusing primarily on fixing the engineering problem - not fixing the business problem.

In a Disaster Recovery situation, the timer of recovery is known as Recovery Time Objective (RTO). That is the time interval starting from the moment ot disaster in which operation must be recovered to limited but essential functionality.

A good DR manager - regardless of his position and education does his work with a stopwatch. The time he can allow the engineers to try to fix the problem does not have a formal name so let's call it Fixing Time. It is the time difference between RTO and the tested time required to activate the Disaster Recovery systems.
Once this Fixing Time passes, Disaster Recovery preparations must start. If the problem gets fixed before completion of DR system activation, all is well. If not, RTO has been met. Oh, and the engineers can relax from the urgency pressure and work on fixing the original problem for as long as it takes

Back to my iPhone example - what was my timing? A phone RTO should be the recharge time - 2 hours. Getting a replacement phone is a walk to the store and buying the cheapest prepaid model or borrowing a spare form a friend - 30 minutes. So I needed to keep my cool, and try to fix the problem for only 1.5 hours before looking for an alternative. After that, I could have spent a week on the iPhone - no pressure to fix it fast.

Related posts

3 Rules to Prevent Backup Headaches
Business Continuity Analysis - Communication During Power Failure
Example Business Continuity Plan for Brick&Mortar Business
Business Continuity Plan for Blogs
Example Business Continuity Plan For Online Business

Talkback and comments are most welcome

Business Continuity Analysis - Communication During Power Failure

As the world gets ever more hungry for power, resources are depleting while the climate is changing and large storms become frequent, power outages and massive problems on the grid all over the world will start to rise. While massive power outages will bring a lot of problems, companies will strive to continue some level of operation. And to achieve it, they need to communicate - both internally and externally. And massive power failures dictate special analysis of the telco backup resources. Here is the analysis and recommendations:

What happens to the telco infrastructure during a massive power failure?

  • Every advanced telco device not on UPS will stop functioning immediately, including: routers and modems, PBX, faxes, cordless phones, ISDN phones
  • The advanced telco devices supported by UPS will fail within 90-180 minutes after the power failure, since the same UPS is also supporting PCs and other equipment
  • The alarm systems which usually have their independent battery pack will stop operating after approximately 24 hours
  • The gsm telephony base stations are mostly supported by UPS, with only the largest ones supported by generators. Therefore, they will fail within 100-200 minutes, after the power failure.
  • The only remaining telco resources after approximately 4 hours of blackout will be
    • The advanced telco devices supported by a diesel generator
    • Public Switched Telephony Network (PSTN) lines - they are powered over the telephone line by the telco PBX, which in turn is powered by a generator
    • Islands of mobile telephony in the cells created by the Large Mobile Telephony base stations
    • Satellite communication devices, like VSAT or IRIDIUM phones - these are a very temporary solution, since they are strongly dependent on battery capacity

Although diesel generators are not expensive, companies avoid them for all except the largest company locations for the following reasons:
  • installation brings a wealth of problems for companies, since they need approval from fire inspectors,
  • the company must adhere to safety and pollution regulations to install the generator
  • maintenance costs cannot be ignored, especially when the normal grid is
  • the diesel generators can become unreliable in very hot or very cold days
  • generators can become dysfunctional due to neglect or external influence, for instance, the other company sealing off the exhaust pipe during remodeling

Recommended Measures
  • Place diesel generators at all locations where it is possible - don't go overboard, just use a small device with 6-8 hours of anatomy and internal tank. After 10 hours of operation, you can create a controlled shutdown for a refill.
  • Have dedicated "red phone" PSTN line at each location or several of them attached to a simple phone device (with no external power requirements) , which can be used during normal operations, but which will become the primary means of communication during a longer period blackout.
  • Include the threat in your Business Continuity Plan (BCP) and define proper steps to be taken in case of occurrence
  • Test the BCP with the power failure scenario.

Naturally, the measures are simple and well known, and naturally, few managers will accept the first two until the first power failure event.

But the Business Continuity Manager can do the following: Create a
BCP test scenario in which it will be forbidden to communicate via any advanced telco devices, and present the results of the BCP to Management. The results will not be good, so be prepared to take the fire!

Related posts
Example Business Continuity Plan for Brick&Mortar Business
Business Continuity Plan for Blogs
Example Business Continuity Plan For Online Business

Talkback and comments are most welcome

Business Continuity Plan for Brick & Mortar Businesses

Just as Business Continuity Plan for Blogs covered the activities for Business Continuity for a very small online business, The BCP is much more important for standard everyday businesses.

As a continuation of the series of Business Continuity Plan examples, we are happy to present a BCP for "Brick and Mortar" businesses. This example BCP is modeled after a mid-range accounting business, and it is easily adapted to any office based business.

The Incidents included in this BCP are

  • Fire
  • Flood
  • Earthquake
  • Employee Illness - Epidemic
  • Strike blocking transport routes to site of business
Also, the BCP includes elements which are applicable to a multi-person organization, like chain of command, locations of alternative resources and communication plans - All of these need to be in print and all employees need to be aware of them for proper BCP execution.

You can download the Example Business Continuity Plan for Brick and Mortar business HERE

Related posts
Business Continuity Plan for Blogs
Example Business Continuity Plan For Online Business

Talkback and comments are most welcome

Business Continuity Plan for Blogs

After the post on Example Business Continuity Plan For Online Business , there was a mail discussion with a reader about whether it's at all relevant to Blogs. Here I would like to stress a fact. The blog hosting providers have BCP plans, but to recover THEIR services, not all blogs. A lost blog may be collateral damage, since it is after all- free service.
Here is a Business Continuity Plan for Blogs - It is actually the BCP of Shortinfosec, which I am using

SHORTINFOSEC BUSINESS CONTINUITY PLAN BEGINS

Incidents

  1. Loss of broadband link communication
  2. Loss of Hosting (Blogspot down)
  3. Loss of Hosting (Blogspot lost content)

Loss of broadband link communication
Time to wait before using BCP plan - 24 hours


  • Find alternative communication alternative choice
  • Use dial-up for connectivity - Time to achieve - Immediately
  • Use public hot spot at the Mall or Cafe - Time to achieve - 1 hour
  • Use GPRS from the iPhone
  • Publish the following message, post in the hotlink spot and as a first post:

We are experiencing difficulties in publication of new content. We
will continue with publication within the next 24 hours. In the meantime, please review our Archive

Total time of minimal function recovery - 1 hour after BCP activation

Total time of full recovery - 48 hours after BCP activation

Resources

  • Charged Laptop Battery
  • Charged iPhone
  • Modem within Laptop/PC
  • WiFi adapter for Laptop

Loss of Hosting (Blogspot down)

Time to wait before using BCP plan - 6 hours

  • Find alternative host and register - Time to achieve - 15 minutes
  • Wordpress http://wordpress.com/signup/
  • Typepad https://www.typepad.com/t/app/register
  • Choose a default template and Browse to see that it works - Time to achieve - 15 minutes
  • Login to feedburner and modify the feedburner path to new RSS feed - Time to achieve - 10 minutes
  • Publish post with content below - Time to achieve - 10 minutes
Title: Temporarily Moved We are experiencing difficulties in hosting of http://www.shortinfosec.net/. We are
working to resume normal operation. In the meantime, this is our temporary
home.
Please send your comments, questions and reactions to shortinfosec _at_ gmail dot com
  • Set-up the temp blog to accept the address http://www.shortinfosec.net/ - Time to achieve - 15 minutes
  • Log-On to DNS Hosting and redirect http://www.shortinfosec.net/ to new blog location - Time to achieve - 15 minutes
  • If the blogger problem persists more then 24 hours, post new content to new blog.
  • Wait for Blogger recovery, and if required restore template and content so the original site is available.
  • If blogger is not recovered within 48 hours, post old content as archive on the new site(PDF or backdated posts)

Total time of minimal function recovery - 80 minutes after BCP activation

Total time of full recovery - 48-72 hours after BCP activation

Resources

  • Charged Laptop Battery
  • Functioning Internet access (refer to incident 1)
  • URL and account name/password of DNS Hosting Service - written down on paper, in laptop bag, also saved in laptop
  • Current Backup of Blogspot XML Template - Backup Weekly and send as attachment to two web-mail services
  • Current Backup of custom Widgets - Backup Weekly and send as attachment to two web-mail services
  • Current Backup of Template Images and Icons - Backup Monthly and send as attachment to two web-mail services
  • Current Backup of Blogspot Posts - Subscribe to Feedburner to two web-mail services - Immediate Backup
  • Current backup of Downloads section - Backup Monthly and send as attachment to two web-mail services

Loss of Hosting (Blogspot lost content)
Time to wait before using BCP plan - 1 hour

  • Login to blogspot or re-register if account is lost - Time to achieve - 15 minutes
  • Choose a default template and Browse to see that it works - Time to achieve - 15 minutes
  • Login to feedburner and modify the feedburner path to new RSS feed (if changed) - Time to achieve - 10 minutes
  • Publish post with content below - Time to achieve - 10 minutes

Title: Temporarily Moved We are experiencing difficulties in hosting
of
http://www.shortinfosec.net/.
We are working to resume normal operation. In the meantime, this is our
temporary home. Please send your comments, questions and reactions to shortinfosec _at_ gmail dot com

  • Set-up the temp blog to accept the address http://www.shortinfosec.net/ - Time to achieve - 15 minutes
  • Log-On to DNS Hosting and redirect http://www.shortinfosec.net/ to new blog location - Time to achieve - 15 minutes
  • If required restore template and content so the original site is available.

Total time of minimal function recovery - 80 minutes after BCP activation
Total time of full recovery - 24- 48 hours after BCP activation

Resources

  • Charged Laptop Battery
  • Functioning internet access (refer to incident 1)
  • URL and account name/password of DNS Hosting Service - written down on paper, in laptop bag, also saved in laptop
  • Current Backup of Blogspot XML Template - Backup Weekly and send as attachment to two web-mail services
  • Current Backup of custom Widgets - Backup Weekly and send as attachment to two web-mail services
  • Current Backup of Template Images and Icons - Backup Monthly and send as attachment to two web-mail services
  • Current Backup of Blogspot Posts - Subscribe to Feedburner to two web-mail services - Immediate Backup
  • Current backup of Downloads section - Backup Monthly and send as attachment to two web-mail services

SHORTINFOSEC BUSINESS CONTINUITY PLAN ENDS

Related Posts

Example Business Continuity Plan For Online Business

Talkback and comments are most welcome

Example Business Continuity Plan For Online Business

Online based businesses are 100% dependent on IT services, but a lot of them don't even consider the scenario of what will happen in a situation of IT failure of the IT systems hosting their business/service.
Furthermore, a lot of online business owners simply rely that their hosting providers will recover their services -THIS IS WRONG - they will restore the information, but not necessarily functionality!
Here is an analysis and a summary plan for business continuity of an online business:

First, a couple of definitions:

  • The goal of business continuity is to resume business operation in a reduced but controlled manner after a disaster which impacts operation - until full recovery is achieved
  • The goal of disaster recovery is to resume IT operations after a disaster which impacts IT operation - until full recovery is achieved

Requirement analysis
For large companies, the initial step of planning business continuity is the Business Impact Analysis (BIA), during which the company identifies which processes are critical to the company's survival and need to be restarted immediately, and which can be restored later.

For small online portals/services these have the following processes:
  • Service Delivery - actual service running on web and database servers
  • Service Development - design, programming, upgrading, bug fixing of the service
  • Sales and Marketing - promotion, communication with affiliates
  • Accounting and back office operations - self explanatory
To simplify the BIA process, let's grade each process with a number by which we indicate which service process to be restarted at what time. Here are the numbers and their meaning:
  • 1 - Process must never stop, immediate restart is needed
  • 2 - We can survive without this process for 1 day
  • 3 - We can survive without this process for 5 days
  • 4 - We can survive without this process for 15 days
So, for our processes, these are the numbers
  • Service Delivery - 1
  • Service Development - 3
  • Sales and Marketing - 2
  • Accounting and back office operations - 3
So, the most critical process (surprise) is Service Delivery. This process is bound with network, hosting, servers, databases. Our continuity plan will limit itself to this process and only to one incident that can impact this process. The real Business Continuity Plan should take into account multiple incidents (power outage, DoS, loss of DNS, virus)

Example Business Continuity Plan

I. Incident type - Loss Of Application and Database Data due to hosting server errors
Steps to achieve continuity
  1. Post a temporary information and contact page on alternative free hosting - Time to achieve - 15 minutes
  2. Redirect DNS to temporary information page - Time to achieve - 10 minutes
  3. Investigate whether servers are available. If not available, consult the list of alternative hosting providers that can provide hosting for 1 to 3 months - Time to achieve - 1 hour
  4. Restore latest trusted backup of Database to operational DB server - Time to achieve -1 hour
  5. Restore latest trusted backup of Web Application to operational Web server - Time to achieve -30 minutes
  6. Perform functional test of updated infrastructure - Time to achieve - 1 hour
  7. Redirect DNS to temporary information page - Time to achieve - 10 minutes
Total maximum time to recovery - 4 hours

Resources to achieve continuity
  • Temporary page prepared and available for publishing
  • Funds on credit card to purchase hosting for 1 month
  • List of alternative hosting providers which can support the application with contact information
  • Functional broadband link - alternative, direct access to hosting provider premises and vehicle for transport
  • Database Administrator/Developer available for activities
  • Web Application Administrator/Developer available for activities
  • Trusted and Stable Backup of Database
  • Trusted and Stable Backup of Web Application
Naturally, the plan must be tested that it works

This example plan is very limited (one process, one incident) but this is the general structure of a continuity plan. But for an online business, in which every second of downtime counts, such a plan may be the difference between a minor incident and loss of business

Talkback and comments are most welcome

Designed by Posicionamiento Web