I wasn't planning to touch the issue of the Service Level Agreement (SLA) for some time, but it appears that the incident report (Link to Blog Post) has stirred attention that merit a post on the subject.
As i already mentioned, it is a very frequent occurrence that the SLA is just an afterthought when preparing a contract, and that the buyer is usually waiting for the supplier to produce the SLA agreement. Of course, this leads to the situation in which the SLA actually protects the supplier, not the buyer.So here are the items one must do to achieve at least a reasonable if not good SLA
- Remember that any SLA is open for negotiation, but only in initial purchase- although the supplier may propose a very rigid position on the SLA (especially common in large companies), the SLA is part of the sales process. Standing by a rigid position should immediately raise red flags that the proposed "unchangeable" SLA is protecting the supplier, not the buyer. So the best opportunity to negotiate it is during the initial RFP negotiations. Once the product/service is sold and goes into production use, the buyer has lost all power of negotiation. So be very wary to agree that you will negotiate the SLA after delivery, end of warranty or some similar wording.
- Define Availability as you would expect it - availability is usually calculated as a percentage of time the product under SLA is up and running. Usual numbers vary from 98% to 99.999% of the time. Now, let's examine the "time" factor in the formula. Upon first reading, a person will usually interpret that 98% will be 98% of any time measure, whether it be hour, day, month, year, century...But let's observe the following table: In a SLA contract specifying a percentage of availability per time period, the total downtime is accumulated over the entire time period. Furthermore, if there is no time period specified in the availability percentage, the default time period is the period of the validity of the contract - which is very often 1 year or more. So, if you sign a yearly contract with an SLA of 99%, it doesn't guarantee you that you will have at most 10 minutes of downtime per week. It means that you won't have more then 3.65 days (or 86 hours) of downtime over the entire year, which means that you can have full 10 8-hour workdays WITHOUT ANY SERVICE in that year. If you take the same 99%, but insist on applying it on a weekly level, you suddenly get much better odds - now, you can't have more then 1.68 hours of downtime in any of the days. So take a day of meetings in your company to define what is your maximum possible downtime per day, and use the above matrix to find the best option for you.
- Always keep in mind the distinction between reaction time and correction time - During the negotiation of an SLA It is usual to have very tense negotiations to achieve a good "response time". But this umbrella term is an excellent umbrella - for the supplier! Response time is defined as the time passing between formal logging of problem and until a representative from the supplier logs a response (sends a reply on e-mail, makes a phone call or arrives on-site). So when defining the response times, ALWAYS define two or three different times: reaction time - which is equivalent to response time, workaround time - the time in which it is expected to achieve a temporary solution which will alleviate the problem and correction time - the time in which it is expected that a final solution will be found.
- Make precise definitions of problem severity levels and tie them in with reaction and correction times - as in my previous post, the severity of the problem can be viewed differently by the buyer and supplier. So, define a clear matrix of severity levels, and have a clause which states that if severity level differently, the view of the buyer prevails. A sample of severity levels are presented in the table below:
- Define response time for all levels of severity - naturally, the buyer should expect faster reaction and correction for more severe problems. When defining the severity levels, in each one include at least the expected reaction time and workaround time.
- Define channels of communication and escalation - At first glance a very simple thing, but one that is very often a reason for not being able to dispute the SLA contract. For the problem to be considered properly reported, the supplier will expect a report from an authorized person to specific persons via email, fax number or phone. Any deviation from the agreed upon process is an excellent reason for not meeting SLA parameters on the grounds of "not being informed". So always have at least three authorized persons for problem reporting, and modify internal procedures so these persons are the first to be informed of a problem. The same is true for the escalation of problems to higher levels, should the problem persist.
- Define the conditions under which the SLA criteria are applied to a problem - It is not uncommon in SLA agreements to see that the SLA criteria start to apply from the time of problem reporting from the buyer to the supplier. This is an element usually insisted upon by the supplier, since it offloads the burden of monitoring and reporting on the buyer. By the time the problem is reported, the actual problem is already existent for several minutes up to half an hour. Even more so, there are products for which the supplier cannot perform the monitoring and cannot conclude that a problem is occurring. So although this point will not be applied in the contract, adjust internal procedures so that the authorized persons of the buyer IMMEDIATELY report the problem to the supplier. Internal metrics can be even applied to this process, to identify internal lags in communication.
- Define measurements and reporting - An SLA is useless if you can't measure and document each problem length properly. So the buyer should keep track of problems, with info on the severity, duration of problem, reaction time and correction time, with all relevant e-mails and messages exchanged. Tracking can be achieved with something as simple as an excel sheet, all it requires is regular update.
- Tie in penalties and contract back-out options - this is the actual big stick in the SLA. Breach of SLA parameters should be tied to serious penalties and possibility for contract termination. When defining penalties, always strive to define them in monetary value payable immediately upon breach of SLA. Also, you should try negotiate a penalty that has an exponential growth with each further hour of SLA breach. Do not accept a penalty to be compensated with other goods or services from the same supplier, since the supplier will value such services at sales price in the refund, while their internal costs for such services are significantly lower, thus reducing the actual loss of the supplier in SLA breach
The SLA Lesson: software bug blues