The SLA Lesson: software bug blues

I have been hugely busy in the past weeks with several projects, so the blogging got stuck... I Will try to avoid this in the future. Now back to my latest experience

Part of every Information Security Management System is the incident management process. It is as process in which the company identifies a problem which is occurring or has ocurred, and performs steps to contain it, minimize the impact, identify the root cause and take measures to prevent the incident from recurring.

The incident in question is a dreaded application blocking - a company of 1000 employees uses a custom made fully integrated CRM/ERP system, which exibited complete or partial non-responsiveness of several minutes for a period of nearly two hours. This situation was identified at several departments, while the rest of the company is functioning as usual.

As soon as the call came in, the incident response team was formed and the problem was analyzed. After 15 minutes, the problem was identified. Accounting has started a program which should run once a week and affects the billing information of most Key Customers. This program was started at it's usual time, with usual parameters. The problem was rectified by stopping the processing and postponing it for after business-hours

Upon further investigation of the incident it was identified that the problem has occured before, at regular intervals, but was never reported as an incident. The situation has been handled by the IT department, who communicated the problem to the software company which created the software as a bug.

When i requested a status update from IT on this bug report, i received a shocking information: The software company has closed the bug report with a status of DENIED

So I called the release manager at the software company, and i got an even bigger shock: He explained that the software company decided to deny this bug report due to overwhelming change requests and bug reports from our company. In his words, this bug was a mere nuisance since it blocked part of the software for about an hour once a week - just run it during lunch!

At this point, the incident was no longer just an incident, it became a support contract issue, so i reported the situation to management and recommended an intervention from their side.

This incident is a very good lesson in the different priorities and focus of the parties involved:

For a user of the system any problem can be a show stopper.

For the manufacturer of the system, the same problem can be played down to an importance of an itch. There can be many reasons for such a difference in opinion, but here are a few:

  1. There are insufficient human resources to address the issue
  2. There are profitable change requests or projects to to address, so this element is merely postponed since the software company will not see a profit from engaging their resources into correcting this problem.
  3. The problem is caused by a design flaw in the system, that is either very difficult or impossible to rectify in a reasonable time and within reasonable budget

The only way to increase the value of the users' incident to the manufacturer is through applying proper controls and penalties in the support contract. That is why security incidents history and results should also be used as a very valid input into the preparation and negotiation of the SLA

2 comments:

gsyoungblood said...

The problem is getting an SLA with teeth can be difficult. Especially if there is a significant size disparity between the customer and the vendor.

I can't tell you how many SLA's I've seen that weren't worth the paper they were written on. It's an SLA in name only, with virtually no penalties for failure to perform. And, as such, the vendor has minimal motivation to meet or exceed criteria in SLA.

Another bit of advice, make sure even if the SLA you do get is toothless that you can tie incidents in with your contract - especially if the contract term is a year or longer. Having the ability to terminate the contract without (or with minimal) penalty due to multiple incidents can be a lifesaver.

I recall one company that had a 3 year contract for service that included an SLA guaranteeing 99.99% availability. The problem was the method of calculating availability was not defined, and the customer and vendor had two different definitions. Short story: customer wanted to cancel contract for failure to live up to SLA, vendor threatened termination penalties and fees. No one was happy.

dark0 said...

I fully agree that the topic of SLA is never an easily solved one.
On the other hand, it has been my experience that SLA's are very frequently an afterthought and that the buyer is usually waiting for the seller to produce the SLA agreement. Ofcourse, this leads to the situation presented by gsyoungblood. If on the other hand, the SLA is properly addressed at time of purchase and tied in with incident tracking, it can be used as a very big stick, even if the buyer is John Doe Inc.and the seller is Microsoft
I feel that this will merit another post to communicate my experiences on the subject.

Designed by Posicionamiento Web