Friday, April 11, 2008

The SLA Lesson: software bug blues

5:27 PM Posted by Bozidar Spirovski , , ,
I have been hugely busy in the past weeks with several projects, so the blogging got stuck... I will try to avoid this in the future. Now back to my latest experience

Part of every Information Security Management System is the incident management process. It is as process in which the company identifies a problem which is occurring or has occurred, and performs steps to contain it, minimize the impact, identify the root cause and take measures to prevent the incident from recurring.

The incident in question is a dreaded application blocking - a company of 1000 employees uses a custom made fully integrated CRM/ERP system, which exhibited complete or partial non-responsiveness of several minutes for a period of nearly two hours. This situation was identified at several departments, while the rest of the company is functioning as usual.

As soon as the call came in, the incident response team was formed and the problem was analyzed. After 15 minutes, the problem was identified. Accounting has started a program which should run once a week and affects the billing information of most Key Customers. This program was started at it's usual time, with usual parameters. The problem was rectified by stopping the processing and postponing it for after business-hours

Upon further investigation of the incident it was identified that the problem has occurred before, at regular intervals, but was never reported as an incident. The situation has been handled by the IT department, who communicated the problem to the software company which created the software as a bug.

When i requested a status update from IT on this bug report, i received a shocking information: The software company has closed the bug report with a status of DENIED

So I called the release manager at the software company, and i got an even bigger shock: He explained that the software company decided to deny this bug report due to overwhelming change requests and bug reports from our company. In his words, this bug was a mere nuisance since it blocked part of the software for about an hour once a week - just run it during lunch!

At this point, the incident was no longer just an incident, it became a support contract issue, so i reported the situation to management and asked for their involvement.

This incident is a very good lesson in the different priorities and focus of the parties involved:
For a user of the system any problem can be a show stopper.
For the manufacturer of the system, the same problem can be played down to an importance of an itch. There can be many reasons for such a difference in opinion, but here are a few:
  1. There are insufficient human resources to address the issue
  2. There are profitable change requests or projects to to address, so this element is merely postponed since the software company will not see a profit from engaging their resources into correcting this problem.
  3. The problem is caused by a design flaw in the system, that is either very difficult or impossible to rectify in a reasonable time and within reasonable budget
The only way to increase the value of the users' incident to the manufacturer is through applying proper controls and penalties in the support contract. That is why security incidents history and results should also be used as a very valid input into the preparation and negotiation of the SLA