System Management - When do the IT Admins Screw Up?

The main purpose of IT within a company is to provide IT services to the business. This means that the responsibility for availability, response time, and service quality rests mostly on the shoulders of IT admins.

In most cases IT personnel understand the burden they bear very well, and are extremely careful in their daily activities. But if certain processes and IT culture are not in place in an organization, system admins can cause disruptions.

Here are the conditions with real life examples under which an IT admin can screw up:

  1. Lack of Proper Testing and Contingency Planning 1 - A corrective update batch process was run on the CRM system. The admin started the process at 9 PM without to complete overnight and left it without supervision. The process ran until 5 AM, when it failed and the database began rollback. The rollback took another 8 hours, incapacitating the companies CRM until noon the following business day.
  2. Lack of Proper Testing and Contingency Planning 2 - During database maintenance, several large tables were moved directly to archive and recreated as empty ones manually. The system ran well for 5 days, after which each operation became very slow or could not be performed at all. A simple analysis concluded that the during the archive and recreation process, the indexes were not recreated on the newly created tables, thus forcing the database to do a full table scan for every operation. Since the tables were empty, this did not become an immediate problem.
  3. Lack of Coordination and Communication - A clustered mail server exhibited errors in mailbox processing. Two administrators were called in to remedy the problem. The first administrator initiated a mailbox rebuild process. 10 minutes later, the second admin instructed the cluster to fail-over the mail server resources on the other server. The rebuild process crashed and corrupted the entire mailbox pool, which had to be restored from backup. All received emails after the backup were lost.
  4. Not following procedures - The corporate web server sent an alert of low disk space, so a system admin searched the disk for items to delete. He found a folder "Copy of wwwroot" and assumed that it is a copy of the web server root directory. He deleted the folder and all sub folders thus creating free space. 5 minutes later, the manager called to report that their corporate web site is down. Another admin assisted web development in placing a new version of the portal the previous day, and they placed it in "Copy of wwwroot". Luckily, the old version was still available a temporary version of the portal went up in 10 minutes.
  5. Direct training or testing on live environment - A newly hired administrator was given access to administrative passwords. Since his new job would require to administer routers, after work he decided to try some router commands. He chose to connected to a router whose IP address was commonly mentioned, logged on and started typing basic commands, specifically the 'show' command set. He also used the abbreviated version of 'show' - 'sh'. He got braver and entered an interface configuration, and typed 'sh' again, and pressed enter. The router complied and did not return anything. What the admin didn't know was that at interface level 'sh' means 'shutdown'. The Internet link of the company was down for 2 hours until a senior admin brought the interface back up.

Related Posts

8 Golden Rules of Change Management

Talkback and comments are most welcome

1 comment:

WinDiagnostic said...

A helpful service for a server sysadmins is a System Auditer. You might consider WinDiagnostic, which provides sever agents that continuously monitor both the Windows Registry and all file systems for any changes.

Generally, SysAdmins will see a 90% reduction in server problem diagnosis efforts.

See http://www.WinDiagnostic.com

Designed by Posicionamiento Web