Search This Blog

Thursday, December 16, 2010

A Comedy of IT errors

It's worth taking a look at Humid Beings Post on Orleans Parish Clerk of (Civil) Court Computer Crash.

Having worked in IT long enough to experience computer crashes, database crashes, database archive configuration issues, backup configuration issues.... it takes a lot of things going wrong simultaneously for this kind of thing to happen....

We forget how critical it is to have people who:
1) understand the technology being used
2) understand the data and it's importance
to be able to keep these critical systems happy and safe.

I have yet to read anything which indicates that anyone knows what really went wrong.


The servers containing all of the conveyance and mortgage records crashed. The company hired in August 2009 to back up the records had stopped receiving good data in July, and it lost the older data in monthly purges. A batch of fully updated records was recovered, but it was garbled and deemed unusable.

Let's take this apart bit by bit:

The servers containing all of the conveyance and mortgage records crashed.Crashed... were running and went down. Servers reboot folks... Disks crash and can be recovered or restored via RAID and/or backup recover from tape or disk copy. Servers (as in more than one) all crashing and not rebooting or recovering is... well in today's computer world.... rare.

Not using a RAID system to protect the data in today's world - STUPID. But say that no one in City Government is smart enough to configure RAID disks. Disk space is so cheap you could literally make a complete copy of the database on a completly different system pretty inexpensively.

And when systems crash it is critical to have people available who understand how the systems are configured and what the error messages mean.

The company hired in August 2009 to back up the records had stopped receiving good data in July, and it lost the older data in monthly purges. A batch of fully updated records was recovered, but it was garbled and deemed unusable.
AND
The court eventually was able to recover digital conveyance records from the 1980s up to March 27, 2009, and mortgage data through Aug. 6, 2009.


Really?- NO BODY... NOT ONE PERSON was watching to ensure that the backups were completing successfully? No one? Monthy purges of what exactly? Weekly backups? Known in the business as incrementals. Folks you can loose every single incremental backup and still RECOVER as long as you have a valid Full Backup. This makes it sound like the last successful FULL Backup they got was August 2009. So for a whole year no one checked to confirm that the backups were functioning. No ONE? When we are talking about ALL the data on real estate transactions in Orleans Parish. Wow. I've seen incompetence. More than my share. Unintentional misunderstanding how the systems function both hardware and software can cause this to happen. But a year of not making sure the backups work? A year? Not a couple of weeks. Which is more likely to happen in big companies than you'd like to think.... but a YEAR?

Databases (Oracle/SQL) have the capacity to have archives which allow them to be "rolled back" or "forward to" a specific recovery point. It takes knowledge of how Oracle/SQL and the application and the volume of data changes as well as finesse to set this up effectively and effienctly. And in the best of all possible configurations the data should be on one server and the database and it's archive logs should be on another so that you never loose both at once but... well... that probably falls under the finesse category. It sounds like they lost this too.
either that OR
they have everything successfully on backup but they lost the backup server's database which allows the backup tapes/disks to be rebuilt. The article isn't clear as to whether the problem was with the recovered data (well pulled it off backup tape but what we got we can't use to rebuild the database) or the ability to recover data (we got stuff on tape/disk but we can't figure out how to get it off.) Understanding the problem is a big step toward a solution and I'm not sure that anyone at the Clerk of (Civil) Court understands what really happened.

The backup system doesn't seem to either have been configured properly, or functioning properly or tested to confirm it was working properly.

But real estate agents and title attorneys say they have never seen two servers full of data knocked out simultaneously, all the backup systems fail and 21 months of records disappear in one fell swoop.

No it takes a number of large doses of bad luck and incompetence and just flat not caring to get us where were are. Someone from the outside should have been watching and questioning what the recovery process was because it is real hard to beleive that the same people who got us into this mess could ever dig us out.

1 comment:

ErinL said...

This office, and other city offices that manage critical data, should have a disaster recovery plan, and an out-of-area data replication site. And just like fire drills, it needs to be tested occasionally. This lesson should have been learned with Katrina.

We all need and deserve to learn what really happened, and what steps are being taken to ensure it doesn't happen again.