THE EVENING OF SUNDAY, May 4, 2003, at Aeneas Internet and Telephone began as any previous Sunday evening had. The Jackson, Tennessee-based company that serves about 10,000 Internet and 2500 telephone customers was closed for the weekend, awaiting the return of its 17 employees the next morning. Just before midnight, however, all hell broke loose. An F-4 category twister touched down just outside of town, then tore through Jackson's downtown area, levelling houses, historical sites and municipal buildings alike. The tornado ripped straight through Aeneas's one-story building, leaving only a pile of rubble.
Meanwhile, Aeneas CIO and Operations Manager Josh Hart, who'd heard about multiple tornadoes in the area that day, was home, 84 kilometres away in Martin, Tennessee, huddling in his bathroom with his family. As soon as he was able, he flipped on the TV for news footage of the devastation. What he saw looked like "a war zone," bricks and concrete everywhere and piles upon piles of rubble.
At 2 am, with those images in the background, Hart's cell phone rang - it was Aeneas Network Administrator Jason Warren calling from what he likened to Ground Zero to report that everything in Jackson was lost. Another call came in from CEO Jonathan Harlan.
"I'm listening to [Warren] tell me what it's like, and he says, 'It doesn't even look like there was an office here,'" remembers Hart, 25. "The tornado destroyed our computers, our desks, everything. I couldn't believe what he was telling me."
Aeneas lost nearly $US1 million in hardware and software that night, and an estimated 72 hours of downtime. But just as Aeneas in Virgil's Aeneid endured the worst the gods had to offer, so too did this Aeneas. This one, however, was wise enough to have created a contingency plan - one that minimized the damage and kept the company afloat during its darkest hour.
The company is not alone. After a nationwide scramble to prepare for high-impact, low-probability events similar to the attacks of Sept. 11, CIOs have since realized that their organizations are far more likely to succumb to another type of event - one that has a high probability of occurring and, curiously enough, is probably simpler to predict: the weather. For example, in June, while the Atlantic seaboard was bracing for the start of hurricane season, Arizona was busy battling forest fires. And in Harris County, Texas, in 2001, a tropical storm and resulting flood taught one IT executive the importance of flexibility.
Both Aeneas's Hart and Steven W. Jennings, Harris County's executive director of central technology, share their experiences here in an effort to provide best practices and battle-tested secrets about which preparations work best. According to Carol Kelly, vice president of government strategies for Meta Group, these are lessons from which everyone can learn. "When disaster strikes, you want to be ready with a plan of action and an approach of how to deal," she says. "You might be ready for the next terrorist attack, but if you're not ready for the next nor'easter, your plans won't amount to much."
Big Plans for a Small CompanyAeneas launched its contingency plan when it was founded in 1996; since then, CIO Hart has enhanced the strategy gradually almost every year. In early 2002, as the ISP neared 10,000 Internet customers, he and his network administrator, Warren, thought up the company's most comprehensive approach yet. While they determined that the likelihood of a terrorist attack on the western Tennessee town of Jackson, population 59,600, was slim to none, they concluded that because of the municipality's location in the central United States' infamous Tornado Alley, the plan should respond to the next most likely cause of disaster - twisters. What ensued was a three-pronged plan that hinged upon colocation, distribution and backups.
- First, by employing Border Gateway Protocol (BGP) programming on a high-class circuit shared with an ISP 90 miles down Interstate 40 in Memphis, Aeneas would colocate in real-time its IP addresses and reroute data traffic offsite during any local disruption. With this system, servers would automatically reroute Internet service operations the moment a disruption occurred. In theory, at least, that would guarantee continuity of operations across the board.
- Next, the company distributed its voice traffic dynamically, paving the way to switch its T1 connections from one fibre node in the Bell South network to another, in the event of a sudden telecommunications infrastructure failure. This system was designed to preserve continuity much like the BGP system.
- Finally, the company's network administration team engineered applications that stored customer records and other data on tape as well as on backup hard drives. Though the tape and hard drives were stored onsite at the Jackson location, Hart and Warren figured onsite backup was better than none.
This strategy wasn't put to the test until tornado season this year, when hardware, software and pieces of the local infrastructure were destroyed May 4. Business customers on T1 lines lost their connections as soon as the tornado struck. ISP traffic also went down immediately and took 36 hours to restore. The fibre node switch to recover voice traffic took a bit more time, as Aeneas programmers worked around the clock with technicians from Bell South to migrate the T1 connections from the old node to the new, finalizing the switch nearly three days after the twister hit.
"When you have hundreds of T1 lines that need to be moved from one node to the next, there's a lot of reengineering that needs to take place," says Hart. "We thought we were prepared, but I'm not sure we ever considered just how difficult this would be."
Bumps in the Disaster Recovery RoadBeyond the challenges inherent in rerouting traffic, the remediation effort hit two other snags. The first revolved around colocation; because the colocation arrangement with the Memphis ISP was still being set up at the time of the tornado, the Memphis site didn't yet have sufficient servers. To remedy the situation, Aeneas staff members - and family and friends - drove to Memphis with additional equipment to handle the load. The company had some of this equipment on hand - what it didn't have, Hart and Warren purchased online and had overnighted to their homes. All told, colocation was down for about a day and a half.
The larger and more formidable of the two setbacks involved the company's tape and hard-drive backups. It was clear from the beginning that most of the company's paper-based customer records had fallen victim to Mother Nature, but four days after the tornado, Hart and Warren discovered that the electronic tape and hard-drive backups had failed as well. Hart finally uncovered the tape and hard drives May 8 - when he pulled the tape from the rubble, it was so badly damaged that he hardly recognized it. Hart passed the hard drives on to a number of local data recovery specialists to see if they could retrieve anything. One by one, each came up empty.
Finally, as a last resort, Hart plucked the hard drives from four different nonfunctioning computers and turned them over to Kroll OnTrack, a data recovery company in Minneapolis. Miraculously, the vendor discovered a recent copy of the customer records database on all four computers and was able to recover all of the customer data and return it to Aeneas, delaying printing of its May bills only minimally.
Large Organization, Even Larger PlansFor an IT organization as small as Aeneas, the tornado presented sizeable challenges. But for the IT organization of Harris County, Texas, which services more than 15,000 county employees and nearly 3.5 million constituents, the problems presented by Tropical Storm Allison were downright monumental.
Disaster struck June 6, 2001 - the second day of a five-day storm - when atmospheric conditions caused a cloud to linger over the Houston area for nearly six hours, dropping more than 39 inches of rain. By the time the clouds parted, Harris County government had lost five buildings and most of the communications and other hardware and software in them to water damage. The price tag: a whopping $US24 million.
Fortunately, though, Executive Director of Central Technology Steve Jennings had prepared for such an event. When Jennings joined county government in 1975, he established continuity planning to address natural disasters, such as flooding and hurricanes. The plan, which he dubbed the Four R strategy, hinges on four incremental steps - review, rewire, relocate and rebuild.
With this in mind, Jennings attacked the recovery immediately, following his plan like a bible. The morning after the deluge, he and his top advisers met to review assets and assess damages. Next, because Harris County is public and qualifies for federal aid, Jennings called in the Federal Emergency Management Agency (FEMA) to inspect the damage and lend him some disaster recovery expertise. He also brought in NetVersant Solutions to lay new fibre-optic cables. This process took approximately six weeks. In the meantime, Jennings reconvened his advisers, and put together an emergency relocation plan to disperse county employees into available office space on high, dry ground. Three months later, he tapped into the first of several batches of funding from FEMA to start rebuilding, spending millions on treating buildings for water damage.
Jennings also worked double time to ensure that county communications didn't miss a beat. "We utilized existing remote access facilities that allowed county employees to dial in from home until their new offices were finished," he says. This was done for employees whose jobs were deemed critical to county operations and for those for whom the county couldn't find alternative space. Jennings then mobilized a force of technicians to install high-speed connections at the homes of those employees who needed it most.
Finally, with the help of the county clerk's office, Jennings activated a cache of 300 Cingular cell phones, which had been reserved to help the blind vote on Election Day, and distributed them on an as-needed basis to county departments. "Those phones are deactivated for 11 months of the year, but they were available and we needed them," he says, noting that network administrators deactivated the phones and retrieved them once they managed to bring each department back online. "Part of recovering from a disaster is making use of everything you can find, and we did just that." When all was said and done, it took the county about a year to return to normal, which, according to Jennings, was pretty good given the scope of the damage.
Lessons LearnedJennings says the storm confirmed his belief that continuity plans should be flexible and horizontally applicable. Before the flood, Harris County's disaster recovery plan was conceived to respond to potentially any disaster, but it typically addressed single events such as the loss of a building, a network or a system. It was flexible enough, however, that it worked even when the county was faced with recovering multiple facilities. He adds that Harris County government "uses different portions of the plan for total recovery." Today, the Harris County continuity plan incorporates suggestions from employees who were part of the recovery process and lists scenarios for various "disaster combinations" that could occur during the next big storm - such as what to do if both the jail and family court gets hit. When that storm does happen, Jennings says he'll respond even faster than he did in 2001.
The next time a weather event occurs, Jennings says he'll also have the added benefit of wireless. After the flooding, as Jennings tried to rewire the Harris County jail, he spent $US200,000 on Lynx high-definition wireless technology as an interim solution. The technology worked so well that he kept it and now has it on hand to pinch-hit during the next crisis. If, for example, a storm knocks out phone lines in the southeast corner of the county, Harris can set up wireless in hours. In addition, if another rainstorm waterlogs some of the underground fibre optics downtown, Harris can use the technology to provide emergency telephone service to anyone who needs it.
"Mother Nature never follows a script, especially not the one you wrote," Jennings quips. "As we have more experience recovering from the disasters she wields, we'll have a better sense of which remedies work best."
At Aeneas, Hart notes that from "now until the end of time," he'll keep an electronic records backup offsite to eliminate the problems he endured in recovering those mission-critical customer files. Planning for offsite backup had begun before the May tornado, and the site is now up and running in Memphis. Hart admits that his error in planning nearly cost Aeneas everything, adding that he'll never make that mistake again. Another misstep Hart says he'd correct is the way he handled the media in the days following the tornado. If he could do it all over again, Hart says, he would have been on the phone immediately with newspapers, TV stations and radio outlets to jump-start the company's PR campaign and assuage customer concerns.
"[Our customers] must have been watching the TV news thinking, 'Man, that's my ISP,' and we're too busy working on restoring systems to think about putting their minds at ease," he says. "Restoring technology after a disaster is important. But rebuilding customer confidence. . . it doesn't get more important than that."
SIDEBAR: Five Keys to Ensuring Business Continuity
You get only one chance to respond to a weather catastrophe. According to Al Berman, senior vice president and leader of the national business continuity management practice at Marsh, IT organizations of every size should test their disaster recovery strategies regularly. He offers the following tips.
1. Back up all critical data at least daily, including data to redundant servers, network drives, and tape or optical drives. Backups should be performed more frequently for data that cannot be reconstructed from any other source.
2. Back up laptops and ensure that critical data is not stored on C drives but is stored to network servers that are backed up and stored offsite. Many businesses fail to realize the importance of data stored locally on laptops. Due to their mobile nature, they can easily be lost or damaged.
3. Maintain copies of all backups offsite. This is especially important if an entire server is damaged or destroyed. SmartSync Software, Sun Microsystems and Veritas Software provide software that automatically backs up data to an offsite server daily.
4. Schedule regular reviews of computer security to help prevent unauthorized access, modification or deletion of data.
5. Create a disaster recovery plan as part of a comprehensive business continuity plan.