WebMail on recovery
USF faculty and students hoping to catch up on their University email over Thanksgiving break ran into a bit of a snag when USF’s email service crashed Nov. 21.
The crash caused a day-long holdup on incoming mail, the short-term loss of all existing data in USF email accounts and the permanent erasure of all mail received between 4 a.m. and 4:30 p.m.
The crash came less than a year after a more catastrophic crash on Dec. 22, 2005. Both were caused by similar hardware failures and have prompted administrators in Academic Computing to replace all of the hard drives and the server used to maintain USF email, a process they hope to complete before spring term begins.
“There’s no guarantees, but the new system should be more reliable,” said Eric Pierce, an email administrator for Academic Computing.
The short-term loss of all pre-existing emails posed a serious problem for those who needed access to data stored in their University accounts. Frantz Aubry, a senior majoring in biomedical science, was unable to access some documents he needed for a personal statement on his medical school application.
“It wasn’t a huge deal, but it was a hassle,” Aubry said.
Faculty and students receiving important emails on Tuesday morning and afternoon also faced problems, and posts on the USF WebMail homepage instructed them to contact the sender and request that messages be re-sent.
For Jeff Hill, a sophomore majoring in psychology, the crash didn’t cause any serious problems.
“I’ve learned not to rely on the USF mail server,” said Hill, who sends most of his emails through an alternate account.
On Nov. 21 at 4:30 p.m., about 50,000 email accounts were wiped out after the malfunction of two hardware controllers that coordinate the functioning of the 12 hard drives that store all email account data. The controllers ensure the drives all work together, and that in the event that one or more of the drives fail, other drives take over.
When the controllers failed to perform this vital task, it caused all 12 of the drives to crash, Pierce said. Much of the data on the drives was stored on backup tapes, but the incoming emails stored in the 12 hours preceding the crash were irretrievably lost.
“It’s kind of like taking a snapshot,” said Alex Campoe, the associate director for Academic Computing. “The last snapshot we had was early Tuesday morning.”
Pierce and fellow email administrator Chance Gray then had to recreate all of the email accounts from scratch, a process that was completed Wednesday evening. All of the emails received before the crash on Tuesday, which had been piling up on another drive, could then be transferred into the appropriate email accounts.
After that, the lengthy process of restoring all the backed up data began. Part of the reason this process took so long, Campoe said, was that all of the hard drives are backed up on tapes rather than other hard drives, and data retrieval for tapes takes much longer.
The final transfer of all this data into email accounts was completed at 6 a.m. Sunday.
Pierce, who did the lion’s share of the work to get the system up and running, said he spent part of his Thanksgiving working on the system from home and nearly all of Wednesday and Sunday on the restoration process.
“It was a long weekend,” said Pierce.
After the initial failure, USF had to run the e-mail system on a different set of drives until the faulty controllers could be replaced. Although they were replaced on Wednesday, USF e-mail administrators have decided to continue running the system on the other drives, which they feel are more reliable.
“With two failures in a year we just don’t feel comfortable with that system anymore,” Pierce said. “So until we’re more comfortable, we aren’t going to put any more data on it.”
While they may be more reliable, these other drives are also slower, since they are not directly connected to the USF email server and instead must transfer data through a network connection. This explains the slowdowns that users are currently experiencing, and they should expect slower load times for the next few days, said Pierce.
Pierce said the new hardware, consisting of new hard drives, controllers and a new server from Dell, should be more stable and he hopes it will prevent future crashes.
Pierce, who also spent much of Christmas Eve last year working on the email system following the crash on Dec. 22, added that he hoped the new servers would keep him from spending parts of his next holiday working on restoring thousands of email accounts.
“As soon as I got the page that the server was down on Tuesday, I had flashbacks to last Christmas.”