BQBackup server down since this morning.

Portal Home > Knowledgebase > Industry Announcements > Web Hosting Main Forums > Providers and Network Outages and Updates > BQBackup server down since this morning.

Posted by ZKuJoe, 09-07-2010, 05:30 PM
Anybody else experiencing downtime? Not a big deal for me because I utilize multiple backup solutions but I've sent them a few e-mails but haven't received any updates since 2:20AM so I was hoping somebody might have some additional information.

Thanks!
Posted by bqinternet, 09-07-2010, 05:34 PM

Quote:

Originally Posted by ZKuJoe

Anybody else experiencing downtime? Not a big deal for me because I utilize multiple backup solutions but I've sent them a few e-mails but haven't received any updates since 2:20AM so I was hoping somebody might have some additional information.

Thanks!

Hi Joe,

We're having trouble with the RAID array on the server that you're on, and have had a technician there a couple times to work on it. I am working on it remotely as well. I don't actually see any emails from you other than the one this morning that I replied to, so please check where you sent them.
Posted by bqinternet, 09-07-2010, 06:30 PM
Status Update:

We're currently working on a problem with the storage server that your backups are on. The RAID controller card locked up this morning, and we dispatched a technician to revive it. While the controller is now running, several steps are still required in order for the RAID volumes to be usable. We are proceeding cautiously to avoid data loss, and expect it to be several more hours before a status update will be available.
Posted by ZKuJoe, 09-07-2010, 09:20 PM
I've been replying to your replies but I don't think they are getting to you. It was funny to see that the same day my backup server goes offline my main host has a SAN issue which rendered my site offline for a brief moment. Luckily I rsync my data to my home NAS every 4 hours in the event something like this happens.

Both BQBackup and GigeNET have been solid but it just goes to show that against all odds there are chances 2 providers can go down at once so backups aren't to be taken lightly.
Posted by t325, 09-07-2010, 10:25 PM
Any updates? I just noticed mine was down too. Thanks
Posted by bqinternet, 09-07-2010, 10:42 PM

Quote:

Originally Posted by t325

Any updates? I just noticed mine was down too. Thanks

Unfortunately it's still not clear how long it will take to fix. In order to reset the frozen RAID controller, its battery backup had to be disconnected, and it came back up with all the disks marked as Free. We recreated the raidset, and recreated the volumes (set to "No Init" to preserve the data), but the data seems to be out of order. I have no reason to suspect data loss, but getting it back online is proving to be tricky.

Having more than one backup is definitely prudent. Although this particular problem is very rare, it is something that can and does happen. We've used this hardware platform for about 5 years on multiple racks of servers, and it is the first time this particular problem has come up, but anyone that has been on WHT for a while has heard similar stories from others.
Posted by bqinternet, 09-08-2010, 01:27 AM
Hi guys,

I'm starting to make some progress on figuring out why the data is out of order. I created an image file of the small 1GB OS partition, and I am using a combination of system tools and custom software (which I am writing on the fly) to figure out the puzzle. I will continue working on it through the night.
Posted by bqinternet, 09-08-2010, 06:05 AM
Update: We have determined in which way the stripes from the rebuilt raidset are out of order. The planned course of action is to clear the raidset metadata, have a technician physically swap drives so that they're in the expected order, and then recreate the raidset and volumes (again, as "No Init" to preserve the data). We expect to move forward within a few hours.
Posted by bqinternet, 09-08-2010, 09:09 AM
A technician is walking over to the server now. Hopefully I should have another update shortly.
Posted by HostXNow, 09-08-2010, 09:48 AM
I just checked my account and it seems to be working fine. Either my account wasn't affected or the issue has already been fixed. Anyway, it's a good job I also keep backups with a second provider.
Posted by bqinternet, 09-08-2010, 09:57 AM

Quote:

Originally Posted by HostXNow

I just checked my account and it seems to be working fine. Either my account wasn't affected or the issue has already been fixed. Anyway, it's a good job I also keep backups with a second provider.

Only customers on this particular server are affected. In any case, I would agree with a recommendation to have multiple backups.
Posted by bqinternet, 09-08-2010, 10:19 AM
Hi guys,

There's good news and bad news. The good news is that the tech completed the task needed to repair the RAID array. The bad news is that one of the hard drives that wasn't touched decided to change to a Failed state right before recreating the volumes. Looking at the logs, it appears that the failed drive is related to the original lockup. In short, additional work is still required to get this thing online, and I will have a new ETA available once I figure out the next steps.
Posted by bqinternet, 09-08-2010, 11:29 AM
The next step is to clone the failing hard drive to another hard drive. The RAID controller is marking it as failed because it is not passing self-tests, but I have confirmed that it is readable. I also confirmed that it is the same drive that the logs suggested was the initial cause of the problem. The drive cloning should be started in the next 30 minutes, and then will take about 4 hours to complete.

If anyone on this server would like a clean account on another server, email me at scott@bqinternet.com. That will at least allow you keep your backups up to date while this server is being repaired.
Posted by bqinternet, 09-08-2010, 08:19 PM
To provide another update, most of the data has been copied from the failing drive, but there are some bad sectors that are causing the disk to become in a frozen state, requiring a power cycle. It is slowing down the copy, but it is still moving forward.
Posted by bqinternet, 09-09-2010, 04:41 AM
The duplication of the failing hard drive continues to make progress, and we were able to maintain stability by using a smaller block size. It is currently running on autopilot thanks to the GNU ddrescue software.

It is difficult to estimate the remaining time without knowing exactly how many bad sectors are on the hard drive. A healthy drive of this size would complete in a little under 4 hours, copying at 120MB/s. Unfortunately, when it encounters an area of the disk with bad sectors, it can take several minutes just to get through a few MB of data.

The amount of bad sectors which have been observed is significantly out of spec, and the uniform distribution of the bad sectors suggests that the read/write head may have collided with the platter during a seek operation, causing a scratch that spans many tracks. Normally a bad disk would simply be replaced in the running RAID array without any downtime, but the nature of the original RAID failure necessitates that this disk be repaired first.

Knowing the minimum amount of time that it will take this process to complete, I am going to take this opportunity to get a few hours of sleep. I will check on the status again later in the morning.
Posted by ZKuJoe, 09-09-2010, 05:37 AM
Thank you for an update. Is there any chance that clients affected by this can be moved to another server? I don't need my data on the disks since any backups on there are already older than I would have use for.
Posted by bqinternet, 09-09-2010, 09:11 AM

Quote:

Originally Posted by ZKuJoe

Thank you for an update. Is there any chance that clients affected by this can be moved to another server?

Yes. I went ahead and created space for you on another server, and just sent an email about it. If you have any questions, just reply to the email.
Posted by ZKuJoe, 09-09-2010, 07:37 PM
Thanks! My backups are running again as normal.
Posted by bqinternet, 09-09-2010, 10:02 PM
Hi guys,

I've mostly been communicating by email, but I wanted to put an update here too. The ddrescue software has gone through one of the two bad parts of the disk, and is working on the second one now. Again, I can't provide a reliable time estimate since it depends on the number of damaged sectors, which is an unknown quantity, but forward progress is still being made.

I want to remind affected customers again that we can create new backup space on another server in order to get your backups running again. It only takes a few minutes to create, so please don't hesitate to request it by emailing me at scott@bqinternet.com.
Posted by bqinternet, 09-10-2010, 05:42 AM
We're within hours of finishing the hard drive duplication. There are now just a handful of bad sectors that cause the drive to freeze, and they are being quickly narrowed down to the exact sector.
Posted by bqinternet, 09-10-2010, 10:52 AM
Wrapping up the hard drive duplication now, and about to begin testing in the next hour.
Posted by bqinternet, 09-10-2010, 01:48 PM
Testing is going well so far. I noticed a few minor things that need to be fixed, but nothing major.
Posted by bqinternet, 09-10-2010, 05:48 PM
The server will be booting up in about 2 hours.
Posted by bqinternet, 09-10-2010, 08:42 PM
Very good news. I'm currently booted into single-user mode on the server, with the RAID volumes intact.
Posted by bqinternet, 09-10-2010, 10:37 PM
The server is now running in read-only mode. Customers can log in and access their data. Writes are disabled while we run filesystem integrity checks.
Posted by Steven, 09-11-2010, 01:52 AM
I have to say - this is how a company should run. Regular updates and a knowledgeable admin. This is why I recommend Scott to my customers.

Good job Scott.
Posted by TheServerExperts, 09-11-2010, 02:41 AM
Thumbs up how this was handled...
Posted by bqinternet, 09-11-2010, 03:18 AM

Quote:

Originally Posted by Steven

I have to say - this is how a company should run. Regular updates and a knowledgeable admin. This is why I recommend Scott to my customers.

Good job Scott.

Quote:

Originally Posted by TheServerExperts

Thumbs up how this was handled...

Thanks guys. You two both know what a nightmare situation this type of server failure can be for an admin, especially when it's redundant hardware with so much data. Luckily the affected customers were patient and understanding, and I was able to spend my time on repairing it.

I should have an official explanation of the failure this weekend. If any customer is unable to access the server, please email me at scott@bqinternet.com.
Posted by bqinternet, 09-11-2010, 12:13 PM
Copies of all of the filesystems are being made before making them writeable. My original intention was for these copies to act as a temporary backup during the repair work. Since the copies are being stored on a brand new server that is not yet being used for anything else, I have decided that I will just mount the image files on the new server, and leave the original data as the backup copy.

The short story is that everyone on this server is getting an early upgrade to our new platform:

Old server: Dual-core Opteron 2212HE, 8GB RAM, RAID6 SATA storage, 48-bit LBA
New server: Octo-core Opteron 6124HE, 16GB RAM, RAID6 SAS storage, 64-bit LBA
Posted by Dougy, 09-11-2010, 02:35 PM

Quote:

Originally Posted by bqinternet

Copies of all of the filesystems are being made before making them writeable. My original intention was for these copies to act as a temporary backup during the repair work. Since the copies are being stored on a brand new server that is not yet being used for anything else, I have decided that I will just mount the image files on the new server, and leave the original data as the backup copy.

The short story is that everyone on this server is getting an early upgrade to our new platform:

Old server: Dual-core Opteron 2212HE, 8GB RAM, RAID6 SATA storage, 48-bit LBA
New server: Octo-core Opteron 6124HE, 16GB RAM, RAID6 SAS storage, 64-bit LBA

Mmmm... good servers want
Posted by Dan_EZPZ, 09-11-2010, 05:09 PM
I'm very impressed with the way this was handled! It would have been much simpler to wipe the array and hand out new accounts but a lot of work was put in to recover all data and keep customers up to date.

Well done Scott.
Posted by woods01, 09-12-2010, 09:14 PM
We've been a customer of BQ for a little while now and haven't seen any issues.

For it to be agreed upon to need to purchase a backup for a backup seems to be the job of BQbackup and not the job of BQ's customers.

Why doesn't BQ purchase backup space to backup it's backups?

Seems like a logical thought.

I know when I buy my cars I don't buy 2 in case one breaks down because I know i'll be provided with a rental until the other is fixed.
Posted by Steven, 09-12-2010, 10:23 PM

Quote:

Originally Posted by woods01

We've been a customer of BQ for a little while now and haven't seen any issues.

For it to be agreed upon to need to purchase a backup for a backup seems to be the job of BQbackup and not the job of BQ's customers.

Why doesn't BQ purchase backup space to backup it's backups?

Seems like a logical thought.

I know when I buy my cars I don't buy 2 in case one breaks down because I know i'll be provided with a rental until the other is fixed.

It would be a never ending cycle, Bq backup buys backups from someone, who does that someone buy backups from?

Besides - no data was lost.
Posted by bqinternet, 09-12-2010, 11:05 PM

Quote:

Originally Posted by woods01

Why doesn't BQ purchase backup space to backup it's backups?

We get that question from time to time. For us to keep another backup of the backup would basically double the cost to provide the service. We would need twice as many servers, twice as much power, twice as much rack space, twice as much maintenance, etc. For that reason, we leave it up to the customer to decide how extensive their backup policy will be. To use your car analogy, the car maker doesn't make an extra car for every one that they sell.
Posted by t325, 09-12-2010, 11:26 PM
When will the servers be online and fully operational?
Posted by bqinternet, 09-12-2010, 11:29 PM

Quote:

Originally Posted by t325

When will the servers be online and fully operational?

The server is online and accessible to users, but some of the volumes are in read-only mode while filesystem checks continue to run. If you will email your username to scott@bqinternet.com, I can tell you which volume you are on.
Posted by spaethco, 09-12-2010, 11:52 PM

Quote:

Originally Posted by woods01

Why doesn't BQ purchase backup space to backup it's backups?

I know when I buy my cars I don't buy 2 in case one breaks down because I know i'll be provided with a rental until the other is fixed.

To use your car analogy, do have a spare tire for your spare tire?

Backups are, by definition, not primary storage. Loss of a backup should be non-critical.
Posted by bqinternet, 09-13-2010, 12:34 AM
Reason For Outage

Summary

On September 7, 2010, an unusual equipment failure affected the service of a small portion of BQBackup's customers. Working around the clock, we were able to restore service without loss of the data.

What happened?

The storage controller in the backup server experienced a lockup. The controller's battery backup unit kept the controller in a frozen state, so a technician was dispatched to disconnect and reconnect the battery. Upon starting the server again, the storage controller did not recognize the existing RAID volumes.

Unable to use the data as-is, the remainder of the day was spent verifying that the raidset and volumes could be rebuilt without data loss. We dispatched a technician again to complete the procedure. Unfortunately, just before the process was completed, the storage controller suddenly changed a hard drive to a Failed state, requiring additional repair work.

It was determined that the failing hard drive could still be read, so it was moved to another server, and the data was cloned to another hard drive. The majority of the data was quickly copied within hours. For small sections of the disk that were badly damaged, extensive manual interaction was required.

Once the damaged hard drive was duplicated, the good copy was put back in the server, and we were able to successfully boot the server and provide customer access to the data.

Why did the hard drive fail?

Based on observations during the duplication step, we suspect that the hard drive platter was physically damaged by coming in contact with the drive head. In several parts of the disk, the bad sectors are uniformly spread, suggesting that this happened during a seek operation, which damaged data on multiple tracks as the head moved.

Why did the storage controller fail?

We suspect that the nature of the hard drive failure caused the storage controller to execute a code path that perhaps was not well tested by the vendor. A review of controller logs suggests that it locked up shortly after logging a string of errors that were related to the failing hard drive.

If the storage system is redundant, why did it affect service?

In this particular case, there were effectively two simultaneous failures which interacted in such a way that it caused the server to go offline.

Is this type of failure common?

BQ Internet has not experienced a complete RAID failure in the past. We have used the same storage platform on multiple racks of servers for the last 5 years, and it has proven to be a reliable platform.

What is the current status?

The server is online and is accessible to customers. For safety reasons, some volumes are mounted in read-only mode while filesystem checks complete. Customers on those volumes can read the data, but cannot yet update it. Temporary space on other servers is available for customers that would like to update their data immediately.

Is the same server being used?

After booting the repaired server in read-only mode, we began sending a copy of each volume to another server. The other server was previously empty, and is part of the deployment of our new, modernized storage platform. As each volume completes its filesystem check, the copy on the new server will become active. When the volume is mounted as writeable, it will be running from the new server. The old server will then be retired.

How will future server failures be prevented?

While hardware failure can never truly be prevented, BQ Internet takes many steps to minimize their impact. Over the years, our redundant systems have handled many potential issues without loss of service. Listed are some of the steps that we take:

We use dual-parity RAID storage. Two hard drives can fail without loss of data.

We use battery backup units for the storage controllers

We use redundant power supplies in our storage systems

We use server-grade hardware, including error-correcting RAM

We use high quality, carrier-neutral datacenters

Conclusion

While it is unfortunate that this hardware failure occurred, we are happy to report that we were able to repair it without data loss. If any customer has a lingering issue, feel free to email us at support@bqinternet.com.
Posted by Coolraul, 09-13-2010, 12:51 AM
Handled exceptionally well. Kudos. Now go get some sleep
Posted by HostXNow, 09-13-2010, 06:04 AM
Good work, Scott.

When will the accounts that weren't affected be moved to the new upgraded servers? Or do we just put in a request for this?
Posted by bqinternet, 09-13-2010, 06:13 AM

Quote:

Originally Posted by HostXNow

Good work, Scott.

When will the accounts that weren't affected be moved to the new upgraded servers? Or do we just put in a request for this?

I haven't made any formal announcements about the new platform yet. I was planning on doing it after all the upgrades have been completed. We've been working on it for a few months, and I expect it to be completed by mid-October.
Posted by HostXNow, 09-13-2010, 06:16 AM

Quote:

Originally Posted by bqinternet

We've been working on it for a few months, and I expect it to be completed by mid-October.

Ok, good stuff.