SolarVPS :: WinVPS35 Hardware failure/outage

Portal Home > Knowledgebase > Industry Announcements > Web Hosting Main Forums > Providers and Network Outages and Updates > SolarVPS :: WinVPS35 Hardware failure/outage

Posted by JanusMahzon, 02-07-2010, 12:46 AM
A VPS server had a raid array failure and the host has taken over a day to get it back up (still is not up). They said it may need a bit by bit restore that will take several hours.

Is this legit?
Posted by enotchnet, 02-07-2010, 01:21 AM
It depends on if your service provider has direct access to the servers. There may be a delay in communication of progress to you since they have to get the information from their provider(s).

Based on the lenght of time it sounds like a RAID 5+ failure and the provider may have to actually provide some of the information from backups or from data recovery tools depending on the situation. This process can take many hours from our experience. However, each situation is unique and the restore times vary by RAID level, amount of data and server hardware specifications.
Posted by IGXHost, 02-07-2010, 01:39 AM
RAID-5 always scares me especially when it's for a designated hosting environment. It is probably legit. I believe if it was RAID-10 it may have been much quicker to fix.
Posted by TonyB, 02-07-2010, 01:59 AM
Well when I hear raid failure I think the entire raid was lost thus they're using backups now to fix it. It does not make sense otherwise as typically servers you're either swapping the bad drive on the fly or you're shutting it down replacing the drive then bringing it back up. In both cases it will rebuild in the background whether it's raid 1,5,6,10 etc.

So I'm leaning towards them having to use backups which could take quite a while depending on the amount of data and the type of backup systems they have in place.
Posted by borgdrone7, 02-07-2010, 03:54 AM
Yeah, I guess somehow their complete RAID array failed because otherwise they could just swap failed HDD and rebuild array. I guess in that case there is possibility you will not get your most recent data.
Posted by SolarVPS|Justin, 02-07-2010, 04:02 AM
Greetings!

I am guessing based on the specific terminology used that you're referring to our WinVPS35 RAID failure.

To clarify: This occured on a 500GB+ RAID10 array

We certainly do wish that this was a simple single drive failure, as we'd probably have had it fixed within an hour or two of it failing, at most. Unfortunately, this escalated into a corruption within the drives. Thus, the problem encompasses the entire array, unfortunately.

After several attempts to rebuild the partition tables and boot the node, we've decided that the only option we have is to extract the data from the array itself and rebuild from there.

Normally, we'd use our own disaster recovery backups to rectify the situation, however, this particular node is one of the very few nodes left that are not yet on our internal backup system, which is still in the process of being fully implemented (progress is estimated at about 95% completion on our entire network which spans 7 facilities in 5 cities and several dozen vLans). When finished, our backup system will take full disaster recovery backups of each individual container on a nightly basis, which are then stored on a dedicated backup node. Due to a few bugs in our backup system, however, full implementation has not yet been achieved. Thus, our only course of action is to extract the data from the drives the hard way.

We certainly don't wish to hide the fact that this is a major failure, and we're certainly not going to hand our customers veiled half truths on the situation. In fact, I'd say that this is probably the most severe hardware failure we've experienced as a company.

We fully expect this to take several hours to get the data into a bootable state that can be used and be stable. First the data must be extracted from the failed drives and checked for further corruption. This process is highly time consuming and combined with the fact that we're looking at around half of a terabyte of data to 3/4 of a terabyte of data, the time needed to complete this compounds on itself. Then, once we have the data, we need to then restore the RAID10 array itself to bring the node itself up to spec.

Our CEO is actually at the facility right now working on this process as we speak along with two of our remote technicians who will be taking over and monitoring the data import.

We have not yet 100% verified the cause of this failure, though we suspect that it has to do with the RAID controller in the node itself. We'll be examining the situation once we have gotten everyone back online, as that is our number one priority.

We are also very frustrated with the situation and can understand if any of our customers share that frustration.

Additionally, We would be more than happy to build you a new container on another node with the same IPs if you happen to have a backup of your data that can be imported. That would ultimately be the fastest means of getting you back online that we currently have.

We appreciate your understanding in the matter. It has been and will be a very long night for us. Good thing I have a sizeable stash of coffee

Thanks!
Posted by SolarVPS|Justin, 02-07-2010, 07:55 PM
Unfortunately, the data on this server has become a complete loss. The index tables on the three disk drives is corrupted beyond repair. As such, all semblance of data structure has been entirely lost.

The below notification has been sent to all of the customers on WinVPS35:

Quote:

Greetings,

As you are most likely aware, our WINVPS35 Windows VPS node suffered a severe hardware failure early Saturday morning. After thorough investigation, we’ve determined that the incident was caused by multiple drive failures in the server’s RAID10 array ( 3 drives to be exact ). Using RAID10 technology, 2 independent RAID1 drive arrays are striped for speed. This means that in a RAID10, up to 1 drive from each RAID1 can be lost while maintaining operational status. Unfortunately, given that there were 3 drive failures, the RAID10 became unusable. Please note that a failure like this is *extremely* rare.

Given that this node was from an era before we provided automated daily backups for our customers, there were unfortunately no backups of its data. We are in the process of expanding our backup network to ultimately provide backups for all of our older machines, however, this is a costly and time consuming process. Please note that we did not offer or guarantee any backups for our older HSPc connected VPS systems ( purchased before September 2009 ). While about a good amount of the nodes in that older system are currently being backed up, the WINVPS35 was unfortunately not part of that percentage.

We’ve been attempting several recovery methods over the last 24 hours, which most of you have been notified about. I won’t go into details about those methods at this point as I believe you’ve all been made aware of them. Unfortunately, our efforts have not yielded any results and so at this time, we are pronouncing the data on that server a 100% loss. While this is devastating for us, we understand that it is even more devastating for all of you. Hardware failures are unfortunately a part of the hosting business, however, that doesn’t make them any less difficult to deal with.

We sincerely value your business and we want to do what we can to help you through this. Every customer affected by this will be offered 2 free months of service on a comparable service plan in our new system. Our new system utilizes hardware that has been deployed in the last year and includes automated daily backups. We will also offer our technical support services to you, whether you be managed or unmanaged, to help you to restore your servers. There are many choices out there for virtual server providers and while we understand that you may have lost faith in our service, please understand that these failures do happen to every provider. We want to make sure that each and every one of you are taken care of and that we can offer you automated daily backups going forward.

If you would like to take advantage of this, please contact sales@solarvps.com with your customer and service plan details so that we can begin setting you up with a new system right away.

Thank you for your understanding.