JaguarPC 2+ days downtime/Lessons learned

Portal Home > Knowledgebase > Industry Announcements > Web Hosting Main Forums > Providers and Network Outages and Updates > JaguarPC 2+ days downtime/Lessons learned

Posted by ReZa, 12-26-2010, 01:16 PM
One of my hybrid servers (100GB space, 4GB Ram) hosted with Jaguar PC is down since December 24, I am desperately waiting for them to restore a server image and its containers.

there are many lessons to be learned from this holiday gift from JagPC, I just review it at first.

Phase 1: Dec 23, 13:50 to Dec 24, 07:02

VPS was working for a very long time with no traffic spike or new website hosted on it, Everything was ok untill I noticed that the my VPN to server disconnected and also DNS server is not working propely and some sites cannot resolve,

I checked the server and noticed high load on VPS, I contacted support regarding the issue, they told me the problem is related to my services specially mysql service.

Considering their response, I spent more and more time and stopped almost all services and noticed that there is nothing my side,
I asked them a couple of times to check if other servers are affecting mine as the iowait was so high making load around 8 to 15.

This phase took around 17 hours, I sent 12 tickets and got 4 not helping responses including 1 sincere apology for support delays because their high work load. First ticket took 1 hour and others more.

Phase 2: Dec 24, 07:02 to Dec 25, 10:42 PST

After 7 hours waiting for a support reply They told me that they have found a bad hdd on server and hardware replacement is in progress, the server was up for around 3 hours after that and load increased to around 25 mark.

Finally everything went down and a forom thread created on the issue:
http://www.jaguarpc.com/forums/showthread.php?t=25762

Typical support reply in this phase is: We are sorry, please be patient, we have no moe information, please wait for forum to be updated.

Forum update excerpts:
4 hours delay: we identified a bad hdd (Dec 24 10:42 AM)
1 hour later : we replaced bad drive but the server didn't boot
3.5 hours later: we identified another bad hdd, it needs manual FSCK, we hope to bring it back online once FSCK completed
2 hours later: FSCK still in progress (after a couple of tickets begging them to update the progress every half an hour )
1.5 hours later: FSCK still in progress
1.2 hours later: FSCK completed, but the server does not boot again
1.5 hours later: we are still working on issue, kindly hold on
3 hours later (Dec 25) : NOCC engineers are working on issue, rebuilding raid offline
3 hours later: RAID (4 disk RAID 10) rebuild 50% complete
2 hours later: Rebuild failed, next plan: creating new raid array and restoring server image
4 hours later: we are configuring new server for re-install, backup image dated : Dec 22 !!

Phase 3: Dec 25, 10:42 PST to now (Dec 26, 9:00)

Around 24 hours passed, There is no forum update, I have sent a couple of tickets and they are polite like always and asking me to just wait.
I asked them to restore my VPS individually they told no because the backup queue may be needed by main server in the meantime
I also asked about my CDP backups and they told that CDP backups have destroyed tooand what I have been paying has been for CDP control panel.

Because of this Xmas gift, I am so tired and under pressure of my customers and have nothing to tell to them.

What should I do in your opinion?
What lessons do we learn from this?

I had good experience with JPC in the past before this long downtime and I believe I chose a reputable company with all the pre-requirements of a reliable server , but what matters is emergency cases like this,

They claim to be 12 years in hosting business, providing 100% Network uptime, Hardware replacement SLA, RAID 10, and exclusive client dashboard;

Does it really matter when they have 50 hours or more downtime?
What could the JPC do to prevent all this downtime?
What other companies do in similar situations?
What strategy ranks best in this scenarios?
Posted by JSCL, 12-26-2010, 03:01 PM
fsck can take a whiile - it depends how big the server is. The rebuilding too, can take a while.

Hardware failures happen, they can be damning and long experiences - but I can tell you that you can have faith in the Jag guys. They know what they are doing and are very experienced in this market.
Posted by ReZa, 12-26-2010, 03:25 PM
We all know that FSCK can take a while, but the FSCK has been repeated 2 times on a 4 disk RAID 10 array with 2 (later 3) bad hdds!! after that , there has been about 30 hours spent on setting up a new hardware node.
They just updated the status page and said about a change of tactic because of complications, I'm just hopeful that their server image is a verified backup and works.
Experience should bring a value, For example they should have monitored their RAID arrays for failures before identifying 3 HDD failures at the same time.
They should have spare RAID arrays , and servers ready to be used as replacement and restore destination for backup containers to improve their hardware SLA
They should have better response times in their ticket support system, average 2 to 3 hours is so bad.

There are many other things that I think should be improved, although I agree with you that they are among the big names and believe in their expertise, but I think their server monitoring, disaster recovery, and emergency plans needs improvement.

I don't want to blame JagPC, I have no choice just waiting for server restore to be finished, I want to know what lessons do you see in this downtime? The lessons learned may help us all for better services as a biz owner or better decisions as a customer.
Posted by ReZa, 12-26-2010, 05:16 PM
I got following informative reply from Nick who is JPC technical support representative. He permitted me to share his viewpoints with others on WHT:

Quote:

Our RAID arrays are monitored but SMART had not flagged the drive as bad. What was happening every once in a while the drive would have an issue and cause the raid controller to have to rewrite whatever it was writing to the disk. This was causing disk IO on the system to be at lot slower than normal.

Due to the number of processes waiting disk IO this resulted in a higher load that normal. Eventually one of our technicians determined that one of the drives was failing and scheduled it for replacement. A second drive failed due to the increased disk IO and caused the system to freeze.

As this system was RAID10 the data still should have been intact so we replaced the two drives and started the rebuild. My best guess is the drive caused the RAID controller to crash (which then caused the system to hang) which caused a consistency issue because the controller did not finish flushing it's cache.

We ran a fsck while the rebuild was running which failed to repair the damage because the problem was at the block level.

We then restored to having the drives replaced which, due to the holidays, took an hour. We then found out that the R1Soft recovery CD would not detect the NIC. We decided to reinstall the on the hardware node and then restore the containers. This failed as well. We were not able to get r1soft to restore to the system.

Eventually we fell back to trying a different NIC in the system and starting the restore from the recovery cd again which worked this time (so far anyway).

Which brings us to where we are now. Waiting for the data to restore. This is my point of view of events and I have to disclose that I wasn't around for most of it may have happened a little differently that I have described.

What we learned from this?
I'd have to say that our notes are not detailed enough but I did have trouble writing a detailed description of events. Might have not been useful in this case but I'll probably end up writing a general outline on what to examine and note when approaching these issues.

Consider testing the R1Soft recovery CD on new hardware builds before putting it into production. I haven't seen this fail before but I'd rather not see it fail again either.

Posted by cscarlet, 12-26-2010, 05:47 PM
Its not a good time for a hardware failure, I would say about having a hd on standby in the array but I doubt this would have helped prevent any further issues.
Posted by AdmoNet, 12-26-2010, 08:17 PM
This really sounds like their array is to blame. Depending on teh storage controller vendor, regular disk consistency checks (verifies) should be ran. Bad sectors can exist on disks in areas that have not yet been written to. This can easily cause double-disk faults.

Quote:

Originally Posted by ReZa

We all know that FSCK can take a while, but the FSCK has been repeated 2 times on a 4 disk RAID 10 array with 2 (later 3) bad hdds!! after that , there has been about 30 hours spent on setting up a new hardware node.
...

Posted by LesJPC, 12-27-2010, 02:51 PM
Reza I've updated you already but wanted to post here also that this is being worked out for you.