Steadfast Networks Outage

Portal Home > Knowledgebase > Articles Database > Steadfast Networks Outage

Posted by KarlZimmer, 09-12-2010, 02:12 PM
Letter from our President/CEO Karl Zimmerman: We fully understand the severity of this situation, that this negatively affects your business, as this is our business, providing reliable connectivity and data center services. Our own business is also damaged greatly by these types of events, we fully understand that and feel that our stellar uptime performance up to this point is a testament to that. In this case, I am gravely sorry that we have let you down and have not lived up to our promises. What I can do, is assure you that we are working as quickly and efficiently as possible on this matter. At the beginning of the project we had assured all of our in-house engineering staff was on-site for the maintenance, along with 3rd party network engineering for support and additional supervision of the project. We thought we had everything prepared and had spent weeks in configuration and testing, but it appears we were wrong. I dont need to tell you, but things did not go smoothly. Overall, we faced many instances of undocumented differences in the handling of standardized protocols. During this time, we worked extensively with both Cisco and Brocade engineers. At this point, it is too early to tell whether it is Cisco or Brocade at fault for those issues, but in the end, we understand that the network reliability is fully our responsibility, thus we are to blame. Once this maintenance is finally completed, we will have significantly improved capabilities, performance, and expandability. With the gear well have another maintenance like this would not be required for a decade, if not more. You have nothing to fear of this being a continuous issue or a recurring event. It was a one-time major network infrastructure improvement. I am asking you to please stick with us through these times. We have provided robust and friendly service up to this point, dont let this one incident, even though it is a huge incident, destroy the quality business relationship we have together. If you help us through this time with your understanding, we can assure you we are still committed to providing a steadfast and reliable service and this will pay off long-term dividends. - Karl Zimmerman I agree, this outage is unacceptable. I am posting this here myself so we can accept our blame and take our shots. We deserve it for this one. You cannot understand how bad I feel for this and how sorry I am for the customers this has hurt.
Posted by Mekhu, 09-12-2010, 02:25 PM
Karl, I'm not here to whine or complain. Your server has been AMAZING over the years. We're more curious when we can expect a perm. fix to this?
Posted by David, 09-12-2010, 03:03 PM
Alright, enough of the ****ing circus act. Facility is 100% offline again.
Posted by PersonalJ, 09-12-2010, 03:09 PM
I thought the maintenance was only related to future hosting, I was not aware all of steadfast.net had lost connectivity. It's a bit strange when my services at FDC have better uptime than the VPS at steadfast...
Posted by AHDOnline, 09-12-2010, 03:10 PM
Im sorry but this is the most stupid line ever "Once this maintenance is finally completed, we will have significantly improved capabilities, performance, and expandability. With the gear we’ll have another maintenance like this would not be required for a decade, if not more. You have nothing to fear of this being a continuous issue or a recurring event. It was a one-time major network infrastructure improvement." Dont worry you wont have 15 hour downtime again, so that makes this one alright.. Your updates need working on they are horrible. And it took YOU 12 hours to say anything.. Last edited by AHDOnline; 09-12-2010 at 03:18 PM.
Posted by VINAX, 09-12-2010, 03:10 PM
The network is down again. What's going on? Edit. Looks like it backs up now! Hope everything restores ASAP. Last edited by VINAX; 09-12-2010 at 03:24 PM.
Posted by zenex5ive, 09-12-2010, 03:22 PM
Did anyone get their maintenance notice?
Posted by AHDOnline, 09-12-2010, 03:24 PM
It was said to have went out on august 23rd i think
Posted by Mike343, 09-12-2010, 03:32 PM
So it seems the network went down a while ago and it is coming close to 9 hours, at least for me? Last edited by Mike343; 09-12-2010 at 03:35 PM.
Posted by RodrigoBR, 09-12-2010, 03:34 PM
My two servers down for hours, almost all day now (only a little moments of server working). I was aware of the maintenance, but now I'm having serious problems because of this long delay. Too much downtime = unhappy customers = less money for me I know that Stedfast was doing a good job in the last months, but this is a big problem now. Waiting for a final resolution... Best Regards, Rodrigo
Posted by AHDOnline, 09-12-2010, 03:35 PM
you're lucky if you are only noticing 9 hours..
Posted by drmorley, 09-12-2010, 03:54 PM
Here's my traffic monitor--looks like it's been bouncing since 6am CST. Attached Images traffic monitor.jpg (144.3 KB, 122 views)
Posted by SuperJ, 09-12-2010, 03:58 PM
Does anyone know what's going on with FutureHosting in Chicago? My server has been failing for at least six hours. I've tried to contact admins but there's no response from them. FTP seems to work but is very sluggish...
Posted by AHDOnline, 09-12-2010, 03:58 PM
well I guess its how you look at it. My first alert of downtime was at 2:26 with a FEW very small uptime alerts. It looks like the majority of the servers did go offline around 5am EST
Posted by David, 09-12-2010, 03:59 PM
They're utilizing steadfast.net which has been on and off since midnight see: http://www.webhostingtalk.com/showthread.php?t=980579
Posted by futurehosting, 09-12-2010, 03:59 PM
There aren't any open tickets - if you submitted one, you should have received a response within a couple minutes, but yes, take a look at this thread for information: http://www.webhostingtalk.com/showth...wpost&t=980579
Posted by RobbyHicks, 09-12-2010, 04:12 PM
What's the deal with SLA Credit on this?
Posted by SuperJ, 09-12-2010, 04:17 PM
No, I haven't gotten any response. This is a copy of my email: Server not responsive.Sunday, September 12, 2010 2:31 PMFrom: "Jacek Nabywaniec" Add sender to ContactsTo: "Future Hosting" Hello, My sites do not respond to http calls. FTP connected but sluggish. www.hockey.pl www.polville.com www.photojack.info It's been an hour and a half since the email but I have a guy working for me who says the server has been down since about 10:00 EST. I understand that it was more unreacheable than down. Anyways, you guys run your service well, overall. I'd expect you to put a message on futurehosting.com where I could read a few lines about when the service should be back up and things would be a lot simpler.
Posted by bmhrules, 09-12-2010, 04:18 PM
This Service Level Agreement does not cover outages due to scheduled or emergency network and/or facility maintenance, which will be broadcast to all customers in advance via the web page at https://support.steadfast.net/?_m=news&_a=view, and will not exceed 60 minutes per month. So I guess that since it was down for more than one hour, we get credit for it?
Posted by futurehosting, 09-12-2010, 04:21 PM
If you login the customer portal, you will see the announcement listed in bright blue actually. We don't have any tickets under your account - ticket's cant be opened via e-mail. If you go to my.futurehosting.com you will be able to login and submit a ticket. I'll open a ticket under your account so you can respond to that via e-mail.
Posted by GCM, 09-12-2010, 04:21 PM
Your sites seem to be loading now. Best of Luck to FutureHosting and SteadFast recovering.
Posted by David, 09-12-2010, 04:28 PM
None of them load here, along with 85% of the planet. They're working in extremely limited areas.
Posted by IGXHost, 09-12-2010, 04:29 PM
Ah, that explains why our VPS with them was going up and down.
Posted by Mr Terrence, 09-12-2010, 04:30 PM
The whole network is still down? how long was it down for?
Posted by SuperJ, 09-12-2010, 04:34 PM
Nothing works for me. Not even FTP. I'm in Toronto. My guy in Europe says the service is blazing fast.
Posted by GCM, 09-12-2010, 04:36 PM
Ah, I was only testing from our San Francisco network and Detroit VPN. (Pretty fast might I add) Attached Images hockey-pl.png (325.9 KB, 31 views)
Posted by AHDOnline, 09-12-2010, 04:36 PM
Man, see how many times you could have implemented a back-out plan? Assuming there was one...
Posted by Deathspawner, 09-12-2010, 04:39 PM
This downtime hurts, big time. I'll be the first to say that my experience with Steadfast up to this point has been nothing short of incredible, but this kind of downtime is hard to bear. Worst still, I didn't receive an e-mail regarding this maintenence at all. At least if I had a heads-up, I wouldn't have been scurrying all over the place wondering what's going on. The official site should have been updated with updates, not just this forum thread. Not everyone immediately heads here to get updates.
Posted by AHDOnline, 09-12-2010, 04:43 PM
They did sorta, kinda, pretend to have an update page. It if you can still get to the site on support.steadfast.net there were a few updates as to its not working no eta, three hours later: not working no eta
Posted by jlaws, 09-12-2010, 05:00 PM
I don't delete any email and I don't filter any out either. I just did a search for all Steadfast emails and there was not a notice of this sent to me. Some downtime is understandable for the infrastructure upgrades taking place but this has gotten a bit ridiculous. Hopefully service is returned within the next couple of hours.
Posted by cshenderson, 09-12-2010, 05:07 PM
Ive been extremely happy with steadfast .. until now. It's not the necessarily the outage; which as we all know crap happens, and sometime the backout plan leads to more more long term delays and huge overhead costs. What is disappointing is that Steadfast, like all other companies we critically rely on (like Covad or even Comcast Business) just will not set up a warning system. In the age of Email, IM, Text... a dozen and a half different ways of contacting people, nobody will do it; even power companies rely on people calling it to inform them something went terribly wrong during maintenance. Learning of the outage by one of my clients calling me is just plain BAD. Me coming into my small business only to learn that our ISP did maintenance and hour before and is still down is BAD. Why oh why can't we get decent warnings from companies(not just Steadfast) instead of having to constantly troll their boards for maintenance schedules?
Posted by HD-Sam, 09-12-2010, 05:10 PM
16 Hours and counting .... We cannot access our servers at Steadfast. Their site is down, support site is down, and their phone lines are down (at least for me, at some ISPs their site still up it seems). The premium we are paying for this is outrageous.
Posted by VINAX, 09-12-2010, 05:16 PM
Did you subscribe to the news at their support system? https://support.steadfast.net/index.php?_m=news&_a=view If not, enter your email and submit it.
Posted by SuperJ, 09-12-2010, 05:17 PM
Any idea what all this means, guys? I don't mean to be a pest but Sunday is prime time for me. Hockey games have finished hours ago and there's no stats because I can't access anything from Toronto, while my audience in Poland can. I'm looking pretty bad right now. Any idea when Toronto traffic will make it through?
Posted by VINAX, 09-12-2010, 05:20 PM
Their site and support system didn't down because it hosted on a different network.
Posted by GCM, 09-12-2010, 05:21 PM
No idea, what you can do it use a proxy in the meantime. http://www.hidemyass.com/proxy-list/
Posted by jlaws, 09-12-2010, 05:23 PM
My service is now back online. YEAH!
Posted by Deathspawner, 09-12-2010, 05:24 PM
My site is also back online. Whew.
Posted by jlaws, 09-12-2010, 05:25 PM
It was up for me earlier today but has been down for about 2 hours now (both their main site and their support site). Neither DNS of steadfast.net nor support.steadfast.net resolve or respond to ping.
Posted by Mike343, 09-12-2010, 05:29 PM
Still facing downtime here.
Posted by VINAX, 09-12-2010, 05:29 PM
It's weird. Their site is always up for us.
Posted by HD-Sam, 09-12-2010, 05:30 PM
Not true. I thought it was on a different network as well. But... I tried it on my own ISP, my sprint phone, an AT&T phone, and the ultimate test was I verified it with Alertra.com's spotcheck. It was down on all, and Alertra.com listed it down at 6 of 10 of their worldwide locations. Last edited by HD-Sam; 09-12-2010 at 05:33 PM.
Posted by cshenderson, 09-12-2010, 05:30 PM
I thought I had signed up for the Steadfast support alerts. I guess my ignorance of this upgrade is my fault. I still maintain that most critical ISP don't seem to have warning systems in place but, in this case I seem wrong. To jump into the other conversations: Their support site was down for me across 3 different ISPs in the chicagoland area.
Posted by Mekhu, 09-12-2010, 05:30 PM
We run VOIP servers on Steadfast and haven't been blocked out totally. It's just a matter of up and down like a yo yo all day. My last pingdom down report came in at 5:13pm EST (15 minutes ago). Lets hope that was the last one... receiving 200+ of those in an evening just sucks. The only thing I want an answer to is notification. Why in the $%#^ were we not notified of this maintenance window? That to me screams SLA credit. If I was aware, I'd have no issue with you copying TOS but with no notice comes some responsibility. You made EVERY business that relies on you look like amateurs today with this stunt.
Posted by chopsmidi, 09-12-2010, 05:31 PM
Been down for me now for half hour. Was stable before then for about an hour and a half.
Posted by Mekhu, 09-12-2010, 05:31 PM
Signup for what? Since when must you signup with a provider to receive important (service impacting) notices. That's rediculous.
Posted by Deathspawner, 09-12-2010, 05:33 PM
+1. I was thinking the same thing. It should be a given. Edit: And I spoke too soon... our site is down again. Ugh.
Posted by crazylane, 09-12-2010, 05:37 PM
oops, down again.
Posted by HD-Sam, 09-12-2010, 05:39 PM
They sent out an email August 24th about the maintenance but never did I once think it would be an outage like this
Posted by RodrigoBR, 09-12-2010, 05:43 PM
Still down for me.
Posted by Mekhu, 09-12-2010, 05:44 PM
Confirmed down since 5:34pm EST. I have all of our clients updated. Time to go make some sweet BBQ and forget about this for a bit.
Posted by jlaws, 09-12-2010, 05:54 PM
Was up shortly and down again. :/ Up again. Guess this is to be expected for 12-24 hours. Last edited by jlaws; 09-12-2010 at 05:58 PM.
Posted by ManagerJosh, 09-12-2010, 06:03 PM
All our latest updates on the network maintenance can be found at https://support.steadfast.net/index....ews&newsid=285 I'm sorry for any problems and disruptions we are causing to your business.
Posted by Mekhu, 09-12-2010, 06:05 PM
sigh... you just had to post it didn't you "Too many connections". If anything new comes into that site, someone be sure to copy/paste here for us all!
Posted by HD-Sam, 09-12-2010, 06:09 PM
It says too many connections now... but I have my last refresh from about 30 minutes ago: 2:29 PM: Restoration is taking longer than anticipated. ETA revised to 4:30PM CDT 2:29 PM: ETA is about 30m-1h. We'll fully update when things are back online, but it's expected we'll see a full restoration of connectivity before 3:30PM CDT 11:32 AM: Core routers are both online. We're bringing things back VLAN by VLAN; as soon as you're back online we'll email replies to existing tickets to confirm that your connectivity has been restored. 7:38 AM: We are continuing to work on outstanding issues caused by the router replacements and currently have no ETA on resolution. 6:06 AM: The problems being caused by the compatibility issue seem to have been solved by removing core4 and replacing it with the Brocade. We are still diagnosing some of the remaining issues and hope to have them resolved shortly. 5:15 AM: The issue was traced to a compatibility issue that prevents the Brocade and Cisco from working together and we are going ahead with completing the replacement of core4 at which point we will resolve all outstanding issues. We are expanding the maintenance period by one hour to deal with this problem. 4:58 AM: We are continuing to work on solving the problems caused by the new router with Brocade. We are attempting several possible solutions at this time and will report when we have additional information. 4:00 AM: We are currently working with Brocade support to attempt to resolve some of the issues that have resulted from the upgrade. We will provide an update as soon as we can. 3:23 AM: We are currently diagnosing known issues with packet loss that are affecting most customers. We will not proceed with the core4 replacement until this problem is resolved. 2:30 AM: core3 has been replaced and we are currently working on bringing the new Brocade router into normal operation status and solving a few complications from the change before we begin the replacement of core4. We will provide an update when this step is complete. 12:53 AM: We are now proceeding with the removal of core3 and replacement with its Brocade equivalent. We will provide an update if there are any issues with this process or when the replacement is completed. 12:25 AM: We are currently verifying configuration settings on the routers to reduce the risk of errors that may cause transition issues. We will provide an update when we are ready to begin the physical migration. 11:39 PM: The maintenance is on schedule to begin at 12:00 AM.
Posted by ManagerJosh, 09-12-2010, 06:10 PM
I will do my best to post updates as fast as I receive them.
Posted by Mekhu, 09-12-2010, 06:10 PM
Thanks Sam. I was hoping to see another update after the 2:29 posting but I'll remain patient. Thank GOD we have some amazing clients.
Posted by MattE, 09-12-2010, 06:11 PM
We noticed major packet loss starting at around 3AM, lasting until about 7AM when everything completely died, and we've been offline since (now it is 5PM). We have had an exceptional service with Steadfast up to this point, though. We are still very happy with the service, though hopefully there is some form of credit offered for this downtime.
Posted by Mike343, 09-12-2010, 06:13 PM
it seems that ETA was too early as 4:30CDT has come and gone :/
Posted by TopekaHost, 09-12-2010, 06:14 PM
We just started with Geekstorage.com and they use Steadfast Networks for their data center this has been awful we just started in the business and I as an owner called all my clients in our offline database since almost all our data was on the server so at one time I decided to make an access database on our computers so we can call clients in the event our hostbill is offline. So by all means if anyone hears or see status update data by all means please post thanks again sincerely from my heart to yours Last edited by TopekaHost; 09-12-2010 at 06:16 PM. Reason: Asking for Status updates
Posted by Deathspawner, 09-12-2010, 06:18 PM
The only new update there is this: 5:210 PM: Key components are now working, services should be coming up for most customers shortly. Our site keeps going up and down sporadically though.
Posted by TopekaHost, 09-12-2010, 06:26 PM
In no circumstances did I expect to be almost 18 hours now my site went offline at 12AM and has been off since then my entire vps is offline thanks to them.
Posted by KarlZimmer, 09-12-2010, 06:30 PM
Wow, sorry for the delay. We've been actively working on this and even worked out a plan to revert back to the old setup, but when we started getting things together both the chassis and a line card refused to work again... NOTHING has been going right today, it is just crazy. Well, things ARE starting to stabilize. There will still be odds and ends to tie up and likely still some BGP convergence type stuff, but stabilizing for the most part.
Posted by HD-Sam, 09-12-2010, 06:34 PM
Our servers are starting to reappear. Hope it stays. Thanks for the update.
Posted by ManagerJosh, 09-12-2010, 06:34 PM
I'm sorry you've had a horrible first impression with us. I know how important our services are to the success of your organization, and we will do whatever it takes to restore service.
Posted by TopekaHost, 09-12-2010, 06:35 PM
So Karl, When will this outage finally be over when will we have access to our systems again.
Posted by KarlZimmer, 09-12-2010, 06:36 PM
Yes, our engineers have experience with brocade, though not extensive, but the 3rd party engineer we had in on-site has extensive Brocade experience. We basically setup the boxes as part of a test environment with Cisco gear, though not the exact same gear, and did very similar configurations, etc. We did turn up BGP, and BGP was not the primary issue here. We had our own network engineer on site as well as the 3rd party network engineer, they were the ones performing the maintenance. There was absolutely no delay anywhere for bringing someone on-site, everyone was already on-site. Right now we're working out additional plans/changes regarding our overall network engineering and network configurations arrangement. I've been busy actually dealing with resolving this issue so don't have those plans finalized yet at this point.
Posted by David, 09-12-2010, 06:40 PM
Again, I find that extremely weird considering one of your own employees said otherwise just a few hours ago via IRC. 14:01 <@ub3r> okay, engineers are going to 350 right now. 14:02 <@ub3r> they were working remotely, going onsite now 14:02 <@David> ... 14:02 <@David> you. are.!@#$!@#$. kidding. me? 14:02 <@Steven> what the !@#$! 14:02 <@ub3r> What do you know about networking david? 14:02 <@David> nothing. 14:02 <@David> roughly the same as you folks.
Posted by TopekaHost, 09-12-2010, 06:40 PM
I can finally see steadfast core router but still not totally up but we are making progress
Posted by KarlZimmer, 09-12-2010, 06:42 PM
It states in our welcome email: Then in the TOS: We would love to mail everyone announcements, but we've gotten SO many complaints when we do...
Posted by David, 09-12-2010, 06:43 PM
There's quite a bit more as well, stating otherwise: 14:04 <@ub3r> well anywho, there isn't always a reason to be onsite 14:05 <@ub3r> we do have an out-of-band management system, and our IP-to-Serial boxes are connected to that network.
Posted by TopekaHost, 09-12-2010, 06:44 PM
Hallelujah we are finally up and running thanks steadfast for connectivity.
Posted by KarlZimmer, 09-12-2010, 06:44 PM
That is just purely wrong and incorrect. I can't explain why he would have said that. We've had engineers on site since 10PM last night preparing for the maintenance and then performing it. I can show you badge access logs for the building if you really don't believe me...
Posted by Steven, 09-12-2010, 06:45 PM
I honestly do not consider this to be routine maintenance or a simple service change. This is a serious change. There are several people I know that have had their lifes turned upside down with this downtime. As you may or may not know - today is an important day for fantasy football websites.
Posted by David, 09-12-2010, 06:47 PM
I believe you, I'd have no reason to suspect otherwise -- merely going based on what he's stated. I assumed Mike was in-the-know, considering he's a team member. At any rate, I'm sure you're busy -- you've got an awful lot more clients to **** over so I'll let you get back to work.
Posted by Deathspawner, 09-12-2010, 06:52 PM
It would probably be easier to still have e-mails sent out by default, and just let those who don't want them to unsubscribe easily. I'd be willing to bet that there are a lot more people who would rather receive these e-mails than not receive them - given that their paychecks are likely to be the result of their site. I find it strange that so many people would complain anyway, given the sheer amount of spam that reaches our inboxes (mine, anyway).
Posted by PersonalJ, 09-12-2010, 06:56 PM
I'm back online, I think I was down for roughly 18 hours. I did not notice the downtime until around 3 AM EST.
Posted by RodrigoBR, 09-12-2010, 07:02 PM
Nothing yet, my servers still down here... I have large and important websites/customers in these servers, I'm losing money.
Posted by jlaws, 09-12-2010, 07:06 PM
Mine has been up and down for the past few hours. The down periods are a bit long for routing re-convergence but at least its not totally dead atm.
Posted by KarlZimmer, 09-12-2010, 07:13 PM
There are currently off and on memory issues with one of the routers and that is probably what you're seeing. We're working with the vendor on getting config properly transfered, replaced, etc.
Posted by RodrigoBR, 09-12-2010, 07:14 PM
For me I can see only a little moments up, totally unstable. But in the most of the time all is down. Like in this moment, since my last post, still all services down. Best Regards, Rodrigo
Posted by dgessler, 09-12-2010, 07:14 PM
We're running high traffic e-commerce sites, we lost a ton of money because of this downtime. Although we were not subscribed to their maintenance e-mail (we didn't really think to do so before), steadfast should have e-mailed everyone about this major scheduled downtime. Notified or not, 6 hours is still an unacceptable amount of time to be down and 17+ hours is epically unacceptable. This definitely warrants a change in datacenters,
Posted by Mekhu, 09-12-2010, 07:19 PM
That's funny, my welcome email from your company (I signed up 4-5 years ago I think) says nothing of the sort. Anyways, I'm not about to argue. I'm about done with your company as of right now and just want to forget this. I can only hope you're not a money hungry company and come good on some compensation for us all. I think I pay about 2x more at Steadfast compared to our Dallas, New York, etc locations and I've never had an issue with that until now. I'm still amazed I was sent no information about this. Maybe I'll have to start checking 10+ websites daily for updates from our DC's... yeah, that seems logical BTW, still ******** network access from our end.
Posted by Steven, 09-12-2010, 07:21 PM
Its horrible. All day I was able to access it for the most part - been completely down for me for the last 2 1/2 hours.
Posted by menchibantam, 09-12-2010, 07:22 PM
I have been with steadfast for many years now because I like to think that if there are any problems they will be my fault only because steadfast is so good when it comes to handling their own problems. All our critical stuff is with them because they are the most reliable people with the nicest staff in the business as far as I'm concerned. I wonder if I have just gotten lucky all this time after this 18 hour downtime. I have lost a lot of advertising revenue and potential user upgrades today and it hurts more because Sunday is almost always our busiest day. Last edited by menchibantam; 09-12-2010 at 07:22 PM. Reason: typo
Posted by HD-Sam, 09-12-2010, 07:22 PM
Agreed. We've been down since 11:56pm last night and it is now under 6 hours until the 24 hour mark. Let's hope we don't reach that
Posted by Mekhu, 09-12-2010, 07:26 PM
Did the VPS machines need to FSCK? I'm lost as to why some of us are getting yo yo connections and others are 100% offline!?
Posted by HD-Sam, 09-12-2010, 07:30 PM
They shouldn't, this is a network issue. Karl mentioned: That and BGP convergence are most likely why we are going on/offline
Posted by Mekhu, 09-12-2010, 07:31 PM
Thanks for the reply. I thought FSCK was power loss only so that's good to know. As for the on/offline, I understand that. Just confused why some have NO access at all.
Posted by HD-Sam, 09-12-2010, 07:36 PM
Ah, they also mentioned they were bringing things up VLAN by VLAN. They may have not gotten to yours yet.
Posted by KarlZimmer, 09-12-2010, 07:38 PM
Yep, router going up and down causing BGP issues, etc. but there are a number of customers with their VLAN only on that router. We have people picking up a spare supervisor card to fix that right now.
Posted by KarlZimmer, 09-12-2010, 07:39 PM
If you have NO access please PM me and we'll look into that ASAP.
Posted by Deathspawner, 09-12-2010, 07:40 PM
Does that mean that we're at the end of the downtime?
Posted by Mike343, 09-12-2010, 07:41 PM
There down here at 6:40 PM CST.
Posted by jlaws, 09-12-2010, 07:53 PM
My stuff has been up for the past 30 minutes or so with only 1 or 2 dropped packets. Things may be stabilizing on whichever network segment I'm currently attached.
Posted by drmorley, 09-12-2010, 08:09 PM
All my machines are still unreachable.
Posted by Mekhu, 09-12-2010, 08:16 PM
Agreed. While I don't have our streaming service running, atleast my Pingdom notifications have stopped.
Posted by KarlZimmer, 09-12-2010, 08:16 PM
You're homed on the switch that currently doesn't have a management card, our people are on the way back with the replacement. We weren't expecting hardware issues with two of our management cards, as they had been known working beforehand... You'll be up once that is in, but I'd recommend you contact network operations once this is settled so you can have your VLAN setup with VRRP.
Posted by ManagerJosh, 09-12-2010, 08:21 PM
As of 6:10 PM Central Time: There are off and on memory issues with one of the core routers still causing some issues that we are working with the vendor to resolve, but everything else should be online. If not, please contact our support department.
Posted by jon222, 09-12-2010, 08:36 PM
Karl or someone else from Steadfast, are we going to get any kind of SLA credit for this extended amount of downtime?
Posted by Chrysalis, 09-12-2010, 08:39 PM
Karl the problem is still ongoing for me. My server never actually went down for a long time but it has been going up and down like a yoyo all day long causing havoc with services I host. A long term customer and dont plan on leaving however please next time in an emergency situation send out an email. I see routing when down changing between ntt and nlayer.
Posted by KarlZimmer, 09-12-2010, 08:59 PM
SLA terms will be followed, just open a ticket with billing.
Posted by ManagerJosh, 09-12-2010, 09:21 PM
If you are still experiencing issues with your account, Please open a support ticket with us and one of our team members will investigate.
Posted by ManagerJosh, 09-12-2010, 09:34 PM
As of 7:20 PM: Core4 is actively being worked on and appears to be the last issue. Customers homed to Core4 only are down, but all other services should be normalized. If your account is still down, please contact our support department.
Posted by KarlZimmer, 09-12-2010, 09:37 PM
Parts are all in for core4 and in place, just finishing up configurations.
Posted by Deathspawner, 09-12-2010, 09:49 PM
I don't know what SLA is, but are regular customers going to see any retribution for this hassle? I don't run a site that drives sales like many others here, but I still lose money on advertising, and not to mention my traffic numbers for almost an entire day. Kind of hard to stomach.
Posted by ManagerJosh, 09-12-2010, 09:57 PM
SLA is service level agreement. Please see http://en.m.wikipedia.org/wiki/Servi...edirected=true for a better idea what SLA is. As for compensation for the downtime, please open a support ticket with billing and they will take care of you.
Posted by RodrigoBR, 09-12-2010, 10:01 PM
I was posting about services back up, but I have still one server down. How much time do you need to fix all issues?!? I can't accept almost one full day of downtime. Best Regards, Rodrigo
Posted by ManagerJosh, 09-12-2010, 10:13 PM
Hi Rodrigo: Please open a support ticket and one of our team members will investigate the matter in complete detail. Thank you for your continued patience with us.
Posted by qlites, 09-12-2010, 10:17 PM
Server went down again. No response from support ticket.
Posted by qlites, 09-12-2010, 10:44 PM
Came back up for a minute and back down. This is becoming a very bad joke!
Posted by KarlZimmer, 09-12-2010, 10:51 PM
Now resolving some routing loops from the new setup.
Posted by HD-Sam, 09-12-2010, 10:58 PM
Servers are still going up and down, can't update our ticket because we can't access the steadfast support site either.
Posted by pfak, 09-12-2010, 11:00 PM
SLA credit? I'd be looking for new hosting, clearly Steadfast does not know how to operate a network.
Posted by Mekhu, 09-12-2010, 11:04 PM
Unfortunately I'm not an *******. Steadfast has given myself and my clients years of uninterrupted service without issues. I still trust them. Just want this over and things back to the normal reliable ways. But yes, I agree this is more than a small mixup. I'm pretty amazed and can only hope this hit them in the Wallets hard so they learn their lesson.
Posted by KarlZimmer, 09-12-2010, 11:08 PM
And you don't know the full situation. This is the only major network outage in 3+ years at my last count. This was a MAJOR undertaking that we mis-evaluated. Yes, we made a mistake and were not fully prepared to make this migration, but there are MANY extenuating circumstances that have brought this issue to the point it is now at.
Posted by panopticon, 09-12-2010, 11:09 PM
I feel bad that this went wrong for them; until today's network incident, they've provided me with 100% uptime and excellent service.
Posted by Deathspawner, 09-12-2010, 11:15 PM
I have to agree with this. I was going to e-mail them regarding some sort of credit, but I said screw it. The company has helped me out in rather big ways in the past and didn't rape my wallet, and that's important. Plus, the service has always been fast and thorough, so I'm not about to let one screw-up (albeit a big one) cause me to move. Accidents happen, I guess.
Posted by David, 09-12-2010, 11:20 PM
Why, is the question? Why still trust? I'm afraid at least in my case, I've lost every little ounce of faith I once allocated to Steadfast & Karl. Something to consider is how companies respond during the worst of times -- not how great and fantastic things are on the average day. This wasn't your average day, and has left you without 99.9% uptime for the next year and a half. Though I've managed to successfully wield it into a situation where my clients have the utmost faith in my own service, steadfast did the exact opposite to me today. Their responses and communication via every method they utilized (the few they did) wasn't thorough, and the odd time we received an update beyond Karl's "letter from the CEO" was absolutely !@#%$ing useless. The lack of thoroughness contained in the announcement / news from Steadfast was absolutely appalling, with hours between updates you would think more than a single line could be compiled, especially after ~nineteen hours of downtime and maintenance. Seriously? The times also changed on that 2:29 to 2:29 item as well. Every two hours I sent five paragraph emails to my clients explaining or expanding on what little data I was receiving from steadfast. This isn't service we can easily pass off to our clientele, especially not if you're an avid fan of thorough communication. Steadfast has proven to me today that they're not a company that not only doesn't deserve a damn cent of my client's funds, but aren't interested in working for it. Though I'm sure you've all heard enough of my disapproval, I'll make an exit now. God bless.
Posted by kaniini, 09-12-2010, 11:24 PM
What about the times last year and the year before when your distribution network completely failed? This is not the first serious outage you guys have had, but I do agree that the changes you are making are for the better.
Posted by jon222, 09-12-2010, 11:27 PM
We do not know the extenuating circumstances because all your status page has told us is give us 2 more hours for the past 14 hours. Oh and also everything is fixed except where it isn't.
Posted by panopticon, 09-12-2010, 11:45 PM
Despite the issue today, I plan to stay with steadfast for a long while. Their staff, service, hardware, and network have been excellent for me to date; everyone can make a mistake once.
Posted by kdill, 09-12-2010, 11:46 PM
Everyone has their days. I too lost money today, but I have lost MORE money jumping from host to host looking for a good one. I will continue my service with them for as long as I can afford it. They are number one in my book and will stay that way. 100percent uptime or not. I have faith this will never happen again as I have seen nothing but wonderful from them up till today and it's almost behind us.
Posted by Steven, 09-12-2010, 11:51 PM
http://www.webhostingtalk.com/showth...ight=steadfast That was a pretty bad string of outages - roughly 2 years ago.
Posted by jon222, 09-12-2010, 11:58 PM
You guys act like this was emergency downtime. Nobody imposed this on them and it was supposed to be network maintenance. They promised no more than 30 minutes of downtime for any customer. Their new equipment messed up, I wouldn't complain about that as new equipment will give you problems we all know. Why wasn't everything rolled back to the previous configuration while they figured out what went wrong? Should we suffer 24 hours of on-off downtime for what started as network maintenance? The lack of contingency for this is what saddens me. All that said they are still offer great service every other day, I am really stressed out because I personally can't do anything about this type of downtime and my business is suffering because of it. Last edited by jon222; 09-13-2010 at 12:11 AM.
Posted by NetDoc, 09-13-2010, 12:25 AM
Youknow, I read all 7 pages of hurt, angst and betrayal. I must say that I am happy that I chose Atlantic.net. Good luck peeps.
Posted by KarlZimmer, 09-13-2010, 12:27 AM
We DID roll it back. We rolled it back and then the chassis we were using could no longer be powered on, one primary supervisor engine could be powered on in any of three chassis, and our backup supervisor card was having odd memory issues. Nothing was going right, no matter what we did there was another issue. We'll have a full explanation out tomorrow along with resolutions, etc. To note, the routing issues should be over, if you still see anything, please send us a ticket with a traceroute.
Posted by jon222, 09-13-2010, 12:32 AM
I'm sorry Karl, I did not know that, I'm only going on what the status page announced which made it sound like you guys just persevered. My bad.
Posted by Steven, 09-13-2010, 12:55 AM
I see what the problem is.. That there - I consider this beta hardware. What in the world were you thinking? Why did you break what was working and proven for something that is unproven?
Posted by misterd, 09-13-2010, 01:39 AM
I remember while I was touring DCs several years ago, some of them had "unreleased" core routers as well. -- Now, I don't know if it's standard practice for manufacturers, providers has good connection with them, or they just need some lab rats before official release. I'm stressed out on this situation and lost quite a bit of money & repuation as well. While 24-hour maintenance is not acceptable, but this thing happens and I believe they handled it the best they can. P.S. - My VLAN has been stable for almost 6 hours (fingers crossed). I had problem reaching some networks a few hours earlier, but things now appear to be normal. Last edited by misterd; 09-13-2010 at 01:41 AM. Reason: Add some more info.
Posted by joshribakoff, 09-13-2010, 01:43 AM
I understand issues happen, but there are multiple factors at play 1) They sent an email reminding us about the scheduled maintenance , yet did not send one notifying when it went awry. This provided false intelligence that prolonged our decision to jump ship. We regret this. 2) They did not attempt to roll back / did not have a solid enough rollback plan in place. OK technology messes up, but when you have a data center to run you shouldn't go playing around with beta hardware without a solid rollback procedure 3) Just all around lack of response. Phone circuits jammed. No emails. False timelines, false intelligence given out, crappy SLA policy. I've spent $2,000+ and all I'm going to get is $30 credit, if that. I already setup a new account at the planet, the cloud hosting is running about 10x faster on page loads than steadfast was. I always liked working with steadfast because they're a small company, Karl answered the phone personally. But this is unforgivable, while my site was down and I was spending hours frantically pounding on my keyboard, they prolonged the outages for 17hrs (regardless of what they will say). The line on his site that says "wont need an upgrade for another decade" is the kicker. This is a huge bluff, outright lie. You're just saying that to kiss up to us, you can't make an assertion like that. When stuff goes wrong you shift the blame onto the hardware, but at the same time assure us how good the hardware is. I can't continue to do business with a company that has no cognitive dissonance. It reminds me of the infamous Bush quote "fool me once, fool me twice" Last edited by joshribakoff; 09-13-2010 at 01:46 AM.
Posted by kdill, 09-13-2010, 01:58 AM
Hey life happens. Guess what? It happened to Karl also. He knows the severity of the situation. I mean, what do you tell your clients!? He knows there are going to be people like you that just rage out and leave, but he has to do something. When I read that comment, I had the common sense to see that he only meant to keep people relaxed about the future, not to literally say we are never going to touch our **** for 10 years.
Posted by joshribakoff, 09-13-2010, 02:12 AM
Shouldn't we be able to take his statements literally? I hired him to host my website, not soothe me. So when he said service was restored at 5am, and it was not - he was not being literal. When he says service is now restored, hes wrong again is what you're implying. I'd rather be given a literal statement than a false sugar coated one. Coupled with his lies here about never having 'major' [1] down time, which was disproved by other long time customers (to which I can also personally attest to). The statements permeate his true attitude towards the company, which is to keep BSing us. I was plenty relaxed, all day in fact. Until I read that statement, then my blood boiled. Its like 'drill baby drill' in that we are being assured the inevitable will not happen, when everyone knows that it will. My new host isn't making me false promises. [1] - (of course he injects the subjective term for plausible deny-ability) Last edited by joshribakoff; 09-13-2010 at 02:20 AM.
Posted by ManagerJosh, 09-13-2010, 02:29 AM
Note that I am not using this as an excuse or justification for our situation, but the point is that something as routine and simple may not always work out as intended. One incident that comes to mind was the recent updates McAfee released for their Antivirus solution. However by days end of that release, system administrators had hundreds of computers with corrupt installations of Windows because of a corrupt patch. Something as mundane and done on a weekly, if not daily basis by McAfee, had issues and caused problems.
Posted by jon222, 09-13-2010, 02:29 AM
I wouldn't go that far, their uptime has been excellent bar that previous incident, it can be easy to forget how long ago it was because I thought it was longer ago too. kdill asks what do you tell your clients? - Detailed reports of what is going on at the time, not one liners - Accurate timelines, Don't claim service is resumed if it isnt because that just creates more probems - Refund this month proactively so people don't have to beg. This gesture of good will would probably save some customers because I know most people will just write it off and leave. I know I'm not looking forward to tomorrow when I message them to have them determine whether or not this even constitutes a credit (not hopeful due to the response that karl gave earlier)
Posted by joshribakoff, 09-13-2010, 02:35 AM
Thanks captain obvious ;-) The problem with that is your website currently implies that is not possible. "You have nothing to fear of this being a continuous issue or a recurring event." So either you're not going to touch any settings or hardware for the next 10yrs, or you guys are very misleading to us. 11 PM and they're still having "isolated" incidents. Coincidentally I have just finished moving all over my important sites to "the planet" cloud hosting Last edited by joshribakoff; 09-13-2010 at 02:41 AM.
Posted by pfak, 09-13-2010, 02:35 AM
Apparently Steadfast while claiming to have their network back up, is not actually back up. A number of customers I know do not have service still, and support tickets are not being responded to. 24 hours and counting.
Posted by ManagerJosh, 09-13-2010, 02:38 AM
@joshribakoff - All I can say is I'm sorry. I'm sorry for all the problems we caused you and that you feel your trust has been misplaced. I do hope you will continue placing your trust in our services and that we will be able to demonstrate why we deserve your trust.
Posted by ManagerJosh, 09-13-2010, 02:41 AM
Hi pfak: If you, or any Steadfast customer, is still experiencing issues with their service, please do not hesitate to open a support ticket. Please include in the ticket a traceroute and we'll work on getting it resolved immediately.
Posted by joshribakoff, 09-13-2010, 02:44 AM
That's a lie. There's something else you could say, for example "We are officially retracting our statements that 'there is nothing to worry about', in fact things can wrong in the future, although we will do our best. We want you to know that we have learned and that we have failed on this issue to communicate with our customers. In the future we will be more open & transparent, and we are also refunding everyone for the month's service". You could say something along those lines... that would be a start.
Posted by KarlZimmer, 09-13-2010, 02:45 AM
1) We specifically had it posted on the front of our web site and the top of our support pages showing the status page which was regularly updated. We said as soon as we felt things were going to run long. As far as false intelligence, what was false? The timelines were actually accurate, it was just as one problem was solved a brand new one arose. 2) We did roll back, we had a plan to roll back, but when we started to roll back we had a chassis failure and two management module failures. We've only had one failure before with a 6500 in the past 5 years we've been using them. How do you plan for that? 3) We kept the site as up-to-date as possible and to the best of my knowledge we responded to every one of the thousands of tickets. In addition, I dare you to find anyone with a better SLA...
Posted by joshribakoff, 09-13-2010, 02:49 AM
After the 17hr mark did it cross your mind it might warrant an exceptional mass mailing? This should have taken place 3hrs in, in my opinion. Last edited by joshribakoff; 09-13-2010 at 02:52 AM.
Posted by KarlZimmer, 09-13-2010, 02:55 AM
Network is up and all support tickets are being answered...
Posted by KarlZimmer, 09-13-2010, 02:59 AM
3 hours in we were still in the middle of the maintenance window. We set a 6 hour maintenance window for a reason, it was a large project. Customers were aware of the issue, we answered thousands of tickets and hundreds of phone calls. Regular updates were posted on the site and we made it known here and on our own forum. We will be sending out a full review of the events and future procedure changes, etc. tomorrow. That is a document that certainly could not have been assembled during the rush the day was.
Posted by joshribakoff, 09-13-2010, 03:46 AM
Totally missed my point. Should have sent it after 6hrs then. When you fix one thing and another breaks, and that happens 10 or so times in a row.. there's a trend that should be recognized. After 10x or so of thinking "you've got it", you needed to step up and admit you didn't quite "have it". This should have taken place during, not after the incident. We know you're honestly sorry, we just don't care. What matters is what took place, a simple notification would have saved you my business. You emailed when the maintenance started, why couldn't you have emailed when **** hit the fan, as well? Last edited by joshribakoff; 09-13-2010 at 03:50 AM.
Posted by KarlZimmer, 09-13-2010, 04:29 AM
If I had a way to predict hardware failures I would have certainly used that talent here. Once you have one failure, a once in a 5 year experience for us with the 6500's, then you're generally not expecting a 2nd and then a 3rd. You'd love to think you can plan for everything, but you can't. That is one thing we certainly learned here and will certainly be planning around that idea in the future, going in much smaller and much more reasonable increments.
Posted by joshribakoff, 09-13-2010, 04:39 AM
The whole timeline is a chronicle of "we've almost got it". Its really not an issue of planning. Its an issue of competence. The plan was not sufficient, I know. The issue I am pinning you on is not the plan though, it was your failure to issue a mass mailing. When the maintenance exceeds the given maintenance window, that is a critical decision point for your company. You were faced to decide between accepting humility, or trying to sweep your dirt under the rug and hoping noone noticed. You made a decision to play it low key, and it detrimented your customers, and now you assert that you made all the right decisions. No one asked you to notify us the second you exceeded the window. But what about an hour? Or two hours? Or three? if/when this happens again how long are you going to wait? 17hrs again? ...... I still fail to understand how the *whole* data center going down for 12hrs is a not important enough event to do a mass mailing. Last edited by joshribakoff; 09-13-2010 at 04:45 AM.
Posted by kaniini, 09-13-2010, 05:11 AM
"Life happens" is not an appropriate statement for this situation. This was a voluntary maintenance and they should have been properly prepared for it. What did I tell my clients? The truth: that steadfast promised this maintenance would not go down this way and then I linked them to their status page. What else could I have done in this situation? Made up things that I didn't know were true or false? Steadfast was way too terse during this situation, and on top of that, I have heard their "on-site engineers" were actually working remotely. This was mentioned earlier in the thread. Karl promised this would not be the case, that all engineers would be onsite. I do not know who is telling the truth here, but I do know that Steadfast has had engineers work remotely in the past. I can understand why Karl wants people to be relaxed about the future, but since they have done a rollback, this means we're going to have another downtime in the near future. I cannot relax about that given the fact that this chaos has already happened. There are unresolved questions and there is not even an RFO available yet. Due to all of this, none of my customers have any confidence in the Chicago location anymore - in effect, we have gotten hundreds of transfer requests to our Los Angeles location at QuadraNet and more people asking whether or not our Chicago location will be stable again. That is how bad this outage was. 20 hours of pure hell for my clients, which makes them have a lot of doubt. I really hope that Steadfast gets this right, and gets the router replacement right so that we can just work through this, but I have to prepare for the possibility that they will not. The good news is that I have a contingency plan for that, but the bad news is that in the short term we're going to just have to live through the chaos regardless. I like Steadfast as a provider. Typically they have been pretty solid, but then these things happen (and they have happened before) and it really shakes things up...
Posted by spaethco, 09-13-2010, 09:41 AM
Irrespective of the other issues in this thread, this is really an unimportant point. Clearly you need on-site folks to handle physical activities like plugging in cables, but every other aspect of network changes can be managed remotely. You've seen the chassis - there is nothing on the hardware itself that requires the skills of a network engineer to manipulate -- you slide cards into slots and plug cables into ports. With a terminal server for console access and in-band IP access you can manage all aspects of network configuration. Being on-site or off-site doesn't change your approach to configuration and troubleshooting -- it's not like you need to look at lights on the front of the switch for anything. All the other points and concerns are valid, but calling out on-site vs remote resources is an issue in appearance only.
Posted by chrono-it, 09-13-2010, 10:04 AM
If anyone is still seeing issues please PM or email me directly at marc@steadfast.net with your ticket number and I will have it looked into right away.
Posted by dariusf, 09-13-2010, 11:12 AM
Talk about a stressful Sunday. It is very unfortunate you guys ran in to all these issues and the down window stretched out so much. I have been colocating servers with you for a few years and until now I was extremely satisfied. Very stable, super fast response and support, very friendly, fast hardware access, reasonably priced. That stated I'm definitly desapointed on several issues. 1) Notification - colocating servers with you for a few years I was totally unaware that I had to request notifications. I maybe understand no default notifications for some shared space website customers but for people colocating their servers? This should be send out as a default. I can't imaging someone colocating servers and NOT wanting to get service notification. 2) Maintenance roll back policy - I feel there is absolutely no excuse for dragging out the decision to roll back. I understand that there are times when this much time might be needed but 6 hours just for roll out is quite excessive if roll back time is not included. The maintenance window should not be 6 hours with out including the rollback time in it. If for example it takes 4 hours to roll out and 2 hours to roll back, then you should have your maintenance window set at 6 hours and deadline set at 4 hours. Once you got to that point of 4 hours you should automatically execute the roll back. Regardless how close you feel you are to completing the roll out. If everyone has been notified about the maintenance and you initiated roll back at a set point, not exceeding the 6 hour maintenance window then there would be no issues at all. There is always another day you could attempt to do the upgrade. Now having additional roll back hardware issues would extend that a bit more but still would be understandable and close to the 6 hours. Things happened and there is need to dwell on them but only learn. I hope this unfortunate event will result in upgrades to your procedures and we will not see anything like that again. This is not a first outage as things happen. I recall the power failure and subsequent backup generator failure at Equinix a few years back. I am looking to accelerate my implementation of backup servers at other provider, that was my fault for not having it in place. I have been a very satisfied customer so far and will remain your customer for time to come. Darius Cybermash
Posted by kaniini, 09-13-2010, 11:49 AM
I would just like to point out that we still are awaiting an actual RFO statement from Steadfast about this. In the meantime, I would like to ask why there is so many customer cables going directly into the old core routers? See attached picture. Is the real reason for switching to the Brocade gear to get rid of the distribution network entirely? Attached Images corerouter.jpg (543.9 KB, 147 views)
Posted by David, 09-13-2010, 12:01 PM
No, it was to get rid of the network entirely. Worked for almost ~20 hours until Karl's evil plans were thwarted. Alas, until next time.
Posted by jlaws, 09-13-2010, 01:01 PM
Man, you guys are lethal in this thread. Steadfast has a rather long history of providing stellar uptime. I'm not going to kill someone for extended downtime over the course of one day due to the ridiculous amounts of bad luck they had with a large infrastructure update. I'm far less upset by that than by not knowing that I needed to sign up for notices to be sent to me. Something this large and with a window that large should really have been sent to all customers...if they complain about receiving the email that's too bad...but they can't say you didn't notify them about it. I think some people need to take a deep breath before continuing to bash Steadfast. Many people in here have gotten very childish in their rants. Take it private if you need to continue the bashing, I'm sure many don't really care to see you scream and stomp your feet.
Posted by dariusf, 09-13-2010, 01:04 PM
Totally agree on both. The notifications for this type of huge change should have been send out at least a couple times well ahead of the change to everyone to make sure all are aware and plan out for this as well.
Posted by The Universes, 09-13-2010, 01:14 PM
I personally don't believe it boils down to just that. My main concern is that SF has consistently downplayed the impact of this "maintenance", and downplayed the resulting issues that occurred. Support provided no information about what was going on, the status page had 2 lines of gibberish and a non-informational letter from Karl. I would really like to see a provider be more open about their mistakes and more forthcoming about what is actually going on and what is being down to address the issues.
Posted by kdill, 09-13-2010, 01:20 PM
Read the first post in this topic, I never once felt it was down played. Im sure karl was sick to his stomach all yesterday worrying about getting your guys service back. You guys act like they did this on purpose and that unexpected things never happen. I want to visit this perfect world. Last edited by kdill; 09-13-2010 at 01:21 PM. Reason: grammar
Posted by kaniini, 09-13-2010, 01:21 PM
The august 24th email was downplayed, coloured with phrases like "minimal downtime expected". 20 hours of downtime is not minimal. I am sure Karl was sick to his stomach all day too. Thanks to this, I am pretty sure I now have an ulcer. Last edited by kaniini; 09-13-2010 at 01:25 PM. Reason: conceptual expansion
Posted by KarlZimmer, 09-13-2010, 01:21 PM
There are no customers connected to the core switches directly, everything goes through an aggregation/access layer off of the core routers. The reason for the Brocades was yes, to replace the Cisco 6500's which were being used as a combined core/distribution configuration for the gear at 350 E Cermak, we have a separate distribution layer setup at 725 S Wells. The plan now, and as things are setup, is that the Brocades will be simply taking over BGP, OSPF, etc. and the Cisco would act purely as a distribution switch, nothing more, holding customer VLANs, handling VRRP, etc. This separation is significantly more expensive, but should certainly make our network more robust and make an upgrade such as the one we attempted to perform a thing of the past, as simply moving BGP sessions to a an additional router configuration is MUCH, MUCH simpler and a task we've complete successfully on many occasions.
Posted by KarlZimmer, 09-13-2010, 01:22 PM
Yes, and simply put, things did not go as expected.
Posted by kdill, 09-13-2010, 01:24 PM
You didn't read what I said. It would have been minimal, if things had not gone awry. Some stuff you just can't control, a series of unlucky failures that were totally unexpected is going to increase downtime and make that statement look downplayed.
Posted by KarlZimmer, 09-13-2010, 01:25 PM
Personally, I thought we addressed that. The issues were noted on the announcement page as they were discovered and as things progressed. In addition, I feel I was very open, saying it was our mistake, our mis-management of the situation, etc. and I am currently working on a letter that will contain some more details. As I'm still crafting this letter, can you tell me how you felt we downplayed the issue, what specifically do you feel we weren't open about, what specific information do you want us to disclose?
Posted by kaniini, 09-13-2010, 01:38 PM
Yes, it is truly unfortunate that he had these problems. I am not saying that it isn't unfortunate, but his staff should have began a rollback inside the maintenance window when it was obvious things weren't going to plan. Not at 3 in the afternoon, many hours after the point, and when they did begin the rollback they should have made it very clear they were doing a rollback. I have to stress this is not the first time this has happened. When I first moved to Steadfast from Equinix, it was followed up with a string of outages, previously mentioned in this thread. Since then, it has been pretty good though, and I do give him points for that. However, my customers are out for blood because of this outage, and rightly so. So we have a lot of pressure to ensure that Karl is going to get this right or to relocate their servers to a DC that is not Steadfast. Ultimately what they are looking for is a fix to this problem that will be permanent, so many of us are having to ask Karl questions about the outage and how it was handled. Karl is a very nice guy. I like working with him. I would like to continue working with him. But I need to know what went wrong, why it went wrong, and how it will be corrected in the next attempt to alter the network topology. If I can't tell my clients this information, then they will be moving their virtual machines to other regions or demanding that we move our POP to a different facility. Right now I have a good amount of customers doing both of these things, which leaves me at a crossroads as far as options go. The rest, I may have to offer them upgrades or a service credit to get them to stay. Who knows... my customers seem to like action a lot more than service credits. Those options are ultimately: trust that Karl will have fixed the problem in his next attempt and that there will be no more catastrophes in the near future (say a 6 month time window) or start formulating an exit strategy. Regardless, we are taking action now to move out of reassigned IPs which reduces dependency on our datacenter providers (including steadfast). We owe that to our clients. Hopefully, we won't have to take advantage of that increased portability anytime soon, but it will at least be assuring to our customers once we get it. The reason why it will be assuring is because it gives us the power to leave without forcing them to renumber.
Posted by kaniini, 09-13-2010, 01:49 PM
I want to know specifically what happened and why you felt you should continue pushing forward instead of rolling back at the first sign of trouble. I want to know what has changed that will make the next attempt work. I want clarity on whether or not there will be people from Brocade on-site, not working remotely, actually on-site. I want to know that you will abort your next attempt if there is any sign of a serious issue. I *need* to know this. I want clarity on whether or not what Mike said was accurate. (I surely hope you did not fire him over the IRC logs in this thread... if you did, you need to fix that by unfiring him. I cannot support that kind of business strategy and feel good about it when I go to sleep at night.) Ultimately what I want is something that I can bring to my customers to assure them that everything will be fine with their servers during the next upgrade. If you can't provide such an assurance, then you need to re-evaluate your plan.
Posted by KarlZimmer, 09-13-2010, 01:59 PM
To note, with the maintenance, things WERE working, everything was up and running on the Brocade gear without issue at around 6:15AM and we figured we could get any necessary adjustments made for the few customers still seeing issues before the extended period ended at 7AM. The window was scheduled to be 6 hours, because we knew that with the amount of work needed, it would take a good six hours to get done, though the affects on customers was not supposed to be that great. That everything had been working smoothly made us sure it must have been one of the final changes we made that caused the issues, we reverted configs, put configs back in place, worked with 3rd party engineers and Brocade. We got close several times, CPU load going back down, etc. just for it to flare up again. When it became evident there was no way to get it resolved in a reasonable period of time, we went with plan B. As I stated before, I am working on a complete letter to describe the events and the actions we are now taking.
Posted by KarlZimmer, 09-13-2010, 02:10 PM
1) We pushed forward because things were working, things had been operating. We knew a rollback would be at LEAST 2 hours of downtime and were almost certain the repairs to the Brocades could be done in less time than that. That turned out to be wrong. 2) We are going with a completely different configuration, that will be detailed in the letter. 3) We had our own network engineering team on-site, a 3rd party engineer very familiar with Brocade gear, a 24/7 support contract with Brocade and Brocade on the phone from the beginning. We thought we had taken the actions necessary to be prepared. Simply put, we will not cary out another maintenance of this scope again, ever, and the explanation will be outlined in the official letter. 4) Yes, all future maintenance will involve leaving the existing configuration in place and fully configured. There will be no more complete gear swaps so rolling back will be much more trivial, thus not an impediment to doing a quick roll back. 5) What Mike said was not accurate. He had just logged on to the staff chat and saw I was asking for transportation to 350, as I was working out of our other office. Our head network engineer, CTO, 3rd party engineer and various other staff were on-site since 2 hours (or more for some) before the maintenance window. Mike has not been fired, it was a misunderstanding on his part in a hectic and stressful time. 6) That is entirely the focus of the new and revised plan.
Posted by joshribakoff, 09-13-2010, 02:11 PM
This right here is why I am firing you. My definition of "working" apparently differs from yours. Do I need to post my down time reports? I'll give you a hint, at no point since the maintenance begin did things remain "up" for a continuous 10 minute period, Until 9am, only to go back to up/down all day at 10am. And "not a big deal" with the lie about engineers being on site? How is a lie not a big deal? He's willing to lie about when the down time occurred, lies about decisions that were made, lies about where the engineers are. The issue is not where the engineers were, as I'm sure if he told us they were working remotely, we would have no problem. The issue is that what he told was was not true. We are not mad at the situation, more so mad at Karl's handling of the situation. Karl should admit he was in the wrong for not sending out a notification email. He'll get on here and post all day about how "this & that" happened with the hardware which was out of his control, yet will not address the issues that are in his control (the lying, the negligence, the misleading time lines) When you reverted the configs and it didnt work, and you found yourself on the phone with tech support, don't you think you should have recognized the situation was out of your control? You wrote: "When it became evident there was no way to get it resolved in a reasonable period of time" When exactly did you realize? Was it 7am when you told us: "7:38 AM: We are continuing to work on outstanding issues caused by the router replacements and currently have no ETA on resolution." Last edited by joshribakoff; 09-13-2010 at 02:21 PM.
Posted by KarlZimmer, 09-13-2010, 02:27 PM
We have NEVER lied. I have tried to be open about this and have been truthful about the issues through the entire situations. The timelines provided were honestly the best known information at the time, but then things changed and additional issues surfaced, with these issues being outlined on the site. We had a full 20 minutes of complete network stability. Sure, there were likely customer specific issues, that was expected and to be taken care of on a case-by-case basis, but the vast majority of customers, according to our internal and external reporting, were stable. Honestly, I had no simple way to send an email. Our customer database systems are on our standard network, which was fully affected by these issues as well. Our support site, email, phones, office network, and own web site are on a separate AS, specifically for assuring that they are reachable and then used for updates. having the customer database fully accessible had not crossed our minds and was an over-site on our part, so we used the communication channels we did have available to us. This incident has of course led us to reconsider that and for putting more items on our separate AS.
Posted by KarlZimmer, 09-13-2010, 02:32 PM
We had just had a working box and a working config, you need to evaluate whether the 2+ hour fix for doing a complete roll back will cause more downtime than working out a usable config. Working with the Brocades we had three separate instances where things were stable or getting to the point of being stable, just to have it all crash down again. It was after this third one, which was around noon, that we decided to roll back. It was being stuck where your options were a known bad, an additional 2+ hours of downtime, or an unknown...
Posted by joshribakoff, 09-13-2010, 02:33 PM
Karl, I can't believe some of the stuff you are writing. Full 20 minutes of stability? Seriously wtf man.
Posted by kaniini, 09-13-2010, 02:46 PM
Yeah, things *were* working. What happened with that by the way? As soon as you took out the old cisco, the network came back to life and everything was happy... I mean things were so happy your new Brocade equipment was pushing rainbows and unicorns through it's unused ports and slots. It was good times, man. Then things broke again. I was about to go to sleep when that happened. Ultimately, I didn't get to sleep until 9 hours later, when I finally just said "screw it, there's nothing more I can do to salvage this situation, I need to sleep before I punch a hole in my grandparents' nice new wall" (did I mention I am on holiday? what a great start to a holiday...) It's like someone who is a tease. Your network... it was working... then it went dark. I was so happy it was over. It was working perfectly. Things were looking *great* like rainbows and flowers and bunnies and stuff. And then it didn't. It went dark. For ultimately an additional 14 hours. Anyway. When you implement this revised plan: I am quite happy to do you a favor. I have a nagios setup monitoring my lines from Steadfast. If you would like, I can subscribe you guys to the outage alerts for that VLAN. That way you will know instantly if the problem is bigger than perceived. Deal? Last edited by kaniini; 09-13-2010 at 02:58 PM. Reason: needs moar cowbell
Posted by KarlZimmer, 09-13-2010, 03:07 PM
In the next post, William seems to confirm our analysis. Again, it is certainly possible that you had a customer specific issue that we were working on getting resolved in that remaining window, but we did significant testing and honestly, the performance and the way things were functioning was amazing. Then it wasn't... This was absolutely the ost frustrating day I had ever had, how we would be teased with things working, just for them to fall apart.
Posted by joshribakoff, 09-13-2010, 03:16 PM
I truly feel bad for you, but I don't do business based upon emotion. Thats not how I got to where I am now, and thats now how I'll succeed. Please see attached downtime report. The times it said "ok" my website was still not up (taking too long to load, not loading at all, timing out, etc). Essentially it was a continuous 17hr block of downtime on our end. The only communication we received during this 17hr period was the initial one telling us to ignore any downtime. I went to sleep with this down time going on, thinking you had it under control - which you did not. A single lost sale for me is $1,000+ (in profits lost, not just revenue...) Attached Images SiteUptime - Website Monitoring Service_1284405224437.png (36.2 KB, 106 views) Last edited by joshribakoff; 09-13-2010 at 03:22 PM.
Posted by kaniini, 09-13-2010, 03:40 PM
I am sorry to be rude but frankly I find it hard to believe that you would host a $1000+/mo enterprise on a $30/mo service plan. The reason why I know it was $30/mo is because everyone was effectively down for 20 hours, meaning they get a 100% SLA credit. If your business was that big of a deal you would have service from multiple providers so that a failure at one company would not cause your site to go down so that you could continue doing your e-commerce sales...
Posted by joshribakoff, 09-13-2010, 03:45 PM
Yes you are being rude. I was on a $100+ dedicated server, until it got compromised. You don't know me, or how many hosting accounts I have or how much money I really make, so please stay out of my business. For all you know I could make $1 or $1M
Posted by DPG, 09-13-2010, 03:59 PM
I know that **** happens but this part is alarming. If the actual maintenance was going to take 6 hours and the maintenance window was only 6 hours, there was zero room for error.
Posted by KarlZimmer, 09-13-2010, 04:16 PM
The 6 hours to get it done was allowing time for fix-ups, etc. which was of course calculated in. It was basically allocated as 60 minutes for base/config prep (including two hours of non-service affecting before the maintenance), 90 minutes for core3 replacement, 90 minutes for core4 replacement, and 2 hours for fixing up the odds and ends. From when we've done similar maintenance before, such as in New York in July, those were all over-estimates as well as that maintenance was done completely, for replacing one switch to a new platform, in less than 2 hours. We've swapped out switch platforms on many occasions previously, Foundry to Cisco, Cisco to Juniper, Juniper to Cisco, and thought we had a good handle on the time that would be needed.
Posted by Scott.Mc, 09-13-2010, 04:18 PM
His point is perfectly valid however. Regardless of how much you make (frankly who cares, everyone on WHT always looses millions every second in outages on their $2/month account). If your service is important you should have contingency plans and redundancy. Now if you want to grumble about being down then that's fine but please be quiet with the statements of I lost $xxxxxxxxxx. That's your problem, not theirs.
Posted by j4cbo, 09-13-2010, 05:20 PM
I was unable to connect to either of Steadfast's phone support lines throughout most of this incident. Are you planning on installing a real, non-VoIP phone line that won't stop working next time your entire network falls over?
Posted by spaethco, 09-13-2010, 05:50 PM
Why would this matter? You're probably not going to get more updates than were already provided electronically, and they don't exactly need you to call and report your server being down if it's a DC-wide network event.
Posted by KarlZimmer, 09-13-2010, 05:53 PM
The phones are on a separate AS and network and we were receiving calls through almost the entire incident. It was likely some sort of routing issue you were facing, though the calls are then hard to do a traceroute on. To solve this, we are going to assure that we gracefully turn down the ports to our own network as part of our new maintenance checklist to assure any potential routing issues/anomalies are passed on. To note, one single phone line wouldn't have helped much either.
Posted by joshribakoff, 09-13-2010, 05:55 PM
We'd still have experienced down time. 48hrs for DNS to propagate.
Posted by Steven, 09-13-2010, 05:58 PM
48 hours? Hardly.
Posted by spaethco, 09-13-2010, 05:58 PM
Only for registrar changes. If you were clueful with distributing your DNS servers you could have downtime on the order of a couple minutes, tops.
Posted by dariusf, 09-13-2010, 06:05 PM
One thing that might be useful is a redirection of the calls to automated recording with the details on the outage, but then I would prefer the staff being busy resolving the issue than updating all the status update methods. I think it boils down to reduce the size of the updates in to smaller more manageable chunks and including the rollback time in to the maintenance window.
Posted by dariusf, 09-13-2010, 06:09 PM
I did get threw ~ 10am CST or so and spoke to someone forgot the name and was notified of the maintenance and issues. I was getting dropped calls, as in no ringing at all from about 2pm to 3pm CST. At which time I googled this thread...
Posted by panopticon, 09-13-2010, 06:11 PM
This doesn't add up to me: Could you at least tell us what box(es) you had at steadfast at the time of the outage? The SLA won't cover your full losses from such an event if you're a for-profit, but at least it helps cover your time to respond or provides funds for an emergency setup if needed. I find Steadfast Network's SLA in my experience to be very fair and to actually be better than the planet's SLA, also in my experience hosting there for many years now.
Posted by KarlZimmer, 09-13-2010, 06:29 PM
That does make sense and would probably save us time. We'll see what we can do to work out a system for our staff to be able to insert and update such a message. Thank you for the suggestion.
Posted by joshribakoff, 09-13-2010, 07:31 PM
Please fill me in on how downtime due to DNS could possibly be prevented? What does it matter how many DNS servers I use or what I set the TTL to? I can't force a user's ISP to not cache the records? 48hrs far-fetched? You're fooling yourself. Even steadfast quoted me that estimate of 48hrs when I've asked about this in the past, and even though I observe my own ISP following TTLs, I have conducted enough experiments in the past and talked to enough of my customers to know that it truly does take up to 48hrs. Sad but true. If there's some way around this let me know.
Posted by spaethco, 09-13-2010, 07:51 PM
Well, clearly if your DNS servers are located on the network that's down, it's not going to work. The records at the gtld root (.com/.net) are set to 48 hours of cache time, so if you need to point to new DNS servers it will take a while for the records to age out on servers. You set a lower TTL for the records you want to be available for DNS failover -- something like 180 seconds. The number of ISPs out there that ignore DNS TTLs is a number very close to 0. I'm basing that off not only my experience with my personal gear, but also on professional experience with our globally load balanced member portals for one of the largest healthcare companies in North America. Either you're doing something wrong in testing, or your collection of users is such a statistical anomaly that you should start playing the lottery. We observe a ~99.9% tracking rate within 15 minutes on DNS changes according to session count numbers we track when we move things around.
Posted by joshribakoff, 09-13-2010, 08:00 PM
Hmm well Friday I logged into my registrar and edited the 'A' record for a customer that couldn't reach one of my sites (my fault, had it pointed to the wrong IP). After fixing it I immediately could ping the new IP, because I keep my TTL low like you say. However, this user was unable to access my site until Sunday evening, he checked it every few hours. I did not ask where he was geographically located. Anyways shortly after that customer wrote me to let me know the DNS had propagated, my whole server went down due to this outage. I'd love to take what you wrote at face value but that doesn't explain my users complaining. My users typically don't complain for no reason. Also Karl has offered me an SLA compensation I feel very good about, however I'm still unable to reconcile what went down. They rolled back the old routers, which means they're going to attempt the whole upgrade again at a later date now? ... I found sources that back up what I am saying. What sources do you have to the contrary? I can't post them but search for "dns propagation" and click on the devshed link. This says that ISPs *do* in fact cache the records, and can be up to 72hrs
Posted by spaethco, 09-13-2010, 08:11 PM
'A' records at the gtld root (ie, the records you set at your registrar) are set to 48 hours, as I stated above. The key is to have your own distributed DNS servers already setup with the registrar, and then you can modify records that have TTLs that you can control. My source is actually implementing and operating DNS failover solutions on the production Internet. Last edited by spaethco; 09-13-2010 at 08:16 PM.
Posted by dariusf, 09-13-2010, 09:01 PM
You can also pay for DNS failover service or like spaethco mentioned host your own DNS on two deferent networks. here ate some threads on DNS failover webhostingtalk.com/showthread.php?t=524788 webhostingtalk.com/showthread.php?t=574218
Posted by ManagerJosh, 09-13-2010, 10:50 PM
For those of you who have not received the report, Karl posted it earlier today. It is available to read at https://support.steadfast.net/index....ews&newsid=286 or you may read it below.
Posted by Steven, 09-13-2010, 11:18 PM
How is that even a comparable sla credit? Those upgrades do not help your customers in the short term for the SLA's they must pay out to their customers - Rememeber the sla's you would have to pay out are likely lower then your customers pay out as they may have many servers with lots of individual customers. Last edited by Steven; 09-13-2010 at 11:23 PM.
Posted by ManagerJosh, 09-13-2010, 11:54 PM
With all do respect, it's an option on the table for each customer as each customer's requirements will vary. Some may find it as a viable alternative and some may not.
Posted by misterd, 09-14-2010, 12:12 AM
I'm glad this is happening. --- Several providers that I'm also using will create a trouble ticket when there is going to be a maintenance & possible outage, and it's updated when there's something new. -- I'm not sure how Kayako & Ubersmith work, but if you can do something like that along with announcement on Steadfast.net, it will be perfect.
Posted by BELLonline, 09-14-2010, 09:14 AM
I've been with Steadfast for 2 years now and their network has been almost faultless until this problem. These things happen, they made an upgrade and it went wrong - but they have clearly leaned from what happened.
Posted by KarlZimmer, 09-14-2010, 01:50 PM
Basically, yes. The Brocades will handle the core/BGP and the 6500's will just handle distribution for chi01 and chi02. It will likely be 1-2 months out, as we need to install some addition cabs and power, etc. and should be much, much simpler. The transition would involve moving over BGP sessions gradually.
Posted by superblade, 09-14-2010, 01:59 PM
I don't think it's an option for each customer. I asked for details on the free upgrades and was told it wasn't even an option for VPS clients. I guess i can understand this, but it wasn't clear in the statement that was sent out.
Posted by KarlZimmer, 09-14-2010, 02:20 PM
That shouldn't be the case, PM me your ticket # and I'll see what I can do.
Posted by sirius, 09-14-2010, 04:29 PM
It appears that this issue is now resolved, please feel free to use Steadfast's normal support channel's for any further issues.